All nodes bootstrap properly and reach 100%, the authorities both manage to vote and exchange information. Also the relays and the client bootstrap to 100%.

When are these messages logged?

Sorry, I must update this: The authorities bootstrap to 100%, relays and client are stuck with 80% (sometimes reach 85%).

Nevertheless, the consensus seems to lack relays with guard flags:

Feb 12 10:35:56.000 [notice] I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 2/2,

This log message says that there are only 2 nodes in the consensus at that time.

and can only build 0% of likely paths. (We have 0% of guards bw, 100% of midpoint bw, and 100% of end bw (no exits in consensus,

This log message say that there are no exits in the consensus at that time.

Right now there are even less available nodes and bandwidth showing up in the logs. This changes between runs but never to more promising numbers.

using mid) = 0% of path bw.)

Because of this, no default circuits can be built in the client or the relays

When there are only 2 nodes in the network, you can't build a 3-hop path.

There should be 8 nodes in total so it's kind of strange that only 2 seem to be available in this relay.

in all logs the following message appears every second:

[warn] Failed to find node for hop #1 of our path. Discarding this circuit.

…

In the data_dir/state file I see several guard entries:
Guard in=default rsa_id=[...] nickname=auth01 sampled_on=2019-01-17T18:33:12 sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=[...] nickname=relay03 sampled_on=2019-01-22T17:17:10sampled_by=0.3.5.7 unlisted_since=2019-01-27T11:00:36 listed=0
Guard in=default rsa_id=[...] nickname=relay02 sampled_on=2019-01-24T22:19:10sampled_by=0.3.5.7 unlisted_since=2019-01-29T09:08:59 listed=0
Guard in=default rsa_id=[...] nickname=relay03 sampled_on=2019-02-06T21:07:36sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=[...] nickname=relay05 sampled_on=2019-01-27T16:37:38 sampled_by=0.3.5.7 listed=1

The state file says that there were some nodes in some previous consensuses. None of these nodes come from the current consensus at the time of your log messages.

I use a bash script that manages all the VMs. It kills Tor on all machines, then waits for 5 seconds just to be sure (ShutdownWaitLength 0), then removes all cached, old logs, the state file, ... and some more stuff on the authorities (see below).

    ssh auth01 rm /var/lib/tor/cached*
    ssh auth01 rm /var/lib/tor/*.log
    ssh auth01 rm /var/lib/tor/state
    ssh auth01 rm -r /var/lib/tor/router-stability
    ssh auth01 rm -r /var/lib/tor/sr-state
    ssh auth01 rm -r /var/lib/tor/v3-status-votes
    ssh auth01 rm -r /var/lib/tor/diff-cache

The client also seems to receive a complete consensus, at least all fingerprints of my setup show up if I fetch the file manually.

How do you fetch the file manually, and from where?

wget http://authip:7000/tor/server/all

which should be the cached-descriptors.new file on the authority (which also means it gets deleted on each new startup and must be fresh).

In this file I see all the fingerprints that are supposed to be there. It's also possible to connect to the client's control port and manually build circuits to all relays that should be there. This is an indicator that the client knows the relays (using a fingerprint that is not in the consensus would not work).

Again, guards also show up in the state files of the relays

Guard in=default rsa_id=C122CBB79DC660621E352D401AD7F781F8F6D62D nickname=relay03 sampled_on=2019-02-07T16:24:21 sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=2B74825BE33752B21D17713F88D101F3BADC79BC nickname=relay06 sampled_on=2019-02-03T22:16:29 sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=E4B1152CDF0E5FE697A3E916716FC363A2A0ACF3 nickname=relay07 sampled_on=2019-02-12T18:51:00 sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=911EDA6CB639AAE955517F02AA4D651E0F7F6EFD nickname=relay02 sampled_on=2019-02-11T22:58:28 sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=8E574F0C428D235782061F44B2D20A66E4336993 nickname=relay05 sampled_on=2019-02-01T17:46:05 sampled_by=0.3.5.7 listed=1

The dates are still old, but I delete all states in the big cleanup procedure. Are there some more old caches I need to remove, where does the date information come from?

I'm not sure what is happening here. It looks like some consensuses only have 2 nodes. But other consensuses have most of the nodes.

You might have a bug in your network setup, or you may have found a bug in Tor.

I think it's a bug somewhere in the setup but I just can't find it :(

The most likely explanation is that you had a working network at some time, which gave you the state file. And you had a failed network at some time, which gave you the log messages.

I suggest that you start again with the same config, but remove all previous state.

(Move the cached state, consensuses, descriptors, and log files somewhere else. Do not remove the keys.)

Then you'll know if your current network actually works.

Questions are: Why does the client know all the relays' fingerprints but the network still has problems finishing the bootstrapping and building a complete circuit? Are there any other things I should look into and check to understand the problem?

_______________________________________________
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev