On 14 Feb 2019, at 02:57, Katharina Kohls <katharina.kohls@rub.de> wrote:

All nodes bootstrap properly and reach 100%, the authorities both manage to vote and exchange information. Also the relays and the client bootstrap to 100%.

When are these messages logged?
Sorry, I must update this: The authorities bootstrap to 100%, relays and client are stuck with 80% (sometimes reach 85%).

We recently changed the bootstrap percentages and messages in Tor.
Please paste the log lines that containing these bootstrap messages.
And any error messages near those lines.

You might get better bootstrap messages using Tor master.

Nevertheless, the consensus seems to lack relays with guard flags:

Feb 12 10:35:56.000 [notice] I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 2/2,

This log message says that there are only 2 nodes in the consensus at that time.
and can only build 0% of likely paths. (We have 0% of guards bw, 100% of midpoint bw, and 100% of end bw (no exits in consensus,

This log message say that there are no exits in the consensus at that time.
Right now there are even less available nodes and bandwidth showing up in the logs. This changes between runs but never to more promising numbers.

To get good bandwidth numbers, you'll need to pass some traffic
through your network. To get measured bandwidth in the votes,
you'll need to run a bandwidth authority, like sbws:
https://git.torproject.org/sbws.git

using mid) = 0% of path bw.)

Because of this, no default circuits can be built in the client or the relays

When there are only 2 nodes in the network, you can't build a 3-hop path.
There should be 8 nodes in total so it's kind of strange that only 2 seem to be available in this relay.

It would help to know what's actually in the consensus. (See below.)



In the data_dir/state file I see several guard entries:
Guard in=default rsa_id=[...] nickname=auth01 sampled_on=2019-01-17T18:33:12 sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=[...] nickname=relay03 sampled_on=2019-01-22T17:17:10sampled_by=0.3.5.7 unlisted_since=2019-01-27T11:00:36 listed=0


The state file says that there were some nodes in some previous consensuses. None of these nodes come from the current consensus at the time of your log messages.
I use a bash script that manages all the VMs. It kills Tor on all machines, then waits for 5 seconds just to be sure (ShutdownWaitLength 0),

Maybe there's a bug in ShutdownWaitLength.
We changed that code recently.
Is Tor actually shut down when you remove the files?

When you start Tor, what is actually in the data directory?

then removes all cached, old logs, the state file, ... and some more stuff on the authorities (see below).

ssh auth01 rm /var/lib/tor/cached*
ssh auth01 rm /var/lib/tor/*.log
ssh auth01 rm /var/lib/tor/state
ssh auth01 rm -r /var/lib/tor/router-stability
ssh auth01 rm -r /var/lib/tor/sr-state
ssh auth01 rm -r /var/lib/tor/v3-status-votes
ssh auth01 rm -r /var/lib/tor/diff-cache

The client also seems to receive a complete consensus, at least all fingerprints of my setup show up if I fetch the file manually.

How do you fetch the file manually, and from where?
wget http://authip:7000/tor/server/all

which should be the cached-descriptors.new file on the authority (which also means it gets deleted on each new startup and must be fresh).

In this file I see all the fingerprints that are supposed to be there.

tor/server/all is a list of all relay descriptors that the authority knows about.

But the consensus is different: it contains the relays from the authorities'
votes, but only if those relays are reachable from the authorities
(the Running flag), and the authorities agree on enough info about the
relays.

Please check the votes and consensuses on each authority:
http://<hostname>/tor/status-vote/current/authority
http://<hostname>/tor/status-vote/current/consensus
http://<hostname>/tor/status-vote/current/consensus-microdesc

Source:
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n3867

Then, check the cached consensus-microdesc files on each client.
(Clients and relays use the microdesc consensus by default.)

 It's also possible to connect to the client's control port and manually build circuits to all relays that should be there. This is an indicator that the client knows the relays (using a fingerprint that is not in the consensus would not work).

That's not how Tor works:

Clients randomly select relays from the consensus.

But when someone else specifies the relay, clients will happily
connect to that relay by fingerprint and IP address, even if the
relay isn't in the consensus. (The fingerprint is a hash of the
relay's identity key, which the client checks when it connects
to the relay.)

This feature exists so that the network still works when clients
tell relays and onion services about new relays. (There are a few
valid consensuses on the network at each point in time, and they
can contain different relays.)

Can you copy and paste the code you're using?

Again, guards also show up in the state files of the relays

Guard in=default rsa_id=C122CBB79DC660621E352D401AD7F781F8F6D62D nickname=relay03 sampled_on=2019-02-07T16:24:21 sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=2B74825BE33752B21D17713F88D101F3BADC79BC nickname=relay06 sampled_on=2019-02-03T22:16:29 sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=E4B1152CDF0E5FE697A3E916716FC363A2A0ACF3 nickname=relay07 sampled_on=2019-02-12T18:51:00 sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=911EDA6CB639AAE955517F02AA4D651E0F7F6EFD nickname=relay02 sampled_on=2019-02-11T22:58:28 sampled_by=0.3.5.7 listed=1
Guard in=default rsa_id=8E574F0C428D235782061F44B2D20A66E4336993 nickname=relay05 sampled_on=2019-02-01T17:46:05 sampled_by=0.3.5.7 listed=1

The dates are still old, but I delete all states in the big cleanup procedure. Are there some more old caches I need to remove, where does the date information come from?

The dates are the time when Tor chose the guard.
Maybe you're not actually deleting the state file?
Maybe there's an undocumented state.new file?

What's in the directory after you run the script?

Removing specific files is inherently fragile: future Tor versions
may add new files.

Instead, configure different directories for CacheDirectory,
DataDirectory, and KeyDirectory. Then, delete and re-create
CacheDirectory and DataDirectory. Fail and refuse to start Tor
if the deletion and re-creation fails.

(Normally, relay operators want to keep info from previous
runs in DataDirectory, but your setup is a special case.)

You can also safely delete the short and medium term keys
in KeyDirectory. But it probably doesn't hurt to keep them.

For more info, see:
https://www.torproject.org/docs/tor-manual.html.en


I suggest that you start again with the same config, but remove all previous state.
(Move the cached state, consensuses, descriptors, and log files somewhere else. Do not remove the keys.)

Then you'll know if your current network actually works.
Questions are: Why does the client know all the relays' fingerprints but the network still has problems finishing the bootstrapping and building a complete circuit? Are there any other things I should look into and check to understand the problem?

I think I answered these questions above in context.

Let me know if you're still having trouble.

T