Running a Separate Tor Network - tor-dev

16 Oct 2014

Hi all,
Not content to let you have all the fun, I decided to run my own Tor network!
Kidding ;)  But the Directory Authorities, the crappy experiment
leading up to Black Hat, and the promise that one can recreate the Tor
Network in the event of some catastrophe interests me enough that I
decided to investigate it.  I'm aware of Chutney and Shadow, but I
wanted it to feel as authentic as possible, so I forwent those, and
just ran full-featured independent tor daemons.  I explicitly wanted
to avoid setting TestingTorNetwork.  I did have to edit a few other
parameters, but very few. [0]
I plan on doing a blog post, giving a HOWTO, but I thought I'd write
about my experience so far.  I've found a number of interesting issues
that arise in the bootstrapping of a non-TestingTorNetwork, mostly
around reachability testing.
-----
One of the first things I ran into was a problem where I could not get
any routers to upload descriptors.  An OR checks itself to determine
reachability before uploading a descriptor by building a circuit -
bypassed with AssumeReachable or TestingTorNetwork.  This works fine
for Chutney and Shadow, as they reach into the OR and set
AssumeReachable.  But if the Tor Network were to be rebooted... most
nodes out there would _not_ have AssumeReachable, and they would not
be able to perform self-testing with a consensus consisting of just
Directory Authorities.  I think nodes left running would be okay, but
nodes restarted would be stuck in a startup loop.  I imagine what
would actually happen is Noisebridge and TorServers and a few other
close friends would set the flag, they would get into the consensus,
and then the rest of the network would start coming back...   (Or
possibly a few nodes could anticipate this problem ahead of time, and
set it now.)
What I had to do was make one of my Directory Authorities an exit -
this let the other nodes start building circuits through the
authorities and upload descriptors.  Maybe an OR should have logic
that if it has a valid consensus with no Exit nodes, it should assume
it's reachable and send a descriptor - and then let the Directory
Authorities perform reachability tests for whether or not to include
it?  From the POV of an intentional DoS - an OR doesn't have to obey
the reachability test of course, so no change there.  It could
potentially lead to an unintentional DoS where all several thousand
routers start slamming the DirAuths as soon as a usable-but-blank
consensus is found... but AFAIK routers probe for a consensus based on
semi-random timing anyway, so that may mitigate that?
-----
Another problem I ran into was that nodes couldn't conduct
reachability tests when I had exits that were only using the Reduced
Exit Policy - because it doesn't list the ORPort/DirPort!  (I was
using nonstandard ports actually, but indeed the reduced exit policy
does not include 9001 or 9030.)  Looking at the current consensus,
there are 40 exits that exit to all ports, and 400-something exits
that use the ReducedExitPolicy.  It seems like 9001 and 9030 should
probably be added to that for reachability tests?
-----
Continuing in this thread, another problem I hit was that (I believe)
nodes expect the 'Stable' flag when conducting certain reachability
tests.  I'm not 100% certain - it may not prevent the relay from
uploading a descriptor, but it seems like if no acceptable exit node
is Stable - some reachability tests will be stuck.  I see these sorts
of errors when there is no stable Exit node (the node generating the
errors is in fact a Stable Exit though, so it clearly uploaded its
descriptor and keeps running):
Oct 13 14:49:46.000 [warn] Making tunnel to dirserver failed.
Oct 13 14:49:46.000 [warn] We just marked ourself as down. Are your
external addresses reachable?
Oct 13 14:50:47.000 [notice] No Tor server allows exit to
[scrubbed]:25030. Rejecting.
Since ORPort/DirPort are not in the ReducedExitPolicy, this (may?)
restrict the number of nodes available for conducting a reachability
test.  I think the Stable flag is calculated off the average age of
the network though, so the only time when this would cause a big
problem is when the network (DirAuths) have been running for a little
bit and a full exit node hasn't been added - it would have to wait
longer for the full exit node to get the Stable flag.
-----
Getting a BWAuth running was... nontrivial.  Some of the things I found:
 - SQLAlchemy 0.7.x is no longer supported.  0.9.x does not work, nor
0.8.x.  0.7.10 does.
 - Several quasi-bugs with the code/documentation (the earliest three
commits here: https://github.com/tomrittervg/torflow/commits/tomedits)
 - The bandwidth scanner actively breaks in certain situations of
divide-by-zero (https://github.com/tomrittervg/torflow/commit/053dfc17c0411dac0f6c4e43954f90...)
 - The scanner will be perpetually stuck if you're sitting on the same
/16 and you don't perform the equivalent of EnforceDistinctSubnets 0
[1]
Ultimately, while I successfully produced a bandwidth file [2], I
wasn't convinced it was meaningful.  There is a tremendous amount of
code complexity buried beneath the statement 'Scan the nodes and see
how fast they are', and a tremendous amount of informational
complexity behind 'Weight the nodes so users can pick a good stream'.
-----
I tested what it would look like if an imposter DirAuth started trying
to participate in the consensus.  It generated the warning you would
expect:
Oct 14 00:04:31.000 [warn] Got a vote from an authority (nickname
authimposter, address W.X.Y.Z) with authority key ID Z. This key ID is
not recognized.  Known v3 key IDs are: A, B, C, D
But it also generated a warning you would not expect, and sent me down
a rabbit hole for a while:
Oct 10 21:44:56.000 [debug] directory_handle_command_post(): Received
POST command.
Oct 10 21:44:56.000 [debug] directory_handle_command_post(): rewritten
url as '"/tor/post/consensus-signature"'.
Oct 10 21:44:56.000 [notice] Got a signature from W.X.Y.Z. Adding it
to the pending consensus.
Oct 10 21:44:56.000 [info]
dirvote_add_signatures_to_pending_consensus(): Have 1 signatures for
adding to ns consensus.
Oct 10 21:44:56.000 [info]
dirvote_add_signatures_to_pending_consensus(): Added -1 signatures to
consensus.
Oct 10 21:44:56.000 [info]
dirvote_add_signatures_to_pending_consensus(): Have 1 signatures for
adding to microdesc consensus.
Oct 10 21:44:56.000 [info]
dirvote_add_signatures_to_pending_consensus(): Added -1 signatures to
consensus.
Oct 10 21:44:56.000 [warn] Unable to store signatures posted by
W.X.Y.Z: Mismatched digest.
Over on the imposter:
Oct 14 00:19:32.000 [warn] http status 400 ("Mismatched digest.")
response after uploading signatures to dirserver 'W.X.Y.Z:15030'.
Please correct.
The imposter DirAuth is sending up a signature for a consensus that is
not the same consensus that the rest of the DirAuths computed.
Specifically, the imposter DirAuth lists itself as a dir-source and
the signature covers this line.  (Everything else matches because the
imposter has been outvoted and respects that.)
I guess the lesson is, if you see the "Mismatched digest" warning in
conjunction with the unrecognized key ID - it's just one issue, not
two.
-----
The notion and problems of an imposter DirAuth also come up during how
the network behaves when adding a DirAuth.  I started with 4
authorities, then started a fifth (auth5).  Not interesting - it
behaved as the imposter scenario.
I then added auth5 to a single DirAuth (auth1) as a trusted DirAuth.
This resulted in a consensus with 3 signatures, as auth1 did not sign
the consensus.  On auth1 I got warn messages:
A consensus needs 3 good signatures from recognized authorities for us
to accept it. This one has 2 (auth1 auth5). 3 (auth2 auth3 auth4) of
the authorities we know didn't sign it.
I then added auth5 to a second DirAuth (auth2) as a trusted DirAuth.
This results in a consensus for auth1, auth2, and auth5 - but auth3
and auth4 did not sign it or produce a consensus.  Because the
consensus was only signed by 2 of the 4 Auths (e.g., not a majority) -
it was rejected by the relays (which did not list auth5).  At this
point something interesting and unexpected happened:
The other 2 DirAuths (not knowing about auth5) did not have a
consensus.  This tricked dirvote_recalculate_timing into thinking we
should use the TestingV3AuthInitialVotingInterval parameters, so they
got out of sync with the other 3 DirAuths (that did know about auth5).
That if/else statement seems very odd, and the parameters seem odd as
well.  First off, I'm not clear what the parameters are intended to
represent.  The man page says:
TestingV3AuthInitialVotingInterval N minutes|hours
  Like V3AuthVotingInterval, but for initial voting interval before
the first consensus has been created. Changing this requires that
TestingTorNetwork is set. (Default: 30 minutes)
TestingV3AuthInitialVoteDelay N minutes|hours
  Like TestingV3AuthInitialVoteDelay, but for initial voting interval
before the first consensus has been created. Changing this requires
that TestingTorNetwork is set. (Default: 5 minutes)
TestingV3AuthInitialDistDelay N minutes|hours
  Like TestingV3AuthInitialDistDelay, but for initial voting interval
before the first consensus has been created. Changing this requires
that TestingTorNetwork is set. (Default: 5 minutes)
Notice that the first says "Like V3AuthVotingInterval", but the other
two just repeat their name?  And how there _is no_
V3AuthInitialVotingInterval?  And that you can't modify these
parameters without turning on TestingTorParameters (despite the fact
that they will be used without TestingTorNetwork?)  And also,
unrelated to the naming, these parameters are a fallback case for when
we don't have a consensus, but if they're not kept in sync with
V3AuthVotingInterval and their kin - the DirAuth can wind up
completely out of sync and be unable to recover (except by luck).
It seems like these parameters should be renamed to V3AuthInitialXXX,
keep their existing defaults, remove the requirement on
TestingTorNetwork, and be documented as needing to be divisors of the
V3AuthVotingXXX parameter, to allow a DirAuth who has tripped into
them to be able to recover.
I have a number of other situations I want to test around adding,
subtracting, and manipulating traffic to a DirAuth to see if there are
other strange situations that can arise.
-----
Other notes:
 - I was annoyed by TestingAuthDirTimeToLearnReachability several
times (as I refused to turn on TestingTorNetwork) - I wanted to
override it. I thought maybe that should be an option, but ultimately
convinced myself that in the event of a network reboot, the 30 minutes
would likely still be needed.
 - The Directory Authority information is a bit out of date.
Specifically, I was most confused by V1 vs V2 vs V3 Directories.  I am
not sure if the actual network's DirAuths set V1AuthoritativeDirectory
or V2AuthoritativeDirectory - but I eventually convinced myself that
only V3AuthoritativeDirectory was needed.
 - It seems like an Authority will not vote for itself as an HSDir or
Stable... but I could't find precisely where that was in the code.
(It makes sense to not vote itself Stable, but I'm not sure why
HSDir...)
 - The networkstatus-bridges file is not included in the tor man page
 - I feel like the log message "Consensus includes unrecognized
authority" (currently info) is worthy of being upgraded to notice.
 - While debugging, I feel this patch would be helpful. [3]
 - I've had my eye on Proposal 164 for a bit, so I'm keeping that in mind
 - I wanted the https://consensus-health.torproject.org/ page for my
network, but didn't want to run the java code, so I ported it to
python.  This project is growing, and right now I've been editing
consensus_health_checker.py as well.
https://github.com/tomrittervg/doctor/commits/python-website  I have a
few more TODOs for it (like download statistics), but it's coming
along.
-----
Finally, something I wanted to ask after was the idea of a node (an
OR, not a client) belonging to two or more Tor networks.  From the POV
of the node operator, I would see it as a node would add some config
lines (maybe 'AdditionalDirServer' to add to, rather than redefining,
the default DirServers), and it would upload its descriptors to those
as well, fetch a consensus from all AdditionalDirServers, and allow
connections from and to nodes in either.  I'm still reading through
the code to see which areas would be particularly confusing in the
context of multiple consensuses, but I thought I'd throw it out there.
-tom
-----
[0]
AuthoritativeDirectory                  1
V3AuthoritativeDirectory                1
VersioningAuthoritativeDirectory        1
RecommendedClientVersions           [stuff]
RecommendedServerVersions           [stuff]
ConsensusParams                     [stuff]
AuthDirMaxServersPerAddr        0
AuthDirMaxServersPerAuthAddr    0
V3AuthVotingInterval            5 minutes
V3AuthVoteDelay                 30 seconds
V3AuthDistDelay                 30 seconds
V3AuthNIntervalsValid           3
MinUptimeHidServDirectoryV2     1 hour
-----
[1]

diff --git a/NetworkScanners/BwAuthority/bwauthority_child.py
b/NetworkScanners/BwAuthority/bwauthority_child.py
index 28b89c2..e07718f 100755
--- a/NetworkScanners/BwAuthority/bwauthority_child.py
+++ b/NetworkScanners/BwAuthority/bwauthority_child.py
@@ -60,7 +60,7 @@ __selmgr = PathSupport.SelectionManager(
       percent_fast=100,
       percent_skip=0,
       min_bw=1024,
-      use_all_exits=False,
+      use_all_exits=True,
       uniform=True,
       use_exit=None,
       use_guards=False,
-----
[2]
node_id=$C447A9E99C66A96E775A5EF7A8B0DF96C414D0FE bw=37 nick=relay4
measured_at=1413257649 updated_at=1413257649 pid_error=0.0681488657221
pid_error_sum=0 pid_bw=59064 pid_delta=0 circ_fail=0.0
node_id=$70145044B8C20F46F991B7A38D9F27D157B1CB9D bw=37 nick=relay5
measured_at=1413257649 updated_at=1413257649 pid_error=0.0583026021009
pid_error_sum=0 pid_bw=59603 pid_delta=0 circ_fail=0.0
node_id=$7FAC1066DCCC0C62984B8E579C5AABBBAE8146B2 bw=37 nick=exit2
measured_at=1413257649 updated_at=1413257649 pid_error=0.0307144741235
pid_error_sum=0 pid_bw=55938 pid_delta=0 circ_fail=0.0
node_id=$9838F41EB01BA62B7AA67BDA942AC4DC3B2B0F98 bw=37 nick=exit3
measured_at=1413257649 updated_at=1413257649 pid_error=0.0124944051714
pid_error_sum=0 pid_bw=55986 pid_delta=0 circ_fail=0.0
node_id=$49090AC6DB52AD8FFF95AF1EC1E898126A9E5CA6 bw=37 nick=relay3
measured_at=1413257649 updated_at=1413257649 pid_error=0.0030073731241
pid_error_sum=0 pid_bw=56489 pid_delta=0 circ_fail=0.0
node_id=$F5C43BB6AD2256730197533596930A8DD7BEC367 bw=37 nick=exit1
measured_at=1413257649 updated_at=1413257649
pid_error=-0.0032777385693 pid_error_sum=0 pid_bw=55114 pid_delta=0
circ_fail=0.0
node_id=$3D53FF771CC3CB9DE0A55C33E5E8DA4238C96AB5 bw=37 nick=relay2
measured_at=1413257649 updated_at=1413257649
pid_error=-0.0418210520821 pid_error_sum=0 pid_bw=51021 pid_delta=0
circ_fail=0.0
-----
[3]
diff --git a/src/or/networkstatus.c b/src/or/networkstatus.c
index 890da0a..4d72add 100644
--- a/src/or/networkstatus.c
+++ b/src/or/networkstatus.c
@@ -1442,6 +1442,8 @@ networkstatus_note_certs_arrived(void)
                                  waiting_body,
                                  networkstatus_get_flavor_name(i),
                                  NSSET_WAS_WAITING_FOR_CERTS)) {
+        log_info(LD_DIR, "After fetching certificates, we were able to "
+                 "accept the consensus.");
         tor_free(waiting_body);
       }
     }