metrics: collecting circuit build failures from relays to detect network reachability issues and broken relays - tor-dev

18 Oct 2017


      Hello,
In [1] David describes his preliminary results from scanning a portion
of the tor network to detect connectivity problems (partitions) in the
presumed tor mesh network (where every relay is expected to be able to
reach every other relay).
I believe this information (How far off are we from a complete mesh
network?) is crucial for the anonymity properties of the tor network and
collecting the data to answer that question should be an integral and
continuous part of tor.
Actively scanning the tor network for connectivity issues is resource
intensive. What if we could collect this data without actively scanning
for it?
This could be achieved by collecting reachability information passively
by relays themselves and uploaded via extra-info descriptors.
This data could help reduce the overall scanning effort. So instead of
scanning _all_ relays we can reduce the scanning to relays where some
threshold of relays consistently said that they are hardly reachable.
(to confirm their measurements)
Due to the mutual nature of the information collection, single or a
minor number of lying relays would not be a big issue.
relays could:
- aggregate reachability issues over the past 24 hours (or week?) per
outbound destination relay (which relay failed what percentage of
circuit build attempts)
- if a relay failed more than - some threshold value - check if the
relay in question:
    - had his uptime reset during that timeframe (reachability problem
expected)
    - dropped out of consensus during that timeframe
- only include it in uploaded data if it didn't drop out of consensus
and did not reset his uptime
(using a week instead of a day would help with reducing the amount of
data that we need to process after collecting it)
In addition to that passive collection, relays - when idle/not
overloaded - could actively attempt to create circuits to measure their
reachability to relays for which they did not collect any any passive
data. (maybe limit it to fast and stable destination relays)
To reduce the load on the tor network we could limit the active outbound
connection tests based on relay flags:
- Guard-only relays are more likely to establish outbound connections to
non-exit relays
- middle-only relays (no guard and exit flag) are more likely to
establish connections to exits than to guards
- exits are more likely to get inbound connections from non-guards
(maybe skip active probes from exits since they are a bottleneck already?)
Reachability issues could also be displayed to the relay operator
warning them about the potential problem via log entries. So they could
actively work on debugging problems themselfes.
There is no doubt that this information is also valuable for (powerful)
adversaries (it could help them reduce their effort when they know weak
spots in the network). So you would have to decide what data you collect
an what you would publish (collector.tpo) - even if that means I might
never get to see that interesting data ;)
This is a medium/long term goal, with the usual steps:
- proposal
- implementation
(- deployment
- data collection - the amount of data could be huge
- data analysis)
If there is a consensus that this makes sense and if someone would
actually implement it I would be happy to work/help on a proposal.
This entire idea would be an opt-in torrc setting at the beginning and a
opt-out feature once we are more confident about its implications and
safety.
Please let me know what you think about this idea.
regards,
nusenu
[1]
https://lists.torproject.org/pipermail/tor-project/2017-October/001492.html
related trac tickets:
https://trac.torproject.org/projects/tor/ticket/12131
https://trac.torproject.org/projects/tor/ticket/19068
-- 
https://mastodon.social/@nusenu
twitter: @nusenu_