-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 15/01/16 23:00, Rob Jansen wrote:
Hello,
Hi Rob,
I'm moving this discussion from metrics-team@ to tor-dev@, because I think it's relevant for little-t-tor devs who are not subscribed to metrics-team@. Hope you don't mind.
I was recently reviewing the statistics that Tor allows relays to collect and report to the dir servers [1], which then get published in extra-info documents [2]. Most of this can be enabled by simply setting a torrc option. There are quite a few statistics that I feel should not be collected. I'm wondering if the original purpose for collecting many of these statistics still exists, and if we still feel that the privacy compromises that were made when the collection was implemented are still valid in most cases.
Here are the stats I am most worried about, and why:
[unique ips per country code] *-ips (there are many of these, e.g. "entry-ips") Usually this involves storing individual user IP addresses in memory (in order to track uniqueness) over some period of time (usually 24 hours), sometimes for longer than the user would have otherwise been known to Tor (if a user's session is 1 hour, Tor could remember the IP for at most 23 additional hours). This is reported, e.g., per entry; there are many cases in the data where it is very likely that only one user is connecting to a guard from a given country (because it is rounded up to 8). Users in small countries have the greatest risk (intersection attacks become really easy).
I agree that might just lose these statistics. We used them in the past as first approximation to counting users, but obviously that only works as long as clients only connect to a single relay. The only place where we're still using them is in a workaround for estimating bridge users. See #15469 for more details and #8786 for something we'd have to implement before taking these statistics out.
[exit statistics by port number] exit-kibibytes-written exit-kibibytes-read exit-streams-opened Tor is classifying its traffic into ports, which could uniquely identify the application being used by the client. They also track bandwidth usage per port (and per exit); again, this is bad for those using a random or unique looking ports (that a given exit does not see very often) because it could be used to create a fingerprint. Intersection attacks become easier with this information.
Agreed, I can see us dropping these statistics, too. We're currently not using them. But also see my suggestion below.
The less problematic stats:
[circuit-based cell statistics] cell-processed-cells cell-queued-cells cell-time-in-queue cell-circuits-per-decile This provides queue timings and number of cells being processed at a relay. The number of cells can be used to compute bandwidth of circuits. It may be possible to launch some attacks that create several circuits with the intent of moving which decile buckets some legitimate circuits get placed into, but this is less worrisome of an attack than the others.
I'm less worried about this one. But, suggestion below.
Should Tor still be collecting these things? Should Tor disable the collection of these statistics until we have a more privacy-preserving way to collect and aggregate them?
The good news is that privacy-preserving techniques exist that can reduce information leakage. I'm developing a tool based on the secret-sharing variant of PrivEx [3] to collect some of these types of statistics while providing privacy guarantees. We are currently using it to collect only those stats that are useful for producing Tor traffic models. A great advantage of this tool is that the various counters that we store during the collection phase get noise added and are randomized during initialization; only the aggregates are ever known and revealed by the aggregation server, limiting the information that is lost if a relay is compromised. This is a large improvement over the current collection method, which only adds noise before publication and reveals statistics on a per-relay basis.
Suggestion: How about we evaluate these statistics published by relays in the past years to see if there are other benefits or risks we didn't think of, and then we decide whether to leave them in, modify them, or take them out?
The reason is that I'd want to avoid removing this code only to realize shortly after that we overlooked a good reason for keeping it. These statistics are being collected for years now, and it might take another year or so for relays to upgrade to stop collecting them. So what's another month.
Thanks for (re-)starting this discussion!
All the best, Rob
All the best, Karsten
[1] https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt [2] https://collector.torproject.org/recent/relay-descriptors/extra-infos/
[3] www.cypherpunks.ca/~iang/pubs/privex-ccs14.pdf
_______________________________________________ metrics-team mailing list metrics-team@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
On Jan 19, 2016, at 3:45 AM, Karsten Loesing karsten@torproject.org wrote:
Signed PGP part On 15/01/16 23:00, Rob Jansen wrote:
Hello,
Hi Rob,
I'm moving this discussion from metrics-team@ to tor-dev@, because I think it's relevant for little-t-tor devs who are not subscribed to metrics-team@. Hope you don't mind.
No problem.
Should Tor still be collecting these things? Should Tor disable the collection of these statistics until we have a more privacy-preserving way to collect and aggregate them?
The good news is that privacy-preserving techniques exist that can reduce information leakage. I'm developing a tool based on the secret-sharing variant of PrivEx [3] to collect some of these types of statistics while providing privacy guarantees. We are currently using it to collect only those stats that are useful for producing Tor traffic models. A great advantage of this tool is that the various counters that we store during the collection phase get noise added and are randomized during initialization; only the aggregates are ever known and revealed by the aggregation server, limiting the information that is lost if a relay is compromised. This is a large improvement over the current collection method, which only adds noise before publication and reveals statistics on a per-relay basis.
Suggestion: How about we evaluate these statistics published by relays in the past years to see if there are other benefits or risks we didn't think of, and then we decide whether to leave them in, modify them, or take them out?
Sounds great, though I'm not sure how this evaluation will happen.
The reason is that I'd want to avoid removing this code only to realize shortly after that we overlooked a good reason for keeping it.
The problem is that it is unlikely that anyone will speak up until *after* we remove them, so it may be difficult to realize all use cases until they have already been removed. At least for me, it's not just a matter of thinking hard enough about it.
That said, I think that for some of these stats, the risk is such that it is hard to imagine collecting it the way Tor does currently.
These statistics are being collected for years now, and it might take another year or so for relays to upgrade to stop collecting them. So what's another month.
Agreed.
To be clear, I am not suggesting that we simply remove everything and never look back. I'm actually suggesting using secure aggregation to *replace* the current method for counting and aggregating. Maybe the secure counting/aggregation happens occasionally, or maybe continuously. The details there still need to be worked out (working on it).
I would suggest that we wait until those details are in fact worked out and we discuss a transition plan before removing the old collection methods, but I think that some stats have enough risk that it may not be worth waiting. Maybe we can remove the riskiest stats (IP addresses, exit ports, exit bytes) and wait to remove the others until I have more details about a replacement.
Thanks for (re-)starting this discussion!
Cheers, Rob
On Feb 11, 2016, at 2:51 PM, Rob Jansen rob.g.jansen@nrl.navy.mil wrote:
These statistics are being collected for years now, and it might take another year or so for relays to upgrade to stop collecting them. So what's another month.
Agreed.
Hi Karsten,
Could you please summarize your current plans regarding stopping or replacing the collection method for sensitive statistics, based on discussions in Valencia? I'm particularly interested in the plans for the client IP and the exit stats that are categorized by exit port.
Thanks, Rob
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 25/03/16 16:24, Rob Jansen wrote:
On Feb 11, 2016, at 2:51 PM, Rob Jansen rob.g.jansen@nrl.navy.mil wrote:
These statistics are being collected for years now, and it might take another year or so for relays to upgrade to stop collecting them. So what's another month.
Agreed.
Hi Karsten,
Could you please summarize your current plans regarding stopping or replacing the collection method for sensitive statistics, based on discussions in Valencia? I'm particularly interested in the plans for the client IP and the exit stats that are categorized by exit port.
Hi Rob,
are you asking for a summary of our discussion in Valencia or for the current state of things?
Here's a quick update on the two stats you're mentioning:
- I'm working on replacing client IP stats by implementing stats on directory requests by transport and IP version (#8786). I got a bit distracted by the fact that we're currently not counting IPv6 directory requests at all (#18460), but once that's solved, I'll resume working on the other ticket.
- I asked a metrics team volunteer to look at exit stats and the possible benefits from gathering them and then forgot about that task. I'll look at these stats myself this week.
I also added an item to the metrics roadmap, so that we can revisit this discussion at metrics team meetings:
https://trac.torproject.org/projects/tor/wiki/org/teams/MetricsTeam#RoadmapO...
All the best, Karsten
On Mar 28, 2016, at 5:04 AM, Karsten Loesing karsten@torproject.org wrote:
Signed PGP part On 25/03/16 16:24, Rob Jansen wrote:
On Feb 11, 2016, at 2:51 PM, Rob Jansen rob.g.jansen@nrl.navy.mil wrote:
These statistics are being collected for years now, and it might take another year or so for relays to upgrade to stop collecting them. So what's another month.
Agreed.
Hi Karsten,
Could you please summarize your current plans regarding stopping or replacing the collection method for sensitive statistics, based on discussions in Valencia? I'm particularly interested in the plans for the client IP and the exit stats that are categorized by exit port.
Hi Rob,
are you asking for a summary of our discussion in Valencia or for the current state of things?
Here's a quick update on the two stats you're mentioning:
- I'm working on replacing client IP stats by implementing stats on
directory requests by transport and IP version (#8786). I got a bit distracted by the fact that we're currently not counting IPv6 directory requests at all (#18460), but once that's solved, I'll resume working on the other ticket.
- I asked a metrics team volunteer to look at exit stats and the
possible benefits from gathering them and then forgot about that task. I'll look at these stats myself this week.
I also added an item to the metrics roadmap, so that we can revisit this discussion at metrics team meetings:
https://trac.torproject.org/projects/tor/wiki/org/teams/MetricsTeam#RoadmapO...
Ahh, great! This is what I was looking for, thank you Karsten :)
-Rob