tl;dr: We propose collecting data from exit nodes to improve the Tor network, using differential privacy and secure multiparty computation to do it in a privacy-sensitive manner.
Hi tor-dev,
In the ongoing effort to make Tor faster, secure and more resilient, network data plays an important role. If we know how the network is being used, what its clients' needs are and the threats that it faces we can deal with these in an intelligent manner. While the Tor Project does collect some statistics from guards, it does not currently collect and share potentially sensitive exit statistics. This data includes destination statistics and client timing behaviour, among many other potentially interesting, but privacy sensitive, data points.
This reticence to collect data is due to the (well-founded) risk to clients and OR operators that this data could pose, such as correlation and coercion attacks. This is unfortunate since, as we observe above, in order to make improvements to the Tor network and its feature set, it would be beneficial to know what is going on inside it and with its users.
To that end, it would be great if we were able to learn about network and client trend data. Some concrete examples include circuit-level data volumes, guard traffic usage, lengths of internal buffers, and latencies at relays. Indeed, if it can be counted then we should be able to collect and report it in a privacy-preserving manner.
Which brings me to the reason for this email; I have had the good fortune to work with George Danezis at UCL and my supervisor Ian Goldberg at the University of Waterloo on coming up with a solution to this private data collection problem. We have created a system, PrivEx, that uses modern privacy-preserving techniques such as differential privacy and secure multiparty computation to address this thorny set of challenges; we have written up the details in a tech report that can be found here:http://cacr.uwaterloo.ca/techreports/2014/cacr2014-08.pdf .
We have also created implementations of the two variants of PrivEx as described in the tech report. We are currently putting in the finishing touches and will be releasing them soon as open source in a git repo.
We would like to start by rolling out our own PrivEx-enabled exits in the Tor network and begin collecting destination visit statistics. We expect that PrivEx will be generally useful to all exit operators and the Tor network in general but there is no requirement to deploy it everywhere. We hope to deploy PrivEx on a handful of exits during the June-August timeframe.
What we would really like in order of importance is 1) a design review of our proposal, 2) an implementation review would be nice (once we release it). We hope that these reviews will address the main concerns of the community at large as well as give it, and us, a measure of confidence that collecting data with PrivEx is inherently good and is being done in a responsible and intelligent manner. We anticipate that this would make PrivEx an attractive addition for the Tor Project and their data collection needs.
Please don't hesitate to give us your feedback, either to the list or to me via email.
Cheers,
Tariq
Tariq Elahi:
What we would really like in order of importance is 1) a design review of our proposal, […]
Maybe it's a non-issue but “better ask than sorry”.
If I get it, the idea is to publish a list of visited domain names and how frequently they have been visited (with noise). Would the privacy of users be respected if a website contains a link to a per-visitor generated hostname (using DNS wildcards)?
Lunar:
Tariq Elahi:
What we would really like in order of importance is 1) a design review of our proposal, […]
Maybe it's a non-issue but “better ask than sorry”.
If I get it, the idea is to publish a list of visited domain names and how frequently they have been visited (with noise). Would the privacy of users be respected if a website contains a link to a per-visitor generated hostname (using DNS wildcards)?
Please ignore me:
“To further reduce the risk of inadvertent disclosures, it collects only information about destinations that appear in a list of known censored websites.”
On 11/06/14 20:54, Tariq Elahi wrote:
tl;dr: We propose collecting data from exit nodes to improve the Tor network, using differential privacy and secure multiparty computation to do it in a privacy-sensitive manner.
Hi tor-dev,
In the ongoing effort to make Tor faster, secure and more resilient, network data plays an important role. If we know how the network is being used, what its clients' needs are and the threats that it faces we can deal with these in an intelligent manner. While the Tor Project does collect some statistics from guards, it does not currently collect and share potentially sensitive exit statistics. This data includes destination statistics and client timing behaviour, among many other potentially interesting, but privacy sensitive, data points.
This reticence to collect data is due to the (well-founded) risk to clients and OR operators that this data could pose, such as correlation and coercion attacks. This is unfortunate since, as we observe above, in order to make improvements to the Tor network and its feature set, it would be beneficial to know what is going on inside it and with its users.
To that end, it would be great if we were able to learn about network and client trend data. Some concrete examples include circuit-level data volumes, guard traffic usage, lengths of internal buffers, and latencies at relays. Indeed, if it can be counted then we should be able to collect and report it in a privacy-preserving manner.
Which brings me to the reason for this email; I have had the good fortune to work with George Danezis at UCL and my supervisor Ian Goldberg at the University of Waterloo on coming up with a solution to this private data collection problem. We have created a system, PrivEx, that uses modern privacy-preserving techniques such as differential privacy and secure multiparty computation to address this thorny set of challenges; we have written up the details in a tech report that can be found here:http://cacr.uwaterloo.ca/techreports/2014/cacr2014-08.pdf .
First of all, thanks for doing this research! Having such a system in place would be very useful. If it can be done securely.
I don't feel competent enough to review the crypto parts in that report, so I'll have to leave that to others on this list.
Just one question from taking a quick look over the report: how resilient are the two designs to failing tally key servers? It seems that the plan is to have around 10 of those, which is about the number of directory authorities. And even those are sometimes having difficulty producing a consensus every hour. We even have a dedicated service that watches out for problems with the consensus process. So, what if a subset of the tally key servers break temporarily or even permanently? I guess what I'm asking is how much coordination effort does it take to run your system? Would we need a new tally-key-server-health service?
We have also created implementations of the two variants of PrivEx as described in the tech report. We are currently putting in the finishing touches and will be releasing them soon as open source in a git repo.
We would like to start by rolling out our own PrivEx-enabled exits in the Tor network and begin collecting destination visit statistics. We expect that PrivEx will be generally useful to all exit operators and the Tor network in general but there is no requirement to deploy it everywhere. We hope to deploy PrivEx on a handful of exits during the June-August timeframe.
What we would really like in order of importance is 1) a design review of our proposal, 2) an implementation review would be nice (once we release it). We hope that these reviews will address the main concerns of the community at large as well as give it, and us, a measure of confidence that collecting data with PrivEx is inherently good and is being done in a responsible and intelligent manner. We anticipate that this would make PrivEx an attractive addition for the Tor Project and their data collection needs.
What's the timeline here? You say that the code will be released soon, that you hope to deploy exits during the June-August timeframe, and that you're hoping to get some review on design and implementation. In what order will these things happen? Stated differently: will people have sufficient time to look out for implementation flaws before you deploy your exits?
Please don't hesitate to give us your feedback, either to the list or to me via email.
Thanks for announcing your plans here in advance!
All the best, Karsten
Hi Karsten, Thanks for your feedback. I will try to address your comments inline below. What follows is terse, and will require referring back to the tech report.
On 14-06-16 04:40 PM, Karsten Loesing wrote:
Just one question from taking a quick look over the report: how resilient are the two designs to failing tally key servers? It seems that the plan is to have around 10 of those, which is about the number of directory authorities. And even those are sometimes having difficulty producing a consensus every hour. We even have a dedicated service that watches out for problems with the consensus process. So, what if a subset of the tally key servers break temporarily or even permanently? I guess what I'm asking is how much coordination effort does it take to run your system? Would we need a new tally-key-server-health service?
The tally servers are indeed a point of failure, but there don't have to be so many of them online at once and when they are online, they only need be around for the duration of the epoch. Let's elaborate.
In both schemes new keys are generated, in S2 by the exits and in D2 by the TKSs, and sent to their respective recipients. That initializes the epoch and let's us know who is around. For instance in S2 each exit will enumerate over all the known TKSs and send them keys if it is able to connect to the PrivEx listener on the TKSs. This way only online TKSs will take part in that epoch. In D2, each TKS has to generate an ephemeral key and send it to the PBB. This way if for an epoch the TKS is down there will not be a key for it in the PBB and hence an exit will not include it in it's key creation process.
Granted a TKS may go down during an epoch and then that epoch's data will be lost. We can mitigate this by reducing the size of the TKS pool and also by only bestowing TKS-hood to those servers with generally high uptime. It is not going to be fool-proof but at the very least it will fail secure.
The key take away is that only single epochs will be affected by this and the general utility of the system can be maintained over a long-term period.
What's the timeline here? You say that the code will be released soon, that you hope to deploy exits during the June-August timeframe, and that you're hoping to get some review on design and implementation. In what order will these things happen? Stated differently: will people have sufficient time to look out for implementation flaws before you deploy your exits?
What we would like is your comments on the design and (as soon as I can get the code cleaned up) on the code as well, if it is not too much trouble. It is my intention to keep in close contact with the Tor community about deployment efforts so as to ensure transparency. I will be in touch about more developments as they occur.
Cheers, Tariq