"A. Johnson" aaron.m.johnson@nrl.navy.mil writes:
Hello all,
<snip>
We put in some simple obfuscations in order to not reveal too sensitive data: we multiplied actual values with a random number in [0.9, 1.1] before including those obfuscated values in extra-info descriptors. Maybe there's something smarter we could do? Or is this okay?
I actually think that additive rather than multiplicative noise (i.e. randomness) makes sense here. Let’s suppose that you would like to obscure any individual connection that contains C cells or fewer (obscuring extremely and unusually large connections seems hopeless but unnecessary). That is, you don’t want the (distribution of) the RP cellcount from any relay to change by much whether or not C cells are removed The standard differential privacy approach would be to *add* noise from the Laplace distribution Lab(\epsilon/C), where \epsilon controls how much the statistics *distribution* can multiplicatively differ. I’m not saying that we need to add noise exactly from that distribution (maybe we weaken the guarantee slightly to get better accuracy), but the same idea applies. This would apply the same to both large and small relays. You *want* to learn roughly how much RP traffic each relay has - you just want to obscure the exact number within some tolerance.
Hello Aaron,
I posted an initial draft of the proposal here: https://lists.torproject.org/pipermail/tor-dev/2014-November/007863.html Any feedback would be awesome.
Specifically, I would be interested in undertanding the concept of additive noise a bit better. As you can see the proposal draft is still using multiplicative noise, and if you think that additive is better we should change it. Unfortunately, I couldn't find any good resources on the Internet explaining the difference between additive and multiplicative noise. Could you expand a bit on what you said above? Or link to a paper that explains more? Or link to some other system that is doing additive noise (or even better its implementation)?
Thanks!
Hi George,
I posted an initial draft of the proposal here: https://lists.torproject.org/pipermail/tor-dev/2014-November/007863.html Any feedback would be awesome.
OK, I’ll have a chance to look at this in the next few days.
Specifically, I would be interested in undertanding the concept of additive noise a bit better. As you can see the proposal draft is still using multiplicative noise, and if you think that additive is better we should change it. Unfortunately, I couldn't find any good resources on the Internet explaining the difference between additive and multiplicative noise. Could you expand a bit on what you said above? Or link to a paper that explains more? Or link to some other system that is doing additive noise (or even better its implementation)?
The technical argument for differential privacy is explained in http://research.microsoft.com/en-us/projects/databaseprivacy/dwork.pdf. The definition appears in Def. 2, the Laplace mechanism is given in Eq. 3 of Sec. 5, and Thm. 4 shows why that mechanism achieves differential privacy.
But that stuff is pretty dry. The basic idea is that you’re trying to the contribution of any one sensitive input (e.g. a single user’s data or a single component of a single user’s data). The noise that you need to cover that doesn’t scale with the number of other users, and so you use additive noise.
Hope that helps, Aaron
"A. Johnson" aaron.m.johnson@nrl.navy.mil writes:
Hi George,
I posted an initial draft of the proposal here: https://lists.torproject.org/pipermail/tor-dev/2014-November/007863.html Any feedback would be awesome.
OK, I’ll have a chance to look at this in the next few days.
Specifically, I would be interested in undertanding the concept of additive noise a bit better. As you can see the proposal draft is still using multiplicative noise, and if you think that additive is better we should change it. Unfortunately, I couldn't find any good resources on the Internet explaining the difference between additive and multiplicative noise. Could you expand a bit on what you said above? Or link to a paper that explains more? Or link to some other system that is doing additive noise (or even better its implementation)?
The technical argument for differential privacy is explained in http://research.microsoft.com/en-us/projects/databaseprivacy/dwork.pdf. The definition appears in Def. 2, the Laplace mechanism is given in Eq. 3 of Sec. 5, and Thm. 4 shows why that mechanism achieves differential privacy.
But that stuff is pretty dry. The basic idea is that you’re trying to the contribution of any one sensitive input (e.g. a single user’s data or a single component of a single user’s data). The noise that you need to cover that doesn’t scale with the number of other users, and so you use additive noise.
Thanks for the resources!
I think I now get the general idea. I don't really understand why it works or why Laplace is the best distribution for this job, but maybe it doesn't matter too much for now.
The next problem is how to find the proper parameters for the Laplace distribution. I guess the mean μ needs to be 0, but the hard part is 'b'. In a few papers I read, they set 'b' to (Δf/ε).
In the above, Δf is the "largest change a single participant could have on the output" of the query. Trying to fit this database paradigm to our use case, the largest change a single HS could cause to the HSDir HS counting stats is change the result by 1. So Δf is 1, and I think that ε is some kind of security (sensitivity) parameter, let's set that to 0.3 or something.
So this gives us approx Laplace(0, 4) which can be seen with blue color here: https://upload.wikimedia.org/wikipedia/commons/0/0a/Laplace_pdf_mod.svg In the end of this post, I put a few samples from this distribution [0]. The generated noise seems reasonable for this job.
Now, I'm wondering how to do the same thing for the RP cell statistics. In this case, Δf would have to be the largest amount of cells we hope to obfuscate in an RP circuit. This is a chicken-and-egg situation, since we don't really know how many cells we usually get without doing these stats first.
Maybe we can use the preliminary stats from #13192, which contain both RP and IP cells (but IP cells will probably be a minority). Or maybe we can fit the distribution dynamically based on the amount of cells we receive every day (does this even make sense)? Or what?
BTW, I plan to start implementantion of this early next week, so that it's ready by mid-December. I hope we have a good solution to this by then, otherwise I will have to do something else (round up the stats to the nearest multiple or something) :/
Thanks!
[0]:for i in xrange(100): print numpy.random.laplace(0,4) ....: 3.75587440621 -4.28136229035 4.76311443928 4.05142557505 1.70198910055 -3.37374208295 1.12837234927 -0.905282823974 7.66083097188 0.246385660561 -3.52939581339 -1.3368353768 -1.7482807282 2.98489896819 2.87155179984 -2.72961210143 -3.04409210121 -1.1975804202 -0.34861261134 0.953918739146 -14.3586324803 0.272984575989 3.41377347603 6.48752681038 -4.74036696099 0.668294672995 3.15847434594 -1.58855932489 1.65921624515 -0.529373859224 1.1739048689 -2.2201602699 -0.510111160097 2.58474424973 -7.4773321899 -13.4406958005 -1.34083931335 0.34051030906 -1.09939649788 0.647560027442 2.05240761873 -0.275439053432 10.1238334205 9.0960448449 3.20236196087 -2.27093832694 -19.187310803 0.894898545361 3.62459774003 2.10979313978 -0.633823085078 3.32591049399 3.11206489604 6.52626921692 2.68590966921 1.64033470377 0.997911309606 2.39357922671 0.308907976786 -3.02768280735 3.07096999256 -0.907608650976 1.72587291595 -0.838153001361 -1.23764100768 -9.56662634071 7.89275256421 10.4346665539 -0.522605672578 1.88585708734 1.77708545023 -0.301420228241 8.69964251692 -3.35490635732 -3.14148766097 1.73070057195 0.0426008469217 -1.74373108092 4.18116416817 0.139266645962 -1.32024236062 -2.40639481448 -0.364266143555 -3.6882489347 5.79025078063 0.386467380832 2.5388775702 1.60630747885 3.53930934459 -3.2270856708 4.15611732496 6.53669582267 3.83838409062 -2.62835636891 -1.36484455975 5.02827935505 0.693370215176 1.91312352565 1.93007931702 3.24710666718
The next problem is how to find the proper parameters for the Laplace distribution. I guess the mean μ needs to be 0, but the hard part is 'b'. In a few papers I read, they set 'b' to (Δf/ε).
In the above, Δf is the "largest change a single participant could have on the output" of the query. Trying to fit this database paradigm to our use case, the largest change a single HS could cause to the HSDir HS counting stats is change the result by 1. So Δf is 1, and I think that ε is some kind of security (sensitivity) parameter, let's set that to 0.3 or something.
Yes, you’re right on about how to set these parameters. ε should probably be less than one, but too small (say, below 0.1) and the accuracy is horrible.
Now, I'm wondering how to do the same thing for the RP cell statistics. In this case, Δf would have to be the largest amount of cells we hope to obfuscate in an RP circuit. This is a chicken-and-egg situation, since we don't really know how many cells we usually get without doing these stats first.
Maybe we can use the preliminary stats from #13192, which contain both RP and IP cells (but IP cells will probably be a minority). Or maybe we can fit the distribution dynamically based on the amount of cells we receive every day (does this even make sense)? Or what?
There is a problem here in that RPs are distributed over all relays, and so each relay must contribute a number (and therefore noise) while probably carrying relatively little traffic. Ignoring that, the amount of cells to obfuscate depends on our privacy goal. If it is to hide whether or not a “typical” rendezvous circuit passed through a given RP, then Δf should be a number of cells that would cover most typical circuits. We can look at this another way as well, and say that we will hide, say, 10MiB of traffic, and beyond that the RP you used might start to become revealed in the stats.
Aaron
Comments on proposal 238: 1. I’m not convinced that the proposed amount of obfuscation is sufficient for the HS descriptor count. Adding noise to cover the contribution in a single period of any single HS doesn’t cover its vector of contributions. Thus, if over time the number of HSes stays the same (or has some other pattern that can be guessed by the adversary), then the randomness of the noise in the descriptor counts can effectively be removed by taking, say, taking the average. The best solution to this that I can think of is to bin every k consecutive integers and report the bin of the count after noise has been added. Then over time an adversary can at worst determine that the number of HSes lies within a range k. This applies to the cell counts also.
2. In 2.3, what exactly are “unique hidden-service identities”? .onion addresses?
3. It would hugely improve statistics accuracy to aggregate the statistics and only add noise once. However, this would require that the relays participate in a distributed protocol (e.g. [0]) rather than stick numbers in their extra-info docs.
4. Some possible privacy issues with revealing descriptor publication counts: - You wish to use hidden services in a way that involves a lot of .onion addresses for your service. This will blow past our noise, which I am assuming is calibrated to hide any single publication (or a small constant number of them). Then the total count could reveal when this new service appeared and is active (assuming the number of other descriptor publications is stable or otherwise predictable, say because they correspond to public HSes whose status can determined via a connection attempt). - You can factor out the noise over time if the total count is stable or otherwise predictable. This is the same issue as #1 above and using bins could work here as well.
[0] Our Data, Ourselves: Privacy via Distributed Noise Generation by Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor EUROCRYPT 2006 http://research.microsoft.com/pubs/65086/odo.pdf
On Nov 25, 2014, at 5:14 PM, A. Johnson aaron.m.johnson@nrl.navy.mil wrote:
Hi George,
I posted an initial draft of the proposal here: https://lists.torproject.org/pipermail/tor-dev/2014-November/007863.html Any feedback would be awesome.
OK, I’ll have a chance to look at this in the next few days.
Specifically, I would be interested in undertanding the concept of additive noise a bit better. As you can see the proposal draft is still using multiplicative noise, and if you think that additive is better we should change it. Unfortunately, I couldn't find any good resources on the Internet explaining the difference between additive and multiplicative noise. Could you expand a bit on what you said above? Or link to a paper that explains more? Or link to some other system that is doing additive noise (or even better its implementation)?
The technical argument for differential privacy is explained in http://research.microsoft.com/en-us/projects/databaseprivacy/dwork.pdf. The definition appears in Def. 2, the Laplace mechanism is given in Eq. 3 of Sec. 5, and Thm. 4 shows why that mechanism achieves differential privacy.
But that stuff is pretty dry. The basic idea is that you’re trying to the contribution of any one sensitive input (e.g. a single user’s data or a single component of a single user’s data). The noise that you need to cover that doesn’t scale with the number of other users, and so you use additive noise.
Hope that helps, Aaron _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev