Ah, I see - PCSA can actually keep track of unique IP's without actually revealing them. Your first link cleared it up a lot for me. PCSA is a really cool technique!
I'd love to work on this as a GSoC project. I'll write up a proposal and send it out soon.
On Tue, Mar 28, 2017 at 2:55 PM, Jaskaran Singh jvsg1303@gmail.com wrote:
Hi Samir,
Brute force does affect Bloom filter/hashed-values as you rightly mentioned, but not Probabilistic Counting by Stochastic Averaging (PCSA).
PCSA works on the principle that in an input the probability of n consecutive bits having value '0' from the left side(could be right as well, but for now assume it left) is 2^(-(n+1)). Bit 'i' of the Bitmap(which is our main data structure) is set if a the number of consecutive zeros (from left) is 'i'.
We keep repeating it for every input(IP address). We then end up with a Bitmap whose most significant '1' can be computed to give us an approximate number of inputs that must have been gone into the algorithm.
In simple words, if I tell you that I have seen the value 1010000 out of a total of 'x' values I examined. You could guess that I had examined a total of 2^5 values before I saw that particular value.
We would tweak the algorithm to store only the significant most '1' in bitmap instead of storing '1' at every iteration. This would mean that all that adversary could get hold of is a bitmap whose just one of the bit is '1'.
Example, the adversary might get a data structure that looks like: 000001000000 and would have no way tell what IP addresses were used as an input.
This was just the basic idea behind PCSA. The actual PCSA makes use of complicated looking formula to get the approximate number of unique IP addresses in order to keep error rate low.
I hope this makes sense.
For some more information and simulation, please check
[0] https://research.neustar.biz/2013/04/02/sketch-of-the- day-probabilistic-counting-with-stochastic-averaging-pcsa/ [1] http://content.research.neustar.biz/blog/runs.html
Regards, Jaskaran
On Wednesday 29 March 2017 01:54 AM, samir menon wrote:
This ticket [1] was suggested as a GSoC project, but I think there might be an issue with the security model/perceived threat.
To summarize the ticket and its child [1], basically, we currently store all the IP's seen by a node so that we can count unique IP's. The idea is that this is dangerous; if a node is compromised, then all of those IP addresses can be retrieved from memory. Therefore, a variety of mitigation methods have been proposed (most prominently, the 'Probabilistic Counting Algorithm' from [2])
Here's my issue: what about brute force?
No matter what method we use, we will arrive at a data structure that should be able to, given an IP address, tell us whether it is new (and we should increment the unique counter) or old (and we should leave the unique counter the same), with some reasonably small false positive rate. Basically, we're supposed to use some kind of Bloom filter like structure.
Then can't that structure then be brute-forced, offline, by an attacker? IPv4 addresses are 32-bits (~4.3 billion of them), so an attacker could just run whatever method we use to check membership over and over, and then recover the set of IP's. The same happens if we hash the IP's beforehand.
So, is this attack acceptable? The only mitigation I've seen is the one referenced by 'Aaron' in the ticket, which is the system that git uses, cryptolog; there, they have a random salt that changes daily. Then, an attacker can only learn the IP's for one day. This sounds like a reasonable compromise to me, but then the implementation becomes rather simple; just hash the IP's with a random salt that changes daily before putting them in the set.
IPv6 also solves this (128 bits), but there again, the solution is just to hash the IP's before storing them - the Bloom filter/'Probabilistic Counting Algorithm' is unnecessary.
I think I must be missing something about how the 'Probabilistic Counting Algorithm' works - somehow, it needs to keep track of the # of unique IP's without knowing (with a high probability) whether any 1 individual IP has been seen.
Any help/pointing out of errors in my reasoning would be useful.
Thanks, Samir Menon menon.samir@gmail.com mailto:menon.samir@gmail.com samir2@stanford.edu mailto:samir2@stanford.edu
[1] https://trac.torproject.org/projects/tor/ticket/7532 [2] http://www.mathcs.emory.edu/~cheung/papers/StreamDB/Probab/
1985-Flajolet-Probabilistic-counting.pdf
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
-- Jaskaran Veer Singh (jvsg) jvsg1303 at gmail dot com PGP 2814 3FB7 A32D 429B 092E 27F0 8AA3 C532 9E1A 6AD8
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev