This ticket [1] was suggested as a GSoC project, but I think there might be an issue with the security model/perceived threat.
To summarize the ticket and its child [1], basically, we currently store all the IP's seen by a node so that we can count unique IP's. The idea is that this is dangerous; if a node is compromised, then all of those IP addresses can be retrieved from memory. Therefore, a variety of mitigation methods have been proposed (most prominently, the 'Probabilistic Counting Algorithm' from [2])
Here's my issue: what about brute force?
No matter what method we use, we will arrive at a data structure that should be able to, given an IP address, tell us whether it is new (and we should increment the unique counter) or old (and we should leave the unique counter the same), with some reasonably small false positive rate. Basically, we're supposed to use some kind of Bloom filter like structure.
Then can't that structure then be brute-forced, offline, by an attacker? IPv4 addresses are 32-bits (~4.3 billion of them), so an attacker could just run whatever method we use to check membership over and over, and then recover the set of IP's. The same happens if we hash the IP's beforehand.
So, is this attack acceptable? The only mitigation I've seen is the one referenced by 'Aaron' in the ticket, which is the system that git uses, cryptolog; there, they have a random salt that changes daily. Then, an attacker can only learn the IP's for one day. This sounds like a reasonable compromise to me, but then the implementation becomes rather simple; just hash the IP's with a random salt that changes daily before putting them in the set.
IPv6 also solves this (128 bits), but there again, the solution is just to hash the IP's before storing them - the Bloom filter/'Probabilistic Counting Algorithm' is unnecessary.
I think I must be missing something about how the 'Probabilistic Counting Algorithm' works - somehow, it needs to keep track of the # of unique IP's without knowing (with a high probability) whether any 1 individual IP has been seen.
Any help/pointing out of errors in my reasoning would be useful.
Thanks, Samir Menon menon.samir@gmail.com samir2@stanford.edu
[1] https://trac.torproject.org/projects/tor/ticket/7532 [2] http://www.mathcs.emory.edu/~cheung/papers/StreamDB/Probab/1985-Flajolet-Pro...