Hi,
Please have a look at this proposal. Will replace xyz with more meaningful numbers once this is finalized. Comments, suggestions and criticism are welcome.
------------- Filename: xxx-Count-unique-IPs-in-anonymous-way.txt Title: Count Unique IP addresses in an anonymous way Author: Jaskaran Veer Singh Created: 14 March 2017 Status: Draft
§0. Introduction
Currently, guard relays and bridges maintains a list of IP addresses of the devices that connect to it for various reasons such as for use by the bridge to check which country has them blocked. This is dangerous because if any of these tor instances get compromised, clients will be de-anonymized. To solve this issue, a new data structure that keeps a track of unique IP addresses seen but does not directly keep a list of them is proposed in this document.
§1. Specification
§1.1. Notation
Let `a^b` denote the exponentiation of a to the bth power.
Let `a == b` denote the equality of a with b, and vice versa.
Let `a := b` be the assignment of the value of b to the variable a.
Let `a / b` denote the division of a by b.
Let `a <= b` denote that a is less than equal to b.
Let `a >= b` denote that a is greater than equal to b.
§2. Research
There are three ways to solve this problem. All the three ways are actually Big Data Algorithms. A few problems arises for each of these algorithms since they are made for big data but the data we would provide is not necessarily big.
§2.1. Bloom Filter[1]
A bloom filter maps the input to positions on the bitmap after passing through 2 or more hash functions. Later any new input are mapped onto this bitmap in the same way to check whether this value is already present in the set. The feature of this bitmap is that collisions could happen. And this collision creates deniability. When collisions happen, On the one hand, one of the input doesn’t count to be unique (although in reality, it is), and on the other hand, this is beneficial since this creates deniability. The person who gets hand on this data structure could never be 100% sure about the original inputs. So we get the job done successfully at some error rate.
§3.1.1. Obstacle
Suppose if the number of inputs is small. Let’s say we receive just 1 connection in a day from some small, less busy country like Estonia. In that case, there might not be any chance for collision and the adversary could determine the IP address with some brute force. Hence this algorithm isn’t suited for us.
§3.2. Rappor[2]
Randomized Aggregatable Privacy Preserving Ordinal Responses is an algorithm where the system adds some deterministic and nondeterministic noise to the data that has to be stored. This creates deniability. In our case, we don’t need to have deterministic noise added at first stage. So we’ll just stick to adding non deterministic noise and storing it in a bloom filter.
One thing to note here is that, we should not accept output of the non deterministic randomizer which can be traced back to any IP address of Group D or Group E since those IP addresses are not in use and the adversary could easily know that those have been produced after adding random noise.
§3.2.1. Obstacle
Using brute force technique, the adversary could check to see whether the IP address stored is the correct one or produced using random noise. In this technique, the adversary could compare those IP address (obtained via brute force) to the directory of what IP addresses are allotted to what country. The one that does not match, is the one that has been faked by using random noise.
§3.3. Probabilistic Counting with Stochastic Averaging[3] (PCSA)
It is based on FM sketches. The algorithm goes as follows:
| m = 2^b # with b in [4...16] | bitmaps = [[0]*32]*m # initialize m 32bit wide bitmaps to 0s | | # Construct the PCSA bitmaps | for h in hashed(data): | bitmap_index = 1 + get_bitmap_index( h,b ) # binary address of rightmost b bits | run_length = run_of_zeros( h,b ) # length of the run of zeros starting at bit b+1 | bitmaps[bitmap_index][run_length] = 1 # set the bitmap bit based on the run length observed | | # Determine the cardinality | phi = 0.77351 | DV = m / phi * 2 ^ (sum( least_sig_bit( bitmap ) ) / m) # the DV estimate
So, Error is bounded by 0.78/sqrt(m)
§3.3.1. Obstacle
This algorithms stated above is made for use on large databases. Infact, these were invented to save time and space while doing basic set operations on data with high cardinality. But the data we would provide as an input is not necessarily of high cardinality. Since we would be counting numbers for each country separately, so the expected value of the input cardinality would be :
0 <= C <= 2500
where C is the actual cardinality of the input data.
§3.3.2. Workaround
As a workaround, we could introduce a correction term for small values of cardinality. So our final formula could look something like:
DV(S1,...,Sm) := m· (2^(Z/m) −2^(−κ·Z/m))/ ρ
Where Κ is chosen to be around 1.75
§4. Implementation
The data structure present in the geoip.c needs to be removed as it stores the IP address of the client. The data structure is shown below.
struct clientmap_entry_t { HT_ENTRY(clientmap_entry_t) node; tor_addr_t addr; char *transport_name; unsigned int last_seen_in_minutes:30; unsigned int action:2; };
It then later needs to be replaced by a datastruture that contains the list of countries with the number of unique IP addrs seen for that country.
The whole system can be represented by this diagram.
------------------ | Input IP Addr | ------------------ | | --------------------- | Determine Country | --------------------- | | ---------------------- | Determine Transport| ---------------------- | | ------------------------------ | Get salted hash of IP addr | ------------------------------ | | -------------------------- | Apply the formula | | on the hash obtained | -------------------------- | | --------------------------------------- | Increment that country's counter | | if the cardinality estimation | | is greater than the estimation | | done previously | ---------------------------------------
§5. Security Considerations
We might as well go ahead and implement this, but the thing to keep in mind is that implementing this would not protect a client's identity from a adversary completely. First thing to understand is that, an adversary that has gained access to a guard node or a bridge could still get the random value generated and deanonymize the client using brute force. Or even better, why would the adversary need that random value when she simply log all network connections coming into the compromised system?
So, A thing to note is that this functionality would keep those clients anonymous who had connected to the compromised system in the past, but not those who will connect to it in the future.
§6. References
[1] https://en.wikipedia.org/wiki/Bloom_filter [2] https://www.chromium.org/developers/design-documents/rappor [3] https://research.neustar.biz/2013/04/02/sketch-of-the- day-probabilistic-counting-with-stochastic-averaging-pcsa/
-------------
Regards, Jaskaran Veer Singh
On Fri, 17 Mar 2017 18:12:11 +0000, Jaskaran Singh wrote: ...
Currently, guard relays and bridges maintains a list of IP addresses of the devices that connect to it for various reasons such as for use by the bridge to check which country has them blocked. This is dangerous because if any of these tor instances get compromised, clients will be de-anonymized.
As an adversary, I wouldn't take down the bridge but either monitor the traffic to it ($country can also do this on its border gateways), or modify it to tell me the connecting IP addresses.
End users tend to be on dynamic IP address, so stored IP addresses aren't of much worth when you don't know when they were used; that is a reason why $adversary might be more interested in snooping than in compromising the bridge.
(Although I don't know how prevalent changing IP addresses still are when you're online permanently. E.g. here in germany telekom changes to all-ip, and there no longer disconnects after 24h, and thus you don't change IPs every day.)
...
present in the set. The feature of this bitmap is that collisions could happen. And this collision creates deniability. When collisions happen,
The problem is that for the accounting purposes you don't want (too many) collisions, and also that state agencies don't necessarily care for plausible deniability - if an IP address is found by enumeration and probing the bloom filter they might still decide to put that user on closer watch. (I've heard that a lot of the traditional telephone tapping isn't used as evidence in court but produces leads to where to investigate next.)
On the other hand side you can indeed keep the filter rather small because one bridge doesn't get that many collisions, and you don't need to make it anywhere as big as to avoid collision with 2^32 entries. Could also be dynamically sized depending on the number of clients seen - you need aging anyway, so the next table can have a different size.
You can also go and poison the bloom filter with some random addresses, even a lot, actually. If we're talking of 2000 users you can easily throw in another 2000 random addresses without decreasing the precision of the statistics much - only on a size comparable to collisions in the bloom filter itself.
- Andreas
Hi Andreas,
On Saturday 18 March 2017 10:06 AM, Andreas Krey wrote:
As an adversary, I wouldn't take down the bridge but either monitor the traffic to it ($country can also do this on its border gateways), or modify it to tell me the connecting IP addresses.
Absolutely correct. In fact I mentioned this toward the end of the proposal.
Or even better, why would the adversary need that random value when she can simply log all network connections coming into the (compromised) system?
But I think that our users don't expect us to profile them. So, I believe it is bad practice to keep a list of something as important as IP addresses for any anonymity system.
But there's a thing to be happy about. If we replace the list of IP addresses with something better, we might be able to prevent adversary from getting the knowledge of IP addresses that connected in the past. Suppose an adversary(lets say the government) suspects someone used a bridge to connect to Tor Network. They would only be able to know that for sure if they log that activity there and then. Means, they would not be able to de-anonymize users retrospectively. Of course, if the user again connects from the same IP and the adversary is monitoring the relay/bridge, that might cause problem.
End users tend to be on dynamic IP address, so stored IP addresses aren't of much worth when you don't know when they were used; that is a reason why $adversary might be more interested in snooping than in compromising the bridge.
(Although I don't know how prevalent changing IP addresses still are when you're online permanently. E.g. here in germany telekom changes to all-ip, and there no longer disconnects after 24h, and thus you don't change IPs every day.)
...
present in the set. The feature of this bitmap is that collisions could happen. And this collision creates deniability. When collisions happen,
The problem is that for the accounting purposes you don't want (too many) collisions, and also that state agencies don't necessarily care for plausible deniability - if an IP address is found by enumeration and probing the bloom filter they might still decide to put that user on closer watch. (I've heard that a lot of the traditional telephone tapping isn't used as evidence in court but produces leads to where to investigate next.)
On the other hand side you can indeed keep the filter rather small because one bridge doesn't get that many collisions, and you don't need to make it anywhere as big as to avoid collision with 2^32 entries. Could also be dynamically sized depending on the number of clients seen
- you need aging anyway, so the next table can have a different size.
I feel that this isn't a nice solution. Suppose you have 10 cells and 3 hash functions at the beginning. Later when inputs exceed a threshold, you increase the size of bloom filter to make it to 20 cells. Now those 3 hash functions would map to the whole range which means the inputs that were mapped to 10 cells would now map to something completely different. Hence, error rate would be, I guess exactly 100%.
You can also go and poison the bloom filter with some random addresses, even a lot, actually. If we're talking of 2000 users you can easily throw in another 2000 random addresses without decreasing the precision of the statistics much - only on a size comparable to collisions in the bloom filter itself.
- Andreas
Regards, Jaskaran
On Tue, 21 Mar 2017 16:06:59 +0000, Jaskaran Singh wrote: ...
On the other hand side you can indeed keep the filter rather small because one bridge doesn't get that many collisions, and you don't need to make it anywhere as big as to avoid collision with 2^32 entries. Could also be dynamically sized depending on the number of clients seen
- you need aging anyway, so the next table can have a different size.
I feel that this isn't a nice solution. Suppose you have 10 cells and 3 hash functions at the beginning. Later when inputs exceed a threshold, you increase the size of bloom filter to make it to 20 cells.
10 is way too low for any population (if 'cell' means 'bit'); I played with a bit of code for that.
3 hash functions would map to the whole range which means the inputs that were mapped to 10 cells would now map to something completely different. Hence, error rate would be, I guess exactly 100%.
No, basically you need to retain the old bloom filter while seeding the new one for the amount of time of your entry timeout.
(And given the other discussion regarding brute-forcing the 2^32 space, either you need a really time-consuming hash, or you count on the pre-seeding with random IP addresses as the only means to cover the real ones.)
I wonder what would happen if you implement the aging by just randomly clearing bits in the bloom filter at an appropriate rate.
Andreas
Hi Jaskaran,
On 17 Mar 2017, at 23:42, Jaskaran Singh jvsg1303@gmail.com wrote:
§2. Research
There are three ways to solve this problem. All the three ways are actually Big Data Algorithms. A few problems arises for each of these algorithms since they are made for big data but the data we would provide is not necessarily big.
As we discussed on IRC, there's a simpler way to solve the problem of storing IP addresses in memory: store a (keyed) hash of the IP address instead.
The hash can be tuned to be sufficiently hard to make brute-forcing impractical. (As long as each 'country' has sufficiently many IP addresses. And as long as the threat model excludes adversaries which only want to confirm a few addresses.)
The key can be rotated at a suitable interval, ensuring that past addresses can not be discovered by future attackers.
Noise can be added to the hash if we wish.
And there's no need for a correction factor: the hash is an exact mapping.
Including a hash-based scheme would make the proposal more comprehensive. It could also help justify the complexity of the other schemes in terms of the benefits they provide over and above a simple hash (if any).
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
teor teor2345@gmail.com writes:
Hi Jaskaran,
On 17 Mar 2017, at 23:42, Jaskaran Singh jvsg1303@gmail.com wrote:
§2. Research
There are three ways to solve this problem. All the three ways are actually Big Data Algorithms. A few problems arises for each of these algorithms since they are made for big data but the data we would provide is not necessarily big.
As we discussed on IRC, there's a simpler way to solve the problem of storing IP addresses in memory: store a (keyed) hash of the IP address instead.
The hash can be tuned to be sufficiently hard to make brute-forcing impractical. (As long as each 'country' has sufficiently many IP addresses. And as long as the threat model excludes adversaries which only want to confirm a few addresses.)
Hey teor,
I feel like replying to this thread is basicaly moot without a precise threat model, since it seems like each person here has a different threat model in mind. There are many attacks and scenarios applicable and it's unclear which ones we are trying or hoping to cover.
For example, I'm confused on how a keyed hash is better than a simple hash (or scrypt) of the IP address (which most people agree is not a good solution). That is, if an attacker pwns a box and steals the hash key and the list of hashes, the attacker can easily brute-force the 2^32 keyed hashes at that point and derive the list of connected IP addresses. What's the difference?
The key can be rotated at a suitable interval, ensuring that past addresses can not be discovered by future attackers.
Hmm, is this attack in the threat model? And how does it work exactly?
Also how does salted hash vs normal hash make a difference here, since even in the normal hash variant, if you memwipe/rotate the hash list everyday, an attacker should not be able to discover past addresses.
All in all, I think it will be hard to choose a scheme here without a precise threat model, and I don't see the original proposal making an attempt to define that.
Greetings from Amsterdam!
Hi,
So here's the updated part of the proposal.
------------
§ Threat model & Security Considerations
Consider the adversary with the following powers:
- Has sufficient computational and storage power to brute force any method that can be brute forced.
- Can get the recurrent control of the concerned guard-node/bridge.
- Can interact with the concerned data structure that stores unique-IP- addresses/hash-values/bloom-filter/bitmaps etc.
- Can also log incoming connections and IP addresses outside the realm of Tor(i.e at the system level or at gateways etc.)
- Can manipulate the incoming connection with some made up IP address as to observe the working of our proposed solution.
- As a consequence of previous power, adversary can also inject pattern of IP addresses to observe any pattern in the stored data structure.
An ideal solution would not involve hashing or even if it does, it would manipulate that hash to before storing in such a way that adversary cannot learn about IP addresses even with brute force attack.
An ideal solution would not help the adversary observe any pattern in the stored data structure. This could be accomplished by incorporating salted hash or variations of it into the proposed solution. And the salt would be changed every time we start tracking unique IP addresses.
There is a fundamental limitation to what we can do and that is that we cannot stop an adversary from gaining knowledge of IP addresses at the system level or a gateways etc. But, the thing to cheer about is that in this way, the adversary cannot learn about the users retrospectively.
------------
Regards, Jaskaran
This looks plausible to me. As a formatting note: we usually don't give the non-proposed solutions equal treatment with the proposed solutions. Instead, we usually mark them as not proposed. From this document, I _think_ you're proposing PCSA, and suggesting that we not do the others? But I'm not sure.
For the PCSA portion, we should really have citations to the actual papers where these formulas and algorithms come from.
We should also consider how this proposal would interact with other proposed secure aggregation solutions, like Privcount [1] and/or some other kind of PrivEx [2]. I'd like to hear what the designers of those ideas think of this one.
[1] https://github.com/privcount/privcount [2] http://cacr.uwaterloo.ca/techreports/2014/cacr2014-08.pdf
We should also consider how this proposal would interact with other proposed secure aggregation solutions, like Privcount [1] and/or some other kind of PrivEx [2]. I'd like to hear what the designers of those ideas think of this one.
As you know, PrivCount is a secure aggregation system. In particular, it allows you to count some value (i.e. compute some sum) across relays and over time such that the only output anyone learns is a noisy (and differentially-private) version of the sum.
As you also know, PCSA allows one to locally store a count of unique items in a way that provides some privacy if the machine is compromised. PrivCount isn’t designed to support such counting of unique items, but it does provide such “forward privacy" (actually it provides perfect, information-theoretic privacy).
So, it seems that PCSA and PrivCount do similar but different counting and provide similar but different kinds of forward privacy. These ideas could be combined to provide some combination of their functionality and privacy features. To do so, have each relay maintain an FM sketch as a sequence of PrivCount counters (recall that these counters are blinded and thus appear random at any given time). Represent a 0 in each counter by storing (in a blinded way) 0, and represent 1 by storing (in a blinded way) a random value (with b bits in the counter it is non-zero with probability 1-2^(-b)). Given a new item, a relay turns bits from 0 to 1 in the sketch by adding a random value. The PrivCount aggregation protocol would reveal each bit of the FM sketch aggregated across all relays, which would thus reveal the PCSA count of unique IPs seen across all relays. Note that this output would not be differentially private - it isn’t obvious how FM sketches can support differential privacy with any accuracy as some single user does cause all bits to flip to one. The output would provide whatever privacy FM sketches provide, and this idea would support the privacy-enhancement of flipping random bits proposed by Tschorsch and Scheuermann.
This scheme would provide better forward privacy than PCSA alone, because PrivCount would secure the counters perfectly during the measurement period. It would provide better unique counting than per-relay PCSA because it would be counting uniquely across all relays. Again, it wouldn’t provide the differential privacy guarantee of PrivCount, but it would provide whatever privacy is provided by PCSA.
This is an idea I had considered a while ago but gave up on because of the lack of a differential-privacy guarantee for the output and my general lack of confidence in the privacy provided by PCSA. However, it may well be an improvement on what Tor is doing currently :-)
Best, Aaron