New subject: Proposal xyz : Count Unique IP addresses in an anonymous way

17 Mar 2017


      Hi,
Please have a look at this proposal. Will replace xyz with more
meaningful numbers once this is finalized. Comments, suggestions and
criticism are welcome.
-------------
Filename: xxx-Count-unique-IPs-in-anonymous-way.txt
Title: Count Unique IP addresses in an anonymous way
Author: Jaskaran Veer Singh
Created: 14 March 2017
Status: Draft
§0. Introduction
Currently, guard relays and bridges maintains a list of IP addresses of
the devices that connect to it for various reasons such as for use by
the bridge to check which country has them blocked. This is dangerous
because if any of these tor instances get compromised, clients will be
de-anonymized. To solve this issue, a new data structure that keeps a
track of unique IP addresses seen but does not directly keep a list of
them is proposed in this document.
§1. Specification
§1.1. Notation
Let `a^b` denote the exponentiation of a to the bth power.
Let `a == b` denote the equality of a with b, and vice versa.
Let `a := b` be the assignment of the value of b to the variable a.
Let `a / b` denote the division of a by b.
Let `a <= b` denote that a is less than equal to b.
Let `a >= b` denote that a is greater than equal to b.
§2. Research
There are three ways to solve this problem. All the three ways are
actually Big Data
Algorithms. A few problems arises for each of these algorithms since
they are made for
big data but the data we would provide is not necessarily big.
§2.1. Bloom Filter[1]
A bloom filter maps the input to positions on the bitmap after passing
through 2 or more hash functions. Later any new input are mapped onto
this bitmap in the same way to check whether this value is already
present in the set. The feature of this bitmap is that collisions could
happen. And this collision creates deniability. When collisions happen,
On the one hand, one of the input doesn’t count to be unique (although
in reality, it is), and on the other hand, this is beneficial since this
creates deniability. The person who gets hand on this data structure
could never be 100% sure about the original inputs. So we get the job
done successfully at some error rate.
§3.1.1. Obstacle
Suppose if the number of inputs is small. Let’s say we receive just 1
connection in a day from some small, less busy country like Estonia. In
that case, there might not be any chance for collision and the adversary
could determine the IP address with some brute force. Hence this
algorithm isn’t suited for us.
§3.2. Rappor[2]
Randomized Aggregatable Privacy Preserving Ordinal Responses is an
algorithm where the system adds some deterministic and nondeterministic
noise to the data that has to be stored. This creates deniability. In
our case, we don’t need to have deterministic noise added at first
stage. So we’ll just stick to adding non deterministic noise and storing
it in a bloom filter.
One thing to note here is that, we should not accept output of the non
deterministic randomizer which can be traced back to any IP address of
Group D or Group E since those IP addresses are not in use and the
adversary could easily know that those have been produced after adding
random noise.
§3.2.1. Obstacle
Using brute force technique, the adversary could check to see whether
the IP address stored is the correct one or produced using random noise.
In this technique, the adversary could compare those IP address
(obtained via brute force) to the directory of what IP addresses are
allotted to what country. The one that does not match, is the one
that has been faked by using random noise.
§3.3. Probabilistic Counting with Stochastic Averaging[3] (PCSA)
It is based on FM sketches. The algorithm goes as follows:
|  m = 2^b # with b in [4...16]
|  bitmaps = [[0]*32]*m # initialize m 32bit wide bitmaps to 0s
|
|  # Construct the PCSA bitmaps
|  for h in hashed(data):
|      bitmap_index = 1 + get_bitmap_index( h,b ) # binary address of
                                                    rightmost b bits
|      run_length = run_of_zeros( h,b )  # length of the run of zeros
                                            starting at bit b+1
|      bitmaps[bitmap_index][run_length] = 1 # set the bitmap bit
                                       based on the run length observed
|
|  # Determine the cardinality
|  phi = 0.77351		
|  DV = m / phi * 2 ^ (sum( least_sig_bit( bitmap ) ) / m) # the DV
                                                             estimate
So, Error is bounded by 0.78/sqrt(m)
§3.3.1. Obstacle
This algorithms stated above is made for use on large databases. Infact,
these were invented to save time and space while doing basic set
operations on data with high cardinality. But the data we would provide
as an input is not necessarily of high cardinality. Since we would be
counting numbers for each country separately, so the expected value of
the input cardinality would be :
0 <= C <= 2500
where C is the actual cardinality of the input data.
§3.3.2. Workaround
As a workaround, we could introduce a correction term for small values
of cardinality.
So our final formula could look something like:
DV(S1,...,Sm) :=  m· (2^(Z/m) −2^(−κ·Z/m))/ ρ
Where Κ is chosen to be around 1.75
§4. Implementation
The data structure present in the geoip.c needs to be removed as it
stores the IP address of the client. The data structure is shown below.
struct clientmap_entry_t {
  HT_ENTRY(clientmap_entry_t) node;
  tor_addr_t addr;
  char *transport_name;
  unsigned int last_seen_in_minutes:30;
  unsigned int action:2;
};
It then later needs to be replaced by a datastruture that contains the
list of countries with the number of unique IP addrs seen for that country.
The whole system can be represented by this diagram.
------------------	
    		      |	Input IP Addr  |
    		      ------------------
    			      |
    			      |
    		     ---------------------
    		     | Determine Country |
    		     ---------------------
    			      |
    			      |
    		    ----------------------
    		    | Determine Transport|
    		    ----------------------
    			      |
    			      |
    		  ------------------------------
    		  | Get salted hash of IP addr |
    		  ------------------------------
    			      |
    			      |
    		   --------------------------
    		   |   Apply the formula    |
    		   |   on the hash obtained |
    		   --------------------------
    			      |
    			      |
    		 ---------------------------------------
    		 |   Increment that country's counter  |
    		 |   if the cardinality estimation     |
    		 |   is greater than the estimation    |
    	         |         done previously             |
    		 ---------------------------------------
§5. Security Considerations
We might as well go ahead and implement this, but the thing to keep in
mind is that implementing this would not protect a client's identity
from a adversary completely. First thing to understand is that, an
adversary that has gained access to a guard node or a bridge could still
get the random value generated and deanonymize the client using
brute force. Or even better, why would the adversary need that random
value when she simply log all network connections coming into the
compromised system?
So, A thing to note is that this functionality would keep those clients
anonymous who had connected to the compromised system in the past, but
not those who will connect to it in the future.
§6. References
[1] https://en.wikipedia.org/wiki/Bloom_filter
[2] https://www.chromium.org/developers/design-documents/rappor
[3] https://research.neustar.biz/2013/04/02/sketch-of-the-
day-probabilistic-counting-with-stochastic-averaging-pcsa/
-------------
Regards,
Jaskaran Veer Singh