Hi Karsten,
I'm working on https://bugs.torproject.org/9316, which will make BridgeDB export usage statistics. I would like these statistics to be public, privacy-preserving, and -- ideally -- added to Tor Metrics. I wanted to hear your thoughts on 1) what statistics we should collect, 2) how we can collect these statistics safely, and 3) what format these statistics should have.
Broadly speaking, these statistics should answer the following questions:
* How many requests does BridgeDB see per day? * What obfuscation protocols are the most popular? * What bridge distribution mechanisms are the most popular? * From what countries do we see the most bridge requests? * How many BridgeDB requests fail and succeed, respectively? * How many requests does BridgeDB see from Yahoo/Gmail/Riseup? * How many HTTPS requests are coming from proxies? * How many requests are suspicious, and likely issued by bots?
Each request to BridgeDB carries with it some information, which allows us to answer the above questions. I suggest that we collect the following:
* The distribution mechanism. Currently, this is HTTPS, email, or Moat.
* The requested transport. Currently this is vanilla, fte, obfs3, obfs4, or scramblesuit.
* The request's origin. For Moat and HTTPS, it's the two-letter country code, e.g., IT for Italy. For email, it's the user's email domain (Gmail, Yahoo, or Riseup).
* Whether the request was successful or unsuccessful, i.e., resulted in BridgeDB handing out bridges or not.
* Whether the request was issued by a user or a bot. David suggested heuristics that would allow us to estimate if a request came from a bot: https://bugs.torproject.org/9316#comment:19 I like these suggestions but I'm not sure yet how to encode them -- it's more complex than a simple binary flag.
The combination of these statistics results in ~16,800 buckets (3 mechanisms * 5 transports * ~280 ISO country codes * 2 success states * 2 bot states). We only need to export statistics with non-empty buckets. To protect users whose request is the only one in a given bucket (e.g., there may be only one user in Turkmenistan who successfully requested an FTE bridge over HTTPS on 2019-04-02), we should bin the statistics by rounding them up to the next multiple of, say, 10. We should further export statistics infrequently -- maybe once a day.
Here's an example of a simple CSV format that takes into account the above:
timestamp,mechanism,transport,country|domain,success,count,origin 1555977600,https,vanilla,it,successful,40,user 1555977600,https,obfs4,ca,unsuccessful,10,user 1555977600,email,vanilla,yahoo.com,successful,50,user ...
What are your thoughts?
Thanks, Philipp