Hi Karsten,
I'm working on https://bugs.torproject.org/9316, which will make BridgeDB export usage statistics. I would like these statistics to be public, privacy-preserving, and -- ideally -- added to Tor Metrics. I wanted to hear your thoughts on 1) what statistics we should collect, 2) how we can collect these statistics safely, and 3) what format these statistics should have.
Broadly speaking, these statistics should answer the following questions:
* How many requests does BridgeDB see per day? * What obfuscation protocols are the most popular? * What bridge distribution mechanisms are the most popular? * From what countries do we see the most bridge requests? * How many BridgeDB requests fail and succeed, respectively? * How many requests does BridgeDB see from Yahoo/Gmail/Riseup? * How many HTTPS requests are coming from proxies? * How many requests are suspicious, and likely issued by bots?
Each request to BridgeDB carries with it some information, which allows us to answer the above questions. I suggest that we collect the following:
* The distribution mechanism. Currently, this is HTTPS, email, or Moat.
* The requested transport. Currently this is vanilla, fte, obfs3, obfs4, or scramblesuit.
* The request's origin. For Moat and HTTPS, it's the two-letter country code, e.g., IT for Italy. For email, it's the user's email domain (Gmail, Yahoo, or Riseup).
* Whether the request was successful or unsuccessful, i.e., resulted in BridgeDB handing out bridges or not.
* Whether the request was issued by a user or a bot. David suggested heuristics that would allow us to estimate if a request came from a bot: https://bugs.torproject.org/9316#comment:19 I like these suggestions but I'm not sure yet how to encode them -- it's more complex than a simple binary flag.
The combination of these statistics results in ~16,800 buckets (3 mechanisms * 5 transports * ~280 ISO country codes * 2 success states * 2 bot states). We only need to export statistics with non-empty buckets. To protect users whose request is the only one in a given bucket (e.g., there may be only one user in Turkmenistan who successfully requested an FTE bridge over HTTPS on 2019-04-02), we should bin the statistics by rounding them up to the next multiple of, say, 10. We should further export statistics infrequently -- maybe once a day.
Here's an example of a simple CSV format that takes into account the above:
timestamp,mechanism,transport,country|domain,success,count,origin 1555977600,https,vanilla,it,successful,40,user 1555977600,https,obfs4,ca,unsuccessful,10,user 1555977600,email,vanilla,yahoo.com,successful,50,user ...
What are your thoughts?
Thanks, Philipp
Hi Philipp, Karsten,
On 24 Apr 2019, at 10:50, Philipp Winter phw@nymity.ch wrote:
I'm working on https://bugs.torproject.org/9316, which will make BridgeDB export usage statistics. I would like these statistics to be public, privacy-preserving, and -- ideally -- added to Tor Metrics. I wanted to hear your thoughts on 1) what statistics we should collect, 2) how we can collect these statistics safely, and 3) what format these statistics should have.
Broadly speaking, these statistics should answer the following questions:
- How many requests does BridgeDB see per day?
- What obfuscation protocols are the most popular?
- What bridge distribution mechanisms are the most popular?
- From what countries do we see the most bridge requests?
- How many BridgeDB requests fail and succeed, respectively?
- How many requests does BridgeDB see from Yahoo/Gmail/Riseup?
- How many HTTPS requests are coming from proxies?
- How many requests are suspicious, and likely issued by bots?
Each request to BridgeDB carries with it some information, which allows us to answer the above questions. I suggest that we collect the following:
The distribution mechanism. Currently, this is HTTPS, email, or Moat.
The requested transport. Currently this is vanilla, fte, obfs3, obfs4, or scramblesuit.
The request's origin. For Moat and HTTPS, it's the two-letter country code, e.g., IT for Italy. For email, it's the user's email domain (Gmail, Yahoo, or Riseup).
Whether the request was successful or unsuccessful, i.e., resulted in BridgeDB handing out bridges or not.
Whether the request was issued by a user or a bot. David suggested heuristics that would allow us to estimate if a request came from a bot: https://bugs.torproject.org/9316#comment:19 I like these suggestions but I'm not sure yet how to encode them -- it's more complex than a simple binary flag.
The combination of these statistics results in ~16,800 buckets (3 mechanisms * 5 transports * ~280 ISO country codes * 2 success states * 2 bot states). We only need to export statistics with non-empty buckets. To protect users whose request is the only one in a given bucket (e.g., there may be only one user in Turkmenistan who successfully requested an FTE bridge over HTTPS on 2019-04-02), we should bin the statistics by rounding them up to the next multiple of, say, 10. We should further export statistics infrequently -- maybe once a day.
Here's an example of a simple CSV format that takes into account the above:
timestamp,mechanism,transport,country|domain,success,count,origin 1555977600,https,vanilla,it,successful,40,user 1555977600,https,obfs4,ca,unsuccessful,10,user 1555977600,email,vanilla,yahoo.com,successful,50,user ...
What are your thoughts?
Over the next few months, Nick and I are going to work on PrivCount for statistics generated by tor relays and bridges. (I'll be on leave from today until late May.)
We haven't done the detailed design of PrivCount's API yet.
For Tor relay/bridge statistics, we'll have some Rust code embedded in the tor binary (Data Collectors), which will add noise, bin, and blind the statistics.
Then we'll have some aggregation servers (Tally Reporters) which will aggregate and un-blind the results.
If we design the interfaces correctly, we should be able to re-use the noise and bin code for BridgeDB. (The blinding is redundant, until we have more than one BridgeDB.)
I imagine we could pass results to a command-line tool for noise and binning. This tool would also be useful for tests. (Tests are *so* much simpler when there's no network in the middle.)
That way, all of Tor's relay, bridge, and BridgeDB statistics will noised, binned, and reported in the same way.
I'm not sure if the timeframes will work out though: I'll be doing the noise and binning when I get back at the end of May.
So we might need to do something quick and dirty until then.
T