Hi,
I'd like to help improve the Tor Censorship Detector. I've read some background material and think I understand the basics of George Danezis' detection algorithm [1, 2].
Is anyone still working on this? Two tickets from a year ago talk about experimenting with various detection algorithms and turning one of them into a standalone utility [3, 4]. Has anything happened since then?
My background: I'm a graduate student at Georgia Tech studying network censorship circumvention and measurement. Although I've met Tor developers on various occasions, I haven't directly contributed to the project; I'd like to change that.
Thanks!
Sam
[1] https://lists.torproject.org/pipermail/tor-dev/2011-September/002923.html [2] https://metrics.torproject.org/papers/detector-2011-08-11.pdf [3] https://trac.torproject.org/projects/tor/ticket/3718 [4] https://trac.torproject.org/projects/tor/ticket/4180
On 5/7/13 12:44 AM, Sam Burnett wrote:
Hi,
Hi Sam,
I'd like to help improve the Tor Censorship Detector. I've read some background material and think I understand the basics of George Danezis' detection algorithm [1, 2].
Great! Trivial nitpick: here's a better URL for George's tech report:
https://research.torproject.org/techreports/detector-2011-09-09.pdf
Is anyone still working on this? Two tickets from a year ago talk about experimenting with various detection algorithms and turning one of them into a standalone utility [3, 4]. Has anything happened since then?
I don't think that anyone made progress on the detection algorithm or tool. We're still running this code:
https://gitweb.torproject.org/metrics-tasks.git/tree/HEAD:/task-2718
What did change, however, is that we'll soon have better input data for a new detection algorithm available:
https://metrics.torproject.org/csv/userstats.csv
This file contains by-country statistics for directly connecting users and for bridge users. Here are the first five lines, to give you an idea:
date,node,country,transport,version,frac,users 2013-05-06,relay,,,,22,798301 2013-05-06,relay,??,,,22,10045 2013-05-06,relay,a1,,,22,692 2013-05-06,relay,a2,,,22,204 2013-05-06,relay,ad,,,22,162
- The date column is, well, the ISO 8601 date.
- The node column contains either 'relay' or 'bridge'.
- The country column contains either the empty string for all countries or the ISO 3166 two-letter lower-case country code plus some MaxMind-specific codes plus '??' for unknown.
- You can safely ignore the transport and version columns for the moment. These are for pluggable transport users and for users by IP version. In the future it may be interesting to see sudden changes in usage by transport, but so far these values are not stable enough.
- You can also ignore the frac line. It says what fraction of relays or bridges we're basing our estimate on, from 0 to 100. A value of 10 should be sufficient for the censorship detector, because we want it to warn as early as possible.
- The users column is the estimated number of users.
If you want to learn more about how we compute these estimates, here are the code and the tech report that the code is based on:
https://trac.torproject.org/projects/tor/ticket/8462
https://research.torproject.org/techreports/counting-daily-bridge-users-2012...
Just keep in mind that this is still work in progress.
My background: I'm a graduate student at Georgia Tech studying network censorship circumvention and measurement. Although I've met Tor developers on various occasions, I haven't directly contributed to the project; I'd like to change that.
Cool! Let me know if I can give you any more details or provide any assistance. Thanks for working on the censorship detector!
Best, Karsten
Thanks!
Sam
[1] https://lists.torproject.org/pipermail/tor-dev/2011-September/002923.html [2] https://metrics.torproject.org/papers/detector-2011-08-11.pdf [3] https://trac.torproject.org/projects/tor/ticket/3718 [4] https://trac.torproject.org/projects/tor/ticket/4180 _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Sam Burnett sam.burnett@gatech.edu writes:
Hi,
I'd like to help improve the Tor Censorship Detector. I've read some background material and think I understand the basics of George Danezis' detection algorithm [1, 2].
Is anyone still working on this? Two tickets from a year ago talk about experimenting with various detection algorithms and turning one of them into a standalone utility [3, 4]. Has anything happened since then?
My background: I'm a graduate student at Georgia Tech studying network censorship circumvention and measurement. Although I've met Tor developers on various occasions, I haven't directly contributed to the project; I'd like to change that.
Hi there,
this project has been pretty stale lately indeed.
Some random thoughts:
* George Danezis started designing another censorship anomaly detection algorithm, but he never implemented it.
He did not like how the first detection algorithm triggered multiple times after a censorship event was reported (since all subsequent days after censorship deviated from the week before and it used a 7 days delta). Furthermore, he did not like that the algorithm only considered clients from a week ago; maybe a day ago would be a better model in some cases.
I think I have an email with his ideas on the new model; I will ask George whether I can send it to you.
* IIRC, the current model compares the rate of change of rate of clients of a jurisdiction (between t_i and t_{i-1}) with the same value in other ("stable") jurisdictions.
This means that the current model only cares about the trends of a jurisdiction against the "global trends". While this makes sense, it might also be useful to compare the current parameters of a jurisdiction with past (old) parameters from the same jurisdiction (for example, see how the rate of change of rate of number of clients of this week compares with that of two months ago).
* You can find daily results of the algorithm here: https://lists.torproject.org/pipermail/tor-censorship-events/
As you can see the signal-to-noise ratio is pretty high. Maybe there are heuristics that we could use to weed out useless events?
* If we get better at anomaly detection, it would be fun to use our algorithms to find anomalies in various properties of the Tor network. For example, finding anomalies on the values of the relay flag thresholds might help us detect attacks on the network. Related Tor tickets: https://trac.torproject.org/projects/tor/ticket/8164 https://trac.torproject.org/projects/tor/ticket/8151
George Kadianakis desnacked@riseup.net writes:
Sam Burnett sam.burnett@gatech.edu writes:
Hi,
I'd like to help improve the Tor Censorship Detector. I've read some background material and think I understand the basics of George Danezis' detection algorithm [1, 2].
Is anyone still working on this? Two tickets from a year ago talk about experimenting with various detection algorithms and turning one of them into a standalone utility [3, 4]. Has anything happened since then?
My background: I'm a graduate student at Georgia Tech studying network censorship circumvention and measurement. Although I've met Tor developers on various occasions, I haven't directly contributed to the project; I'd like to change that.
Hi there,
this project has been pretty stale lately indeed.
Some random thoughts:
George Danezis started designing another censorship anomaly detection algorithm, but he never implemented it.
He did not like how the first detection algorithm triggered multiple times after a censorship event was reported (since all subsequent days after censorship deviated from the week before and it used a 7 days delta). Furthermore, he did not like that the algorithm only considered clients from a week ago; maybe a day ago would be a better model in some cases.
I think I have an email with his ideas on the new model; I will ask George whether I can send it to you.
With permission from George, I post his ideas for the second version of the censorship detector. He told me I should mention that the results of the second version were not much better than the results of the first version.
Inlining:
Subject: v2 model for censorship detection Date: Sat, 24 Sep 2011 19:16:44 +0000 From: George Danezis gdane@microsoft.com
Hello Karsten (and Roger),
Following our discussions about the current detector, data feels, and estimation of clients, I tried to improve our existing model using the data feeds already available. We can see how this new model addresses the issues of quality of detection, and then think carefully whether improving the client estimation from the directory authorities will improve it (since counting connections seems more complex than I had originally understood - thanks for the report explaining the intricacies).
In a nutshell, I used some of the principles we have used before - namely that current number of clients is related to clients at some point in that past, and that deviation in a particular country from global trends is a noteworthy event. Yet, our previous model was simple minded: (a) it only considered a global fixed previous number of clients (that we had set to be a week earlier to accommodate periodic effects from South Korea and Taiwan); (b) once censorship was detected multiple events were reported (about 7) since all subsequent days after censorship deviated from the week before. The new model addresses these two issues: it tries to fit the data to the best available model (one day before, or one week before), and then once a censorship event was detected it uses a censorship model for the estimation of new data points. Under the hood it is a Bayesian model, which means that we get probabilities of censorship events, not all-or-nothing alarms. These can be used along with your preference for false positives / false negatives to tune the reporting (a point that Roger made in previous emails).
The detector is based on the following model:
- At each point in time a county sees clients connecting at a certain hidden rate "r". The observation "o" you see is distributed according to a Poisson distribution with this parameter "r". This means that countries with few clients do not report censorship when none of them connect in a certain day - this is part of normal behaviour, and the detector should not be upset by jurisdictions with few clients.
- At any point in time we use one of many models, each model relating the current rate "r" with a rate at a previous time "r-1" or "r-7" (more previous times can be configured). Which model we use is determined by a parameter "t" which changes between time periods. We assume that these are rather stable, so the probability of changing the model from a time to the next is called "Beta", and is set rather low (<= 0.01). We restrict rates to be positive numbers (negative rates make no sense).
- At each point in time we also record whether a censorship event has just occurred, by a Boolean variable "c". This is set according to a prior that expresses the fact that censorship is rare ("alpha" = Pr[c = True] < 0.01). The censorship event is independent from previous censorship events.
- We observe the relative change in volumes of traffic from a selected number of large countries - about 50, from which we do not expect censorship. Using this data we built a different model at all times of the percentage change of traffic from one time to the next for each of our models. So in practice now we have a distribution that tells us how we expect the rate "r" to be, given the rate the previous day or the week before. These are simply normal distributions over the percent change with a mean and standard deviation that are estimated directly from the data series.
- The variance of the change observed depends on whether we are under conditions of censorship. We multiply the standard deviation of the normal models by 4 for non-censorship conditions, and with a factor of 10 for censorship conditions. This variable is called "m". Basically, that says: if we are under honest conditions the variance is small, but if we are under censorship we have little idea what happens. - Given the current censorship flag "c" and the current model "t" we decide whether we are under censorship conditions, by ORing all the previous "t" censorship flags. This ensures that we do not get those trains of events - once we are under censorship any model that is based on data before the censorship event will see its variance increase like crazy without reporting new events.
- Given a previous rate "r_p'" and a sample "v" from this normal distribution model, with the appropriate modification "m" to the variance to account for censorship, the new rate is r = r_p * (1 + v). The previous rate is chosen according to "t" (which model we are using, the day-before model or the week-before model)
The attached figure summarises the process using a standard graphical model. It depicts the variables at a particular time, and how they relate to other times and model parameters.
All the above is nice, but of course out of all the variables only the observations "o" are known, along with the parameters of the normal distributions relating rates (not observed) between time periods (a week or a day). We set the priors "alpha", "beta", and a fixed "m" as well. All other parameters, the rates "r", the censorship flags "c", the model selection "t" must be inferred for all times.
We infer those parameters using sequential monte carlo techniques (also known as particle filters). In a nutshell: you create a population of, say 100, particles with a reasonable (but not necessarily correct) set of parameters for time 0. At time 1, and subsequent times, you sample new parameters for these particles according to the distributions above: the model selection flag "t" changes with some probability, the censorship flag "c" is activated with some probability, and the rate is the result of sampling the normal distribution according to the model and the censorship conditions so far. This can be done by direct sampling (even though for technical convergence reasons I use importance sampling for "t" and "c"). After a new set of candidate particles are generated in this manner, they are weighted according to the likelihood they generated the observed data point "o". The final set of particles for the new time are re-sampled according to these weights. The procedure is repeated for the next time, given the particles of the previous time.
Once a set of particles is available for all time periods, you can simply compute the fraction of particles with a censorship event, or more generally under censorship conditions (=censorship event in the previous "t" times) to get the probability of censorship. I also recommend you plot the fraction of particles under censorship conditions to prevent one censorship event masking another happening shortly after. Furthermore some instances of censorship are not acute enough in a single time period to raise an alarm, but evidence of them amasses as time goes by through few censorship particles being selected to overwhelmingly explain future events.
The sampling seems to work well, BUT it takes a longer time than the previous detector to provide results. The speed depends on the number of particles used for the estimation - but the fewer particles the coarser / more noisy the results. I would recommend around 100 particles for quick diagnostics, and over 1000 particles for long term stats or jurisdictions that are likely to exhibit censorship. Of course, we might also get speed ups by carefully profiling the code, or re-writing it in C++ (I make extensive use of dynamic data structures for fast prototyping - simplifying those would help).
I attach 6 selected annotated graphs for china, Burma, Libya, Egypt where we should observe censorship as well as Germany and South Korea, where we should not observe censorship (but Korea is weird as we have discussed). Triangles and percentages indicate censorship events, while pink zones denote a high number of particles being under censorship conditions (20% and 50% according to intensity of pink).
I also attach the python script and data used to generate the graphs. You will need scipy, numpy and matplotlib to run it - as for the previous script. The function "plot_country" should give an idea how to use it, and the function "get_events" processes the particles to return the statistics used to plot the censorship events.
Let me know what you think,
George