Hi Karsten,
On 14 Sep 2011, at 07:17, Karsten Loesing wrote:
However, as one can see, George's script also detects quite a few false positives. Whenever there's a red or blue dot, the script would have issued a warning for a human to check. It would be neat to reduce these false warnings while still catching the really suspicious events.
Want to help us make our censorship-detection system better? Any suggestion to improve George's algorithm or to come up with an alternative approach to detect possible censorship events in our data would be much appreciated! Let us know if we can help you get started.
Well the easiest thing to do would be to change the parameter which decides whether to send an alert out. According to the paper: "We consider that a ratio of connections is typical if it falls within the 99.99 % percentile of the Normal distribution N(m, v) modelling ratios." Maybe 99.995% or 99.999% would be better?
Can we look at the alerts and categorise them into ones which were not censorship events (false positives) and ones which were events that we would like to be alerted about (true positives). Also look for any censorship events which were missed (false negatives). Then for each event, see how much the ratio of connections diverges from that predicted by the model. If the divergence is larger for true positives than it is for false positives, and there are few false negatives, then the model can be left unchanged, but the alert-criteria needs to be lifted. If the divergence for each of the categories are tightly clustered then the model would have to be changed.
But do remember this is a very challenging problem. The vast majority of the time a censorship event has not happened. This means that even if we have an extremely accurate detector, we will still get quite a few false positives (perhaps more than we do true positives). For the reason, see http://en.wikipedia.org/wiki/Base_rate_fallacy. It might be that we will have to filter out the false negatives manually by asking people in country.
Steven.