Hi Daniel,
This is indeed a very important problem that we have also been facing in our analysis and have some ideas on how to address it.
In origin the http_requests test worked much better than it does not. This was mainly due to the fact that not many people used cloud flare when we started, but as the adoption of cloudflare increased, so did the unreliability of the ooni-probe test.
I think we have reached a point now, where it doesn’t really make much sense to perform that control request over tor, since in most cases it will lead to a false positive and even if it doesn’t we will probably have to eliminate some false positives from that report already.
For this reason we are considering removing the Tor aspect from the http_requests test and rely solely on control measurements that are done from a network vantage point we know not to be censored.
The idea is to have daily measurements for all the possible URLs user are interested in testing from one of our control machines. In the future we could even allow the client to request a certain URL be tested. The control machine will check to see if it has a fresh result and if so serve it, if not perform a request.
Unfortunately this plan will require a bit of development effort, so I can’t tell you exactly when it will be ready, however I can offer 2 concrete ways that you can today make use of logic that is along these line.
1) Use existing uncensored results
Currently we get measurement every week from around 10-50 distinct network vantage points. The majority of these vantage points are inside of countries that do no or little censorship. You could use the HTTP API to get the IDs of measurements from countries that are known not to block the site you are interested in testing. We shall soon add support to the HTTP API (that is also not yet release, but that too will happen very soon), to query for input.
So you could say:
“Give me the measurements for URL http://google.com/ in the time period of 2015-03 from the countries US, NL”
2) Use the whitelist based matching
Another strategy would be do use the data David is extracting to determine which URLs are a sign of false positive (because they are cloud flare CAPTCHA pages) and eliminate them from your result.
This will perhaps be more accurate in the specific case of cloud flare and google captchas, but may not work as well with other new CAPTCHA technologies.
I hope this helps you out a bit and I am more than happy to help anybody out in developing the HTTP request control test helper if someone is interested.
Have fun!
~ Arturo
On Jun 21, 2015, at 6:04 PM, Daniel Ramsay daniel@dretzq.org.uk wrote:
That's very interesting - I'm working on a project that aims to provide real-time blocking detection using ooni-probe's http_requests test, and the false-positives caused by TOR-blocking are a big problem there.
Being able to identify TOR-blocking would definitely help us with the elimination of false-positives, although I was also meaning to ask the list if there are any other tests that you'd recommend to use in combination with http_requests to improve results accuracy?
Many thanks,