I want to search OONI reports for cases of Tor exits being blocked by the server (things like the CloudFlare 403 captcha). The http_requests test is great for that because it fetches a bunch of web pages with Tor and without.
The attached script is my first-draft attempt at finding block pages. Its output is at the end of this message. You can see it finds a lot of CloudFlare captchas and other blocks.
First, I ran ooniprobe to get a report-http_requests.yamloo file. Then I ran the script, which does this: Skip the first YAML document, because it's a header. For all other documents: Skip it if it has non-None control_failure or experiment_failure--there are a few of these.
Look for exactly two non-failed requests, one with is_tor:false and one with is_tor:true. Skip it if it lacks these.
Classify the blocked status of the is_tor:false and is_tor:true responses. 400-series and 500-series status codes are classified as blocked and all others are unblocked.
Print an output line if the blocked status of is_tor:false does not match the blocked status of is_tor:true.
I have a few questions.
Is this a reasonable way to process reports? Is there a more standard way to do e.g. YAMLOO processing?
I know there are many reports at https://ooni.torproject.org/reports/. Is that all of them? I think I heard from Arturo that some reports are not online because of storage issues.
What's the best way for me to get the reports for processing? Just download all *http_requests* files from the web server?
Here is the output of the script. 403-CLOUDFLARE is the famous "Attention Required!" captcha page. I investigated some of the others manually and they are mostly custom block pages or generic web server 403s. (There are also a couple of CloudFlare pages that have a different form.) Overall, almost 4% of the 1000 URLs scanned by ooniprobe served a block page over Tor.
I'm not sure what's up with the non-Tor 503s from Amazon. They just look like localized internal service error pages ("ist ein technischer Fehler aufgetreten", "une erreur de système interne a été décelée"). The one for blog.com is a generic Nginx "Bad Gateway" page.
non-Tor Tor domain 302 403-OTHER yandex.ru 302 403-OTHER craigslist.org 301 403-CLOUDFLARE thepiratebay.se 503-OTHER 301 amazon.de 200 403-CLOUDFLARE adf.ly 301 403-OTHER squidoo.com 301 410-OTHER myspace.com 303 503-OTHER yelp.com 302 403-CLOUDFLARE typepad.com 503-OTHER 301 amazon.fr 301 403-CLOUDFLARE digitalpoint.com 301 403-CLOUDFLARE extratorrent.com 200 403-OTHER ezinearticles.com 200 403-OTHER hubpages.com 200 403-OTHER 2ch.net 200 403-OTHER hdfcbank.com 302 403-CLOUDFLARE meetup.com 302 403-CLOUDFLARE 1channel.ch 200 403-CLOUDFLARE multiply.com 301 403-CLOUDFLARE clixsense.com 301 403-OTHER zillow.com 301 403-CLOUDFLARE odesk.com 301 403-CLOUDFLARE elance.com 301 403-CLOUDFLARE youm7.com 200 403-CLOUDFLARE jquery.com 200 403-CLOUDFLARE sergey-mavrodi.com 301 403-CLOUDFLARE templatemonster.com 302 403-CLOUDFLARE 4tube.com 301 403-CLOUDFLARE mp3skull.com 301 403-CLOUDFLARE porntube.com 200 403-OTHER tutsplus.com 200 403-CLOUDFLARE bitshare.com 301 403-OTHER sears.com 200 403-CLOUDFLARE zwaar.net 502-OTHER 200 blog.com 302 403-CLOUDFLARE myegy.com 301 400-OTHER mercadolibre.com.ve 302 403-OTHER jabong.com 301 403-CLOUDFLARE free-tv-video-online.me 302 403-CLOUDFLARE traidnt.net
That's very interesting - I'm working on a project that aims to provide real-time blocking detection using ooni-probe's http_requests test, and the false-positives caused by TOR-blocking are a big problem there.
Being able to identify TOR-blocking would definitely help us with the elimination of false-positives, although I was also meaning to ask the list if there are any other tests that you'd recommend to use in combination with http_requests to improve results accuracy?
Many thanks,
Daniel.
On 19/06/15 18:03, David Fifield wrote:
I want to search OONI reports for cases of Tor exits being blocked by the server (things like the CloudFlare 403 captcha). The http_requests test is great for that because it fetches a bunch of web pages with Tor and without.
The attached script is my first-draft attempt at finding block pages. Its output is at the end of this message. You can see it finds a lot of CloudFlare captchas and other blocks.
First, I ran ooniprobe to get a report-http_requests.yamloo file. Then I ran the script, which does this: Skip the first YAML document, because it's a header. For all other documents: Skip it if it has non-None control_failure or experiment_failure--there are a few of these.
Look for exactly two non-failed requests, one with is_tor:false and one with is_tor:true. Skip it if it lacks these. Classify the blocked status of the is_tor:false and is_tor:true responses. 400-series and 500-series status codes are classified as blocked and all others are unblocked. Print an output line if the blocked status of is_tor:false does not match the blocked status of is_tor:true.
I have a few questions.
Is this a reasonable way to process reports? Is there a more standard way to do e.g. YAMLOO processing?
I know there are many reports at https://ooni.torproject.org/reports/. Is that all of them? I think I heard from Arturo that some reports are not online because of storage issues.
What's the best way for me to get the reports for processing? Just download all *http_requests* files from the web server?
Here is the output of the script. 403-CLOUDFLARE is the famous "Attention Required!" captcha page. I investigated some of the others manually and they are mostly custom block pages or generic web server 403s. (There are also a couple of CloudFlare pages that have a different form.) Overall, almost 4% of the 1000 URLs scanned by ooniprobe served a block page over Tor.
I'm not sure what's up with the non-Tor 503s from Amazon. They just look like localized internal service error pages ("ist ein technischer Fehler aufgetreten", "une erreur de système interne a été décelée"). The one for blog.com is a generic Nginx "Bad Gateway" page.
non-Tor Tor domain 302 403-OTHER yandex.ru 302 403-OTHER craigslist.org 301 403-CLOUDFLARE thepiratebay.se 503-OTHER 301 amazon.de 200 403-CLOUDFLARE adf.ly 301 403-OTHER squidoo.com 301 410-OTHER myspace.com 303 503-OTHER yelp.com 302 403-CLOUDFLARE typepad.com 503-OTHER 301 amazon.fr 301 403-CLOUDFLARE digitalpoint.com 301 403-CLOUDFLARE extratorrent.com 200 403-OTHER ezinearticles.com 200 403-OTHER hubpages.com 200 403-OTHER 2ch.net 200 403-OTHER hdfcbank.com 302 403-CLOUDFLARE meetup.com 302 403-CLOUDFLARE 1channel.ch 200 403-CLOUDFLARE multiply.com 301 403-CLOUDFLARE clixsense.com 301 403-OTHER zillow.com 301 403-CLOUDFLARE odesk.com 301 403-CLOUDFLARE elance.com 301 403-CLOUDFLARE youm7.com 200 403-CLOUDFLARE jquery.com 200 403-CLOUDFLARE sergey-mavrodi.com 301 403-CLOUDFLARE templatemonster.com 302 403-CLOUDFLARE 4tube.com 301 403-CLOUDFLARE mp3skull.com 301 403-CLOUDFLARE porntube.com 200 403-OTHER tutsplus.com 200 403-CLOUDFLARE bitshare.com 301 403-OTHER sears.com 200 403-CLOUDFLARE zwaar.net 502-OTHER 200 blog.com 302 403-CLOUDFLARE myegy.com 301 400-OTHER mercadolibre.com.ve 302 403-OTHER jabong.com 301 403-CLOUDFLARE free-tv-video-online.me 302 403-CLOUDFLARE traidnt.net
ooni-dev mailing list ooni-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/ooni-dev
Hi Daniel,
This is indeed a very important problem that we have also been facing in our analysis and have some ideas on how to address it.
In origin the http_requests test worked much better than it does not. This was mainly due to the fact that not many people used cloud flare when we started, but as the adoption of cloudflare increased, so did the unreliability of the ooni-probe test.
I think we have reached a point now, where it doesn’t really make much sense to perform that control request over tor, since in most cases it will lead to a false positive and even if it doesn’t we will probably have to eliminate some false positives from that report already.
For this reason we are considering removing the Tor aspect from the http_requests test and rely solely on control measurements that are done from a network vantage point we know not to be censored.
The idea is to have daily measurements for all the possible URLs user are interested in testing from one of our control machines. In the future we could even allow the client to request a certain URL be tested. The control machine will check to see if it has a fresh result and if so serve it, if not perform a request.
Unfortunately this plan will require a bit of development effort, so I can’t tell you exactly when it will be ready, however I can offer 2 concrete ways that you can today make use of logic that is along these line.
1) Use existing uncensored results
Currently we get measurement every week from around 10-50 distinct network vantage points. The majority of these vantage points are inside of countries that do no or little censorship. You could use the HTTP API to get the IDs of measurements from countries that are known not to block the site you are interested in testing. We shall soon add support to the HTTP API (that is also not yet release, but that too will happen very soon), to query for input.
So you could say:
“Give me the measurements for URL http://google.com/ in the time period of 2015-03 from the countries US, NL”
2) Use the whitelist based matching
Another strategy would be do use the data David is extracting to determine which URLs are a sign of false positive (because they are cloud flare CAPTCHA pages) and eliminate them from your result.
This will perhaps be more accurate in the specific case of cloud flare and google captchas, but may not work as well with other new CAPTCHA technologies.
I hope this helps you out a bit and I am more than happy to help anybody out in developing the HTTP request control test helper if someone is interested.
Have fun!
~ Arturo
On Jun 21, 2015, at 6:04 PM, Daniel Ramsay daniel@dretzq.org.uk wrote:
That's very interesting - I'm working on a project that aims to provide real-time blocking detection using ooni-probe's http_requests test, and the false-positives caused by TOR-blocking are a big problem there.
Being able to identify TOR-blocking would definitely help us with the elimination of false-positives, although I was also meaning to ask the list if there are any other tests that you'd recommend to use in combination with http_requests to improve results accuracy?
Many thanks,
On Jun 19, 2015, at 7:03 PM, David Fifield david@bamsoftware.com wrote:
I want to search OONI reports for cases of Tor exits being blocked by the server (things like the CloudFlare 403 captcha). The http_requests test is great for that because it fetches a bunch of web pages with Tor and without.
The attached script is my first-draft attempt at finding block pages. Its output is at the end of this message. You can see it finds a lot of CloudFlare captchas and other blocks.
First, I ran ooniprobe to get a report-http_requests.yamloo file. Then I ran the script, which does this: Skip the first YAML document, because it's a header. For all other documents: Skip it if it has non-None control_failure or experiment_failure--there are a few of these.
Look for exactly two non-failed requests, one with is_tor:false and one with is_tor:true. Skip it if it lacks these. Classify the blocked status of the is_tor:false and is_tor:true responses. 400-series and 500-series status codes are classified as blocked and all others are unblocked. Print an output line if the blocked status of is_tor:false does not match the blocked status of is_tor:true.
I have a few questions.
Is this a reasonable way to process reports? Is there a more standard way to do e.g. YAMLOO processing?
If you are looking for robust standalone code for parsing YAML reports you should look at: https://github.com/TheTorProject/ooni-pipeline-ng/blob/master/pipeline/helpe...
It handles also skipping over report entries that are badly formatted and will already take care of separating the header from the entries.
If it becomes useful to do so we may eventually refactor that code out of the pipeline, though my dream is that people will not have to parse YAML reports themselves, but can rely on the data pipeline for getting the information they want in JSON (more on that below).
I know there are many reports at https://ooni.torproject.org/reports/. Is that all of them? I think I heard from Arturo that some reports are not online because of storage issues.
Currently the torproject mirror of reports is not the most up to date repository of reports, because it does not yet sync with the EC2 based pipeline.
You may find the most up to date reports (that are published daily) here:
If you open the web console you will see a series of HTTP requests being made to the backend. With similar requests you can hopefully obtain the IDs of the specific tests you need and then download them.
What's the best way for me to get the reports for processing? Just download all *http_requests* files from the web server?
With this query you will get all the tests named “http_requests”:
http://api.ooni.io/api/reports?filter=%7B%22test_name%22%3A%20%22http_reques...
The returned list of dicts also contains an attribute called “report_filename” you can use that to download the actual YAML report via:
http://api.ooni.io/reportFiles/$DATE/$REPORT_FILENAME.gz
Note: don’t forget to put $DATE (that is the date in ISO format YYYY-MM-DD) and to add the .gz extension.
So this is what you can do today by writing a minor amount of code and not having to depend on us.
However I think this test is something that would be quite useful to us too in order to identify the various false positives we have in our reports, so I would like to add this intelligence to our database.
The best way to add support for processing this sort of information is writing a batch Spark task that will look for these report entries and add them to our database.
We currently have only implemented one of such filters, but will soon add support for also the basic heuristics of other tests too.
You can see how this is done here: https://github.com/TheTorProject/ooni-pipeline-ng/blob/master/pipeline/batch...
Basically in the find_interesting method you get passed an RDD (entries) that you can run various querying and filtering operations on: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.
To add support for spotting captchas I would add a new class called:
HTTPRequestsCaptchasFind(FindInterestingReports)
and
HTTPRequestsCaptchasToDB(InterestingToDB)
If you do this and it gets merged, then we can run this on an ephemeral hadoop cluster and/or set it up to run automatically every day.
Here is the output of the script. 403-CLOUDFLARE is the famous "Attention Required!" captcha page. I investigated some of the others manually and they are mostly custom block pages or generic web server 403s. (There are also a couple of CloudFlare pages that have a different form.) Overall, almost 4% of the 1000 URLs scanned by ooniprobe served a block page over Tor.
I'm not sure what's up with the non-Tor 503s from Amazon. They just look like localized internal service error pages ("ist ein technischer Fehler aufgetreten", "une erreur de système interne a été décelée"). The one for blog.com is a generic Nginx "Bad Gateway" page.
non-Tor Tor domain 302 403-OTHER yandex.ru 302 403-OTHER craigslist.org 301 403-CLOUDFLARE thepiratebay.se 503-OTHER 301 amazon.de 200 403-CLOUDFLARE adf.ly 301 403-OTHER squidoo.com 301 410-OTHER myspace.com 303 503-OTHER yelp.com 302 403-CLOUDFLARE typepad.com 503-OTHER 301 amazon.fr 301 403-CLOUDFLARE digitalpoint.com 301 403-CLOUDFLARE extratorrent.com 200 403-OTHER ezinearticles.com 200 403-OTHER hubpages.com 200 403-OTHER 2ch.net 200 403-OTHER hdfcbank.com 302 403-CLOUDFLARE meetup.com 302 403-CLOUDFLARE 1channel.ch 200 403-CLOUDFLARE multiply.com 301 403-CLOUDFLARE clixsense.com 301 403-OTHER zillow.com 301 403-CLOUDFLARE odesk.com 301 403-CLOUDFLARE elance.com 301 403-CLOUDFLARE youm7.com 200 403-CLOUDFLARE jquery.com 200 403-CLOUDFLARE sergey-mavrodi.com 301 403-CLOUDFLARE templatemonster.com 302 403-CLOUDFLARE 4tube.com 301 403-CLOUDFLARE mp3skull.com 301 403-CLOUDFLARE porntube.com 200 403-OTHER tutsplus.com 200 403-CLOUDFLARE bitshare.com 301 403-OTHER sears.com 200 403-CLOUDFLARE zwaar.net 502-OTHER 200 blog.com 302 403-CLOUDFLARE myegy.com 301 400-OTHER mercadolibre.com.ve 302 403-OTHER jabong.com 301 403-CLOUDFLARE free-tv-video-online.me 302 403-CLOUDFLARE traidnt.net
Do you have an output that also includes the report_ids and exit IP?
I believe this data would be of great use to us too.
~ Arturo
On Mon, Jun 22, 2015 at 12:12:50PM +0200, Arturo Filastò wrote:
On Jun 19, 2015, at 7:03 PM, David Fifield david@bamsoftware.com wrote:
I know there are many reports at https://ooni.torproject.org/reports/. Is that all of them? I think I heard from Arturo that some reports are not online because of storage issues.
Currently the torproject mirror of reports is not the most up to date repository of reports, because it does not yet sync with the EC2 based pipeline.
You may find the most up to date reports (that are published daily) here:
This is great! I had no idea it existed.
Thank you for the detailed reply.
If you open the web console you will see a series of HTTP requests being made to the backend. With similar requests you can hopefully obtain the IDs of the specific tests you need and then download them.
What's the best way for me to get the reports for processing? Just download all *http_requests* files from the web server?
With this query you will get all the tests named “http_requests”:
http://api.ooni.io/api/reports?filter=%7B%22test_name%22%3A%20%22http_reques...
The returned list of dicts also contains an attribute called “report_filename” you can use that to download the actual YAML report via:
http://api.ooni.io/reportFiles/$DATE/$REPORT_FILENAME.gz
Note: don’t forget to put $DATE (that is the date in ISO format YYYY-MM-DD) and to add the .gz extension.
That's perfect.
So this is what you can do today by writing a minor amount of code and not having to depend on us.
However I think this test is something that would be quite useful to us too in order to identify the various false positives we have in our reports, so I would like to add this intelligence to our database.
The best way to add support for processing this sort of information is writing a batch Spark task that will look for these report entries and add them to our database.
We currently have only implemented one of such filters, but will soon add support for also the basic heuristics of other tests too.
You can see how this is done here: https://github.com/TheTorProject/ooni-pipeline-ng/blob/master/pipeline/batch...
Basically in the find_interesting method you get passed an RDD (entries) that you can run various querying and filtering operations on: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.
To add support for spotting captchas I would add a new class called:
HTTPRequestsCaptchasFind(FindInterestingReports)
and
HTTPRequestsCaptchasToDB(InterestingToDB)
If you do this and it gets merged, then we can run this on an ephemeral hadoop cluster and/or set it up to run automatically every day.
Okay, thanks, perhaps it will be merged somewhere down the line.
Do you have an output that also includes the report_ids and exit IP?
I believe this data would be of great use to us too.
Do you mean in this specific example? It just comes from my manual run of ooniprobe. I'm not sure how to get the report ID. Of course that information is available to the script.
On Fri, Jun 19, 2015 at 10:03:38AM -0700, David Fifield wrote:
The attached script is my first-draft attempt at finding block pages. Its output is at the end of this message. You can see it finds a lot of CloudFlare captchas and other blocks.
Hi David,
Once you have a reliable cloudflare detector, please consider whether you could do bulk scans of how websites handle Tor without even needing to use Tor -- the idea would be to run a Tor exit relay, wait for it to accumulate a bunch of hate from places like Cloudflare, and then do much more efficient scans of the Internet using its IP address (as well as in parallel a nearby but unrelated-to-Tor IP address).
In a sense this is exactly the same as the "is that website censored" test that used Tor as the control, except it's the other way around.
(This is one of the few examples I've come up with where researchers really genuinely do need to run a Tor exit relay, to do their research, yet the research does not involve wiretapping other people's traffic or similarly bad behavior.)
--Roger