On Mon, Jun 22, 2015 at 12:12:50PM +0200, Arturo Filastò wrote:
On Jun 19, 2015, at 7:03 PM, David Fifield david@bamsoftware.com wrote:
I know there are many reports at https://ooni.torproject.org/reports/. Is that all of them? I think I heard from Arturo that some reports are not online because of storage issues.
Currently the torproject mirror of reports is not the most up to date repository of reports, because it does not yet sync with the EC2 based pipeline.
You may find the most up to date reports (that are published daily) here:
This is great! I had no idea it existed.
Thank you for the detailed reply.
If you open the web console you will see a series of HTTP requests being made to the backend. With similar requests you can hopefully obtain the IDs of the specific tests you need and then download them.
What's the best way for me to get the reports for processing? Just download all *http_requests* files from the web server?
With this query you will get all the tests named “http_requests”:
http://api.ooni.io/api/reports?filter=%7B%22test_name%22%3A%20%22http_reques...
The returned list of dicts also contains an attribute called “report_filename” you can use that to download the actual YAML report via:
http://api.ooni.io/reportFiles/$DATE/$REPORT_FILENAME.gz
Note: don’t forget to put $DATE (that is the date in ISO format YYYY-MM-DD) and to add the .gz extension.
That's perfect.
So this is what you can do today by writing a minor amount of code and not having to depend on us.
However I think this test is something that would be quite useful to us too in order to identify the various false positives we have in our reports, so I would like to add this intelligence to our database.
The best way to add support for processing this sort of information is writing a batch Spark task that will look for these report entries and add them to our database.
We currently have only implemented one of such filters, but will soon add support for also the basic heuristics of other tests too.
You can see how this is done here: https://github.com/TheTorProject/ooni-pipeline-ng/blob/master/pipeline/batch...
Basically in the find_interesting method you get passed an RDD (entries) that you can run various querying and filtering operations on: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.
To add support for spotting captchas I would add a new class called:
HTTPRequestsCaptchasFind(FindInterestingReports)
and
HTTPRequestsCaptchasToDB(InterestingToDB)
If you do this and it gets merged, then we can run this on an ephemeral hadoop cluster and/or set it up to run automatically every day.
Okay, thanks, perhaps it will be merged somewhere down the line.
Do you have an output that also includes the report_ids and exit IP?
I believe this data would be of great use to us too.
Do you mean in this specific example? It just comes from my manual run of ooniprobe. I'm not sure how to get the report ID. Of course that information is available to the script.