On 2/11/15 1:34 AM, Kevin Murray wrote:
Hi,
Hi Kevin,
Thanks for your interest in OONI!
I'd love to get involved with the OONI project. I'm doing a PhD in high-performance/scientific computing, working with large experimental datasets. I have experience coding in C, python and R, and a modest understanding of statistics. I have also contributed code (mostly test cases) to little-t tor.
These are all very useful skills especially in light of what are perceived as the most high priority next steps (i.e. data analytics of data collected by ooniprobe and visualizations).
Do you have experience working with mongodb or NoSQL like databases?
Is there a particular part of the OONI infrastructure that would like a volunteer? If possible, it would be great to have a longer-term project, working with a mentor or similar, though I know everyone is very busy. I'm happy to work on any part of OONI.
I would be very happy to mentor you through working on the ooni-pipeline (https://github.com/thetorproject/ooni-pipeline), that is currently where most of the development effort is placed.
The next steps on that front I believe are:
1) Refactor the data structure of the reports and measurements placed inside of the mongodb database.
We have learned the hard way that mongoDB does not seem to function like normal databases in the sense that JOIN operations are not particularly efficient. For this reason I think that instead of splitting the report header and the measurements into 2 different tables we should just put everything inside of 1. This one table will have all of the report fields plus a "measurements" list that contains all the measurements (that were previous stored inside of another table that referenced the report entry).
This task is actually already implemented here: https://github.com/TheTorProject/ooni-pipeline/commit/3be900736472a15b33e67a...
and I have run the import task on the new pipeline.
What now needs to change is the frontend and HTTP API to the database, that can be found here: https://github.com/hellais/ooni-app
In particular what needs to change is: https://github.com/hellais/ooni-app/blob/master/app/controllers/reports.serv...
and
https://github.com/hellais/ooni-app/blob/master/app/controllers/reports.serv...
2) Come up with queries that will give us all the reports that are to be considered "interesting".
Depending on the type of OONI test some elements of the result are symptom of a network anomaly that can be a sign of internet censorship. We should develop a set of mongoDB queries that give us for every test the measurements that contain "interesting" results.
If you look at the "entry_filter" method of oonireader you can see what are the signs of network anomalies for some of the most common ooni-probe measurements: https://github.com/TheTorProject/ooni-reader/blob/master/oonireader/nettests...
These should be refactored to either be part of the ooni-pipeline or even better be a set of MongoDB queries to be run against the database (bonus points if we can come up with a smart way of caching the results of these queries).
3) Devise a methodology for removing the dependency on Tor in the HTTP Requests test, while still having control measurements close enough to the probing time.
Currently when you run the HTTP Requests test the operator will perform a HTTP request on the local network and one via Tor. The two results are compared to assess if the expected result (the one over Tor) matches the experiment result (the on over the local network).
This presents a variety of different problems: some sites will block the Tor network, some operators live in a country where Tor is blocked, etc.
For this reason I think we should expose a HTTP service that allows an ooniprobe to request some metadata associated to a certain website (for example what are the HTTP headers it returns and how long is the body) or obtain the full payload of the response. If another operator requests the same site in a close enough range we should not issue the request again, but just serve a cached copy of it.
An alternative would be to just query all of the sites that we want users to probe with a fixed interval (say 1 hour) and do all the analysis port submission (so we would look at the control measurement that is closest to the experiment measurement).
These are the first things that come to mind, though feel free to also look at the open trac tickets for ooni-probe and tell me if there is something in particular that sparks your interest: https://trac.torproject.org/projects/tor/query?status=accepted&status=as...
In particular you may find interesting this: https://trac.torproject.org/projects/tor/ticket/13731
I would also be very happy to discuss with you further other possible options either on IRC (#ooni irc.oftc.net) or another channel of communication of your choice.
Have fun!
~ Arturo