Hello Oonitarians!
In the past months we have been working on re-engineering the data processing pipeline for OONI. As a result you may have noticed that the publishing of reports via https://ooni.torproject.org/reports/.
Do not fear we have not been loosing reports and we will soon begin to resume publishing of the reports, but I would like to ask what would be the most convenient way to do so.
The major improvement is that the reports will from now on be published in JSON as opposed to YAML. The report format will change slightly to make the task of parsing them a bit easier. Each report will be a JSON stream (that is a series of JSON documents separated by newline) where every document contains also every key present in the report header. This adds a little bit of overhead to the filesize, but allows you to store offsets into the files and not have to always seek to the header to get the all the common information relative to that measurement.
Running some benchmarks on a small sample of the reports collected in one day we can see that the performance increase is huge:
87M 2015-12-22.json 97M 2015-12-22.yaml
vanilla json: 1.37932395935 ultra json: 0.421966075897 simple json: 1.23581790924 pyyaml (without CLoader): 193.864903927 pyyaml (with CLoader): 4.40925312042
Currently for our data processing needs we have begun to bucket reports by date (every date corresponds to when a certain report has been submitted to the collector). What I would like to know is of the two following options what would be most convenient to you for accessing the data.
The options are:
OPTION A: Have 1 JSON stream for every day of measurements (either gzipped or plain)
ex. - https://ooni.torproject.org/reports/json/2016-01-01.json - https://ooni.torproject.org/reports/json/2016-01-02.json - https://ooni.torproject.org/reports/json/2016-01-03.json etc.
OPTION B: Have 1 JSON stream for every ooni-probe test run and publish them inside of a directory with the timestamp of when it was collected
ex. - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-NL-AS32... - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-US-AS32... - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-IT-AS32...
- https://ooni.torproject.org/reports/json/2016-01-02/20160102T204732Z-NL-AS32... - https://ooni.torproject.org/reports/json/2016-01-02/20160102T204732Z-US-AS32...
- https://ooni.torproject.org/reports/json/2016-01-03/20160103T204732Z-IT-AS32... etc.
Since we are internally using the daily batches for doing the processing and analysis of reports unless there is an explicit request to publish them on a test run basis we will probably end up going for option A, so don’t be shy to reply :)
~ Arturo
On Tue, Jan 19, 2016 at 05:47:35PM +0100, Arturo Filastò wrote:
OPTION A: Have 1 JSON stream for every day of measurements (either gzipped or plain)
ex.
- https://ooni.torproject.org/reports/json/2016-01-01.json
- https://ooni.torproject.org/reports/json/2016-01-02.json
- https://ooni.torproject.org/reports/json/2016-01-03.json
etc.
OPTION B: Have 1 JSON stream for every ooni-probe test run and publish them inside of a directory with the timestamp of when it was collected
ex.
https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-NL-AS32...
https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-US-AS32...
https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z-IT-AS32...
https://ooni.torproject.org/reports/json/2016-01-02/20160102T204732Z-NL-AS32...
https://ooni.torproject.org/reports/json/2016-01-02/20160102T204732Z-US-AS32...
https://ooni.torproject.org/reports/json/2016-01-03/20160103T204732Z-IT-AS32...
etc.
I have a small preference for option B, because when I was processing OONI data, I was only interested in one type of report, so I didn't have to download all the other types.