I just downloaded all the http_requests reports from https://measurements.ooni.torproject.org/. It took quite a long time and I wonder if we can make things more efficient by compressing the reports on the server.
This is the command I ran to download the reports: wget -c -r -l 2 -np --no-directories -A '*http_requests*' --no-http-keep-alive https://measurements.ooni.torproject.org/ This resulted in 309 GB and 6387 files.
If I compress the files with xz, xz -v *.json they only take up 29 GB (9%).
Processing xz-compressed files is pretty easy, as long as you don't have to seek. Just do something like this: def open_xz(filename): p = subprocess.Popen(["xz", "-dc", filename], stdout=subprocess.PIPE, bufsize=-1) return p.stdout
for line in open_xz("report.json"): doc = json.loads(line) ... Of course you can do the same thing with gzip.
would you be okay with monthly archives of all tests, or would you want the archives to be separated by test type?
On Fri, Mar 18, 2016 at 10:16 PM, David Fifield david@bamsoftware.com wrote:
I just downloaded all the http_requests reports from https://measurements.ooni.torproject.org/. It took quite a long time and I wonder if we can make things more efficient by compressing the reports on the server.
This is the command I ran to download the reports: wget -c -r -l 2 -np --no-directories -A '*http_requests*' --no-http-keep-alive https://measurements.ooni.torproject.org/ This resulted in 309 GB and 6387 files.
If I compress the files with xz, xz -v *.json they only take up 29 GB (9%).
Processing xz-compressed files is pretty easy, as long as you don't have to seek. Just do something like this: def open_xz(filename): p = subprocess.Popen(["xz", "-dc", filename], stdout=subprocess.PIPE, bufsize=-1) return p.stdout
for line in open_xz("report.json"): doc = json.loads(line) ...
Of course you can do the same thing with gzip. _______________________________________________ ooni-dev mailing list ooni-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/ooni-dev
On Fri, Mar 18, 2016 at 10:21:21PM -0700, Will wrote:
would you be okay with monthly archives of all tests, or would you want the archives to be separated by test type?
You need to make batched archives. I'm just thinking of compressing the files that are already there, one-to-one. I don't think tar or zip containing multiple files would be more useful.
Personally I could live with having multiple test types bundled, but that's because I'm dealing with http_requests, which are huge reports, so the extra types don't cause much additional overhead. Someone working with smaller reports would be annoyed at having to download all the http_requests file.