Currently you can do queries with order_by=test_start_time,
order_by=probe_cc, etc., but you cannot do order_by=index.
https://measurements.ooni.torproject.org/api/v1/files?limit=1&order_by=index
{
"error_code": 400,
"error_message": "Invalid order_by"
}
As I understand it, the difference between index and test_start_time is
that index is always increasing over time (newly uploaded reports always
get a higher index than existing reports), while newly uploaded reports
can have a test_start_time that is in the past (if the probe was not
able to upload for a time, for example).
The ability to order_by=index would allow a slight robustness
enhancement in ooni-sync, in the case when a new report is uploaded
while ooni-sync is running. Currently ooni-sync always does
order=asc&order_by=test_start_time&limit=1000
That is, starting with the oldest reports, get a page of 1000 reports at
a time. The issue is what happens when a report from the past is
uploaded while ooni-sync is downloading. In this case ooni-sync will not
notice the new report right away. Here is an example with made-up
indexes and dates:
ooni-sync starts downloading page 0 from index=5000 (2016-01-01) to index=5999 (2016-03-31)
new report with index=9999 (2016-02-01) appears, gets inserted into page 0
ooni-sync finishes downloading page 0
ooni-sync starts downloading page 1 from index=5999 (2016-03-31) to index=6998 (2016-04-05)
ooni-sync finishes downloading page 1
In this example, ooni-sync never downloads the report with index=9999.
Also, it sees index=5999 twice, because index=9999 pushed index=5999
from page 0 to page 1.
An order_by=index option would prevent newly uploaded reports from
unaligning the pages like that (at least when order_by=asc is used).
The reasons why this is minor minor minor and hardly worth mentioning:
* index=9999 will get downloaded the next time you run ooni-sync
* it can't cause ooni-sync to skip any already uploaded reports (it
would, with order=desc, but that's why ooni-sync uses order=asc)
* ooni-sync will see but won't actually download index=5999 twice
* newly uploaded reports are likely to be on the last page anyway
I wrote a program that uses the OONI API to download reports and keep a
local directory of reports up to date. It's much faster than the Wget
loop I used to use and it finishes quickly when there is nothing new to
download.
git clone https://www.bamsoftware.com/git/ooni-sync.git
For example, lately I've had to download a lot of tcp_connect reports. I
run it like this:
ooni-sync -xz -directory reports.tcp_connect/ test_name=tcp_connect
This command downloads the index of tcp_connect reports and only
downloads the ones that are not already downloaded. It compresses the
downloaded files with xz. The next time I need to update, I run the same
command again, and it only downloads reports that are new since the last
time.
You can use other query parameters supported by the API, like probe_cc,
probe_asn, since, and until. For example:
ooni-sync -xz -directory reports.is/ probe_cc=IS since=2017-01-01
ooni-sync -xz -directory reports.as25/ probe_asn=AS25
ooni-sync -xz -directory reports.tor-turkey/ test_name=vanilla_tor probe_cc=TR
ooni-sync -xz -directory reports.web_connectivity/ test_name=web_connectivity since=2017-01-01 until=2017-01-02
I prefer to keep all the reports compressed on disk, so I always use the
-xz option, but by default reports are saved unmodified.