I was experimenting with adapting ooni-sync to the /api/v1/measurements endpoint. A minimal proof of concept patch is attached. While trying it, I found that the API was returning duplicate measurements and measurements that don't seem to match the query. I'm using this command: ./ooni-sync -xz -directory measurements.archive input=archive.org since=2017-01-01
Here is a query that at the moment happens to return two results with the same measurement_id and measurement_url, but different input and measurement_start_time. There are a few more example of this phenomenon (I found it 6 times in the first 1000 measurements I downloaded).
https://measurements-beta.ooni.io/api/v1/measurements?input=archive.org&... { "input": "http://archive.org", "measurement_id": "51daa51b-07d2-491e-ba2b-9189e1a08146", "measurement_start_time": "2017-01-04T01:55:06Z", "measurement_url": "https://measurements.ooni.torproject.org/api/v1/measurement/51daa51b-07d2-49...", "probe_asn": "AS3243", "probe_cc": "PT", "report_id": "20170104T105911Z_AS3243_OadZCx9yRNvqKYsLQaQDa3c1swLofXEQNtcplXQ14QrXemKcCT", "test_name": "web_connectivity" }, { "input": "http://wayback.archive.org", "measurement_id": "51daa51b-07d2-491e-ba2b-9189e1a08146", "measurement_start_time": "2017-01-04T09:19:42Z", "measurement_url": "https://measurements.ooni.torproject.org/api/v1/measurement/51daa51b-07d2-49...", "probe_asn": "AS3243", "probe_cc": "PT", "report_id": "20170104T105911Z_AS3243_OadZCx9yRNvqKYsLQaQDa3c1swLofXEQNtcplXQ14QrXemKcCT", "test_name": "web_connectivity" },
I also found that the results included some entries whose "input" field didn't seem to match the query. Here is a small sample of them. So far I've found 57/5783 (10%) of downloads whose input doesn't contain "archive.org".
https://measurements-beta.ooni.io/api/v1/measurement/00868be9-2441-42fb-9691... "http://www.imdb.com" https://measurements-beta.ooni.io/api/v1/measurement/0214f18c-058c-44ef-b291... "http://666games.net" https://measurements-beta.ooni.io/api/v1/measurement/03b771b2-9f2c-4eee-8835... "http://www.cesr.org" https://measurements-beta.ooni.io/api/v1/measurement/0cc491bb-30a0-4dea-9271... "http://adultfriendfinder.com" https://measurements-beta.ooni.io/api/v1/measurement/0fdfa57f-836f-4a75-8543... "http://last.fm" https://measurements-beta.ooni.io/api/v1/measurement/10f1f5ad-91c4-46f5-9d6a... "http://www.earthwatch.org" https://measurements-beta.ooni.io/api/v1/measurement/14520322-b00d-437c-be29... "http://abpr2.railfan.net" https://measurements-beta.ooni.io/api/v1/measurement/1bb1aa36-f4fe-440c-be03... "http://666games.net" https://measurements-beta.ooni.io/api/v1/measurement/210703f1-e52c-4740-99b3... "http://amphetamines.com"
On July 25, 2017 at 2:46:07 AM, David Fifield (david@bamsoftware.com) wrote: I was experimenting with adapting ooni-sync to the /api/v1/measurements endpoint. A minimal proof of concept patch is attached. While trying it, I found that the API was returning duplicate measurements and measurements that don't seem to match the query. I'm using this command: ./ooni-sync -xz -directory measurements.archive input=archive.org since=2017-01-01
Excellent, thanks for trying it out!
Here is a query that at the moment happens to return two results with the same measurement_id and measurement_url, but different input and measurement_start_time. There are a few more example of this phenomenon (I found it 6 times in the first 1000 measurements I downloaded).
https://measurements-beta.ooni.io/api/v1/measurements?input=archive.org&... { "input": "http://archive.org%22,%C2%A0 "measurement_id": "51daa51b-07d2-491e-ba2b-9189e1a08146", "measurement_start_time": "2017-01-04T01:55:06Z", "measurement_url": "https://measurements.ooni.torproject.org/api/v1/measurement/51daa51b-07d2-49... "probe_asn": "AS3243", "probe_cc": "PT", "report_id": "20170104T105911Z_AS3243_OadZCx9yRNvqKYsLQaQDa3c1swLofXEQNtcplXQ14QrXemKcCT", "test_name": "web_connectivity" }, { "input": "http://wayback.archive.org%22,%C2%A0 "measurement_id": "51daa51b-07d2-491e-ba2b-9189e1a08146", "measurement_start_time": "2017-01-04T09:19:42Z", "measurement_url": "https://measurements.ooni.torproject.org/api/v1/measurement/51daa51b-07d2-49... "probe_asn": "AS3243", "probe_cc": "PT", "report_id": "20170104T105911Z_AS3243_OadZCx9yRNvqKYsLQaQDa3c1swLofXEQNtcplXQ14QrXemKcCT", "test_name": "web_connectivity" },
So I have confirmed that this is in fact an issue (see: https://github.com/TheTorProject/ooni-pipeline/issues/54) We were somewhat already aware of it, but only after digging more into it the full extent of the issue is apparent, see: https://github.com/TheTorProject/ooni-pipeline/issues/70.
I also found that the results included some entries whose "input" field didn't seem to match the query. Here is a small sample of them. So far I've found 57/5783 (10%) of downloads whose input doesn't contain "archive.org".
https://measurements-beta.ooni.io/api/v1/measurement/00868be9-2441-42fb-9691... "http://www.imdb.com%22%C2%A0 https://measurements-beta.ooni.io/api/v1/measurement/0214f18c-058c-44ef-b291... "http://666games.net%22%C2%A0 https://measurements-beta.ooni.io/api/v1/measurement/03b771b2-9f2c-4eee-8835... "http://www.cesr.org%22%C2%A0 https://measurements-beta.ooni.io/api/v1/measurement/0cc491bb-30a0-4dea-9271... "http://adultfriendfinder.com%22%C2%A0 https://measurements-beta.ooni.io/api/v1/measurement/0fdfa57f-836f-4a75-8543... "http://last.fm%22%C2%A0 https://measurements-beta.ooni.io/api/v1/measurement/10f1f5ad-91c4-46f5-9d6a... "http://www.earthwatch.org%22%C2%A0 https://measurements-beta.ooni.io/api/v1/measurement/14520322-b00d-437c-be29... "http://abpr2.railfan.net%22%C2%A0 https://measurements-beta.ooni.io/api/v1/measurement/1bb1aa36-f4fe-440c-be03... "http://666games.net%22%C2%A0 https://measurements-beta.ooni.io/api/v1/measurement/210703f1-e52c-4740-99b3... "http://amphetamines.com%22%C2%A0 So the problem here is that since `measurement_id` is actually not unique, when you go to retrieve the individual measurement, you will be getting the first entry instead of the actual measurement you were searching for (see this bit of the code: https://github.com/TheTorProject/ooni-measurements/blob/master/measurements/...).
I think at this point the best thing to do is to use some other key to point to the actual measurement you care about and expose that in place of the measurement_id in the top level `/measurements` API search endpoint.
~ Arturo
On Mon, Jul 24, 2017 at 05:45:39PM -0700, David Fifield wrote:
I also found that the results included some entries whose "input" field didn't seem to match the query. Here is a small sample of them. So far I've found 57/5783 (10%) of downloads whose input doesn't contain "archive.org".
Oops, I meant "1%", not "10%".