Parts of the ooni-pipeline are using datetime.fromtimestamp instead of datetime.utcfromtimestamp. It means the timestamp will be parsed differently according to the time zone in which the code is run:
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf... entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time', 0)).strftime("%Y-%m-%d %H:%M:%S")
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf... start_time = datetime.fromtimestamp(header.get('start_time', 0))
Now, it appears that there's no actual harm done by this bug, because it looks like the timestamps were read and written by someone in UTC+1, so the conversion evens out in the final JSON. But it could go wrong if run by someone in a different time zone.
I noticed this because when I process the new JSON files, I get timestamps that are an hour different than I got parsing the YAML files. To parse the JSON times, I use datetime.datetime.strptime(..., "%Y-%m-%d %H:%M:%S"), which gives the correct answer because of the reason in the previous paragraph. My old YAML times were off by an hour, because I used datetime.datetime.utcfromtimestamp(...), parsing a UTC+1 timestamp as if it were UTC.
Here's some related information I found while researching the above. It looks like individual test results are *also* using local timestamps, or have their clocks set wrong, or something. Because when I compare the "test_start_time" field to the "Date" header (which is always supposed to be UTC) in a response to an http_requests test, I get some widely divergent values (I saw up to around 9 hours).
For example consider https://measurements.ooni.torproject.org/2015-01-01/20150101T060029Z-BE-AS29... It has a difference of about 6 hours: { "test_start_time":"2015-01-01 07:00:29", "test_keys": { "requests": [ { "response": { "headers": { "Date": "Thu, 01 Jan 2015 01:00:35 GMT" ... } } } ] }, probe_cc: "BE" } probe_cc is "BE", Belgium, which is supposed to be UTC+1, so I don't know where the 6-hour discrepancy is coming from. Note also that the timestamp in the file name is not the same: 060029.
Here's a summary of the http_requests reports from 2015-01-01. Most of them match, but some are off by 6 or 1 hour:
CC: CH test_start_time: 2015-01-01 03:00:28 Date: Thu, 01 Jan 2015 02:59:14 GMT +0 CC: BE test_start_time: 2015-01-01 07:00:29 Date: Thu, 01 Jan 2015 01:00:35 GMT +6 CC: FR test_start_time: 2015-01-01 09:41:08 Date: Thu, 01 Jan 2015 09:41:14 GMT +0 CC: FR test_start_time: 2015-01-01 10:47:20 Date: Thu, 01 Jan 2015 10:47:23 GMT +0 CC: NL test_start_time: 2015-01-01 12:42:16 Date: Thu, 01 Jan 2015 12:42:21 GMT +0 CC: NL test_start_time: 2015-01-01 15:41:32 Date: Thu, 01 Jan 2015 15:41:44 GMT +0 CC: CZ test_start_time: 2015-01-01 15:46:27 Date: Thu, 01 Jan 2015 15:46:34 GMT +0 CC: NL test_start_time: 2015-01-01 17:19:39 Date: Thu, 01 Jan 2015 17:19:43 GMT +0 CC: HK test_start_time: 2015-01-01 18:03:57 Date: Thu, 01 Jan 2015 17:04:10 GMT +1 CC: NL test_start_time: 2015-01-01 21:47:32 Date: Thu, 01 Jan 2015 21:47:37 GMT +0
I used this script:
import json import sys for line in sys.stdin: doc = json.loads(line) try: probe_cc = doc["probe_cc"] test_start_time = doc["test_start_time"] for request in doc["test_keys"]["requests"]: response_time = request["response"]["headers"]["Date"] print "CC: %s test_start_time: %s Date: %s" % (probe_cc, test_start_time, response_time) except KeyError: pass
So I looked a bit into this matter and it seems like the root of the problem is actually in ooni-probe:
The problem is this function here: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/otime.py#L87
that is called with the result of time.time(). Since time.time() should return a seconds since UTC Epoch, the result of then converting it again with utcfromtimestamp() leads it to actually being converted back into the local timezone of the probe.
This issue seems like something much harder to fix in historical data, but I am going to include a fix for this in versions 1.4.0 of ooni-probe together with the rest of the JSON data format changes.
~ Arturo
On Mar 17, 2016, at 07:04, David Fifield david@bamsoftware.com wrote:
Parts of the ooni-pipeline are using datetime.fromtimestamp instead of datetime.utcfromtimestamp. It means the timestamp will be parsed differently according to the time zone in which the code is run:
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf... entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time', 0)).strftime("%Y-%m-%d %H:%M:%S")
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf... start_time = datetime.fromtimestamp(header.get('start_time', 0))
Now, it appears that there's no actual harm done by this bug, because it looks like the timestamps were read and written by someone in UTC+1, so the conversion evens out in the final JSON. But it could go wrong if run by someone in a different time zone.
I noticed this because when I process the new JSON files, I get timestamps that are an hour different than I got parsing the YAML files. To parse the JSON times, I use datetime.datetime.strptime(..., "%Y-%m-%d %H:%M:%S"), which gives the correct answer because of the reason in the previous paragraph. My old YAML times were off by an hour, because I used datetime.datetime.utcfromtimestamp(...), parsing a UTC+1 timestamp as if it were UTC.
Here's some related information I found while researching the above. It looks like individual test results are *also* using local timestamps, or have their clocks set wrong, or something. Because when I compare the "test_start_time" field to the "Date" header (which is always supposed to be UTC) in a response to an http_requests test, I get some widely divergent values (I saw up to around 9 hours).
For example consider https://measurements.ooni.torproject.org/2015-01-01/20150101T060029Z-BE-AS29... It has a difference of about 6 hours: { "test_start_time":"2015-01-01 07:00:29", "test_keys": { "requests": [ { "response": { "headers": { "Date": "Thu, 01 Jan 2015 01:00:35 GMT" ... } } } ] }, probe_cc: "BE" } probe_cc is "BE", Belgium, which is supposed to be UTC+1, so I don't know where the 6-hour discrepancy is coming from. Note also that the timestamp in the file name is not the same: 060029.
Here's a summary of the http_requests reports from 2015-01-01. Most of them match, but some are off by 6 or 1 hour:
CC: CH test_start_time: 2015-01-01 03:00:28 Date: Thu, 01 Jan 2015 02:59:14 GMT +0 CC: BE test_start_time: 2015-01-01 07:00:29 Date: Thu, 01 Jan 2015 01:00:35 GMT +6 CC: FR test_start_time: 2015-01-01 09:41:08 Date: Thu, 01 Jan 2015 09:41:14 GMT +0 CC: FR test_start_time: 2015-01-01 10:47:20 Date: Thu, 01 Jan 2015 10:47:23 GMT +0 CC: NL test_start_time: 2015-01-01 12:42:16 Date: Thu, 01 Jan 2015 12:42:21 GMT +0 CC: NL test_start_time: 2015-01-01 15:41:32 Date: Thu, 01 Jan 2015 15:41:44 GMT +0 CC: CZ test_start_time: 2015-01-01 15:46:27 Date: Thu, 01 Jan 2015 15:46:34 GMT +0 CC: NL test_start_time: 2015-01-01 17:19:39 Date: Thu, 01 Jan 2015 17:19:43 GMT +0 CC: HK test_start_time: 2015-01-01 18:03:57 Date: Thu, 01 Jan 2015 17:04:10 GMT +1 CC: NL test_start_time: 2015-01-01 21:47:32 Date: Thu, 01 Jan 2015 21:47:37 GMT +0
I used this script:
import json import sys for line in sys.stdin: doc = json.loads(line) try: probe_cc = doc["probe_cc"] test_start_time = doc["test_start_time"] for request in doc["test_keys"]["requests"]: response_time = request["response"]["headers"]["Date"] print "CC: %s test_start_time: %s Date: %s" % (probe_cc, test_start_time, response_time) except KeyError: pass _______________________________________________ ooni-dev mailing list ooni-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/ooni-dev
On Thu, Mar 17, 2016 at 02:54:49PM +0100, Arturo Filastò wrote:
So I looked a bit into this matter and it seems like the root of the problem is actually in ooni-probe:
The problem is this function here: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/otime.py#L87
that is called with the result of time.time(). Since time.time() should return a seconds since UTC Epoch, the result of then converting it again with utcfromtimestamp() leads it to actually being converted back into the local timezone of the probe.
This issue seems like something much harder to fix in historical data, but I am going to include a fix for this in versions 1.4.0 of ooni-probe together with the rest of the JSON data format changes.
Thank you for checking it out. I agree that the past historical timestamps are probably not easily recoverable. I will keep parsing the timestamps as if they are UTC.
I don't understand your explanation that "utcfromtimestamp() leads it to actually being converted back into the local timezone." That's what I would expect of fromtimestamp, not utcfromtimestamp. Consider these:
# fromtimestamp gives different values according to TZ: $ TZ=America/Denver python -c 'import datetime; print str(datetime.datetime.fromtimestamp(1458239719))' 2016-03-17 12:35:19 $ TZ=Europe/Rome python -c 'import datetime; print str(datetime.datetime.fromtimestamp(1458239719))' 2016-03-17 19:35:19 $ TZ=UTC python -c 'import datetime; print str(datetime.datetime.fromtimestamp(1458239719))' 2016-03-17 18:35:19
# utcfromtimestamp gives the same value regardless of TZ: $ TZ=America/Denver python -c 'import datetime; print str(datetime.datetime.utcfromtimestamp(1458239719))' 2016-03-17 18:35:19 $ TZ=Europe/Rome python -c 'import datetime; print str(datetime.datetime.utcfromtimestamp(1458239719))' 2016-03-17 18:35:19 $ TZ=UTC python -c 'import datetime; print str(datetime.datetime.utcfromtimestamp(1458239719))' 2016-03-17 18:35:19
However what you did with defining a UTC tzinfo is probably equivalent. I still think there should not be calls to fromtimestamp without a tzinfo, otherwise the output will differ depending on the time zone it is run in.
On Mar 17, 2016, at 19:40, David Fifield david@bamsoftware.com wrote:
I don't understand your explanation that "utcfromtimestamp() leads it to actually being converted back into the local timezone." That's what I would expect of fromtimestamp, not utcfromtimestamp. Consider these:
# fromtimestamp gives different values according to TZ: $ TZ=America/Denver python -c 'import datetime; print str(datetime.datetime.fromtimestamp(1458239719))' 2016-03-17 12:35:19 $ TZ=Europe/Rome python -c 'import datetime; print str(datetime.datetime.fromtimestamp(1458239719))' 2016-03-17 19:35:19 $ TZ=UTC python -c 'import datetime; print str(datetime.datetime.fromtimestamp(1458239719))' 2016-03-17 18:35:19
# utcfromtimestamp gives the same value regardless of TZ: $ TZ=America/Denver python -c 'import datetime; print str(datetime.datetime.utcfromtimestamp(1458239719))' 2016-03-17 18:35:19 $ TZ=Europe/Rome python -c 'import datetime; print str(datetime.datetime.utcfromtimestamp(1458239719))' 2016-03-17 18:35:19 $ TZ=UTC python -c 'import datetime; print str(datetime.datetime.utcfromtimestamp(1458239719))' 2016-03-17 18:35:19
This is the problem, since you are setting the TZ to be Denver, Rome and then UTC, you would expect that the output of the datetime call would differ, since when you cast it to string it should evaluate to the time in the timezone of the caller.
You can better understand what I mean with this if you try to cast it to seconds since epoch again, as we used to do in ooni-probe < 1.4:
$ TZ=America/Denver python -c 'import datetime; print datetime.datetime.utcfromtimestamp(1458239719).strftime("%s”)’ 1458261319
$ TZ=Europe/Rome python -c 'import datetime; print datetime.datetime.utcfromtimestamp(1458239719).strftime("%s")’ 1458236119
$ TZ=UTC python -c 'import datetime; print datetime.datetime.utcfromtimestamp(1458239719).strftime("%s")’ 1458239719
~ Arturo
On Thu, Mar 17, 2016 at 08:37:21PM +0100, Arturo Filastò wrote:
On Mar 17, 2016, at 19:40, David Fifield david@bamsoftware.com wrote:
I don't understand your explanation that "utcfromtimestamp() leads it to actually being converted back into the local timezone." That's what I would expect of fromtimestamp, not utcfromtimestamp. Consider these:
# fromtimestamp gives different values according to TZ: $ TZ=America/Denver python -c 'import datetime; print str(datetime.datetime.fromtimestamp(1458239719))' 2016-03-17 12:35:19 $ TZ=Europe/Rome python -c 'import datetime; print str(datetime.datetime.fromtimestamp(1458239719))' 2016-03-17 19:35:19 $ TZ=UTC python -c 'import datetime; print str(datetime.datetime.fromtimestamp(1458239719))' 2016-03-17 18:35:19
# utcfromtimestamp gives the same value regardless of TZ: $ TZ=America/Denver python -c 'import datetime; print str(datetime.datetime.utcfromtimestamp(1458239719))' 2016-03-17 18:35:19 $ TZ=Europe/Rome python -c 'import datetime; print str(datetime.datetime.utcfromtimestamp(1458239719))' 2016-03-17 18:35:19 $ TZ=UTC python -c 'import datetime; print str(datetime.datetime.utcfromtimestamp(1458239719))' 2016-03-17 18:35:19
This is the problem, since you are setting the TZ to be Denver, Rome and then UTC, you would expect that the output of the datetime call would differ, since when you cast it to string it should evaluate to the time in the timezone of the caller.
You can better understand what I mean with this if you try to cast it to seconds since epoch again, as we used to do in ooni-probe < 1.4:
$ TZ=America/Denver python -c 'import datetime; print datetime.datetime.utcfromtimestamp(1458239719).strftime("%s”)’ 1458261319
$ TZ=Europe/Rome python -c 'import datetime; print datetime.datetime.utcfromtimestamp(1458239719).strftime("%s")’ 1458236119
$ TZ=UTC python -c 'import datetime; print datetime.datetime.utcfromtimestamp(1458239719).strftime("%s")’ 1458239719
I have always had trouble with the Python date libraries. So I trust you if you say it is working :)