The YAML reports had two time fields: start_time: timestamp of the start of ooni-probe run test_start_time: timestamp of the start of each individual test Within a single report file, start_time was constant, while test_start_time would advance with each successive test, depending on how long each test took to run.
The JSON format reports have just one of the fields, test_start_time, but it confusingly appears to have the same meaning as start_time in the old YAML reports (it doesn't change within a report file): test_start_time: timestamp of the start of ooni-probe run
It might be because of this code: https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf... entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time', 0)).strftime("%Y-%m-%d %H:%M:%S")
Some tests can take many minutes or hours to run, so by the end, the JSON test_start_time might be far off from the real time when it was run. Is there a way we could get both timestamps in each record again?
What I'm doing now is incrementing a counter according to the test_runtime field of each record, and adding that counter to the test_start_time in order to estimate the individual test's start time. But I feel that is only approximate, and some older reports do not have test_runtime.
On Mar 17, 2016, at 07:28, David Fifield david@bamsoftware.com wrote:
The YAML reports had two time fields: start_time: timestamp of the start of ooni-probe run test_start_time: timestamp of the start of each individual test Within a single report file, start_time was constant, while test_start_time would advance with each successive test, depending on how long each test took to run.
The JSON format reports have just one of the fields, test_start_time, but it confusingly appears to have the same meaning as start_time in the old YAML reports (it doesn't change within a report file): test_start_time: timestamp of the start of ooni-probe run
It might be because of this code: https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf... entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time', 0)).strftime("%Y-%m-%d %H:%M:%S”)
Gosh you are right. This is a pretty serious bug in the data pipeline.
The goal there was to actually only retain the ‘test_start_time’ field a drop the ‘start_time’ since that is no longer relevant now that the discrete unit is that of a measurement.
I guess since loosing information is perhaps not ideal I will add both of them back in the final JSONs.
Since we are currently using the test_start_time in a lot of the database views for generating some aggregates, the most straightforward thing to do is to add another field called “measurement_start_time” that will represent the value of what used to be “test_start_time”, while “test_start_time” will continue meaning what previously was “start_time”.
Some tests can take many minutes or hours to run, so by the end, the JSON test_start_time might be far off from the real time when it was run. Is there a way we could get both timestamps in each record again?
What I'm doing now is incrementing a counter according to the test_runtime field of each record, and adding that counter to the test_start_time in order to estimate the individual test's start time. But I feel that is only approximate, and some older reports do not have test_runtime.
I will re-run the pipeline again on all historical data to re-populate these fields according to the changes mentioned above.
~ Arturo
On Thu, Mar 17, 2016 at 02:33:03PM +0100, Arturo Filastò wrote:
On Mar 17, 2016, at 07:28, David Fifield david@bamsoftware.com wrote:
The YAML reports had two time fields: start_time: timestamp of the start of ooni-probe run test_start_time: timestamp of the start of each individual test Within a single report file, start_time was constant, while test_start_time would advance with each successive test, depending on how long each test took to run.
The JSON format reports have just one of the fields, test_start_time, but it confusingly appears to have the same meaning as start_time in the old YAML reports (it doesn't change within a report file): test_start_time: timestamp of the start of ooni-probe run
It might be because of this code: https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf... entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time', 0)).strftime("%Y-%m-%d %H:%M:%S”)
Gosh you are right. This is a pretty serious bug in the data pipeline.
The goal there was to actually only retain the ‘test_start_time’ field a drop the ‘start_time’ since that is no longer relevant now that the discrete unit is that of a measurement.
I guess since loosing information is perhaps not ideal I will add both of them back in the final JSONs.
Since we are currently using the test_start_time in a lot of the database views for generating some aggregates, the most straightforward thing to do is to add another field called “measurement_start_time” that will represent the value of what used to be “test_start_time”, while “test_start_time” will continue meaning what previously was “start_time”.
That's fine with me. And I agree, if there is only one, the individual measurement times are more useful than the overall report time. I don't think I'm using the latter in my processing scripts.
Some tests can take many minutes or hours to run, so by the end, the JSON test_start_time might be far off from the real time when it was run. Is there a way we could get both timestamps in each record again?
What I'm doing now is incrementing a counter according to the test_runtime field of each record, and adding that counter to the test_start_time in order to estimate the individual test's start time. But I feel that is only approximate, and some older reports do not have test_runtime.
I will re-run the pipeline again on all historical data to re-populate these fields according to the changes mentioned above.
I'll get ready to download again :)