New subject: Feedback on OONI data collection, aggregation, and visualization

19 Oct 2014


      Hi everyone,
I just subscribed to this list, because Arturo asked me to comment on
two postings here.  As a very quick introduction, and because I don't
know how distinct the Tor community and the OONI community are: I'm the
developer behind the Tor network data collector CollecTor [-1] and the
Tor Metrics website that aggregates and visualizes Tor network data [0].
Here's what Arturo asked on another thread:
...
With OONI what we are currently focusing on is Bridge Reachability
measurements. We have at this time 1 meter in China, 1 in Iran (a second
one is going to be setup soon), 1 in Russia and 1 in Ukraine. We have
some ideas of the sorts of information we would like to extract from
this data, but it would also be very good to have some more feedback
from you on what would be useful [1].
Long mail is long.  Some random thoughts:
- For Tor network data it has turned out to be quite useful to strictly
separate data collection from data aggregation from data visualization.
 That is, don't worry too much about visualizing the right thing, but
start with something, and if you don't like it, throw it away and do it
differently.  And if you're aggregating the wrong thing, then aggregate
the previously collected data in a different way.  Of course, if you
figure out you collected the wrong thing, then you won't be able to go
back in time and fix that.
- I saw some discussion of "The pool from where the bridge has been
extracted (private, tbb, BridgeDB https, BridgeDB email)".  Note that
isis and I are currently talking about removing sanitized bridge pool
assignments from CollecTor.  We're thinking about adding a new config
line to tor that states the preferred bridge pool, which could be used
here instead.  Just as a heads-up, six months or so in advance.  I can
probably provide more details if this is relevant to you.
...
Another area that perhaps overlaps with the needs of the metrics is data
storage. Currently we have around 16 GB of uncompressed raw report data
that needs to be archived (currently it's being stored and published on
staticiforme, but I have a feeling that is not ideal especially when the
data will become much bigger) and indexed in some sort of database.
Once we put the data (or a subset of it) in a database producing
visualizations and exposing the data to end users will be much simpler.
The question is if this is a need also for
Metrics/BwAuth/ExitScanner/DocTor and if we can perhaps work out some
shared infrastructure to fit both of our goals.
Currently we have placed the data inside of MongoDB, but some concerns
with it have been raised [2].
Again, some random thoughts:
- For Metrics, the choice of database is entirely an internal decision,
and no user would ever see that.  It's part of the aggregation part.  If
we ever decide to pick something else (than PostgreSQL in this case),
we'd have to rewrite the aggregation scripts, which would then produce
the same or similar output (which is an .csv file in our case).  That
being said, trying out MongoDB or another NoSQL variant might be
worthwhile, but don't rely on it too much.
- Would you want to add bridge reachability statistics to Tor Metrics?
 I'm currently working on opening it up and making it easier for people
to contribute metrics.  Maybe take a look at the website prototype that
I posted to tor-dev@ a week ago [3] (and if you want, comment there).  I
could very well imagine adding a new section "Reachability" right next
to "Diversity" with one or more graphs/tables provided by you.  Please
see the new "Contributing to Tor Metrics" section on the About page for
the various options for contributing data or metrics.
- Please ask weasel for a VM to host those 16 GB of report data; having
it on staticiforme is probably a bad idea.  Also, do you have any plans
to synchronize reports between hosts?  I'm planning such a thing for
CollecTor where two or more instances fetch relay descriptors from
directory authorities and automatically exchange missing descriptors.
- I could imagine extending CollecTor to also collect and archive OONI
reports, as a long-term thing.  Right now CollecTor does that for Tor
relay and bridge descriptors, TORDNSEL exit lists, BridgeDB pool
assignment files, and Torperf performance measurement results.  But note
that it's written in Java and that I hardly have development time to
keep it afloat; so somebody else would have to extend it towards
supporting OONI reports.  I'd be willing to review and merge things.  We
should also keep CollecTor pure Java, because I want to make it easier
for others to run their own mirror and help us make data more redundant.
 Anyway, I can also imagine keeping the OONI report collector distinct
from CollecTor and only exchange design ideas and experiences if that's
easier.
Lots of ideas.  What do you think?
All the best,
Karsten
[-1] https://collector.torproject.org/
[0] https://metrics.torproject.org/
[1] https://lists.torproject.org/pipermail/ooni-dev/2014-October/000176.html
[2] https://lists.torproject.org/pipermail/ooni-dev/2014-October/000178.html
[3] https://kloesing.github.io/metrics-2.0/