I am working on a project with Sheharbano Khattak, Sadia Afroz, Mobin
Javed, Srikanth Sundaresan, Vern Paxson, Steven Murdoch, and Damon McCoy
to measure how often web sites treat Tor users differently (by serving
them a block page or a captcha, for example). We used OONI reports for
part of the project. This post is about running our code and some
general tips about working with OONI data. I hope it can be of some use
to the ADINA15 participants :)
The source code I'm talking about is here:
git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git
One of its outputs is here, a big poster showing the web sites with the
highest blocking rates against Tor users:
https://people.torproject.org/~dcf/graphs/tor-blocker-poster-20150914.pdf
I am attaching the README of our code. One of OONI's tests does URL
downloads with Tor and without Tor. The code processes OONI reports,
compares the Tor and non-Tor HTTP responses, and notes whenever Tor
appears to be blocked while non-Tor appears to be unblocked.
Our code is focused on the task of finding Tor blocks, but parts of it
are generic and will be useful to others who are working with OONI data.
The ooni-report-urls program gives you the URLs of every OONI report
published at api.ooni.io. The ooni.py Python module provides an iterator
over OONI YAML files that deals with encoding errors and compression.
The classify.py Python module is able to identify many common types of
block pages (e.g. CloudFlare, Akamai).
Now some notes and lessons learned about working with OONI data.
The repository of all OONI reports is here:
http://api.ooni.io/
The web site doesn't make it obvious, but there is a JSON index of all
reports, so you can download many of them in bulk (thanks to Arturo for
pointing this out). Our source code contains a program called
ooni-report-urls that extracts the URLs from the JSON file so you can
pipe them to Wget or whatever. (Check before you start downloading,
because there are a lot of files and some of them are big!)
wget -O ooni-reports.json http://api.ooni.io/api/reports
./ooni-report-urls ooni-reports.json | sort | uniq > ooni-report-urls.txt
The choice of a YAML parser really really matters, like 30× performance
difference matters. See here:
https://bugs.torproject.org/13720
yaml.safe_load_all(f) function is slow.
yaml.load_all(f, Loader=yaml.CSafeLoader) is what you want to use
instead. yaml.CSafeLoader differs slightly in its handling of certain
invalid Unicode escapes that can appear in OONI's representation of HTTP
bodies, for example separately encoded UTF-16 surrogates:
"\uD83D\uDD07". ooni.py has a way to skip over records like that (there
aren't very many of them). With yaml.CSafeLoader, findblocks takes about
2 hours to process 2.5 years of http_requests reports (about 33 GB
compressed).
There are some inconsistencies and format differences in some OONI
reports, particularly very early ones. For example, the test_name field
of reports is not always the same for the same test. We were looking for
http_requests tests, and we had to match all of the following
test_names:
http_requests
http_requests_test
tor_http_requests_test
HTTP Requests Test
In addition, the YAML format is occasionally different. In http_requests
reports, for example, the way of indicating that Tor is in use for a
request can be any of:
tor: true
tor: {is_tor: true}
tor: {exit_ip: 109.163.234.5, exit_name: hessel2, is_tor: true}
And even in some requests, the special URL scheme "shttp" indicates a
Tor request; e.g. "shttp://example.com/". The ooni.py script fixes up
some of these issues, but only for the http_requests test. You'll have
to figure it out on your own for other tests.
A very early version of this processing code appeared here:
https://lists.torproject.org/pipermail/ooni-dev/2015-June/000288.html
Hi,
Pabs made me aware of the github repository being quite ahead of the one
on git.torproject.org. Having 'Source code[link] is available (GitHub
mirror[link]).' is confusing, because I didn't expect such a discrepancy.
3 possible solutions to avoid this confusion:
- have some little automatic bot pushing changes from git.torproject to
github (post-hook for example)
- write a little mail to Github, to get the same treatment as, for
example, most Apache[1] projects so they pull (twice a day) changes from
git.torproject to github
- change the wording, and make the one on git.torproject the 'mirror
which reflects the state of a foregone era'
Ciao,
kwadro
[1] https://github.com/apache/parquet-mr
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Dear all,
Here's what we did during the mini hackathon in Rome (1-2 October
2015). This was collaboratively written and edited by me.
The two-days hackathon was attended by eight people: alangiu, alemela,
dalla, duncan, gfutia, nuke, poly, and me (sbs).
AirVPN folks joined us in the morning of October 1. With them we
discussed about possible avenues of cooperation between OONI and the
NeuMon project (http://www.neumon.org) to measure censorship.
During the hackathon we worked on the following projects:
# MeasurementKit
- - people: nuke, alangiu, sbs
- - brief overview of what we did:
- Finished a beta working version of the app, added cocoapods in
the Xcode project.
- Removed the build-ios repo and added the scripts in the mail
measurement-kit repo.
- Tested the new async measurement-kit implementation and all
works great, even the logs are separated test by test.
- Added dns support for iOS devices, getting local DNS and using
it for the tests (before was working only in simulator)
- - relevant github repos:
- https://github.com/measurement-kit/measurement-kit
- https://github.com/measurement-kit/measurement-kit-build-ios/
- https://github.com/measurement-kit/measurement-kit-app-ios
- - relevant pull requests:
- https://github.com/measurement-kit/measurement-kit/pull/182
- https://github.com/measurement-kit/measurement-kit/pull/185
- https://github.com/measurement-kit/measurement-kit/pull/187
# NetworkMeter
- - people: poly, sbs
- - brief overview of what we did:
- Added preliminary support for invoking OONI
- Added support for running tests in parrallel
- Implemented homepage, showing currently running tests and tests
that have finished
- Added ability to retrieve output and reports of previous tests
- Various small bugfixes, like removing redundant caching
- Decided on how to handle visualization
- - relevant github repos:
https://github.com/measurement-kit/network-meter
- - relevant pull requests:
https://github.com/measurement-kit/network-meter/pull/34https://github.com/measurement-kit/network-meter/pull/32https://github.com/measurement-kit/network-meter/pull/23https://github.com/measurement-kit/network-meter/pull/21
# World Censorship Map
- - people: duncan
- - brief overview of what we did:
Made a map showing which regions of the world OONI probes are running
in. Each point represents one report collected.
The data is from 2014 mostly, and there is a hidden time slider on the
left. Probes tend to come online, collect a burst of measurements then
disappear again shortly after. Exceptions include the UK, Germany and
Italy where a huge number of measurements were collected year-round.
- - demo:
- https://vtduncan.github.io/ooni-globe
- - relevant github repos:
- https://github.com/vtduncan/ooni-globe
- https://github.com/vtduncan/asn-geo
# World Censorship Report
- - people: dalla, gfutia, alemela
- - brief overview of what we did:
Analyzed some tests in order to understand which data could be useful
to show in a aggregated report Written some node.js code that:
- unzip and parse yaml aggregated reports
- split reports by country and month adding some values
- generate a markdown report with an overview on all tests done
for a specified country in a specified month
Thought about which kind of physical data model would have been more
indicated to store ooni reports.
- - relevant github repos:
https://github.com/alemela/ooni-report
- - - - end projects - - -
Thank you,
- -sbs
- --
Simone Basso
https://nexa.polito.it/people/sbasso
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAEBCAAGBQJWLk38AAoJEIC2kSd3M9lbkN8P/1QrvcdBU/H86yax1tO2uWRt
NgkLaKX8mhhznuqUM9h6LGASHsFyXecDkQGPtQDVmdqds1+ny7XSX2jfgwoSAtG2
zrphrST51cDTcXJV/keLYwVuSDPI9BBtdwMH1McbGm1sq8NEbCQbFkLfprmAyUZQ
zH2XiiTyBRkAjDaYDqhu1RXWXTwbsTheXP4Zdq2XmC81zgJt9Pao6RQi7JCnYhpb
ycbN2j++POEcGqZzEpXsSGMFSQOrkNQHN9heMUd0Ro6y5GVxUcn5RMv9eKHP8lzO
c6wT5hj4Rn6pcf/6rCdg9hxQObUaEeD/rbOJM52ovDmzPri050nmEyw4Gutved8c
kmt3aUhjPJtdk618nYstIhvUTKM2rLWBwC8czG9pAemjivu6b2soculTtASZHIuo
hBRm3cDlVPZGFwJSj0KRvuwTjPrS7rbIbCvetxjAuuuTRFioO9fkDpCk3XNiLFDP
8j8XpFOxOIh7tOPcCu6gZOzqNlK8P2/GS2kPvTiE6jbOqeGXZK03YiZrHIvgFHC5
5/4ZR7bJuIWxYtzct7u4BCVgPYYBgCZOw038nlTgZlK9+aafua5eAfUcYohQKvc3
uiH+HQMmCk14m2F1RjAIILSSa98iNR7jmJaIyAa/gnNC46FWSMNk89R0fMtZjuPu
v5HxUrd2ppIbG9HT0MO9
=jJk8
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Dear all,
This is a reminder that today there will be the weekly OONI meeting.
It will happen as usual on the #ooni channel on irc.oftc.net at 18:00
UTC (19:00 CEST, 14:00 EST, 11:00 PST).
Everybody is welcome to join us and bring their questions and feedback.
Cheers,
- -sbs
- --
Simone Basso
http://nexa.polito.it/people/sbasso
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAEBCAAGBQJWLjN2AAoJEIC2kSd3M9lbl0gP/ReT3P0nIDBHS63wfG0E3yYx
WxcgeeaIEoPrl4t91WWVzY/xM1SdHL8bG9OwivKQLxzYf/JP/QjdHcf0RFAYmwSl
1+8D0pgfzG3rgBYxuqiXEAwCnTf147MIVnaeYDTy6ixpbumLc6Owust3xDN45xio
MThS6n78GLFSecS1ithK3WkZIrIVG9QJgeQu6FF/bpVb9ZDbloty5eJszOb1d8Ix
1qjvpkf0q8EPzEputoQO1ZI1Q1lqSJzT/pn847D3mCiZPFNtnYaCCzlGX5DzHk21
wzfpy+kdBSMpP3IZBJatn/VMDcnP8Lh5DhgjLv+2rvoQI9T41R+exojWk8KeaCu2
H9cGQslkix4P7d+bAERWXM8y9lVOq02n9R63vRnYMIOTzsBDoBr9rCpq7LTLE1Ja
MBem1Vj7/lgI29BVK54q+HoeUDjuVOuf+OknCwXkIK1n9GhQeC4sksRc0SCYWX1c
YXB3m94jLW7trP2htgClFUfBqegmZikuxEhvDpyp9ioqyB8a8kQPz3TTmRc7FG4f
6gK7YaWjAG0IADdouhhuzWobIMX1pnB+5Q1Lj2bpbnYKyD1/3QQI1vHsmAyKji+9
n0KqgMGndN/RoCEuCuI6FdUrx2WDVtki6VMJqmHkMbB/Sk6b1CEUfQFVrh+G6EEo
Ol8SvY2A76emesmUCA71
=bSVG
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hello,
I am not sure if anyone has noticed this, but as a heads up, it looks
like the AWS API key pair associated with ooni-api has expired
(https://github.com/TheTorProject/ooni-api). One of the buckets known to
be associated with this key is ooni-public.
Test URL:
http://api.ooni.io/reportFiles/2013-10-14/20131014T065313Z-AS6327-http_requ…
This XML file does not appear to have any style information associated
with it. The document tree is shown below.
<Error>
<Code>InvalidAccesKeyId</Code>
<Message>
The AWS Access Key Id you provided does not exist in our records.
</Message>
<AWSAccessKeyId>AKIAICLIHHE7CXMFVGDA</AWSAccessKeyId>
<RequestId>D520976FAD96E3D3</RequestId>
<HostId>
HwaC8KIExmEF7hZS2QEOLSlXLa3AA/EIoeSXU43+XQLX6PDKCp+IBmx3WeUAKLche0/yYKAwWAI=
</HostId>
</Error>
- ---
Cheers,
Tyler Fisher
[1] https://github.com/TheTorProject/ooni-api
GPG fingerprint: 8931 45DF 609B EE2E BC32 5E71 631E 6FC3 4686 F0EB
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAEBCAAGBQJWLXylAAoJEGMeb8NGhvDr4q0P/jfqpOOADhvw3Sp+qRcOQ9+U
BC04EyMh4UA7ZNMiUpOAgFacool2jMQa2tLpT8wgUQFbRJdTWBi9m+fP4ZY3L5ku
1lHuL/Y/+geiX8iYayXTVUsl5FBiDcZmGEq/EME4hEtAsO3yyhtEqqAlb4Sq6gv+
eO55U1BNocXLZB4zNkguXQbot1tHxxv7DJu6WuKUFuur/BB805Qy1yizfgxfbVkP
Jd3o9Um4wl3cA1TI4+ZPaHlCw3vW56JXTuALYy/cwmrruJpnJ2X4seINCmDLuchZ
l6dBjl7PDWuqhy3dTCDhFpRItA8rvMyBObvqY07koTwwZfKtvj5zwuTrHMjWlZUr
yj/D3lc8oXWQZK3ySqc3kuSis+WpV1gwx5xvSCc+htZdwmd6NPlkzft/m4dt6h1g
UTT+H0fvAUXtn6ok5emaQcHKc9ib+55vv4ftleVyvIKI/RfVJn46vjy84v3dslC5
TLfL9DjYQjOtbgpI+DyoirfdRqn8mJdc0SYZcdvn7iX91yHD9UOEIsE2Z5CllsNq
vNraKUVPytnNwIHGQCYZFGHK8mJppB3p9FqmkjuUU84ZTlnsVxDa2GYmVvpgmZHL
TwEASA2n99mH96Tkh8pdW2JWTKshTeqByBa5yEAmjQhnmzuQFbi+500TMgtZE6Q1
3Bb8JIHicDl9JCsbplyI
=n2vy
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Hello,
This is a reminder that today there will be the weekly OONI meeting.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (19:00 CET, 13:00 EST, 10:00 PST).
Everybody is welcome to join us and bring their questions and feedback.
See you later!
Vasilis
-----BEGIN PGP SIGNATURE-----
iQIcBAEBCgAGBQJWJR4XAAoJEF+/cLHRJgFilxQP/AlKFzBNQXVhguaKyaExu6BU
o/OyaVDsC9kXBBOdHJgKDzEAh7QoyDVoC4Yez6nbco3DFAgh+BCpgs2MJ7KDa7Xd
MLOi5gHsWdY/nW3bHyqfriVzTxcBR7QopRclGvO/oVH0oLEOlEMnePlGRYwO2smm
Fal2duInhzfdUWMxjqPiSQB1ODWXYgiEG/kcldZyzpk0U1pCR0ggubhxYPtx92WK
ZosnMVymgEIkmamCjoHSet4XOkmG2At7Cf8wnpGiCSuyqMTrjx1RfYEjPrl9QGvw
CVsEfY+lkikeG+xvXNfhkeBZWlUg2872XQZeYkGuHq5z4oOiKuTuYdxkRt52Xgiy
ruR+RTbMNdqmnUx817XbSLv3LLS4KI/Ib7rKsP1F2ofSX5bmHJWEUL30ps1Yw+Sm
Q23Yw4oLDNR0LqF3vJsdj3G7cwh79zzRSspWtR/5vZA4ckkepih82oH/voRS5lgY
QZdtgjU+7A38QBhsZZD1Uzs6UBpsbDP5i8I2wwMCPky2Sc/OLbQUelsf/AYciYn2
x+V3puMmFGSnwegUnSPabng0la/YuunCKlsgRR6rbjM14rf9tOWDxq6z+//2iya0
JXayGG6sjwxas9PpLTc6kYKSm51gWyNOV9fHYr69TedgcM7EdQ8ywkq7GOuhN+At
BqpufsdFJwiwN+zttI2N
=sRjs
-----END PGP SIGNATURE-----
What email address(es) does OONI have registered at https://access.ripe.net ?
Give me an email address and I can give you OONI co-management.
It's likely that other people will want to do this as well. And I'm
confident there's useful things OONI could do with a fuckton of Atlas
credits.
-V
Hello,
This mail doesn't have seemed to end up on the tor-dev lists as intended, but
— in either case — this list is probably a better place for it. Since Pabs'
patch is a one-character change which should fix a SyntaxError in one of the
nettests, I've gone ahead and applied it to the Github ooni-probe master
branch and pushed there. (The git-rw.tpo repo master branch is out of sync
with the GH one, FWIW. I didn't understand why this would be the case, but I
assumed that you all have some organisation reasons for doing so and so I
didn't update the git-rw.tpo repo.)
I hope this was okay to do.
Cheers!
--
♥Ⓐ isis agora lovecruft
_________________________________________________________
OpenPGP: 4096R/0A6A58A14B5946ABDE18E207A3ADB67A2CDB8B35
Current Keys: https://blog.patternsinthevoid.net/isis.txt
----- Forwarded message from Paul Wise <pabs3(a)bonedaddy.net> -----
> Return-Path: <pabs3(a)bonedaddy.net>
> Delivered-To: <isis+torproject(a)patternsinthevoid.net>
> Received: from patternsinthevoid.net
> by greyarea.patternsinthevoid.net (Dovecot) with LMTP id L+9dKDFzH1bpQAAABwIL0w
> for <isis+torproject(a)patternsinthevoid.net>; Thu, 15 Oct 2015 09:34:41 +0000
> Received: from localhost (localhost [127.0.0.1])
> by patternsinthevoid.net (Postfix) with ESMTP id 9E3153D4E11
> for <isis+torproject(a)patternsinthevoid.net>; Thu, 15 Oct 2015 09:34:41 +0000 (UTC)
> X-Virus-Scanned: Debian amavisd-new at patternsinthevoid.net
> Received: from patternsinthevoid.net ([127.0.0.1])
> by localhost (greyarea.patternsinthevoid.net [127.0.0.1]) (amavisd-new, port 10024)
> with ESMTP id fhGf8GwobJgy
> for <isis+torproject(a)patternsinthevoid.net>;
> Thu, 15 Oct 2015 09:34:15 +0000 (UTC)
> Received: from eugeni.torproject.org (eugeni.torproject.org [38.229.72.13])
> by patternsinthevoid.net (Postfix) with ESMTPS id 710B73D4E10
> for <isis+torproject(a)patternsinthevoid.net>; Thu, 15 Oct 2015 09:34:14 +0000 (UTC)
> X-Virus-Scanned: Debian amavisd-new at
> Received: from eugeni.torproject.org ([127.0.0.1])
> by localhost (eugeni.torproject.org [127.0.0.1]) (amavisd-new, port 10024)
> with ESMTP id m1eKAgkSKpP1; Thu, 15 Oct 2015 09:33:42 +0000 (UTC)
> Received: from master.debian.org (master.debian.org [IPv6:2001:41b8:202:deb:216:36ff:fe40:4001])
> (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
> (Client did not present a certificate)
> by eugeni.torproject.org (Postfix) with ESMTPS id 5B5B238753;
> Thu, 15 Oct 2015 09:33:38 +0000 (UTC)
> From: Paul Wise <pabs3(a)bonedaddy.net>
> To: isis(a)torproject.org, tor-dev(a)lists.torproject.org
> Cc: Paul Wise <pabs3(a)bonedaddy.net>
> Subject: [ooni-probe] [PATCH] Fix typo in the meek tests
> Date: Thu, 15 Oct 2015 17:31:09 +0800
> Message-Id: <1444901469-19326-1-git-send-email-pabs3(a)bonedaddy.net>
> X-Mailer: git-send-email 2.6.1
>
> ---
> ooni/nettests/blocking/meek_fronted_requests.py | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/ooni/nettests/blocking/meek_fronted_requests.py b/ooni/nettests/blocking/meek_fronted_requests.py
> index 3795008..2ccd72e 100644
> --- a/ooni/nettests/blocking/meek_fronted_requests.py
> +++ b/ooni/nettests/blocking/meek_fronted_requests.py
> @@ -48,7 +48,7 @@ class meekTest(httpt.HTTPTest):
> """
>
> if self.input:
> - if (isinstance(self.input, tuple) or isinstace(self.input, list)):
> + if (isinstance(self.input, tuple) or isinstance(self.input, list)):
> self.domainName, self.header = self.input
> else:
> self.domainName, self.header = self.input.split(':')
> --
> 2.6.1
>
----- End forwarded message -----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Dear all,
This is a reminder that today there will be the weekly OONI meeting.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (19:00 CEST, 13:00 EST, 10:00 PST).
Everybody is welcome to join us and bring their questions and feedback.
Cheers,
- --
Simone Basso
http://nexa.polito.it/people/sbasso
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAEBCAAGBQJWG6kZAAoJEIC2kSd3M9lbTJUP/1Lo/HQWuSHn+W7ycwsDfbNV
YyI+wBmR/WMoTifHYEU33MXI6q0JXhveTdAEGcnnW/z5S+r+u565eM4qoeIszHUO
VgyJu9lRgpJMgqKKVRbEww5Hbh2oUhtCx+VMs4ws5oAuzwLpTK2S3J+ZQv2KE+ek
RdjQ06Pb0w19WTnek7KdGZzo0GI9Tn2TqGQPNpq4KFKJ83r7HV5LSgDriHa9We0j
Vkl4LcOujX6BifaIziAjgrqxrSKa8zUwAn7Zp0ZFfpW3L0F4+npWYypZ824KDFCP
Q4/IgOsAcVvPYV+TajLtlYNjtP/KZ23PfqYpk3nsAJg7alYGDHab95R4iYRCqoio
+YSQXZ90tM2XQRzHtKcuLRTAP4gJt65/zhsrFPTGP4iafTfjOkNrocyB/ym4eVi5
c1xqq6fQaCDKEDM0elWqvdpS6IVBWL3JXdJU6xywQ1PmA1Ta0qC5bgGcJ+1kTt+7
0Kh0dMNjdZhzXDhMX3a3SUXUooQBOBaO5pE09tjfDwYhVw5d8VKvW6U7GovP3nEF
a6jxDMwPzV6CEI6GhQp7v7vSnqCPUX1bFX94WFjJTqiwmyyBO8+GR3qeQeZsGtgj
3JgHBchs7/Ib0ryYzDMukspq4UlkA5Qis2/u0L7bTK9Cq4cli5OT2Nv9bsu5H1CV
8J2UwCXLUvPIN6K18x6E
=R3ES
-----END PGP SIGNATURE-----
Hello,
It might be worthwhile to take a step back, and have a discussion with
regards to why ooni-pipeline-ng, and related tooling makes use of Apache
Spark + Hadoop in lieu of using a RDBMS/NoSQL database before diving into
the Hadoop ecosystem. In my experience, early adoption of Hadoop typically
incurs a significant degree of technical debt that can be associated with
both the maintenance of the associated cluster, and increased development
time.
If I am correct in my assumption, ooni-pipeline-ng adopted Apache Spark
largely due to its flexibility, and (relatively) low learning curve.
It is worth mentioning that I am currently working on a service in parallel
that makes a subset of the OONI metrics available to developers, and other
users (specifically the YAML reports associated with different ooni-probe
measurements). I've been pretty busy lately, but I should have a service up
shortly that I can open up to peer review. After working with the MongoDB
aggregation framework for a few weeks, I've opted to lean more towards
using PostgreSQL given that it supports conventional aggregate queries that
many developers (including myself) are familiar with, which will make it
easier for other developers to contribute.
The use of PostgreSQL may be favourable in lieu of using a NoSQL database,
or adopting Hadoop given that for the most part, ooni-probe reports can be
modeled in a relational form. The only sparse element of a given ooni-probe
reports is the result associated with a given test, and even that follows a
relatively predictable schema which can be targeted using an aggregate
query.
One of the neat features of PostgreSQL is that it supports aggregate
queries over nested JSON documents, meaning that you can perform aggregate
queries on nested JSON in tandem to aggregates on scalar fields. This is
pretty useful, not to mention performant when proper indices are used. This
may be sufficient for the purposes of what the metrics team is doing, but
of course, I'd have to hear more about what the hurdles you're trying to
cross are.
Before tackling scalability, we should verify that the services in question
fit your use case. If I am correct in my assumption, the ooni-probe metrics
are less than 1GB overall, as opposed to being several terabytes.
[1] ooni-base format:
https://raw.githubusercontent.com/TheTorProject/ooni-spec/master/data-forma…
<https://outlook.office365.com/owa/redir.aspx?REF=PlDkYwQaXc01vRBgkpS-cnxr99…..>
---
Cheers,
Tyler
GPG fingerprint: 8931 45DF 609B EE2E BC32 5E71 631E 6FC3 4686 F0EB
PS: Hopefully I am replying to this e-mail thread properly - I am not too
familiar with mailing lists.
----------------------------------------------------------------------
Message: 1
Date: Thu, 1 Oct 2015 15:50:48 +0200
From: thomas l?rtsch <tl(a)rat.io>
To: ooni-dev(a)lists.torproject.org
Subject: [ooni-dev] hadoop?
Message-ID: <BC4AEC9F-CEEF-4FB7-B313-B87D960A1B66(a)rat.io>
Content-Type: text/plain; charset=windows-1252
Hi,
measurement team thinks about setting up a server with metrics data and an
environment that allows everybody (everybody with a login, that is) to
analyze metrics data and do crazy research with it.
Hadoop as a well established Big Data solution seems like a good choice to
base that environment on, enhanced by R and probably more stuff. The
problem is that Hadoop is not in Debian stable (and doesn?t seem to get in
anytime soon [1]). The only alternatives we could find are PostgreSQL and
MongoDB, but MongoDB is too shoddy and PostgreSQL will likely struggle with
the kind of data we intend to throw at it and won?t be fun to work with.
Ooni does use Hadoop and we?d like to know why and how. Didn?t you, like
us, find any viable alternative to Hadoop that is available in Debian
stable? How did you get around Hadoop not being in stable? Can you advice
us to do the same or look somewhere else? (Where?)
Cheers
Thomas
[1] https://wiki.debian.org/Hadoop
<https://outlook.office365.com/owa/redir.aspx?REF=mB6eWxEmvf_4Xp-tKkmqkJNT2-….>