Hello,

It might be worthwhile to take a step back, and have a discussion with regards to why ooni-pipeline-ng, and related tooling makes use of Apache Spark + Hadoop in lieu of using a RDBMS/NoSQL database before diving into the Hadoop ecosystem.  In my experience, early adoption of Hadoop typically incurs a significant degree of technical debt that can be associated with both the maintenance of the associated cluster, and increased development time. 

If I am correct in my assumption, ooni-pipeline-ng adopted Apache Spark largely due to its flexibility, and (relatively) low learning curve.

It is worth mentioning that I am currently working on a service in parallel that makes a subset of the OONI metrics available to developers, and other users (specifically the YAML reports associated with different ooni-probe measurements). I've been pretty busy lately, but I should have a service up shortly that I can open up to peer review. After working with the MongoDB aggregation framework for a few weeks, I've opted to lean more towards using PostgreSQL given that it supports conventional aggregate queries that many developers (including myself) are familiar with, which will make it easier for other developers to contribute.

The use of PostgreSQL may be favourable in lieu of using a NoSQL database, or adopting Hadoop given that for the most part, ooni-probe reports can be modeled in a relational form. The only sparse element of a given ooni-probe reports is the result associated with a given test, and even that follows a relatively predictable schema which can be targeted using an aggregate query.

One of the neat features of PostgreSQL is that it supports aggregate queries over nested JSON documents, meaning that you can perform aggregate queries on nested JSON in tandem to aggregates on scalar fields. This is pretty useful, not to mention performant when proper indices are used. This may be sufficient for the purposes of what the metrics team is doing, but of course, I'd have to hear more about what the hurdles you're trying to cross are.

Before tackling scalability, we should verify that the services in question fit your use case. If I am correct in my assumption, the ooni-probe metrics are less than 1GB overall, as opposed to being several terabytes.

[1] ooni-base format: https://raw.githubusercontent.com/TheTorProject/ooni-spec/master/data-formats/df-000-base.md

---
Cheers,
Tyler

GPG fingerprint: 8931 45DF 609B EE2E BC32  5E71 631E 6FC3 4686 F0EB

PS: Hopefully I am replying to this e-mail thread properly - I am not too familiar with mailing lists.
----------------------------------------------------------------------

Message: 1
Date: Thu, 1 Oct 2015 15:50:48 +0200
From: thomas l?rtsch <tl@rat.io>
To: ooni-dev@lists.torproject.org
Subject: [ooni-dev] hadoop?
Message-ID: <BC4AEC9F-CEEF-4FB7-B313-B87D960A1B66@rat.io>
Content-Type: text/plain; charset=windows-1252

Hi,

measurement team thinks about setting up a server with metrics data and an environment that allows everybody (everybody with a login, that is) to analyze metrics data and do crazy research with it.
Hadoop as a well established Big Data solution seems like a good choice to base that environment on, enhanced by R and probably more stuff. The problem is that Hadoop is not in Debian stable (and doesn?t seem to get in anytime soon [1]). The only alternatives we could find are PostgreSQL and MongoDB, but MongoDB is too shoddy and PostgreSQL will likely struggle with the kind of data we intend to throw at it and won?t be fun to work with.

Ooni does use Hadoop and we?d like to know why and how. Didn?t you, like us, find any viable alternative to Hadoop that is available in Debian stable? How did you get around Hadoop not being in stable? Can you advice us to do the same or look somewhere else? (Where?)


Cheers
Thomas


[1] https://wiki.debian.org/Hadoop