Hi there,
I had an email exchange with Arturo and he recommended I post this on the list.
Arturo, have you looked at https://zeppelin.incubator.apache.org/ ? Also remember I mentioned SlamData ? It claims to do very fast MongoDB analytics (http://slamdata.com/use-cases/#operational-analytics) - and Michele, they are functional heads (using Purescript for the front-end https://github.com/slamdata/slamdata and talking about relational algebra http://slamdata.com/murray/). Also I think you mentioned that you tried to have your tera byte of data in Mongo but that there were size leaks? Can you talk more about that? Also have you looked at http://gopulsar.io/ ? What is the best place to ask questions about 1/ the data processing pipeline systems 2/ the type of analytics that could be done on the data?
These are all great suggestions and I think the only one I had briefly glanced at is zeppelin. Being still an incubator project it may not be something good to rely on, but perhaps something interesting to play around with (for doing similar tasks we currently import a partition of the data into elastic search and then use kibana to explore it). Makes sense. Haven’t tried any of them myself, and just stumbled upon the Slamdata one because of my interest in functional front-end approaches and they seem to be doing interesting stuff with Mongo and a kind of Python notebook approach to creating analysis documents which I thought was cool.
It would be great if this information were posted to the ooni-dev mailing list (https://lists.torproject.org/cgi-bin/mailman/listinfo/ooni-dev) as perhaps some people there with more knowledge than me can comment on them too. Done!
Regarding the size leaks of mongoDB I didn’t do much investigation into it, though I still have a copy of the original database if you would like to check it out. Basically the problem was that we had something like ~100GB of raw data (~10GB compressed) that when placed inside of the database would increase to a size of ~500GB. I didn’t dig too much into the issue as that iteration of the pipeline was presenting also other scaling issues (the processing of the batch jobs was not being distributed, but just done on a single machine) so we just ditched the mongodb approach and went for elastic search since a community member had recommended we use that.
From what I hear, elastic search is generally doing the job. I’d be interested in understanding what’s the current and planned view on what is done by elastic search (I guess simple full text or maybe also faceted queries?) from what is done by the hadoop cluster from what is done by the storm cluster.
One of the main questions I have with regards to the overall analysis capability of the data pipeline is: is there an approach that would allow analysts (rather than developers) to deploy rapidly complex queries into the data pipeline? In other words, can we shorten the feedback loop between thinking “hey that would be a cool analysis to try” and “look here’s the result!”. It seems that this would help iterate and be more creative with analysing the data.
Maybe its something like a staging/analysis cluster which is a replica of the production cluster but with additional ‘analysis’ processes running and write access for analysts to deploy their experiments? I guess something where you can deploy “queries” as map reduce jobs, to both a batch processing cluster and a streaming processing cluster?
Anyhow. This is surely not a priority but maybe it’ll be interesting to some folks here.
Cheers,
Jun
-- Jun Matsushita Founder, CEO - information innovation lab - iilab.org - @iilab +49 157 530 86174 - jun@iilab.org - skype: junjulien