To be honest, at the time i wrote the email i knew that mongoDB provided sharding which should provide horizontal scaling but at the time i didn't know how it worked in mongodb, because i hadn't time to dig through the docs. Now, after learning a bit more about mongodb i still don't know if i know :) but i agree with you, this is not distributed.
ah interesting mongodb has built in sharding: http://docs.mongodb.org/manual/core/sharding-introduction/
perhaps you are correct about mongo db in that it does seem like it would scale well.
however we have to carefully evaluate several more criteria before choosing a data store. for instance operational costs should always be evaluated:
Is it a pain to setup? (sharded mongo db seems heavy weight!) Is it a pain to add a new replica to a replica set? How are additional shards added? Does balancing the cluster after adding additional shards kill performance and take a long time? (most likely yes)
So, i think we should index the reports to provide a query API, this still applies, but, should we build a distributed datastore that will fit with every deployed collector? or a central respository that grabs the reports of the collectors and index them? should we care at all?
Yes... "indexed" reports sound much easier to work with than just the reports... however it is not yet clear that we really need the datastore to be distributed. Highly or mostly highly availability might be a requirement for this project. That is much easier to accomplish!
OK... so if we go with one of these CF (column family) data stores... then we must keep in mind the types of queries we will need when creating the schema. Another possibility would be Redis. It supports a set-theoretic query language... Also I've heard good things about CouchDB. I think we should look at these different datastore possibilities and discuss potential schema and query design for our project. I suspect a discussion of schema and query patterns will be more useful than discussing operational properties especially if a centralized datastore is good enough.
cheers,
david