On 4/14/13 1:26 PM, Praveen Kumar wrote:
Hi Karsten,
Hi Praveen,
I am so sorry for replying late. I had a seminar presentation on Friday and have another on Monday, so I was a little busy studying for it.
No worries!
I had downloaded about 1GB data of Server Descriptors from the metrics website. I thought of generating some performance metrics of a search application with a MySQL backend and with a MongoDB database backend in Django. So, I implemented two basic apps with a MySQL and MongoDB backend in Django. I processed each file and extracted router_nickname, router_ip, tor_version and platform_os as searchable fields for each server descriptor file. At the time of writing this email, I had processed around 330,000 files for MySQL and have the data of 670,000 files in MongoDB. I can not process all the files as that 1GB data is composed of millions of files and processing is slow on my system. My aim is to issue same queries to both the apps and see which one performs better. Both the databases are indexed on the same fields. I will tell you the metrics day after tomorrow i.e on Tuesday.
Sounds like a fine start. Be sure to include results of this performance comparison in your GSoC application!
But, theoretically speaking, MongoDB is fast because every document is stored in JSON, it is schema less and doesn't has to preform any joins etc. The indexes that are built are based on BTrees which have the worst case time complexity of O(log(n)) for insertion, lookup and deletion. MongoDB also keeps the indexes in RAM as required, for faster searches and to reduce disk reads. MongoDB also has the capability of scaling efficiently.
Well, performance of MongoDB vs. MySQL really depends on the problem you're trying to solve. For example, we'll have to perform joins when storing a network status consensus that references 0..n server descriptors each of which references 0..1 extra-info descriptors. See the descriptor formats page for details:
https://metrics.torproject.org/formats.html
Also, with respect to scaling, the plan would be to run this application on a single server along with other services.
So, in general, I'd be careful with "MongoDB is fast because" statements. Some of them may be correct in this specific case. But there may also be cases where good old SQL has performance advantages over shiny new NoSQL.
I am now, somewhat, in favor of Django Haystack with Solr as the search engine. Using MongoDB will require us to spend considerable time developing the search interface which will be responsible for handling complicated queries and then create appropriate indices to handle those complicated queries.
Sounds good! You should include your preliminary results in your GSoC application, too.
Best, Karsten