Re: [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013

16 Apr 2013


      On 4/14/13 1:26 PM, Praveen Kumar wrote:
...
Hi Karsten,
Hi Praveen,
...
I am so sorry for replying late. I had a seminar presentation on Friday and
have another
on Monday, so I was a little busy studying for it.
No worries!
...
I had downloaded about 1GB data of Server Descriptors from the metrics
website. I thought
of generating some performance metrics of a search application with a MySQL
backend and
with a MongoDB database backend in Django. So, I implemented two basic apps
with a MySQL
and MongoDB backend in Django. I processed each file and extracted
router_nickname, router_ip,
tor_version and platform_os as searchable fields for each server descriptor
file. At the time of writing this email, I had processed around 330,000
files for MySQL and have the data of 670,000 files in MongoDB. I can not
process all the files as that 1GB data is composed of millions of files and
processing is slow on my system.
My aim is to issue same queries to both the apps and see which one performs
better. Both the databases are
indexed on the same fields. I will tell you the metrics day after tomorrow
i.e on Tuesday.
Sounds like a fine start.  Be sure to include results of this
performance comparison in your GSoC application!
...
But, theoretically speaking, MongoDB is fast because every document is
stored in JSON, it is schema less and doesn't has to preform any joins etc.
The indexes that are built are based on BTrees which have the worst case
time complexity of O(log(n)) for insertion, lookup and deletion. MongoDB
also keeps the indexes in RAM as required, for faster searches and to
reduce disk reads. MongoDB also has the capability of scaling efficiently.
Well, performance of MongoDB vs. MySQL really depends on the problem
you're trying to solve.  For example, we'll have to perform joins when
storing a network status consensus that references 0..n server
descriptors each of which references 0..1 extra-info descriptors.  See
the descriptor formats page for details:
https://metrics.torproject.org/formats.html
Also, with respect to scaling, the plan would be to run this application
on a single server along with other services.
So, in general, I'd be careful with "MongoDB is fast because"
statements.  Some of them may be correct in this specific case.  But
there may also be cases where good old SQL has performance advantages
over shiny new NoSQL.
...
I am now, somewhat, in favor of Django Haystack with Solr as the search
engine. Using MongoDB will
require us to spend considerable time developing the search interface which
will be responsible for handling complicated queries and then create
appropriate indices to handle those complicated queries.
Sounds good!  You should include your preliminary results in your GSoC
application, too.
Best,
Karsten

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013