Re: [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013

14 Apr 2013


      Hi Karsten,
I am so sorry for replying late. I had a seminar presentation on Friday and
have another
on Monday, so I was a little busy studying for it.
I had downloaded about 1GB data of Server Descriptors from the metrics
website. I thought
of generating some performance metrics of a search application with a MySQL
backend and
with a MongoDB database backend in Django. So, I implemented two basic apps
with a MySQL
and MongoDB backend in Django. I processed each file and extracted
router_nickname, router_ip,
tor_version and platform_os as searchable fields for each server descriptor
file. At the time of writing this email, I had processed around 330,000
files for MySQL and have the data of 670,000 files in MongoDB. I can not
process all the files as that 1GB data is composed of millions of files and
processing is slow on my system.
My aim is to issue same queries to both the apps and see which one performs
better. Both the databases are
indexed on the same fields. I will tell you the metrics day after tomorrow
i.e on Tuesday.
But, theoretically speaking, MongoDB is fast because every document is
stored in JSON, it is schema less and doesn't has to preform any joins etc.
The indexes that are built are based on BTrees which have the worst case
time complexity of O(log(n)) for insertion, lookup and deletion. MongoDB
also keeps the indexes in RAM as required, for faster searches and to
reduce disk reads. MongoDB also has the capability of scaling efficiently.
I am now, somewhat, in favor of Django Haystack with Solr as the search
engine. Using MongoDB will
require us to spend considerable time developing the search interface which
will be responsible for handling complicated queries and then create
appropriate indices to handle those complicated queries.
On Thu, Apr 11, 2013 at 1:50 PM, Karsten Loesing karsten@torproject.orgwrote:
...
On 4/10/13 11:44 PM, Praveen Kumar wrote:
...
I am Praveen Kumar from India. I want to work on the project "Searchable
Tor descriptor and Metrics data archive". I have participated in the past
instances of GSoC with Melange and e-cidadania, and have an extensive
experience in development with Python.
For the search application, I propose using Django with MongoDB as a
NoSQL
...
database backend for our search application. We have 100GB+ of data which
eventually grows everyday and so using a NoSQL backend will ensure us
that
...
our application scales well with the increase in data as well as user
traffic.
 The application will have various interfaces such as:

Data Updator: This end will connect and retrieve data from the metrics

website periodically via rsync. It will also be responsible for
pre-processing the data to a suitable format as our search application
needs.
2) Storage End: A relay descriptor can be searched by nickname,
fingerprint, IP Addr and various other attributes that define a relay
descriptor. So we can preprocess the whole data, extract the attributes
that define a descriptor and then save it in an appropriate model MongoDB
provides. Since queries are very fast in a NoSQL datastore, our searches
will be very fast.
3) Search Front End: This will be exposed to the user where a user
provides
...
its search query to us.
4) Search query processor: This end will process the query of a user and
determine its type for eg. whether the query is an IP Address or a
nickname
...
etc. It will then connect with our Storage End and return the appropriate
data to the Search Front End.
Above is a very high level view of my approach to this project. We can
also
...
use Django Haystack as a search application framework(I did some research
for existing search frameworks). I can implement this app in an object
oriented way in Python. Python being such a beautiful and easy to
understand language, it will be easy for others to understand and make
changes to the application in least amount of time.
I would like to know if I am thinking in the right direction and would
like
...
to know what Karsten has to say about this.
Hi Praveen!
Glad to see that you're interested in this project!
Your high-level description makes sense to me.  I guess the point where
I'd expect more details in a GSoC application is where you say: "Since
queries are very fast in a NoSQL datastore, our searches will be very
fast."
See also the last paragraph in the project idea: "Applications for this
project should come with a design of the proposed search application,
ideally with a proof-of-concept based on a subset of the available data
to show that it will be able to handle the 100G+ of data."  I'd like to
understand why you think MongoDB will handle searches sufficiently fast.
As an alternative to relying on NoSQL databases doing magic is to
investigate Django Haystack and other existing search application
frameworks.
Note that I don't know what's the best tool or design here.  But I ran
into too many pitfalls in the past when I thought a database design was
fast enough to provide data for an Internet-facing service.  That's why
I'd like to see a convincing design first, bonus points if it comes with
a proof of concept.
Best,
Karsten
-- 
Cheers

Praveen

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013