Re: [tor-dev] Using MaxMind's GeoIP2 databases in tor, BridgeDB, metrics-*, Onionoo, etc.

29 Jan 2014

      On Thu, Jan 16, 2014 at 5:15 AM, Karsten Loesing karsten@torproject.org wrote:
...
Hi devs,
you probably know that we use MaxMind's GeoIP database in various of our
products (list may not be exhaustive):
[...]
...
How do we switch?  First option is to ship their binary database files
and include their APIs [7] in our products.  Looks there are APIs for C,
Java, and Python, so all the languages we need for the tools listed
above.  Pros: we can kick out our parsing and lookup code.  Cons: we
need to check if their licenses are compatible, we have to kick out our
parsing and lookup code and learn their APIs, and we add new dependencies.
I'm not too thrilled with this option.  The C code in question, while
not terrible, would need quite a bit of auditing, and auditing binary
format parser code written in C is not how we should be spending our
days.
Further, older versions of Tor still need old formats, so we'd
probably want to consider some kind of conversion tool anyway.
Also, (and less importantly) the new binary format contains lots of
extra information we don't actually use.  Sure, our current format is
a bit pointlessly verbose, but it includes less data.  Here are file
sizes in KB, compressed:
76 geoip6.gz
596 geoip.gz
804 GeoLite2-Country.mmdb.gz
Note that uncompressed, their file is shorter, since our current
format does indeed kind of suck.  But file transfer sizes matter for
us more than disk sizes, I think.
(For some work on a terser format still, see
https://trac.torproject.org/projects/tor/ticket/2506 .)
...
Another option is to write a new tool that parses their full databases
and converts them into file formats we already support.  (This would
also allow us to provide a custom format with multiple database versions
which would be pretty useful for metrics, see #6471.)  Also, it looks
like their license, Creative Commons Attribution-ShareAlike 3.0
Unported, allows converting their database to a different format.  If we
want to write such a tool, we have a few options:

We use their database specification [8] and write our own parser

using a language of our choice (read: whoever writes it pretty much
decides).  We could skip the binary search tree part of their files and
only process the contents.  Whenever they change their format, we'll
have to adapt.

We use their Python API [9] to build our parser, though it looks like

that requires pip or easy_install and compiling their C API.  I don't
know enough about Python to assess what headaches that's going to cause.

We use their Java API [10] to build our parser, though we're probably

forced to use Maven rather than Ant.  I don't have much experience with
Maven.  Also, using Java probably makes me the default (and only)
maintainer, which I'd want to avoid if possible.
Writing our own parser seems goofy.
If we're feeling paranoid about memory safety, we might want to avoid
their Python code, since it appears to work by invoking their C code.
The Java code is apparently pure Java. (AFAICT).
(For a tool that's only used for converting files before we ship them,
do we care about memory safety?  Probably a bit.)
One more thing to consider here is the actual API.  I looked at the
Java API for a bit, and it seems to expose functions for doing
specific geoip queries, but not to expose functions for iterating over
the whole binary tree. Once we get the binary tree, converting a
binary tree to an array of ranges requires a little basic algorithmic
thought, but first, we've got to *get* the binary tree. So some
hacking might be needed.
Another thought is to divide the problem up: write a minimal tool in
Java to convert the document to json or something, and a nice friendly
maintainable conversion tool in Python.
-- 
Nick

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] Using MaxMind's GeoIP2 databases in tor, BridgeDB, metrics-*, Onionoo, etc.