Hi devs,
you probably know that we use MaxMind's GeoIP database in various of our products (list may not be exhaustive):
- tor: We ship little-t-tor with a geoip and a geoip6 file for clients to support excluding relays by country code and for relays to generate by-country statistics. - BridgeDB: I vaguely recall that the BridgeDB service uses GeoIP data to return only bridges that are not blocked in a user's country. Or maybe that was a feature yet to be implemented. - Onionoo: The Onionoo service uses MaxMind's city database to provide location information of relays. (It also uses MaxMind's ASN database to provide information on AS number and name.) - metrics-db: I'm planning to use GeoIP data to resolve bridge IP addresses to country codes in the bridge descriptor sanitizing process. - metrics-web: We have been using GeoIP data to provide statistics on relays by country. This is currently disabled because the implementation was eating too many resources, but I plan to put these statistics back.
However, the GeoIP database that we currently use has a big shortcoming: it replaces valid country codes with A1 or A2 whenever MaxMind thinks that a relay is an "anonymizing proxy" or "satellite provider".
That's why we currently repair their database by either automatically guessing what country code an A1 entry could have had [1, 2], or by manually looking it up in RIR delegation files [3, 4]. This is just a workaround. Also, I think BridgeDB doesn't repair its GeoIP database.
Here's the good news: MaxMind now provides their databases in new formats which provide the A1/A2 information in *addition* to the correct country codes [5, 6]. We should switch!
How do we switch? First option is to ship their binary database files and include their APIs [7] in our products. Looks there are APIs for C, Java, and Python, so all the languages we need for the tools listed above. Pros: we can kick out our parsing and lookup code. Cons: we need to check if their licenses are compatible, we have to kick out our parsing and lookup code and learn their APIs, and we add new dependencies.
Another option is to write a new tool that parses their full databases and converts them into file formats we already support. (This would also allow us to provide a custom format with multiple database versions which would be pretty useful for metrics, see #6471.) Also, it looks like their license, Creative Commons Attribution-ShareAlike 3.0 Unported, allows converting their database to a different format. If we want to write such a tool, we have a few options:
- We use their database specification [8] and write our own parser using a language of our choice (read: whoever writes it pretty much decides). We could skip the binary search tree part of their files and only process the contents. Whenever they change their format, we'll have to adapt. - We use their Python API [9] to build our parser, though it looks like that requires pip or easy_install and compiling their C API. I don't know enough about Python to assess what headaches that's going to cause. - We use their Java API [10] to build our parser, though we're probably forced to use Maven rather than Ant. I don't have much experience with Maven. Also, using Java probably makes me the default (and only) maintainer, which I'd want to avoid if possible.
Thoughts? What other options did I miss, and what pros and cons that I overlooked?
And is this something that people on this list would want to help with, once we agreed on one of the options? If so, please feel free to join the discussion now and maybe influence which path we're going to take.
All the best, Karsten
[1] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/config/deanonymind.py [2] https://gitweb.torproject.org/onionoo.git/blob/HEAD:/geoip/deanonymind.py [3] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/config/geoip-manual [4] https://gitweb.torproject.org/onionoo.git/blob/HEAD:/geoip/geoip-manual [5] http://dev.maxmind.com/geoip/geoip2/whats-new-in-geoip2/ [6] http://dev.maxmind.com/geoip/geoip2/geolite2/ [7] http://dev.maxmind.com/geoip/geoip2/downloadable/ [8] https://github.com/marklr/MaxMind-IPDB-perl/blob/master/docs/MaxMind-IPDB-sp... [9] https://pypi.python.org/pypi/geoip2 [10] https://github.com/maxmind/MaxMind-DB-Reader-java
On Thu, Jan 16, 2014 at 5:15 AM, Karsten Loesing karsten@torproject.org wrote:
Hi devs,
you probably know that we use MaxMind's GeoIP database in various of our products (list may not be exhaustive):
[...]
How do we switch? First option is to ship their binary database files and include their APIs [7] in our products. Looks there are APIs for C, Java, and Python, so all the languages we need for the tools listed above. Pros: we can kick out our parsing and lookup code. Cons: we need to check if their licenses are compatible, we have to kick out our parsing and lookup code and learn their APIs, and we add new dependencies.
I'm not too thrilled with this option. The C code in question, while not terrible, would need quite a bit of auditing, and auditing binary format parser code written in C is not how we should be spending our days.
Further, older versions of Tor still need old formats, so we'd probably want to consider some kind of conversion tool anyway.
Also, (and less importantly) the new binary format contains lots of extra information we don't actually use. Sure, our current format is a bit pointlessly verbose, but it includes less data. Here are file sizes in KB, compressed:
76 geoip6.gz 596 geoip.gz 804 GeoLite2-Country.mmdb.gz
Note that uncompressed, their file is shorter, since our current format does indeed kind of suck. But file transfer sizes matter for us more than disk sizes, I think.
(For some work on a terser format still, see https://trac.torproject.org/projects/tor/ticket/2506 .)
Another option is to write a new tool that parses their full databases and converts them into file formats we already support. (This would also allow us to provide a custom format with multiple database versions which would be pretty useful for metrics, see #6471.) Also, it looks like their license, Creative Commons Attribution-ShareAlike 3.0 Unported, allows converting their database to a different format. If we want to write such a tool, we have a few options:
- We use their database specification [8] and write our own parser
using a language of our choice (read: whoever writes it pretty much decides). We could skip the binary search tree part of their files and only process the contents. Whenever they change their format, we'll have to adapt.
- We use their Python API [9] to build our parser, though it looks like
that requires pip or easy_install and compiling their C API. I don't know enough about Python to assess what headaches that's going to cause.
- We use their Java API [10] to build our parser, though we're probably
forced to use Maven rather than Ant. I don't have much experience with Maven. Also, using Java probably makes me the default (and only) maintainer, which I'd want to avoid if possible.
Writing our own parser seems goofy.
If we're feeling paranoid about memory safety, we might want to avoid their Python code, since it appears to work by invoking their C code. The Java code is apparently pure Java. (AFAICT).
(For a tool that's only used for converting files before we ship them, do we care about memory safety? Probably a bit.)
One more thing to consider here is the actual API. I looked at the Java API for a bit, and it seems to expose functions for doing specific geoip queries, but not to expose functions for iterating over the whole binary tree. Once we get the binary tree, converting a binary tree to an array of ranges requires a little basic algorithmic thought, but first, we've got to *get* the binary tree. So some hacking might be needed.
Another thought is to divide the problem up: write a minimal tool in Java to convert the document to json or something, and a nice friendly maintainable conversion tool in Python.
On Wed, Jan 29, 2014 at 1:53 PM, Nick Mathewson nickm@alum.mit.edu wrote:
On Thu, Jan 16, 2014 at 5:15 AM, Karsten Loesing karsten@torproject.org wrote:
[...]
Another option is to write a new tool that parses their full databases and converts them into file formats we already support
[...]
Writing our own parser seems goofy.
Then again, I'm a goofy guy, and though the format's kinda ugly, I've seen much worse.
https://github.com/nmathewson/mmdb-convert
If we use this thing, we should move it over into src/config in Tor.
It needs more hacking and testing, but at least it's pure python.
svv,
On 30/01/14 19:31, Nick Mathewson wrote:
On Wed, Jan 29, 2014 at 1:53 PM, Nick Mathewson nickm@alum.mit.edu wrote:
On Thu, Jan 16, 2014 at 5:15 AM, Karsten Loesing karsten@torproject.org wrote:
[...]
Another option is to write a new tool that parses their full databases and converts them into file formats we already support
[...]
Writing our own parser seems goofy.
Then again, I'm a goofy guy, and though the format's kinda ugly, I've seen much worse.
https://github.com/nmathewson/mmdb-convert
If we use this thing, we should move it over into src/config in Tor.
It needs more hacking and testing, but at least it's pure python.
This is great!
I started trying it out and reviewing it. I'm planning to improve it this week until it writes geoip and geoip6 files that we can then place into src/config/. (And I'm thinking about using the same file in Onionoo and dropping city information there, for simplicity.)
Thanks for starting this!
All the best, Karsten
On 03/02/14 16:23, Karsten Loesing wrote:
On 30/01/14 19:31, Nick Mathewson wrote:
On Wed, Jan 29, 2014 at 1:53 PM, Nick Mathewson nickm@alum.mit.edu wrote:
On Thu, Jan 16, 2014 at 5:15 AM, Karsten Loesing karsten@torproject.org wrote:
[...]
Another option is to write a new tool that parses their full databases and converts them into file formats we already support
[...]
Writing our own parser seems goofy.
Then again, I'm a goofy guy, and though the format's kinda ugly, I've seen much worse.
https://github.com/nmathewson/mmdb-convert
If we use this thing, we should move it over into src/config in Tor.
It needs more hacking and testing, but at least it's pure python.
This is great!
I started trying it out and reviewing it. I'm planning to improve it this week until it writes geoip and geoip6 files that we can then place into src/config/. (And I'm thinking about using the same file in Onionoo and dropping city information there, for simplicity.)
Thanks for starting this!
Keeping this list in the loop how most of our IP-to-country-resolution problems are solved by now. Quoting from my original mail sent to this list on January 16, 2014:
- tor: We ship little-t-tor with a geoip and a geoip6 file for clients
to support excluding relays by country code and for relays to generate by-country statistics.
We added a slightly revised version of Nick's mmdb-convert tool to tor master [0]. We also produced a new geoip file [1] for IPv4 addresses and a new geoip6 file [2] for IPv6 addresses based on the February 7, 2014 MaxMind GeoLite2 Country database.
[0] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/config/mmdb-convert.py
[1] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/config/geoip
[2] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/config/geoip6
- BridgeDB: I vaguely recall that the BridgeDB service uses GeoIP data
to return only bridges that are not blocked in a user's country. Or maybe that was a feature yet to be implemented.
Haven't heard from BridgeDB folks. If they need IP-to-country data, they can probably use tor's geoip and geoip6 files.
- Onionoo: The Onionoo service uses MaxMind's city database to provide
location information of relays. (It also uses MaxMind's ASN database to provide information on AS number and name.)
Onionoo now uses the GeoLite2 City database to resolve relay IP addresses to country codes, country names, region names, city names, latitudes, and longitudes. I decided against using tor's geoip and geoip6 files, because suddenly MaxMind decided to put CSV files on their website, and updating Onionoo's parser for those was not too hard.
- metrics-db: I'm planning to use GeoIP data to resolve bridge IP
addresses to country codes in the bridge descriptor sanitizing process.
I'd probably re-use the code from Onionoo for this task.
- metrics-web: We have been using GeoIP data to provide statistics on
relays by country. This is currently disabled because the implementation was eating too many resources, but I plan to put these statistics back.
Same as metrics-db, unless somebody wants to do this in Python. But we have two working solutions that can be adapted for metrics-web.
So, guess that resolves this thread. Thanks again, Nick, for being a goofy guy!
All the best, Karsten