On 5/28/13 1:50 AM, Damian Johnson wrote:
Hi Karsten. I'm starting to look into remote descriptor fetching, a capability of metrics-lib that stem presently lacks [1][2]. The spec says that mirrors provide zlib compressed data [3], and the DirectoryDownloader handles this via a InflaterInputStream [4].
So far, so good. By my read of the man pages this means that gzip or python's zlib module should be able to handle the decompression. However, I must be missing something...
% wget http://128.31.0.34:9131/tor/server/all.z
% file all.z all.z: data
% gzip -d all.z gzip: all.z: not in gzip format
% zcat all.z gzip: all.z: not in gzip format
% python
import zlib with open('all.z') as desc_file:
... print zlib.decompress(desc_file.read()) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> zlib.error: Error -5 while decompressing data: incomplete or truncated stream
Maybe a fresh set of eyes will spot something I'm obviously missing. Spotting anything?
Hmmm, that's a fine question. I remember this was tricky in Java and took me a while to figure out. I did a quick Google search, but I didn't find a way to decompress tor's .z files using shell commands or Python. :/
How about we focus on the API first and ignore the fact that compressed responses exist?
Speaking of remote descriptor fetching, any thought on the API? I'm thinking of a 'stem/descriptor/remote.py' module with...
- get_directory_authorities()
List of hardcoded (IP, DirPorts) tuples for tor's authorities. Ideally we'd have an integ test to notify us when our listing falls out of date. However, it looks like the controller interface doesn't surface this. Is there a nice method of determining the present authorities besides polling the authorities array of 'src/or/config.c' [5]?
- fetch_directory_mirrors()
Polls an authority for the present consensus and filters it down to relays with the V2Dir flag. It then uses this to populate a global directory mirror cached that's used when querying directory data. This can optionally be provided with a Controller instance or cached consensus file to use that instead of polling a authority.
(Minor note: if possible, let's separate methods like this into one method that makes a network request and another method that works only locally.)
- get_directory_cache()
Provides a list of our present directory mirrors. This is a list if (IP, DirPort) tuples. If fetch_directory_mirrors() hasn't yet been called this is the directory authorities.
- query(descriptor_type, fingerprint = None, retires = 5)
Picks a random relay from our directory mirror cache, and attempts to retrieve the given type of descriptor data. Arguments behave as follows...
descriptor_type (str): Type of descriptor to be fetched. This is the same as our @type annotations [6]. This raises a ValueError if the descriptor type isn't available from directory mirrors.
fingerprint (str, list): Optional argument for the relay or list of relays to fetch the descriptors for. This retrieves all relays if omitted.
retries (int): Maximum number of times we'll attempt to retrieve the descriptors. We fail to another randomly selected directory mirror when unsuccessful. Our last attempt is always via a directory authority. If all attempts are unsuccessful we raise an IOError.
========================================
I'd imagine this would make use of the module something like the following...
# Simple script to print all of the exits.
from stem.descriptor import remote
# Populates our directory mirror cache. This does more harm # here than good since we're only making a single request. # However, if this was a longer living script doing this would # relieve load from the authorities.
remote.fetch_directory_mirrors()
try: for desc in remote.query('server-descriptor 1.0'): if desc.exit_policy.is_exiting_allowed(): print "%s (%s)" % (desc.nickname, desc.fingerprint) except IOError, exc: print "Unable to query the server descriptors: %s" % exc
========================================
Thoughts? Does this cover all of the use cases we'll this module for?
This API looks like a fine way to manually download descriptors, but I wonder if we can make the downloader smarter than that.
The two main use cases I have in mind are:
1. Download and archive relay descriptors: metrics-db uses different sources to archive relay descriptors including gabelmoo's cached-* files. But there's always the chance to miss a descriptor that is referenced from another descriptor. metrics-db (or the Python equivalent) would initialize the downloader by telling it which descriptors it's missing, and the downloader would go fetch them.
2. Monitor consensus process for any issues: DocTor downloads the current consensus from all directory authorities and all votes from any directory authority. It doesn't care about server or extra-info descriptors, but in contrast to metrics-db it cares about having the consensus from all directory authorities. Its Python equivalent would tell the downloader which descriptors it's interested in, let it fetch those descriptors, and then evaluate the result.
So, the question is: should we generalize these two use cases and make the downloader smart enough to handle them and maybe future use cases, or should we leave the specifics in metrics-db and DocTor and keep the API simple?
Here's how a generalized downloader API might look like:
Phase 1: configure the downloader by telling it: - what descriptor types we're interested in; - whether we only care about the descriptor content or about downloading descriptors from specific directory authorities or mirrors; - whether we're only interested in descriptors that we didn't know before, either by asking the downloader to use an internal download history file or by passing identifiers of descriptors we already know; - to prefer directory mirrors over directory authorities as soon as it has learned about them, and to memorize directory mirrors for future runs; - to use directory mirrors from the soon-to-be-added fallback directory list (#8374); - parameters like timeouts and maximum retries; and - parameters to the descriptor parser that will handle downloaded contents.
Phase 2: run downloads and pass retrieved descriptors (including information about the directory it downloaded from, the download time, and maybe other meta data) in an iterator similar to what the descriptor reader does.
Phase 3: when all downloads are done and downloaded descriptors are processed by the application: - query the download history or - ask the downloader to store its download history.
Note that the downloader could do all kinds of smart things in phase 2, like concatenating up to 96 descriptors in a single request, switching to all.z if there are many more descriptors to download, round-robining between directories, making requests in parallel, etc.
If we go for the simple API you suggest above, the application would have to implement this smart stuff itself.
All the best, Karsten
[1] https://trac.torproject.org/8257 [2] https://gitweb.torproject.org/metrics-lib.git/blob/HEAD:/src/org/torproject/... [3] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt#l2626 [4] http://docs.oracle.com/javase/6/docs/api/java/util/zip/InflaterInputStream.h... [5] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/or/config.c#l780 [6] https://metrics.torproject.org/formats.html#descriptortypes