Remote descriptor fetching - tor-dev

27 May 2013


      Hi Karsten. I'm starting to look into remote descriptor fetching, a
capability of metrics-lib that stem presently lacks [1][2]. The spec
says that mirrors provide zlib compressed data [3], and the
DirectoryDownloader handles this via a InflaterInputStream [4].
So far, so good. By my read of the man pages this means that gzip or
python's zlib module should be able to handle the decompression.
However, I must be missing something...
% wget http://128.31.0.34:9131/tor/server/all.z
% file all.z
all.z: data
% gzip -d all.z
gzip: all.z: not in gzip format
% zcat all.z
gzip: all.z: not in gzip format
% python
...
...
...
import zlib
with open('all.z') as desc_file:
...   print zlib.decompress(desc_file.read())
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
zlib.error: Error -5 while decompressing data: incomplete or truncated stream
Maybe a fresh set of eyes will spot something I'm obviously missing.
Spotting anything?
Speaking of remote descriptor fetching, any thought on the API? I'm
thinking of a 'stem/descriptor/remote.py' module with...
* get_directory_authorities()
List of hardcoded (IP, DirPorts) tuples for tor's authorities. Ideally
we'd have an integ test to notify us when our listing falls out of
date. However, it looks like the controller interface doesn't surface
this. Is there a nice method of determining the present authorities
besides polling the authorities array of 'src/or/config.c' [5]?
* fetch_directory_mirrors()
Polls an authority for the present consensus and filters it down to
relays with the V2Dir flag. It then uses this to populate a global
directory mirror cached that's used when querying directory data. This
can optionally be provided with a Controller instance or cached
consensus file to use that instead of polling a authority.
* get_directory_cache()
Provides a list of our present directory mirrors. This is a list if
(IP, DirPort) tuples. If fetch_directory_mirrors() hasn't yet been
called this is the directory authorities.
* query(descriptor_type, fingerprint = None, retires = 5)
Picks a random relay from our directory mirror cache, and attempts to
retrieve the given type of descriptor data. Arguments behave as
follows...
descriptor_type (str): Type of descriptor to be fetched. This is the
same as our @type annotations [6]. This raises a ValueError if the
descriptor type isn't available from directory mirrors.
fingerprint (str, list): Optional argument for the relay or list of
relays to fetch the descriptors for. This retrieves all relays if
omitted.
retries (int): Maximum number of times we'll attempt to retrieve the
descriptors. We fail to another randomly selected directory mirror
when unsuccessful. Our last attempt is always via a directory
authority. If all attempts are unsuccessful we raise an IOError.
========================================
I'd imagine this would make use of the module something like the following...
# Simple script to print all of the exits.
from stem.descriptor import remote
# Populates our directory mirror cache. This does more harm
# here than good since we're only making a single request.
# However, if this was a longer living script doing this would
# relieve load from the authorities.
remote.fetch_directory_mirrors()
try:
  for desc in remote.query('server-descriptor 1.0'):
    if desc.exit_policy.is_exiting_allowed():
      print "%s (%s)" % (desc.nickname, desc.fingerprint)
except IOError, exc:
  print "Unable to query the server descriptors: %s" % exc
========================================
Thoughts? Does this cover all of the use cases we'll this module for?
Cheers! -Damian
[1] https://trac.torproject.org/8257
[2] https://gitweb.torproject.org/metrics-lib.git/blob/HEAD:/src/org/torproject/...
[3] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt#l2626
[4] http://docs.oracle.com/javase/6/docs/api/java/util/zip/InflaterInputStream.h...
[5] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/or/config.c#l780
[6] https://metrics.torproject.org/formats.html#descriptortypes