Re: [tor-dev] Remote descriptor fetching

28 May 2013

      On 5/28/13 1:50 AM, Damian Johnson wrote:
...
Hi Karsten. I'm starting to look into remote descriptor fetching, a
capability of metrics-lib that stem presently lacks [1][2]. The spec
says that mirrors provide zlib compressed data [3], and the
DirectoryDownloader handles this via a InflaterInputStream [4].
So far, so good. By my read of the man pages this means that gzip or
python's zlib module should be able to handle the decompression.
However, I must be missing something...
% wget http://128.31.0.34:9131/tor/server/all.z
% file all.z
all.z: data
% gzip -d all.z
gzip: all.z: not in gzip format
% zcat all.z
gzip: all.z: not in gzip format
% python
...
...
...
import zlib
with open('all.z') as desc_file:
...   print zlib.decompress(desc_file.read())
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
zlib.error: Error -5 while decompressing data: incomplete or truncated stream
Maybe a fresh set of eyes will spot something I'm obviously missing.
Spotting anything?
Hmmm, that's a fine question.  I remember this was tricky in Java and
took me a while to figure out.  I did a quick Google search, but I
didn't find a way to decompress tor's .z files using shell commands or
Python. :/
How about we focus on the API first and ignore the fact that compressed
responses exist?
...
Speaking of remote descriptor fetching, any thought on the API? I'm
thinking of a 'stem/descriptor/remote.py' module with...

get_directory_authorities()

List of hardcoded (IP, DirPorts) tuples for tor's authorities. Ideally
we'd have an integ test to notify us when our listing falls out of
date. However, it looks like the controller interface doesn't surface
this. Is there a nice method of determining the present authorities
besides polling the authorities array of 'src/or/config.c' [5]?

fetch_directory_mirrors()

Polls an authority for the present consensus and filters it down to
relays with the V2Dir flag. It then uses this to populate a global
directory mirror cached that's used when querying directory data. This
can optionally be provided with a Controller instance or cached
consensus file to use that instead of polling a authority.
(Minor note: if possible, let's separate methods like this into one
method that makes a network request and another method that works only
locally.)
...

get_directory_cache()

Provides a list of our present directory mirrors. This is a list if
(IP, DirPort) tuples. If fetch_directory_mirrors() hasn't yet been
called this is the directory authorities.

query(descriptor_type, fingerprint = None, retires = 5)

Picks a random relay from our directory mirror cache, and attempts to
retrieve the given type of descriptor data. Arguments behave as
follows...
descriptor_type (str): Type of descriptor to be fetched. This is the
same as our @type annotations [6]. This raises a ValueError if the
descriptor type isn't available from directory mirrors.
fingerprint (str, list): Optional argument for the relay or list of
relays to fetch the descriptors for. This retrieves all relays if
omitted.
retries (int): Maximum number of times we'll attempt to retrieve the
descriptors. We fail to another randomly selected directory mirror
when unsuccessful. Our last attempt is always via a directory
authority. If all attempts are unsuccessful we raise an IOError.
========================================
I'd imagine this would make use of the module something like the following...
# Simple script to print all of the exits.
from stem.descriptor import remote
# Populates our directory mirror cache. This does more harm
# here than good since we're only making a single request.
# However, if this was a longer living script doing this would
# relieve load from the authorities.
remote.fetch_directory_mirrors()
try:
  for desc in remote.query('server-descriptor 1.0'):
    if desc.exit_policy.is_exiting_allowed():
      print "%s (%s)" % (desc.nickname, desc.fingerprint)
except IOError, exc:
  print "Unable to query the server descriptors: %s" % exc
========================================
Thoughts? Does this cover all of the use cases we'll this module for?
This API looks like a fine way to manually download descriptors, but I
wonder if we can make the downloader smarter than that.
The two main use cases I have in mind are:
1. Download and archive relay descriptors: metrics-db uses different
sources to archive relay descriptors including gabelmoo's cached-*
files.  But there's always the chance to miss a descriptor that is
referenced from another descriptor.  metrics-db (or the Python
equivalent) would initialize the downloader by telling it which
descriptors it's missing, and the downloader would go fetch them.
2. Monitor consensus process for any issues: DocTor downloads the
current consensus from all directory authorities and all votes from any
directory authority.  It doesn't care about server or extra-info
descriptors, but in contrast to metrics-db it cares about having the
consensus from all directory authorities.  Its Python equivalent would
tell the downloader which descriptors it's interested in, let it fetch
those descriptors, and then evaluate the result.
So, the question is: should we generalize these two use cases and make
the downloader smart enough to handle them and maybe future use cases,
or should we leave the specifics in metrics-db and DocTor and keep the
API simple?
Here's how a generalized downloader API might look like:
Phase 1: configure the downloader by telling it:
 - what descriptor types we're interested in;
 - whether we only care about the descriptor content or about
downloading descriptors from specific directory authorities or mirrors;
 - whether we're only interested in descriptors that we didn't know
before, either by asking the downloader to use an internal download
history file or by passing identifiers of descriptors we already know;
 - to prefer directory mirrors over directory authorities as soon as it
has learned about them, and to memorize directory mirrors for future runs;
 - to use directory mirrors from the soon-to-be-added fallback directory
list (#8374);
 - parameters like timeouts and maximum retries; and
 - parameters to the descriptor parser that will handle downloaded contents.
Phase 2: run downloads and pass retrieved descriptors (including
information about the directory it downloaded from, the download time,
and maybe other meta data) in an iterator similar to what the descriptor
reader does.
Phase 3: when all downloads are done and downloaded descriptors are
processed by the application:
- query the download history or
- ask the downloader to store its download history.
Note that the downloader could do all kinds of smart things in phase 2,
like concatenating up to 96 descriptors in a single request, switching
to all.z if there are many more descriptors to download, round-robining
between directories, making requests in parallel, etc.
If we go for the simple API you suggest above, the application would
have to implement this smart stuff itself.
All the best,
Karsten
...
[1] https://trac.torproject.org/8257
[2] https://gitweb.torproject.org/metrics-lib.git/blob/HEAD:/src/org/torproject/...
[3] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt#l2626
[4] http://docs.oracle.com/javase/6/docs/api/java/util/zip/InflaterInputStream.h...
[5] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/or/config.c#l780
[6] https://metrics.torproject.org/formats.html#descriptortypes

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] Remote descriptor fetching