Hi Karsten. I'm starting to look into remote descriptor fetching, a capability of metrics-lib that stem presently lacks [1][2]. The spec says that mirrors provide zlib compressed data [3], and the DirectoryDownloader handles this via a InflaterInputStream [4].
So far, so good. By my read of the man pages this means that gzip or python's zlib module should be able to handle the decompression. However, I must be missing something...
% wget http://128.31.0.34:9131/tor/server/all.z
% file all.z all.z: data
% gzip -d all.z gzip: all.z: not in gzip format
% zcat all.z gzip: all.z: not in gzip format
% python
import zlib with open('all.z') as desc_file:
... print zlib.decompress(desc_file.read()) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> zlib.error: Error -5 while decompressing data: incomplete or truncated stream
Maybe a fresh set of eyes will spot something I'm obviously missing. Spotting anything?
Speaking of remote descriptor fetching, any thought on the API? I'm thinking of a 'stem/descriptor/remote.py' module with...
* get_directory_authorities()
List of hardcoded (IP, DirPorts) tuples for tor's authorities. Ideally we'd have an integ test to notify us when our listing falls out of date. However, it looks like the controller interface doesn't surface this. Is there a nice method of determining the present authorities besides polling the authorities array of 'src/or/config.c' [5]?
* fetch_directory_mirrors()
Polls an authority for the present consensus and filters it down to relays with the V2Dir flag. It then uses this to populate a global directory mirror cached that's used when querying directory data. This can optionally be provided with a Controller instance or cached consensus file to use that instead of polling a authority.
* get_directory_cache()
Provides a list of our present directory mirrors. This is a list if (IP, DirPort) tuples. If fetch_directory_mirrors() hasn't yet been called this is the directory authorities.
* query(descriptor_type, fingerprint = None, retires = 5)
Picks a random relay from our directory mirror cache, and attempts to retrieve the given type of descriptor data. Arguments behave as follows...
descriptor_type (str): Type of descriptor to be fetched. This is the same as our @type annotations [6]. This raises a ValueError if the descriptor type isn't available from directory mirrors.
fingerprint (str, list): Optional argument for the relay or list of relays to fetch the descriptors for. This retrieves all relays if omitted.
retries (int): Maximum number of times we'll attempt to retrieve the descriptors. We fail to another randomly selected directory mirror when unsuccessful. Our last attempt is always via a directory authority. If all attempts are unsuccessful we raise an IOError.
========================================
I'd imagine this would make use of the module something like the following...
# Simple script to print all of the exits.
from stem.descriptor import remote
# Populates our directory mirror cache. This does more harm # here than good since we're only making a single request. # However, if this was a longer living script doing this would # relieve load from the authorities.
remote.fetch_directory_mirrors()
try: for desc in remote.query('server-descriptor 1.0'): if desc.exit_policy.is_exiting_allowed(): print "%s (%s)" % (desc.nickname, desc.fingerprint) except IOError, exc: print "Unable to query the server descriptors: %s" % exc
========================================
Thoughts? Does this cover all of the use cases we'll this module for?
Cheers! -Damian
[1] https://trac.torproject.org/8257 [2] https://gitweb.torproject.org/metrics-lib.git/blob/HEAD:/src/org/torproject/... [3] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt#l2626 [4] http://docs.oracle.com/javase/6/docs/api/java/util/zip/InflaterInputStream.h... [5] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/or/config.c#l780 [6] https://metrics.torproject.org/formats.html#descriptortypes
On 5/28/13 1:50 AM, Damian Johnson wrote:
Hi Karsten. I'm starting to look into remote descriptor fetching, a capability of metrics-lib that stem presently lacks [1][2]. The spec says that mirrors provide zlib compressed data [3], and the DirectoryDownloader handles this via a InflaterInputStream [4].
So far, so good. By my read of the man pages this means that gzip or python's zlib module should be able to handle the decompression. However, I must be missing something...
% wget http://128.31.0.34:9131/tor/server/all.z
% file all.z all.z: data
% gzip -d all.z gzip: all.z: not in gzip format
% zcat all.z gzip: all.z: not in gzip format
% python
import zlib with open('all.z') as desc_file:
... print zlib.decompress(desc_file.read()) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> zlib.error: Error -5 while decompressing data: incomplete or truncated stream
Maybe a fresh set of eyes will spot something I'm obviously missing. Spotting anything?
Hmmm, that's a fine question. I remember this was tricky in Java and took me a while to figure out. I did a quick Google search, but I didn't find a way to decompress tor's .z files using shell commands or Python. :/
How about we focus on the API first and ignore the fact that compressed responses exist?
Speaking of remote descriptor fetching, any thought on the API? I'm thinking of a 'stem/descriptor/remote.py' module with...
- get_directory_authorities()
List of hardcoded (IP, DirPorts) tuples for tor's authorities. Ideally we'd have an integ test to notify us when our listing falls out of date. However, it looks like the controller interface doesn't surface this. Is there a nice method of determining the present authorities besides polling the authorities array of 'src/or/config.c' [5]?
- fetch_directory_mirrors()
Polls an authority for the present consensus and filters it down to relays with the V2Dir flag. It then uses this to populate a global directory mirror cached that's used when querying directory data. This can optionally be provided with a Controller instance or cached consensus file to use that instead of polling a authority.
(Minor note: if possible, let's separate methods like this into one method that makes a network request and another method that works only locally.)
- get_directory_cache()
Provides a list of our present directory mirrors. This is a list if (IP, DirPort) tuples. If fetch_directory_mirrors() hasn't yet been called this is the directory authorities.
- query(descriptor_type, fingerprint = None, retires = 5)
Picks a random relay from our directory mirror cache, and attempts to retrieve the given type of descriptor data. Arguments behave as follows...
descriptor_type (str): Type of descriptor to be fetched. This is the same as our @type annotations [6]. This raises a ValueError if the descriptor type isn't available from directory mirrors.
fingerprint (str, list): Optional argument for the relay or list of relays to fetch the descriptors for. This retrieves all relays if omitted.
retries (int): Maximum number of times we'll attempt to retrieve the descriptors. We fail to another randomly selected directory mirror when unsuccessful. Our last attempt is always via a directory authority. If all attempts are unsuccessful we raise an IOError.
========================================
I'd imagine this would make use of the module something like the following...
# Simple script to print all of the exits.
from stem.descriptor import remote
# Populates our directory mirror cache. This does more harm # here than good since we're only making a single request. # However, if this was a longer living script doing this would # relieve load from the authorities.
remote.fetch_directory_mirrors()
try: for desc in remote.query('server-descriptor 1.0'): if desc.exit_policy.is_exiting_allowed(): print "%s (%s)" % (desc.nickname, desc.fingerprint) except IOError, exc: print "Unable to query the server descriptors: %s" % exc
========================================
Thoughts? Does this cover all of the use cases we'll this module for?
This API looks like a fine way to manually download descriptors, but I wonder if we can make the downloader smarter than that.
The two main use cases I have in mind are:
1. Download and archive relay descriptors: metrics-db uses different sources to archive relay descriptors including gabelmoo's cached-* files. But there's always the chance to miss a descriptor that is referenced from another descriptor. metrics-db (or the Python equivalent) would initialize the downloader by telling it which descriptors it's missing, and the downloader would go fetch them.
2. Monitor consensus process for any issues: DocTor downloads the current consensus from all directory authorities and all votes from any directory authority. It doesn't care about server or extra-info descriptors, but in contrast to metrics-db it cares about having the consensus from all directory authorities. Its Python equivalent would tell the downloader which descriptors it's interested in, let it fetch those descriptors, and then evaluate the result.
So, the question is: should we generalize these two use cases and make the downloader smart enough to handle them and maybe future use cases, or should we leave the specifics in metrics-db and DocTor and keep the API simple?
Here's how a generalized downloader API might look like:
Phase 1: configure the downloader by telling it: - what descriptor types we're interested in; - whether we only care about the descriptor content or about downloading descriptors from specific directory authorities or mirrors; - whether we're only interested in descriptors that we didn't know before, either by asking the downloader to use an internal download history file or by passing identifiers of descriptors we already know; - to prefer directory mirrors over directory authorities as soon as it has learned about them, and to memorize directory mirrors for future runs; - to use directory mirrors from the soon-to-be-added fallback directory list (#8374); - parameters like timeouts and maximum retries; and - parameters to the descriptor parser that will handle downloaded contents.
Phase 2: run downloads and pass retrieved descriptors (including information about the directory it downloaded from, the download time, and maybe other meta data) in an iterator similar to what the descriptor reader does.
Phase 3: when all downloads are done and downloaded descriptors are processed by the application: - query the download history or - ask the downloader to store its download history.
Note that the downloader could do all kinds of smart things in phase 2, like concatenating up to 96 descriptors in a single request, switching to all.z if there are many more descriptors to download, round-robining between directories, making requests in parallel, etc.
If we go for the simple API you suggest above, the application would have to implement this smart stuff itself.
All the best, Karsten
[1] https://trac.torproject.org/8257 [2] https://gitweb.torproject.org/metrics-lib.git/blob/HEAD:/src/org/torproject/... [3] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt#l2626 [4] http://docs.oracle.com/javase/6/docs/api/java/util/zip/InflaterInputStream.h... [5] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/or/config.c#l780 [6] https://metrics.torproject.org/formats.html#descriptortypes
Thanks Karsten, thanks Kostas! It's a little disturbing that moria1 is providing truncated responses but guess we'll dig into that more later.
Great points about needing a more flexible downloader. Here's another attempt, this time with a DescriptorDownloader class that's a bit similar to our present DescriptorReader...
https://trac.torproject.org/projects/tor/wiki/doc/stem#RemoteDescriptorFetch...
It still feels a little clunky to me, and I'm not yet sure how best to handle the use case you mentioned concerning votes. Thoughts?
Feel free to edit the page (that's what wikis are there for!). -Damian
On 6/1/13 9:18 PM, Damian Johnson wrote:
Thanks Karsten, thanks Kostas! It's a little disturbing that moria1 is providing truncated responses but guess we'll dig into that more later.
Indeed, this would be pretty bad. I'm not convinced that moria1 provides truncated responses though. It could also be that it compresses results for every new request and that compressed responses randomly differ in size, but are still valid compressions of the same input. Kostas, do you want to look more into this and open a ticket if this really turns out to be a bug?
Great points about needing a more flexible downloader. Here's another attempt, this time with a DescriptorDownloader class that's a bit similar to our present DescriptorReader...
https://trac.torproject.org/projects/tor/wiki/doc/stem#RemoteDescriptorFetch...
It still feels a little clunky to me, and I'm not yet sure how best to handle the use case you mentioned concerning votes. Thoughts?
Feel free to edit the page (that's what wikis are there for!). -Damian
So, this isn't the super smart downloader that I had in mind, but maybe there should still be some logic left in the application using this API. I can imagine how both DocTor and metrics-db-R could use this API with some modifications. A few comments/suggestions:
- There could be two methods get/set_compression(compression) that define whether to use compression. Assuming we get it working.
- If possible, the downloader should support parallel downloads, with at most one parallel download per directory. But it's possible to ask multiple directories at the same time. There could be two methods get/set_max_parallel_downloads(max) with a default of 1.
- I'd want to set a global timeout for all things requested from the directories, so a get/set_global_timeout(seconds) would be nice. The downloader could throw an exception when the global download timeout elapses. I need such a timeout for hourly running cronjobs to prevent them from overlapping when things are really, really slow.
- Just to be sure, get/set_retries(tries) is meant for each endpoint, right?
- I don't like get_directory_mirrors() as much, because it does two things: make a network request and parse it. I'd prefer a method use_v2dirs_as_endpoints(consensus) that takes a consensus document and uses the contained v2dirs as endpoints for future downloads. The documentation could suggest to use this approach to move some load off the directory authorities and to directory mirrors.
- Related note: I always look if the Dir port is non-zero to decide whether a relay is a directory. Not sure if there's a difference to looking at the V2Dir flag.
- All methods starting at get_consensus() should be renamed to fetch_* or query_* to make it clear that these are no getters but perform actual network requests.
- All methods starting at get_consensus() could have an additional parameter for the number of copies (from different directories) to download. The default would be 1. But in some cases people might be interested in having 2 or 3 copies of a descriptor to compare if there are any differences, or to compare download times (more on this below). Also, a special value of -1 could mean to download every requested descriptor from every available directory. That's what I'd do in DocTor to download the consensus from all directory authorities.
- As for download times, is there a way to include download meta data in the result of get_consensus() and friends? I'd be interested in the directory that a descriptor was downloaded from and in the download time in millis. This is similar to how I'm interested in file meta data in the descriptor reader, like file name or last modified time of the file containing a descriptor.
- Can you add a fetch|query_votes(fingerprints) method to request vote documents?
All in all, looks like a fine API. When can I use it? :D
Thanks!
Best, Karsten
Indeed, this would be pretty bad. I'm not convinced that moria1 provides truncated responses though. It could also be that it compresses results for every new request and that compressed responses randomly differ in size, but are still valid compressions of the same input. Kostas, do you want to look more into this and open a ticket if this really turns out to be a bug?
Tor clients use the ORPort to fetch descriptors. As I understand it the DirPort has been pretty well unused for years, in which case a regression there doesn't seem that surprising. Guess we'll see.
If Kostas wants to lead this investigation then that would be fantastic. :)
So, this isn't the super smart downloader that I had in mind, but maybe there should still be some logic left in the application using this API. I can imagine how both DocTor and metrics-db-R could use this API with some modifications. A few comments/suggestions:
What kind of additional smartness were you hoping for the downloader to have?
- There could be two methods get/set_compression(compression) that
define whether to use compression. Assuming we get it working.
Good idea. Added.
- If possible, the downloader should support parallel downloads, with at
most one parallel download per directory. But it's possible to ask multiple directories at the same time. There could be two methods get/set_max_parallel_downloads(max) with a default of 1.
Usually I'd be all for paralleling our requests to both improve performance and distribute load. However, tor's present interface doesn't really encourage it. There's no way of saying "get half of the server descriptors from location X and the other half from location Y". You can only request specific descriptors or all of them.
Are you thinking that the get_server_descriptors() and friends should only try to parallelize when given a set of fingerprints? If so then that sounds like a fine idea.
- I'd want to set a global timeout for all things requested from the
directories, so a get/set_global_timeout(seconds) would be nice. The downloader could throw an exception when the global download timeout elapses. I need such a timeout for hourly running cronjobs to prevent them from overlapping when things are really, really slow.
How does the global timeout differ from our present set_timeout()?
- Just to be sure, get/set_retries(tries) is meant for each endpoint, right?
Yup, clarified.
- I don't like get_directory_mirrors() as much, because it does two
things: make a network request and parse it. I'd prefer a method use_v2dirs_as_endpoints(consensus) that takes a consensus document and uses the contained v2dirs as endpoints for future downloads. The documentation could suggest to use this approach to move some load off the directory authorities and to directory mirrors.
Very good point. Changed to a use_directory_mirrors() method, callers can then call get_endpoints() if they're really curious what the present directory mirrors are (which I doubt they often will).
- Related note: I always look if the Dir port is non-zero to decide
whether a relay is a directory. Not sure if there's a difference to looking at the V2Dir flag.
Sounds good. We'll go for that instead.
- All methods starting at get_consensus() should be renamed to fetch_*
or query_* to make it clear that these are no getters but perform actual network requests.
Going with fetch_*.
- All methods starting at get_consensus() could have an additional
parameter for the number of copies (from different directories) to download. The default would be 1. But in some cases people might be interested in having 2 or 3 copies of a descriptor to compare if there are any differences, or to compare download times (more on this below). Also, a special value of -1 could mean to download every requested descriptor from every available directory. That's what I'd do in DocTor to download the consensus from all directory authorities.
- As for download times, is there a way to include download meta data in
the result of get_consensus() and friends? I'd be interested in the directory that a descriptor was downloaded from and in the download time in millis. This is similar to how I'm interested in file meta data in the descriptor reader, like file name or last modified time of the file containing a descriptor.
This sounds really specialized. If callers cared about the download times then that seems best done via something like...
endpoints = ['location1', 'location2'... etc]
for endpoint in endpoints: try: start_time = time.time() downloader.set_endpoints([endpoint]) downloader.get_consensus()
print "endpoint %s took: %0.2f" % (endpoint, time.time() - start_time) except IOError, exc: print "failed to use %s: %s" % (endpoint, exc)
- Can you add a fetch|query_votes(fingerprints) method to request vote
documents?
Added a fetch_vote(authority) to provide an authority's NetworkStatusDocument vote by querying 'http://<hostname>/tor/status-vote/next/authority.z'. However, I'm not clear from the spec how you can query for specific relays (unless you mean fingerprints to be the authority fingerprints).
Cheers! -Damian
Hi Damian!
On 6/9/13 5:07 AM, Damian Johnson wrote:
Indeed, this would be pretty bad. I'm not convinced that moria1 provides truncated responses though. It could also be that it compresses results for every new request and that compressed responses randomly differ in size, but are still valid compressions of the same input. Kostas, do you want to look more into this and open a ticket if this really turns out to be a bug?
Tor clients use the ORPort to fetch descriptors. As I understand it the DirPort has been pretty well unused for years, in which case a regression there doesn't seem that surprising. Guess we'll see.
If Kostas wants to lead this investigation then that would be fantastic. :)
So, this isn't the super smart downloader that I had in mind, but maybe there should still be some logic left in the application using this API. I can imagine how both DocTor and metrics-db-R could use this API with some modifications. A few comments/suggestions:
What kind of additional smartness were you hoping for the downloader to have?
I had the idea of configuring the downloader to tell it what downloads I'm interested in, let it start downloading, and parse returned descriptors as they come in. But never mind, I think the current API is a fine abstraction that leaves application-specific logic where it belongs.
- There could be two methods get/set_compression(compression) that
define whether to use compression. Assuming we get it working.
Good idea. Added.
- If possible, the downloader should support parallel downloads, with at
most one parallel download per directory. But it's possible to ask multiple directories at the same time. There could be two methods get/set_max_parallel_downloads(max) with a default of 1.
Usually I'd be all for paralleling our requests to both improve performance and distribute load. However, tor's present interface doesn't really encourage it. There's no way of saying "get half of the server descriptors from location X and the other half from location Y". You can only request specific descriptors or all of them.
Are you thinking that the get_server_descriptors() and friends should only try to parallelize when given a set of fingerprints? If so then that sounds like a fine idea.
I was only thinking of parallelizing requests for a given set of fingerprints. We can only request at most 96 descriptors at a time, so it's easy to make requests in parallel.
I agree that this doesn't apply to requests for all descriptors.
- I'd want to set a global timeout for all things requested from the
directories, so a get/set_global_timeout(seconds) would be nice. The downloader could throw an exception when the global download timeout elapses. I need such a timeout for hourly running cronjobs to prevent them from overlapping when things are really, really slow.
How does the global timeout differ from our present set_timeout()?
AIUI, the current timeout is for a single HTTP request, whereas the global timeout is for all HTTP requests made for a single API method.
- Just to be sure, get/set_retries(tries) is meant for each endpoint, right?
Yup, clarified.
- I don't like get_directory_mirrors() as much, because it does two
things: make a network request and parse it. I'd prefer a method use_v2dirs_as_endpoints(consensus) that takes a consensus document and uses the contained v2dirs as endpoints for future downloads. The documentation could suggest to use this approach to move some load off the directory authorities and to directory mirrors.
Very good point. Changed to a use_directory_mirrors() method, callers can then call get_endpoints() if they're really curious what the present directory mirrors are (which I doubt they often will).
- Related note: I always look if the Dir port is non-zero to decide
whether a relay is a directory. Not sure if there's a difference to looking at the V2Dir flag.
Sounds good. We'll go for that instead.
- All methods starting at get_consensus() should be renamed to fetch_*
or query_* to make it clear that these are no getters but perform actual network requests.
Going with fetch_*.
- All methods starting at get_consensus() could have an additional
parameter for the number of copies (from different directories) to download. The default would be 1. But in some cases people might be interested in having 2 or 3 copies of a descriptor to compare if there are any differences, or to compare download times (more on this below). Also, a special value of -1 could mean to download every requested descriptor from every available directory. That's what I'd do in DocTor to download the consensus from all directory authorities.
- As for download times, is there a way to include download meta data in
the result of get_consensus() and friends? I'd be interested in the directory that a descriptor was downloaded from and in the download time in millis. This is similar to how I'm interested in file meta data in the descriptor reader, like file name or last modified time of the file containing a descriptor.
This sounds really specialized. If callers cared about the download times then that seems best done via something like...
endpoints = ['location1', 'location2'... etc]
for endpoint in endpoints: try: start_time = time.time() downloader.set_endpoints([endpoint]) downloader.get_consensus()
print "endpoint %s took: %0.2f" % (endpoint, time.time() - start_time)
except IOError, exc: print "failed to use %s: %s" % (endpoint, exc)
The downside of that approach is that it doesn't make requests in parallel, unless the application parallelizes requests, which I hear isn't trivial in Python. If we can, we should help the application make requests in parallel. Some directories are really slow and can block the application too long if it's only doing one request at a time. (I agree that timeouts can solve that problem to some extent, but a timeout that's chosen too low can also be problematic.)
So, I guess we should either have the API do parallel requests, or describe in a tutorial how to write an application that uses the API to make parallel requests.
By the way, here's an idea how the API could add meta data to descriptors: it could add annotations like "@downloaded-from" and "@downloaded-millis" to descriptors. (Speaking of, is there an easy way to extract the descriptor string without annotations?)
- Can you add a fetch|query_votes(fingerprints) method to request vote
documents?
Added a fetch_vote(authority) to provide an authority's NetworkStatusDocument vote by querying 'http://<hostname>/tor/status-vote/next/authority.z'. However, I'm not clear from the spec how you can query for specific relays (unless you mean fingerprints to be the authority fingerprints).
I meant fingerprints of authorities.
Thanks! Karsten
Hi folks!
Indeed, this would be pretty bad. I'm not convinced that moria1 provides truncated responses though. It could also be that it compresses results for every new request and that compressed responses randomly differ in size, but are still valid compressions of the same input. Kostas, do you want to look more into this and open a ticket if this really turns out to be a bug?
I did check each downloaded file, each was different in size etc., but not all of them were valid, from a shallow look at things (just chucking the file to zlib and seeing what comes out).
Ok, I'll try looking into this. :) do note that exams etc. are still ongoing, so this will get pushed back, if anybody figures things out earlier, then great!
Tor clients use the ORPort to fetch descriptors. As I understand it the DirPort has been pretty well unused for years, in which case a regression there doesn't seem that surprising. Guess we'll see.
Noted - OK, will see!
Re: python url request parallelization: @Damian: in the past when I wanted to do concurrent urllib requests, I simply used threading.Thread. There might be caveats here, I'm not familiar with the specifics. I can (again, (maybe quite a bit) later) try cooking something up to see if such a simple parallelization approach would work? (I should probably just try and do it when I have time, maybe will turn out some specific solution is needed and you guys will have solved it by then anyway.)
Cheers Kostas.
Hi Karsten, I've finally finished implementing stem's module for remote descriptor fetching. Its usage is pleasantly simple - see the example at the start of its docs...
https://stem.torproject.org/api/descriptor/remote.html https://stem.torproject.org/_modules/stem/descriptor/remote.html https://gitweb.torproject.org/stem.git/commitdiff/7f050eb?hp=b6c23b0
The only part of our wiki plans that I regretted needing to drop is a local filesystem cache (the part that made me think twice was figuring out when to invalidate cached resources). Otherwise this turned into a pleasantly slick module.
Cheers! -Damian
PS. Where does an authority's v3ident come from? Presently I reference users to the values in config.c but that's mostly because I'm confused about what it is and how it differs from their fingerprint.
On 7/22/13 5:32 AM, Damian Johnson wrote:
Hi Karsten, I've finally finished implementing stem's module for remote descriptor fetching. Its usage is pleasantly simple
Great stuff!
- see the
example at the start of its docs...
https://stem.torproject.org/api/descriptor/remote.html https://stem.torproject.org/_modules/stem/descriptor/remote.html
These two links don't work for me for some reason.
https://gitweb.torproject.org/stem.git/commitdiff/7f050eb?hp=b6c23b0
I'll look into this in more detail on Thursday or Friday, or maybe after the dev meeting. I'm excited to see this, but I'm not yet sure when to find some undisturbed place to focus on anything code-related.
The only part of our wiki plans that I regretted needing to drop is a local filesystem cache (the part that made me think twice was figuring out when to invalidate cached resources). Otherwise this turned into a pleasantly slick module.
Okay. I don't think I'll need the local filesystem cache for the two metrics use cases (consensus-health checker and descriptor archiver). So, this could be something to add later if we want, IMHO.
PS. Where does an authority's v3ident come from? Presently I reference users to the values in config.c but that's mostly because I'm confused about what it is and how it differs from their fingerprint.
The v3 identity is what v3 directory authorities use to sign their votes and consensuses. Here's a better explanation of v3 identity keys:
https://gitweb.torproject.org/torspec.git/blob/HEAD:/attic/v3-authority-howt...
Best, Karsten
- see the
example at the start of its docs...
https://stem.torproject.org/api/descriptor/remote.html https://stem.torproject.org/_modules/stem/descriptor/remote.html
These two links don't work for me for some reason.
Very strange. It didn't work when I just tried clicking them from another system but when I did a full refresh (ctrl+shift+r) it did. Probably browser side caching.
PS. Where does an authority's v3ident come from? Presently I reference users to the values in config.c but that's mostly because I'm confused about what it is and how it differs from their fingerprint.
The v3 identity is what v3 directory authorities use to sign their votes and consensuses. Here's a better explanation of v3 identity keys:
https://gitweb.torproject.org/torspec.git/blob/HEAD:/attic/v3-authority-howt...
I spotted that the v3ident is the same thing as the 'fingerprint' line from the authority key certificates. In my humble opinion this overloaded meaning of a relay fingerprint is confusing, and I'm not clear why we'd reference authorities by the key fingerprint rather than the relay fingerprint. But oh well. If there's anything we can improve in the module pydocs then let me know.
On Tue, May 28, 2013 at 2:50 AM, Damian Johnson atagar@torproject.orgwrote:
So far, so good. By my read of the man pages this means that gzip or python's zlib module should be able to handle the decompression. However, I must be missing something...
% wget http://128.31.0.34:9131/tor/server/all.z
[...]
% python
import zlib with open('all.z') as desc_file:
... print zlib.decompress(desc_file.read()) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> zlib.error: Error -5 while decompressing data: incomplete or truncated stream
This seemed peculiar, so I tried it out. Each time I wget all.z from that address, it's always a different one; I guess that's how it should be, but it seems that sometimes not all of it gets downloaded (hence the actually legit zlib error.)
I was able to make it work after my second download attempt (with your exact code); zlib handles it well. So far it's worked every time since.
This is probably not good if the source may sometimes deliver an incomplete stream.
TL;DR try wget'ing multiple times and getting even more puzzled (?)