Hi Damian,
I plan to make a few changes to the bridge descriptor sanitizer to implement changes discussed on this list and in various Trac tickets. The result will be format version 1.0. Here's what will change compared to the current (unversioned) format. Can you take a look whether stem would be happy with these descriptors and if there are other tweaks I should do to make it even happier?
- Bridge network statuses contain a "published" line containing the publication timestamp, so that parsers don't have to learn that timestamp from the file name anymore.
- Bridge network status entries are ordered by hex-encoded fingerprint, not by base64-encoded fingerprint, which is mostly a cosmetic change.
- Server descriptors and extra-info descriptors are stored under the SHA1 hashes of the descriptor identifiers of their non-scrubbed forms. Previously, descriptors were (supposed to be; see #5607) stored under the digests of their scrubbed forms. The reason for hashing digests is to prevent looking up an existing descriptor from the bridge authority by its non-scrubbed descriptor digest. With this change, we don't have to repair references between statuses, server descriptors, and extra-info descriptors anymore which turned out to be error-prone (#5608). Server descriptors and extra-info descriptors contain a new "router-digest" line with the hex-formatted descriptor identifier. These lines are necessary, because we cannot calculate the identifier anymore and because we don't want to rely on the file name.
- Bridge nicknames (#5684) in all descriptor types and dirreq-* statistics lines (#5807) in extra-info descriptors are not sanitized anymore.
- All sanitized bridge descriptors contain @type annotations (#5651).
Please let me know what you think about these changes. I plan to start sanitizing descriptors with the described changes tomorrow or the day after and make them available on May 30 (or later if the sanitizing/compressing/uploading takes much longer than expected).
Thanks, Karsten
Hi Karsten.
- Bridge network statuses contain a "published" line
Oh, I didn't realize that there was a consensus that included bridges. Mind explaining where they come from and what they're for? Which category can I find these in on the metrics data page?
I haven't implemented network status entries yet so changes there aren't a concern, though it would be useful for me to have one as an example.
Server descriptors and extra-info descriptors are stored under the SHA1 hashes of the descriptor identifiers of their non-scrubbed forms.
Stem provides its caller with the descriptor's path but doesn't try to do anything with it, so this isn't a concern.
Server descriptors and extra-info descriptors contain a new "router-digest" line with the hex-formatted descriptor identifier.
Not following. Is this new 'router-digest' entry only in the bridge descriptors? Is it a bridge equivalent for a relay server descriptor's 'fingerprint' field? Again, an example of the new descriptors would be nice to have.
Bridge nicknames (#5684) in all descriptor types
Minor tweak for the is_scrubbed() method, but that's all.
... and dirreq-* statistics lines (#5807) in extra-info descriptors are not sanitized anymore.
I didn't realize that bridge extrainfo descriptors _were_ sanitized. What section of the format page details the scrubbing for those?
I've never tried running the stem parser over a bridge extrainfo descriptor, so again an example would be useful. :)
Cheers! -Damian
Hi Damian,
On 5/21/12 5:55 PM, Damian Johnson wrote:
Hi Karsten.
- Bridge network statuses contain a "published" line
Oh, I didn't realize that there was a consensus that included bridges. Mind explaining where they come from and what they're for?
The bridge authority generates a bridge network status that it copies to the BridgeDB host and to the metrics server twice per hour. The bridge network status contains (relay) flags like Running and Stable that BridgeDB uses to decide which bridges to give out. Metrics uses the bridge network status to graph the number of running bridges.
Which category can I find these in on the metrics data page?
The bridge descriptor tarballs contain bridge network statuses, server descriptors, and extra-info descriptors. See:
https://metrics.torproject.org/data.html#bridgedesc
I haven't implemented network status entries yet so changes there aren't a concern, though it would be useful for me to have one as an example.
You'll find an example here:
https://metrics.torproject.org/formats.html#bridgedesc
(I'll also include an example of the suggested format below.)
Server descriptors and extra-info descriptors are stored under the SHA1 hashes of the descriptor identifiers of their non-scrubbed forms.
Stem provides its caller with the descriptor's path but doesn't try to do anything with it, so this isn't a concern.
Okay.
Server descriptors and extra-info descriptors contain a new "router-digest" line with the hex-formatted descriptor identifier.
Not following. Is this new 'router-digest' entry only in the bridge descriptors?
Yes, it would be only in bridge server descriptors and in bridge extra-info descriptors. For relay descriptors, you'd determine the descriptor identifier by calculating the SHA1 of "router [...]\nrouter-signature\n" or "extra-info [...]\nrouter-signature\n". This wouldn't be possible anymore with bridge descriptors anymore, because we'd change some lines or line parts in the sanitizing process. Therefore the extra "router-digest" line.
Is it a bridge equivalent for a relay server descriptor's 'fingerprint' field?
No, the fingerprint is the identity key digest, whereas the descriptor identifier is the descriptor digest.
Again, an example of the new descriptors would be nice to have.
Sure. The bridge network status entry below references the server descriptor via AG/Za6N (base64) = 006FD96B (hex) which in turn references the extra-info descriptor via 068A2E28.
@type bridge-network-status 1.0 published 2012-04-16 11:37:05 [...] r ididnteditheconfig Pp+Rv3MgCzkgeoaIx4uHnGaz0Yo AG/Za6Ned4Wmo7i3X+LiQ1oTvbQ 2012-04-15 22:04:22 10.32.143.78 40187 0 s Fast Guard Running Stable Valid w Bandwidth=55 p reject 1-65535 [...]
@type bridge-server-descriptor 1.0 router ididnteditheconfig 10.32.143.78 40187 0 0 platform Tor 0.2.2.35 (git-73ff13ab3cc9570d) on Linux x86_64 opt protocols Link 1 2 Circuit 1 published 2012-04-15 22:04:22 opt fingerprint 3E9F 91BF 7320 0B39 207A 8688 C78B 879C 66B3 D18A uptime 2 bandwidth 204800 204800 55794 opt extra-info-digest 068A2E28D4C934D9490303B7A645BA068DCA0504 opt hidden-service-dir reject *:* router-digest 006FD96BA35E7785A6A3B8B75FE2E2435A13BDB4
@type bridge-extra-info 1.0 extra-info ididnteditheconfig 3E9F91BF73200B39207A8688C78B879C66B3D18A published 2012-04-15 22:04:22 [...] router-digest 068A2E28D4C934D9490303B7A645BA068DCA0504
Bridge nicknames (#5684) in all descriptor types
Minor tweak for the is_scrubbed() method, but that's all.
Great.
... and dirreq-* statistics lines (#5807) in extra-info descriptors are not sanitized anymore.
I didn't realize that bridge extrainfo descriptors _were_ sanitized. What section of the format page details the scrubbing for those?
Aha, good catch, that's not mentioned on the format page. Right now, dirreq-*, cell-*, and exit-* lines are completely removed. #5807 is about leaving dirreq-* lines in. I'll update the format page next week when the new tarballs are available.
I've never tried running the stem parser over a bridge extrainfo descriptor, so again an example would be useful. :)
Plenty of examples available, e.g.,
https://metrics.torproject.org/data/bridge-descriptors-2012-05.tar.bz2
Thanks, Karsten
On 5/21/12 7:19 PM, Karsten Loesing wrote:
On 5/21/12 5:55 PM, Damian Johnson wrote:
I didn't realize that bridge extrainfo descriptors _were_ sanitized. What section of the format page details the scrubbing for those?
Aha, good catch, that's not mentioned on the format page. Right now, dirreq-*, cell-*, and exit-* lines are completely removed. #5807 is about leaving dirreq-* lines in. I'll update the format page next week when the new tarballs are available.
After thinking more about it, I came to the conclusion that we should stop sanitizing *-stats lines at all. As you pointed out, we never said that we'd sanitize them, so I tried to draft a sentence or two why we remove cell-* and exit-* lines. But I failed to come up with a good reason. Removing those lines doesn't hide bridge locations any better than leaving them in.
As a result, the only thing that's sanitized in extra-info descriptors is the bridge fingerprint, similar to how it's sanitized in server descriptors.
Best, Karsten
Thanks, Karsten!
The bridge descriptor tarballs contain bridge network statuses, server descriptors, and extra-info descriptors. See:
Oops, I read 'contain similar documents as the relay descriptor archives' as being server descriptors. Maybe in this first sentence it should explicitly say that it's a bundled batch of network status, server descriptors, and extra-info descriptors?
You'll find an example here:
https://metrics.torproject.org/formats.html#bridgedesc
(I'll also include an example of the suggested format below.)
Oops again. Didn't figure that we'd use the same scrubbing description for both. Personally I'd find it more intuitive if we had separate sections for both, though I see why you did it this way.
No, the fingerprint is the identity key digest, whereas the descriptor identifier is the descriptor digest.
Gotcha. Added support for the router-digest lines and flagged them as being required for bridge server descriptors... https://gitweb.torproject.org/stem.git/commitdiff/e7e03d2f61d6dcc7bc5e5ad4de...
Minor tweak for the is_scrubbed() method, but that's all.
Great.
Changed... https://gitweb.torproject.org/stem.git/commitdiff/f7fb726cc3dea8bfd294833b15...
After thinking more about it, I came to the conclusion that we should stop sanitizing *-stats lines at all.
In that case the 'router-signature' lines are the only ones being scrubbed out of bridge extra-info descriptors, right? If so then we don't need a 'router-digest' here since the digest can be calculated from the (now unscrubbed) content - right?
Cheers! -Damian
Hi Damian,
On 5/23/12 7:27 PM, Damian Johnson wrote:
The bridge descriptor tarballs contain bridge network statuses, server descriptors, and extra-info descriptors. See:
Oops, I read 'contain similar documents as the relay descriptor archives' as being server descriptors. Maybe in this first sentence it should explicitly say that it's a bundled batch of network status, server descriptors, and extra-info descriptors?
I tweaked the paragraph a bit. Please feel free to edit it more and send me a patch.
https://gitweb.torproject.org/metrics-web.git/commitdiff/3dbf9ae
You'll find an example here:
https://metrics.torproject.org/formats.html#bridgedesc
(I'll also include an example of the suggested format below.)
Oops again. Didn't figure that we'd use the same scrubbing description for both. Personally I'd find it more intuitive if we had separate sections for both, though I see why you did it this way.
It's probably a matter of taste. Organizing the description by descriptor type would mean we'd repeat a few things. For example, we replace bridge identities in all three descriptor types and IP addresses in two of them (where the third type doesn't contain the bridge IP address). I think it's easier to list the changes made to all descriptor types.
After thinking more about it, I came to the conclusion that we should stop sanitizing *-stats lines at all.
In that case the 'router-signature' lines are the only ones being scrubbed out of bridge extra-info descriptors, right? If so then we don't need a 'router-digest' here since the digest can be calculated from the (now unscrubbed) content - right?
No, the extra-info descriptors contain hashed bridge fingerprints, not the original ones. That's why we need the "router-digest" line.
Best, Karsten