Hi Damian,
I plan to make a few changes to the bridge descriptor sanitizer to implement changes discussed on this list and in various Trac tickets. The result will be format version 1.0. Here's what will change compared to the current (unversioned) format. Can you take a look whether stem would be happy with these descriptors and if there are other tweaks I should do to make it even happier?
- Bridge network statuses contain a "published" line containing the publication timestamp, so that parsers don't have to learn that timestamp from the file name anymore.
- Bridge network status entries are ordered by hex-encoded fingerprint, not by base64-encoded fingerprint, which is mostly a cosmetic change.
- Server descriptors and extra-info descriptors are stored under the SHA1 hashes of the descriptor identifiers of their non-scrubbed forms. Previously, descriptors were (supposed to be; see #5607) stored under the digests of their scrubbed forms. The reason for hashing digests is to prevent looking up an existing descriptor from the bridge authority by its non-scrubbed descriptor digest. With this change, we don't have to repair references between statuses, server descriptors, and extra-info descriptors anymore which turned out to be error-prone (#5608). Server descriptors and extra-info descriptors contain a new "router-digest" line with the hex-formatted descriptor identifier. These lines are necessary, because we cannot calculate the identifier anymore and because we don't want to rely on the file name.
- Bridge nicknames (#5684) in all descriptor types and dirreq-* statistics lines (#5807) in extra-info descriptors are not sanitized anymore.
- All sanitized bridge descriptors contain @type annotations (#5651).
Please let me know what you think about these changes. I plan to start sanitizing descriptors with the described changes tomorrow or the day after and make them available on May 30 (or later if the sanitizing/compressing/uploading takes much longer than expected).
Thanks, Karsten