Filename: 285-utf-8.txt
Title: Directory documents should be standardized as UTF-8
Author: Nick Mathewson
Created: 13 November 2017
Status: Open
1. Summary and motivation
People frequently want to include non-ASCII text in
their router
descriptors. The Contact line is a favorite place to do
this, but in
principle the platform line would also be pretty
logical.
Unfortunately, there's no specified way to encode
non-ASCII in our
directory documents.
Fortunately, almost everybody who does it, uses UTF-8
anyway.
As we move towards Rust support in Tor, we gain another
motivation
for standarding on UTF-8, since Rust's native strings
strongly prefer
UTF-8.
So, in this proposal, we describe a migration path to
having all
directory documents be fully UTF-8.
2. Proposal
First, we should have Tor relays reject ContactInfo
lines (and any
other lines copied directly into router descriptors)
that are not
UTF-8.
At the same time, we should have authorities reject any
router
descriptors or extrainfo documents that are not valid
UTF-8.
Simultaneously, we can have all Tor instances reject all
non-directory-descriptor directory documents that are
not UTF-8,
since none should exist today.
Finally, once the authorities have updated, we should
have all Tor
instances reject all directory documents that are not
UTF-8. (We
should not take this step until the authorities have
upgraded, or
else the behavior of updated and non-updated clients
could be
distinguished.)
2.1. Hidden service descriptors' encrypted bodies
For the encrypted bodies of hidden service descriptors,
we cannot
reject them at the authority level, and so we need to
take a slightly
different approach to prevent client fingerprinting
attacks.
First, we should make Tor instances start warning about
any hidden
service descriptors whose bodies, post-decryption,
contain non-utf-8
plaintext. At the same time, we add a consensus
parameter to
indicate that hidden service descriptors with non-utf-8
plantexts
should be rejected entirely:
"reject-encrypted-non-utf-8". If that
parameter is set to 1, then hidden service clients will
not only
warn, but reject the descriptors.
Once the vast majority of clients are running versions
that support
the "reject-encrypted-non-utf-8" parameter, that
parameter can be set
to 1.