Re: [tor-dev] Proposal 285: Directory documents should be standardized as UTF-8

9 Jan 2018

      Hi, Teor, and sorry for the long delay!  You had a lot of good
questions on this proposal, and I didn't know how to answer them all.
So in hopes of making progress here, I'm taking wild guesses and
asking for help in making the wild guesses better :)
On Mon, Nov 13, 2017 at 5:28 PM, teor teor2345@gmail.com wrote:
...
On 14 Nov 2017, at 05:51, Nick Mathewson nickm@torproject.org wrote:
Filename: 285-utf-8.txt
Title: Directory documents should be standardized as UTF-8
Author: Nick Mathewson
Created: 13 November 2017
Status: Open

Summary and motivation
People frequently want to include non-ASCII text in their router
descriptors.  The Contact line is a favorite place to do this, but in
principle the platform line would also be pretty logical.
Unfortunately, there's no specified way to encode non-ASCII in our
directory documents.
Fortunately, almost everybody who does it, uses UTF-8 anyway.

How many current descriptors will be rejected as non-UTF-8?
I think that when last I checked, the number was something like 3.
...
As we move towards Rust support in Tor, we gain another motivation
   for standarding on UTF-8, since Rust's native strings strongly prefer
   UTF-8.
So, in this proposal, we describe a migration path to having all
   directory documents be fully UTF-8.

Proposal
First, we should have Tor relays reject ContactInfo lines (and any
other lines copied directly into router descriptors) that are not
UTF-8.

How do we define UTF-8?
I tried to do so as follows:
We define the allowable set of UTF-8 as:
        * Encoding the codepoints U+01 through U+10FFFF,
        * but excluding the codepoints U+D800 through U+DFFF,
        * each encoded with the shortest possible encoding.
        * without any BOM
Are there other restrictions we should make?  If so, how should we phrase them?
[...]
...
How do we carry forward existing ASCII restrictions into UTF-8?
I don't understand this question.
...
We will need to update the directory spec to acknowledge that
contact and platform lines may be parsed as UTF-8 or
ASCII-including-arbitrary-bytes-except-NUL, and that they are
terminated by single-byte newlines regardless.
Ack.
...
How do we deal with format confusion attacks?
UTF-8 has a few alternative whitespace characters. These could
be used in an attack that confuses either humans viewing the file,
or automated software:
If a human uses a UTF-8 compatible viewer or editor, it likely shows
Unicode newlines and ASCII newlines in an identical way. Similarly,
it may show Unicode spaces and ASCII spaces in the same way.
This may confuse the human reader.
Right.  I don't see an obvious attack here, but we should keep it in mind.
Do you have a different suggestion of what to do here?
...
Similarly, if automated software parses using a Unicode whitespace
or newline character class, it will mis-parse directory documents.
(Our Rust protover code looks for ASCII spaces, so it appears to
be fine.)
Note that we already have this issue with line feeds and carriage
returns, which I thought we had solved by banning carriage returns
in directory documents. But it appears we allow "any printing ASCII
character". (We will have to edit this to include Unicode.)
Also let's consider all the nonprinting ASCII: it's already a
potential display problem if you're using a bad editor, or whatever.
...
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n218
At the same time, we should have authorities reject any router
   descriptors or extrainfo documents that are not valid UTF-8.
   Simultaneously, we can have all Tor instances reject all
   non-directory-descriptor directory documents that are not UTF-8,
   since none should exist today.
If we apply the existing restrictions in dir-spec, which require
non-directory-descriptor directory documents to be ASCII, they will
also be UTF-8.
Isn't it confusing to say "UTF-8", when what we really mean is "ASCII"?
Do we expect to migrate these to non-ASCII UTF-8 at some point?
I think having non-ASCII in extrainfos is a reasonable possibility.
I'm not so sure about the others: there could be reasons in the
future.
My rationale for declaring everything to be UTF-8 was that it seemed
more reasonable to have a single set of rules for parsing everything
than to have different rules for different documents.
...
Also, does "non-directory-descriptor directory documents" mean we
can reject non-UTF-8 microdescriptors? I think we should.
I think so.
...
Does the NS consensus contain any lines that are copied verbatim from
descriptors?
I don't think so.
[...]
...
should be rejected entirely: "reject-encrypted-non-utf-8".  If that
   parameter is set to 1, then hidden service clients will not only
   warn, but reject the descriptors.
Once the vast majority of clients are running versions that support
   the "reject-encrypted-non-utf-8" parameter, that parameter can be set
   to 1.
We also can't reject bridge descriptors at the authority level.
(Bridge clients download bridge descriptors directly from bridges.)
Do we need bridge clients to also use this consensus parameter?
I added an extra section for this, basically saying "bridge clients
should do that too":
2.2. Bridge descriptors
Since clients download bridge descriptors directly from the bridges, they
   also need a two-phase plan as for hidden service descriptors above.  Here
   we take the same approach as in section 2.1 above, except using the
   parameter "reject-bridge-descriptor-non-utf-8".

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] Proposal 285: Directory documents should be standardized as UTF-8