Hi,On 12/02/18 23:55, isis agora lovecruft wrote: 1. What passes for "canonicalised" "utf-8" in C will be different to
what passes for "canonicalised" "utf-8" in Rust. In C, the
following will not be allowed (whereas they are allowed in Rust):
- NUL (0x00)
- Byte Order Mark (0xFEFF)
Much of the metrics software is written in Java. Java strings allow forNUL to appear, but assume that there is no BOM. If a BOM appears, thenthis would be interpreted as data and, I assume, parsing would probablyfail. Should the whole document be rejected if it contains a NUL or BOM,or should these values be stripped and then carry on parsing as if itnever happened?
Directory authorities and bridge clients already reject descriptors that
contain NUL. (This is an artefact of the C implementation: the descriptor
is seen as truncated, so it won't parse.)
We should specify rejection for BOM as well.
2. Directory document keywords MUST be printable ASCII.
This can be validated. Should a single document keyword containingprintable non-ASCII be enough to reject the document, or should a parsertry to recover?
If parsers want to be consistent with the Tor implementation, they should
reject.
I'd really like to see a section in the proposal about how parsers
should react when they find something unexpected, otherwise all the
parsers may end up doing different things.
+1
3. This change may break some descriptor/consensus/document parsers.
If you are the maintainer of a parser, you may want to start
thinking about this now.
For the metrics tools there are some guidelines on this we can follow:https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The otherlanguage would be Python (for stem), but Python developers have probablygot a good understanding of unicode/str/bytes by now. (In Python 3: whenusing UTF-8, BOM will not be stripped and will be interpreted as data,and you can have a NUL in a str).
Python for txtorconRust for Tor's experimental protover implementation
And perhaps others: