Hi,
On 12/02/18 23:55, isis agora lovecruft wrote:
- What passes for "canonicalised" "utf-8" in C will be different to what passes for "canonicalised" "utf-8" in Rust. In C, the following will not be allowed (whereas they are allowed in Rust): - NUL (0x00) - Byte Order Mark (0xFEFF)
Much of the metrics software is written in Java. Java strings allow for NUL to appear, but assume that there is no BOM. If a BOM appears, then this would be interpreted as data and, I assume, parsing would probably fail. Should the whole document be rejected if it contains a NUL or BOM, or should these values be stripped and then carry on parsing as if it never happened?
- Directory document keywords MUST be printable ASCII.
This can be validated. Should a single document keyword containing printable non-ASCII be enough to reject the document, or should a parser try to recover?
I'd really like to see a section in the proposal about how parsers should react when they find something unexpected, otherwise all the parsers may end up doing different things.
- This change may break some descriptor/consensus/document parsers. If you are the maintainer of a parser, you may want to start thinking about this now.
For the metrics tools there are some guidelines on this we can follow: https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other language would be Python (for stem), but Python developers have probably got a good understanding of unicode/str/bytes by now. (In Python 3: when using UTF-8, BOM will not be stripped and will be interpreted as data, and you can have a NUL in a str).
Thanks, Iain.