Hello everyone, I'm a long time user of Tor, first time poster here.
Over the last few months, I have been working on a light weight C++ client only implementation of the Tor protocol, intended to be used as an embedded library in other applications. It is now at the stage where it can complete bootstrapping, build circuits, as well as connect to services / host hidden services (v3). I have been building this primarily off the spec documents (which have generally been extremely helpful), and well as the assistance of stepping through the official Tor implementation when needed for troubleshooting and to confirm specifics. Before I release this to the wider world, I'd like to confirm a few points that may not be explicitly stated in the specs.
1. In general, what are the things to look out for when implementing the Tor protocol beyond "making it work" and validating all data (signatures, timestamps, etc)? One thing I'm concerned about is the risk of fingerprinting where the spec does not completely specify behaviour, e.g. the order in which link specifiers are passed in an extend cell, exact criteria for when circuits are explicitly destroyed etc. (I'm very excited to see the proposals around CBOR on this point which would help greatly with knowing that a canonical data representation was used).
2. When it comes to bootstrapping, the official implementation appears to favour accessing directories via plaintext HTTP rather than connecting on the OR port and using create fast / begin dir. What is the motivation for using the plaintext option (and for that matter, having a plaintext http service open at all)?. While the OR will learn just as much about the client regardless, it seems like the default plaintext access to directory information unnecessarily gives away details of how clients engage with the Tor network to third parties.
3. When using bridges and in particular pluggable transports, how is the client intended to safely bootstrap in the cold start case where it does not know up front which bridge/relay it will be connected to (e.g. when using Snowflake)? The RSA identity can be accepted in blind faith based on the Tor handshake, and it's then possible to get the full details with create fast / begin dir, but how does a client know that it has been connected to a bridge that is "blessed" by the Tor network rather than a MITM actor?
4. Finally, if anyone reading has been involved with or close to the development of other unofficial Tor implementations, what are the lessons learned on this front? I'm aware of among others Orchid (updated last in 2016), node-Tor (does not implement ECC) and torpy (does not implement hidden services v3). What makes these fail / stall?
Many thanks, P