On Tuesday, June 23, 2020 7:34 PM, Nick Mathewson nickm@torproject.org wrote:
On Fri, Jun 19, 2020 at 3:59 PM The Paranoia Project info@paranoia.tools wrote:
Hello everyone, I'm a long time user of Tor, first time poster here. Over the last few months, I have been working on a light weight C++ client only implementation of the Tor protocol, intended to be used as an embedded library in other applications. It is now at the stage where it can complete bootstrapping, build circuits, as well as connect to services / host hidden services (v3). I have been building this primarily off the spec documents (which have generally been extremely helpful), and well as the assistance of stepping through the official Tor implementation when needed for troubleshooting and to confirm specifics. Before I release this to the wider world, I'd like to confirm a few points that may not be explicitly stated in the specs.
Hello, P! It'll be neat to have a look at this when it comes out; building a Tor implementation is a lot of work.
Many thanks for your reply! The intent is to release this as open source once reasonably stable, and I would be very happy to get more eyes on it when it is ready.
I started out thinking I was making good progress with the core binary protocol and crypto primitives, but quickly found out that there are many layers to doing this well, the directory protocol formats was a whole thing in itself, not to mention path selection, DHT for hidden services, etc. As it stands, I have around 20k LOC of C++ and an implementation that functionally performs the basics up to using hidden services (as a client, or hosting one) but I'm sure there's a lot more to come to make it stable and network friendly as you mention below.
- In general, what are the things to look out for when implementing the Tor protocol beyond "making it work" and validating all data (signatures, timestamps, etc)? One thing I'm concerned about is the risk of fingerprinting where the spec does not completely specify behaviour, e.g. the order in which link specifiers are passed in an extend cell, exact criteria for when circuits are explicitly destroyed etc. (I'm very excited to see the proposals around CBOR on this point which would help greatly with knowing that a canonical data representation was used).
There are a lot of these, I'm afraid, and they're not all perfectly documented. Some of the trickier ones are about "being kind to the network" -- not making too many circuits, not over-using resources when idle, and so on. These are under-documented, but I believe Roger has talked a few times about starting a document to collect these. Roger, do you remember if this ever went anywhere, and produced a draft or something?
For your first few versions, I'd suggest that your best bet is to label your software loudly as experimental and alpha, since there will almost certainly be ways to distinguish your software and surprising bugs. Trying to be completely indistinguishable is probably impossible, due to timing issues at least -- about the best you can do is try to avoid easy ways to passively distinguish your software.
This all makes sense, and there are a lot of active behaviour parts that I definitely do not yet have implemented in an identical way (e.g. around padding). I've tried to mimic what the official client sends on the wire where possible, but some other cases I've found would include specific format of HTTP requests including headers used, ordering of header, etc (for anonymous HsDir access), base64 encoding in the directory formats that is sometimes explicitly specified as being without trailing padding =s, canonical ordering of link specifiers, and ordering of entries in directory documents. In rend-spec-v3, there is an order in which directory keys are described but I do not think it's explicitly stated that it should be the order in which the keys are given.
If there are any more examples that comes to mind where fingerprinting of data contracts would be possible, I'd appreciate any pointers.
In general, I wouldn't mind taking patches to enhance the specs by describing a preferred behavior whenever there is more than on possible behavior.
- When it comes to bootstrapping, the official implementation appears to favour accessing directories via plaintext HTTP rather than connecting on the OR port and using create fast / begin dir. What is the motivation for using the plaintext option (and for that matter, having a plaintext http service open at all)?. While the OR will learn just as much about the client regardless, it seems like the default plaintext access to directory information unnecessarily gives away details of how clients engage with the Tor network to third parties.
That isn't right. It's preferred for clients to download directory material over the ORPort via begindir. Plaintext DirPorts are supposed to be used by relays and authorities only.
What part of the spec or the implementation says that the plaintext dirport should be preferred? I'd like to correct that.
I may have misinterpreted the spec here. Dir-spec 5.1 states that "The client does not build circuits until it has a live network-status consensus document, and it has descriptors for a significant proportion of the routers that it believes are running (this is configurable using torrc options and consensus parameters).".
I took this to mean that _all_ bootstrapping would be done over HTTP until the point of having "enough" descriptors, but I assume now that this is meant to describe only "real" anonymous circuits?
In my current implementation, I do use begindir, and currently always using create fast (as this is required in the cold start case). Is there a meaningful preference here between using create/create fast for the non anonymous circuits used for directory info? What is the official client doing?
- When using bridges and in particular pluggable transports, how is the client intended to safely bootstrap in the cold start case where it does not know up front which bridge/relay it will be connected to (e.g. when using Snowflake)? The RSA identity can be accepted in blind faith based on the Tor handshake, and it's then possible to get the full details with create fast / begin dir, but how does a client know that it has been connected to a bridge that is "blessed" by the Tor network rather than a MITM actor?
Bridge addresses and identities need to be discovered out-of-band, by some means like bridgedb.torproject.org, personal communication, or bundling with software. The provenance of this information is the only way to tell whether you're getting a likely-to-be-good bridge or likely-to-be-run-by-your-enemy bridge.
For normal bridges I understand this point, though I am not clear on how it is intended to work with pluggable transports like Meek or Snowflake. Unless I've missed it, I've not seen any secure identity being specified when choosing a Meek or Snowflake setup, so how does the OP know who it is (or should be) connected to on the first hop?
- Finally, if anyone reading has been involved with or close to the development of other unofficial Tor implementations, what are the lessons learned on this front? I'm aware of among others Orchid (updated last in 2016), node-Tor (does not implement ECC) and torpy (does not implement hidden services v3). What makes these fail / stall?
I'll let developers answer here -- part of the issue is that it can be hard to maintain feature parity over time.
For a longer list of implementations, see https://gitlab.torproject.org/legacy/trac/-/wikis/doc/ListOfTorImplementatio... [warning -- wiki migration in progress].
Thank you for that, I hadn't seen this specific list before!
Many thanks, P