Greetings tor-dev!
This email is a discussion on adding tracing to little-t tor. Tracing can be a very abstract notion sometimes so I'll start with a quick overview of what that is, what we can achieve and use cases within tor. Then I'll go over a last point which is safety.
This email doesn't go into the technical details of userspace tracing on how and what will be done to add it to tor. That is for another discussion.
1. Overview
Long story short, you can see tracing as a specific type of logging as in it records information of the application at runtime using tracepoints (similar to logging statement) so it can be used later. But the main differences from logging are in two parts: performance and API stability.
Usually, tracing implies high performance as in adds very little overhead to the application in order to disrupt as little as possible the normal behavior of an application. This is extremely useful in cases where you want to catch race conditions or performance bottle necks.
Tracers in userspace have usually an "inprocess library" which in short means that it records data (raw) from the application and move it to an outside buffer. Then, that buffer is emptied either on disk or network by the outside component of the tracer for which the data can be analyzed after collection.
So all a tracer do is, within the application when a tracepoint is hit, copy some data into a buffer and yields back to the application.
The other part is the API stability. Very often, logs (let say at DEBUG level) don't usually have strict stable requirements between released versions. But tracing events (tracepoints), are exposed to the outside for tracers to hook on, and for people to run analyzing tools on the recorded data. Thus, stability is usually strongly encouraged. In other words, what the tracepoint exposes, once released stable, should really not change that much over time.
With a proper abstraction in the application, we can offer stable tracepoints for which a variety of tracers can hook themselves on at runtime. It is all about providing an interface to the outside world.
2. Why Tracing in Tor
The tor software is a very complex beast. It has dozens of subsystems with various interactions between them. One of the big main job of tor is to relay data as fast as possible in order to keep the latency low. Which means, that there are code paths that are considered "fast path" implying that they must remain light and fast. One example is the crypto code that is hit at each cell.
Tracing comes in extremely handy to hunt down race conditions, performance issues, or even multithreading problems. A fast relay, let say 25MB/s, if we wanted to record cell timing in order to hunt down such issues, it simply can _not_ be done with logging at debug level since it slows down considerably tor but also fills the disk in a matter of minutes.
And using the control port is not a good solution for two main reasons: string formatting at each event and control port is part of the mainloop. So anything you ask to go on the control port will add an overhead to the overall behavior of tor which is not good when you hunt down races.
One concrete example where tracing was used in the past in tor is with the rewrite of the cell scheduler (KIST). In order to measure cell timings within Tor so bottlenecks issues could be found, tracing had to be added so millions of events could be recorded within few minutes of using a fast relay in production.
In pressure situation, this is where tracing comes handy. Tracing was also used recently to find onion service v3 reachability issues. In order to correlate connection, cell and circuit level problems with the higher level HS subsystem, we were able to record events in all those subsystems, match them with their precise timing (offered by tracing) and analyze the results later on after recording the data.
3. Safety Discussion
Onto the last part I wanted to raise. Allowing anyone to record very low level data from tor, there is an obvious safety question that must be raised.
Over the years, I've talked about tracing with many people in Tor and the consensus was always that it should never be enabled in production. As in, the packages shipped by Tor or by distros should _never_ build the tracepoints.
In other words, it should be considered a development option only. Not only an option, but compiled _out_ in production and one has to explicitly build them into tor.
For example (nothing final, just to show the idea):
$ ./configure --enable-tracing
I personally think that should be enough since the presence of the code upstream won't stop people from using it (bad or not) but we can prevent it to be in any legit Tor packages out there. See it a bit like the obsolete Tor2web option that was never enabled in any published packages by Tor Project or distros, one had to explicitly enable it at configure time.
The ControlPort is allowed in production and if a malicious actor gets access to it, then game over. I do see tracing like that as well but at least we can control its availability as a feature where we can't for the ControlPort as of today.
Any feedback is very welcome! Concerns, questions, thoughts.
Cheers! David