Hello everyone!
DISCLAIMER: The following is enormous and tries to describe in some level of details the situation in tor with connection<->channel<->scheduler. This comes after we've merged the KIST scheduler, we've realized many things we'ren't what they were suppose to be or meant for. In the end, I'm asking questions so we can move forward with development and fixing things.
Last thing before you start your journey in the depth of Tor, the 3 subsystems I'm going to talk about and how they interact are kind of very complicated so it is very possible that I might have gotten things wrong or miss some details. Please, point them out so we can better document, better be informed and make good decisions. I plan to document as much as I can from this process for a new file in torguts.git repository.
This is part of the work from ticket #23993.
Enjoy!
== Part One - The Distant Past ==
Once upon a time, Tor had 0.2.3 years old. It was before the "big change" and it was using connections and circuits in a simplistic but yet clever way. I'll briefly describe here how it worked out and why we wanted to change it.
Roughly, once a cell comes in, it is put in the inbound buffer called "inbuf" of a connection_t and it is processed. Then, if it needs to be relayed that is sent along the circuit it came on, it is put in that circuit queue. Each connection_t had a priority queue of "active circuits" using the EWMA policy to prioritize. The first circuit was to be flushed of a certain amount of cells onto the connection_t outbound buffer or "outbuf" where libevent main loop takes over to actually write the bytes on the network from the outbuf. Then, the circuit pqueue is updated and we would move on. The following is a high level diagram of a relayed cell:
+--------------+ +-----------+ +----------------+ |connection_t A| | circuit_t | | connection_t B| | (inbuf) |--->| (pqueue) |--->| (outbuf) |---> Network +--------------+ +-----------+ +----------------+
That had at least one big problem. Not considering the entire set of circuits could make one connection (with some circuits on it) bloat the network link of the relay. It would actually be worst than that if it had bandwidth limitation, you want to pick the connection that has the highest priority for the circuit on it to flush
All in all, tor needed to move to a logic where _all_ circuits are considered using the EWMA policy, flushing the cell(s) of that picked circuit on the connection outbuf so it is prioritized on the network.
Then in 0.2.4, everything changed!
== Part Two - The Big Change ==
Two new abstractions were added to Tor 0.2.4 (#6465), channel and circuitmux (short for circuit multiplexer or multiplexing, history is unclear ;). Up until this day, this is what tor is using.
The channel_t object was intended to provide a "transport" abstraction layer for relay to relay connection which is reprensented by a connection_t object. Each channel has a circuitmux_t object that encapsulates the queuing logic and selection policy of circuits of a channel. In other words, the circuits that goes through that connection are listed in that object so we can do cell prioritization.
To summarize, this roughly looks like this now:
+------------+ |connection_t| +------------+ | v 1 +---------+ |channel_t| +---------+ | v 0..n +---------+ |circuit_t| +---------+
Channels are suppose to work as a protocol abstraction on that connection. The original intent of this was to be able to implement more easily different protocl like for instance helping researchers to experiment with other things more easily. Decoupling TLS from connection_t and putting them in a subclass of channel_t and thus was born channel_tls_t. It implements the channel_t interface and is in theory designed to handle TLS layer of relay to relay connection (handshake, cells, etc). So far, tor only has that channel class.
Now that we had this layer and a way to multiplex circuits on a single connection through the use of channel, we needed one last piece, a way to prioritize cells on the network considering all circuits. Along came the "cell scheduler" (scheduler.c) in Tor 0.2.6 and now called the Vanilla scheduler.
In a nutshell, libevent main loop calls the scheduler at a constant rate and the subsystem will consider all channels between each other comparing circuits priority using the EWMA policy (an interface was created to implement more policies but I won't get into that in this email).
So now, tor finally had a good infrastructure to prioritize cells on the network by looking at all circuits at once. TCP connections would be established between relays, the channel layer would be handling the TLS business and the scheduler would be flushing cells on the network by considering all circuits on all channels. It is a really nice modularized design and would make tor improve in performance.
But, that's not entirely the reality of things right now.
== Part Three - The Thruth ==
In 2014, a paper came out titled: Never Been KIST: Tor’s Congestion Management Blossoms with Kernel-Informed Socket Transport (http://www.robgjansen.com/publications/kist-sec2014.pdf). An improved way for tor to scheduled data transmission properly by asking the kernel information about the state of the TCP buffers so tor could flush data on the network without bloating the local link.
Then in 2016 up to now, Matthew Traudt implemented KIST in tor which required an important refactoring of the scheduler subsystem in order to make it work with different multiple scheduling type. It is today the Scheduler option in torrc. And all 0.3.2 tor uses KIST by default. Many things were discovered during that piece of work by Matt. For instance, this very important #20459 bug for which we were badly comparing cmux policy thus ultimately prioritizing wrongly our circuits.
But fundemental issues in the design of channels and the scheduler were found for which I'll list some here and will ultimately lead me to Part Four of this email about our next steps moving forward.
1. Channels implemented cell queues that weren't really queues.
If you look at the channel_t object, you'll notice "incoming_queue" and "outgoing_queue" which basically allow the transition of a cell from a connection_t inbuf to the circuit queue and then to the connection_t oubuf. It looks like this for a relayed cell:
conn inbuf -> chan in_queue -> circ queue -> chan out_queue -> conn outbuf
The idea behind this is to quickly empty the conncetion inbuf to an intermediary object (channel) leaving the kernel inbound TCP queue as much room as we can. In other words, offloading the kernel buffer to user space since we need to run some processing on that data and there was no need to leave that data on the kernel buffer during that time.
Then from the channel, a cell would be put on the circuit queue and in turn would be process by the scheduler. The scheduler would select a circuit (EWMA policy), flush cells from its queue onto the channel outgoing queue and then it would be flushed to the connection outbuf which is taken care of by libevent to write on the socket (to the network).
One could argue here that in terms of design we have too many queues but that would be true if the channel queues were actually acting like queues. It turns out that the incoming and outgoing queues of a channel are always empty (#23709) and this is due to the fact that the code queues a cell then processes it immediately meaning either put it on the circuit queue or connection outbuf.
Each cell is memcpy() over the a channel queue and then processed right away just to be copied again into the circuit queue or outbuf. This has a performance cost over time as a relay sees thousands of cells a second. But also it makes the channel subsystem much more complicated in terms of code with trying to handle those queues every part of the way in and out of the channel subsystem. But the worst part is that the scheduler subsystem assumed some decisions based on the outgoing queue size. And when it is always 0, issue arise like #23676.
2. DESTROY cells handling
Within a circuitmux object, there is a "destroy cell queue" on which a DESTROY cell is put in for one of the circuit on the cmux. An important thing for tor is that when it needs to send a DESTROY, it needs to _stop_ sending any queued cell on that circuit, dump them and only send the DESTROY cell.
Many things are problematic currently. For starter, sending a DESTROY cell bypasses the scheduler (#23712). Tor will immediately try to flush the cell on the network *if* the channel doesn't have queued *writes* which means if the connection outbuf is empty. If not, once it is sufficiently drained, the channel will be scheduled in and the DESTROY cell finally sent. I've observed this process going from 0.5 seconds up to 10 seconds on a normal relay if the outbuf has data on it.
Second, because of this concept of different queue for the DESTROY cell, tor will back and forth between that queue and the normal queue of a circuit. Remember, that the destroy queue is *not* per circuit but per circuitmux. This is used by the "cmux->last_cell_was_destroy" which is set to 1 if the last cell the cmux handled was a DESTROY or 0 if not.
This has a side effect. Before sending a DESTROY cell, the circuit queue is cleared out without changing the circuit EWMA priority, in order to make sure we don't send cells from that point on. However, because we back and forth, the EWMA policy will pick the right channel based on the overall state of the circuits (see scheduler_compare_channels()) on it but will not prioritize the circuit with the DESTROY cell on using that policy. Furthermore, the destroy cell queue is a FIFO so if 10 DESTROY cells were queued bu th 10th one is actually on the circuit with the highest priority, we won't process it until those 9 previous who came in before are flushed.
3. Connection and Channel decoupling is kind of a lie
The connection_t object is bound to be a TCP connection except for DNSPort which will be a UDP. That object also has a "TLS connection state" (tor_tls_t) and a channel_tls_t object (chan).
First, it turns out that the connection subsystem will establish the TLS session and not the channel tls layer. When opening a circuit, it looks like this: channel_connect_for_circuit() |- channel_connect() |- channel_tls_connect() |- connection_or_connect() |- connection_tls_start_handshake()
Second, when processing a cell from the connection inbuf in connection_or_process_cells_from_inbuf(), we'll directly call channel_tls_handle_cell(). The channel_tls_t layer handles PADDING, VERSIONS, CERTS, AUTH_CHALLENGE, AUTHENTICATE and NETINFO cells. The other cell types will be queued on the corresponding circuit queue.
Third, once the TLS session has been established, a VERSIONS cell can be sent which is done by the connection subsystem with connection_or_send_versions(). But it is bypassing the channel_tls_t layer that processes them and should write them using channel_tls_write_cell_method(). Futhermore, the channel_tls_t layer is using the connection layer to send those cells using connection_or_send_certs_cell() and cie.
My point here is that the channel abstraction is kind of violated in many ways by either directly calling the TLS channel layer or not using it to send specific cells that are for the TLS handshake. And also, the TLS state is baked in the connection subsystem. This means that anyone wanting to implement a new channel will need to do a major refactoring first which unfortunately leaves this abstraction kind of pointless. And do not make the mistake to label the channel subsystem as a "transport layer", it is at best a protocol layer, connection_t handles transport, not channels.
4. Vanilla scheduler is not really a scheduler.
It turns out that the Vanilla scheduler is not really scheduling things but rather shoving as much as it could onto the connection outbuf. This ain't ideal because tor is selecting circuit based on a policy (EWMA) and when the circuit is selected, we would push on the connection outbuf by either writing as much as we can on the outbuf or stop at 1000 cells (see MAX_FLUSH_CELLS) which is 512 * 1000 bytes == 0.5MB. One can argue, is it too little to only write 0.5MB per scheduler tick or actually too much?
So imagine tor flushed 1000 cells everytime it picks a circuit and then moves on to the next circuit. I'm not sure this is playing well with a relay with bandwidth limitation for instance. If you have 100 circuits, and 100KB to spend, you might not want to spend it all on one single circuit?
Also, because this scheduler runs as often as possible, only one channel was considered everytime. Matt's experiment showed that in practice which means that a scheduler that is only considering one channel doesn't make good priority decision because there is none in the first place.
Finally, when trying to bloat the outbuf as much as possible, and for that I'm not sure how libevent operates, but it leaves the decision of when and what to flush on the network to the process that handles outbuf. In other words, the scheduler doesn't actually control when the data is written to the socket so the scheduling decision aren't necessarly followed through on the transport layer. This is something the scheduler and connection subsystem should be thightly connected.
== Part Four - The Conclusion ==
Through this epic journey, we've discovered some issues as well as design problems. Now the question is what should and can do about it?
In a nutshell, there are a couple of questions we should ask our selfves and try to answer so we can move forward:
* I believe now that we should seriously discuss the relevance of channels. Originally, the idea was good that is providing an abstraction layer for the relay to relay handshake and send/process cells related to the protocol. But, as of now, they are half doing it.
There is an important cost in code and maintanance of something that is not properly implemented/finished (channel abstraction) and also something that is unused. An abstraction implemented only for one thing is not really useful except maybe to offer an example for others? But we aren't providing a good example right now imo...
That being said, we can spend time fixing the channel subsystem, trying to turn it in a nicer interface, fixing all the issues I've described above (and I suspect there might be more) so the cell scheduler can play nicely with channels. Or, we could rip them off eliminating lots of code and reducing our technical debt. I would like us to think about what we want seriously because that channel subsystem is _complicated_ and very few of us fully understands it afaict.
Which would bring us back to (which is btw basically what we have now considering the channel queues are useless):
conn inbuf -> circ queue -> conn outbuf
If we don't want to get rid of channel, the fixes are non trivial. For starter, we have to decide if we want to keep the channel queue or not and if yes, we need to almost start from square 1 in terms of testing because we would basically introduce a new layer of queuing cells.
* Dealing with the DESTROY cell design issue will require a bit more tricky work I think. We must not starve circuit with a DESTROY cell pending to be sent else the other side keeps sending data. But we should also not starve all the circuits because if we ever need to send a gazillion DESTROY cell in priority, we'll make the relay useless (DoS vector).
The question is, do we trust our EWMA policy to be wise enough to pick the circuit in a reasonable amount of time so we can flush the DESTROY cell from the circuit queue? Or we really need to bypass or prioritize somehow that cell in order to send them asap in order to avoid load on the network because the other side of the circuit is still sending?
* In the short term, we should get rid of Vanilla scheduler because it complefixies a lot the scheduler code by adding uneeded things to channel_t but also bloated the scheduler interface with pointless function pointers for instance. And in my opinion, it is not helping performance the way it is done right now.
Cheers! David