Sections 3 and 6 of the Quux paper have some relevant discussion [1]
Unfortunately, it appears that the Quux QUIC paper studied QUIC at the wrong position - between relays, and the QuTor implementation is unclear. This means that it may still be an open question as to if QUIC's congestion control algorithms will behave optimally in a Tor-like network under heavy load.
As Reardon and Goldberg noted in concluding remarks, approaches other than hop-by-hop will incur an extra cost for retransmissions, since these must be rerouted through a larger part of the network [RG09].
As Tschorsch and Scheuermann discuss [TS12], due to the longer RTT of TCP connections, end-to-end approaches will also take longer to “ramp up” through slow start and up to a steady state.
Both of these factors (not to mention increased security risk of information leakage [DM09]) suggest that hop-by-hop designs are likely to yield beer results. In fact, the hop-by-hop approach may be viewed as an instance of the Split TCP Performance-Enhancing Proxy design, whereby arbitrary TCP connections are split in two to negate the issues noted above.
Unfortunately, the Quux implementation appears to use QUIC at a suboptimal position -- they replace Tor's TLS connections with QUIC, and use QUIC streams in place of Tor's circuit ID -- but only between relays. This means that it does not benefit from QUIC's end-to-end congestion control for the entire path of the circuit. Such a design will not solve the queuing and associated OOM problems at Tor relays, since relays would be unable to drop cells to signal backpressure to endpoints. Drops will instead block every circuit on a connection between relays, and even then, due to end-to-end reliability, relays will still need to queue without bound, subject to Tor's current (and inadequate) flow control.
A fully QUIC relay path (with slight modication to fix a limit on internal buffer sizes) would allow end-to-end backpressure to be used from the client application TCP stream up to the exit TCP stream. Leaving aside Tor’s inbound rate limit mechanism but retaining the global outbound limit, this design would allow max-min fairness to be achieved in the network, as outlined by Tschorsch and Scheuermann [TS11].
...
Once implemented however, backpressure would allow Tor to adopt a signicantly improved internal design. In such a design, a Tor relay could read a single cell from one QUIC stream’s read buffer, onion crypt it, and immediately place it onto the write buffer of the next stream in the circuit. This process would be able to operate at the granularity of a single cell because the read and write operations for QUIC are very cheap user-space function calls and not syscalls as for host TCP.
The schedule of this action would be governed by the existing EWMA scheduler for circuits that have both a readable stream and a writeable stream (and as allowed by a global outgoing token bucket), allowing optimal quality of service for circuits.
It’s expected that backpressure implemented in this way will yield signicant performance and fairness gains on top of the performance improvement found in this thesis.
One issue for Quux was that it used the Chromium demo QUIC server code as the basis for its implementation, which was fine for performance research but not such a good choice for Tor's networking stack.
Several Rust implementations have been released with server-side (not just client-side) usage, so I expect that to be much less of an issue today.
io_uring is also a significant development since Quux was developed, as it can reduce the performance hit for host-TCP syscalls, or for using recvmsg instead of recvmmsg with QUIC if the implementation makes it difficult to use recvmmsg on the listener side.
[1] https://www.benthamsgaze.org/wp-content/uploads/2016/09/393617_Alastair_Clar...
The following paper has in-depth discussion, but I don't have a copy to hand unfortunately:
Ali Clark. Tor network performance — transport and flow control. Technical report, University College London, April 2016