Sections 3 and 6 of the Quux paper have some relevant discussion [1]
> Unfortunately, it appears that the Quux QUIC paper studied QUIC at the
> wrong position - between relays, and the QuTor implementation is
> unclear. This means that it may still be an open question as to if
> QUIC's congestion control algorithms will behave optimally in a
> Tor-like network under heavy load.
As Reardon and Goldberg noted in concluding remarks, approaches other
than hop-by-hop will incur an extra cost for retransmissions, since
these must be rerouted through a larger part of the network [RG09].
As Tschorsch and Scheuermann discuss [TS12], due to the longer RTT of
TCP connections, end-to-end approaches will also take longer to “ramp
up” through slow start and up to a steady state.
Both of these factors (not to mention increased security risk of
information leakage [DM09]) suggest that hop-by-hop designs are likely
to yield beer results. In fact, the hop-by-hop approach may be viewed as
an instance of the Split TCP Performance-Enhancing Proxy design, whereby
arbitrary TCP connections are split in two to negate the issues noted
above.
> Unfortunately, the Quux implementation appears to use QUIC at a
> suboptimal position -- they replace Tor's TLS connections with QUIC,
> and use QUIC streams in place of Tor's circuit ID -- but only between
> relays. This means that it does not benefit from QUIC's end-to-end
> congestion control for the entire path of the circuit. Such a design
> will not solve the queuing and associated OOM problems at Tor relays,
> since relays would be unable to drop cells to signal backpressure to
> endpoints. Drops will instead block every circuit on a connection
> between relays, and even then, due to end-to-end reliability, relays
> will still need to queue without bound, subject to Tor's current (and
> inadequate) flow control.
A fully QUIC relay path (with slight modication to fix a limit on
internal buffer sizes) would allow end-to-end backpressure to be used
from the client application TCP stream up to the exit TCP stream.
Leaving aside Tor’s inbound rate limit mechanism but retaining the
global outbound limit, this design would allow max-min fairness to be
achieved in the network, as outlined by Tschorsch and Scheuermann
[TS11].
...
Once implemented however, backpressure would allow Tor to adopt a
signicantly improved internal design. In such a design, a Tor relay
could read a single cell from one QUIC stream’s read buffer, onion crypt
it, and immediately place it onto the write buffer of the next stream in
the circuit. This process would be able to operate at the granularity of
a single cell because the read and write operations for QUIC are very
cheap user-space function calls and not syscalls as for host TCP.
The schedule of this action would be governed by the existing EWMA
scheduler for circuits that have both a readable stream and a writeable
stream (and as allowed by a global outgoing token bucket), allowing
optimal quality of service for circuits.
It’s expected that backpressure implemented in this way will yield
signicant performance and fairness gains on top of the performance
improvement found in this thesis.
One issue for Quux was that it used the Chromium demo QUIC server code as the
basis for its implementation, which was fine for performance research but not
such a good choice for Tor's networking stack.
Several Rust implementations have been released with server-side (not just
client-side) usage, so I expect that to be much less of an issue today.
io_uring is also a significant development since Quux was developed, as
it can reduce the performance hit for host-TCP syscalls, or for using
recvmsg instead of recvmmsg with QUIC if the implementation makes
it difficult to use recvmmsg on the listener side.
The following paper has in-depth discussion, but I don't have a copy to
hand unfortunately:
Ali Clark. Tor network performance — transport and flow control. Technical report, University College London, April 2016