On Tue, Sep 27, 2022 at 08:22:21PM +0200, Linus Nordberg wrote:
David Fifield david@bamsoftware.com wrote Tue, 27 Sep 2022 08:54:53 -0600:
I checked the number of sockets connected to the haproxy frontend port, thinking that we may be running out of localhost 4-tuples. It's still in bounds (but we may have to figure something out for that eventually).
# ss -n | grep -c '127.0.0.1:10000\s*$' 27314 # sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_port_range = 15000 64000
Would more IP addresses and DNS round robin work?
By more IP addresses you mean more localhost IP addresses, I guess? All of 127.0.0.0/8 is localhost, so we can expand the range of four-tuples by using more addresses from that address range in either the source or destination address position. haproxy probably has an option to listen on multiple addresses. The trick is actually using the multiple addresses. I don't think DNS will work directly, because snowflake-server gets the address of its upstream from the TOR_PT_ORPORT environment variable, which is specified to take an IP:port, not a DNS name (and is implemented that way in goptlib). https://gitweb.torproject.org/torspec.git/tree/pt-spec.txt?id=ec77ae643f3e47... https://gitweb.torproject.org/pluggable-transports/goptlib.git/tree/pt.go?h=... You could try using more addresses from 127.0.0.0/8 in the *source* address position, by specifying the second parameter of net.DialTCP to set the source address here: https://gitweb.torproject.org/pluggable-transports/goptlib.git/tree/pt.go?h=...
It may be something inside snowflake-server, for example some central scheduling algorithm that cannot run any faster. (Though if that were the case, I'd expect to see one CPU core at 100%, which I do not.) I suggest doing another round of profiling now that we have taken care of the more obvious hotspots in https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf...
After an interesting chat with anarcat I think that we are CPU bound and in particular by handling so many interrupts from the NIC and dealing with such a high number of context switches. I have two suggestions on how to move forward with this.
First, let's patch tor to get rid of the extor processes, as suggested by David earlier when discussing RAM pressure. This should bring down context switches.
The easiest way to do this is probably to comment out the re-randomization of the ExtORPort auth cookie file on startup, and replace the existing cookie files with static files. Or even just comment out the failure case in connection_ext_or_auth_handle_client_hash. https://gitweb.torproject.org/tor.git/tree/src/feature/relay/ext_orport.c?h=...
The uncontrollable rerandomization of auth cookies is the whole reason for extor-static-cookie: https://forum.torproject.net/t/tor-relays-how-to-reduce-tor-cpu-load-on-a-si...
Here's my post requesting support in core tor: https://lists.torproject.org/pipermail/tor-dev/2022-February/014695.html