On 01/02/2015 05:20 AM, Yawning Angel wrote:
So while optimization is cool and all, I'm not seeing why this specifically is the underlying issue.
A lot of people have been reporting underperformance on OpenBSD, and time syscalls are a very common source of performance discrepancy between Linux and *BSD:
https://www.freebsd.org/doc/en/books/porters-handbook/dads-avoiding-linuxism...
They're also the predominant form of syscall made by Tor IME, so I thought it would be worth investigating.
Each cell can contain 498 bytes of user payload. Looking at things simplistically this is 800 KiB/s -> 1644 cells/sec, leaving you with approximately 608 microseconds of processing time per cell.
On my i5-4250U box, gettimeofday() takes 22 ns on Linux, and 2441 ns on FreeBSD. I'm not sure how accurate the FreeBSD results are as it was in a VirtualBox VM (getpid() on the same VM takes 124 ns). If someone has a OpenBSD box they should benchmark gettimeofday() and see how long the call takes.
This benchmark:
https://github.com/rtsisyk/time-bench
Gave these outputs:
Count: 1000000 call 0.01 ns/call memset 2.95 ns/call gettimeofday 1368.05 ns/call time 1374.71 ns/call CLOCK_REALTIME 1344.93 ns/call CLOCK_MONOTONIC 1226.56 ns/call CLOCK_PROCESS_CPUTIME_ID 1259.13 ns/call CLOCK_THREAD_CPUTIME_ID 1308.11 ns/call
This was on OpenBSD 5.6 with an Intel Atom D2700.
Taking the FreeBSD case (since we know that tor works fine on Linux), a single gettimeofday() call takes approximately, 0.39% of the per-cell processing budget.
For reference (assuming gettimeofday() in *BSD really is this shit performance wise), 7000 calls to gettimeofday() is 17.09 ms worth of calls.
I don't know if this is the case, but could all these syscalls increase latency and thereby lower the perceived available bandwidth? I know that my relay can move 10.5 MB/s in each direction, but its advertised bandwidth stays at around 2.8-4.8 MB/s.
I don't know much about TCP, but it seems that if the syscalls were made right when we wanted to read or write each cell (which seems to be the case), it would cause latency greater than if they were just part of a homogenous workload. I'm probably not using the proper terms here, but do you see what I'm saying?
The clock code in tor does need love, so I wouldn't object to cleanup, but I'm not sure it's in the state where it's causing the massive performance degradation that you are seeing.
I agree that this probably isn't the sole cause of OpenBSD's Tor woes. Even if not, though, it could still contribute.