On 1 Jan 2015, at 07:39 , Greg Troxel gdt@lexort.com wrote:
Libertas libertas@mykolab.com writes:
Some of the people at tor-bsd@lists.nycbug.org and I are trying to figure out why Tor relays under-perform when running on OpenBSD. Many such relays aren't even close to being network-bound, file-descriptor-bound, memory-bound, or CPU-bound, but relay at least 33-50% less traffic than would be expected of a Linux machine in the same situation.
I'm more familiar with NetBSD, but hopefully my comments are helpful.
For those not familiar, a Tor relay will eventually have an open TCP connection for each of the other >6,000 active relays, and (if it allows exit traffic) must make outside TCP connections for the user's requests, so it's pretty file-hungry and crypto-intensive.
It may also have something to do with TCP. A few thoughts:
- run netstat -f inet and look and the send queues. That's not really
cleanly diagnostic, but if they are all huge, it's a clue
- run netstat -m and vmstat -m (not sure those map from NetBSD). Look
for runnig out of mbufs and mbuf clusters. Perhaps bump up NMBCLUSTERS in the kernel if it's not dynamic.
Tor 0.2.6.2-alpha (just in the process of being released) has some changes to queuing behaviour using the KIST algorithm.
The KIST algorithm keeps the queues inside tor, and makes prioritisation decisions from there, rather than writing as much as possible to the OS TCP queues. I'm not sure how functional it is on *BSDs, but Nick Mathewson should be able to comment on that. (I've cc'd tor-dev and Nick.)
teor pgp 0xABFED1AC hkp://pgp.mit.edu/ https://gist.github.com/teor2345/d033b8ce0a99adbc89c5 http://0bin.net/paste/Mu92kPyphK0bqmbA#Zvt3gzMrSCAwDN6GKsUk7Q8G-eG+Y+BLpe7wt...
On Thu, 1 Jan 2015 14:19:08 +1100 teor teor2345@gmail.com wrote:
On 1 Jan 2015, at 07:39 , Greg Troxel gdt@lexort.com wrote:
Tor 0.2.6.2-alpha (just in the process of being released) has some changes to queuing behaviour using the KIST algorithm.
The KIST algorithm keeps the queues inside tor, and makes prioritisation decisions from there, rather than writing as much as possible to the OS TCP queues. I'm not sure how functional it is on *BSDs, but Nick Mathewson should be able to comment on that. (I've cc'd tor-dev and Nick.)
I don't think we merged that branch yet, since it's not ready for general use. Additionally, it's not currently functional on the *BSDs. The KIST code last I checked only is used under Linux. While the full portability story is in #12890 it looks roughly like:
* Linux - Supported. * Windows - Possible, needs code in tor. * Darwin - Possible, uses interfaces marked as undocumented/internal. * FreeBSD - Requires a trivial kernel patch (interface is there, information exposed is incomplete). * Other BSDs - Requires a kernel patch, which is more involved than the FreeBSD one (implementing the required interface vs exposing more information). The patch is still trivial for anyone that's familiar with the TCP/IP code.
I don't think we should be in the business of maintaining kernel patches either, so I'm not sure what the right thing to do would be for non-Darwin *BSD.
Regards,
I just ran ktrace/kdump (used for observing system calls) on the Tor process of my exit node, which relays about 800 KB/s. It listed >400,000 calls to gettimeofday(). The list was swamped with them.
I think I remember reading somewhere that that sort of system call is way slower in OpenBSD than Linux. Could this be related to the issue? I've found a lot of similarly mysterious slowdowns related to *BSD gettimeofday() on other projects' bug trackers, but nothing definitive.
As you've likely noticed, although I started this discussion I'm very new to system-level performance tuning. Let me know if I'm not making sense, or if there's something else I should be focusing on.
On 12/31/2014 10:37 PM, Yawning Angel wrote:
On Thu, 1 Jan 2015 14:19:08 +1100 teor teor2345@gmail.com wrote:
On 1 Jan 2015, at 07:39 , Greg Troxel gdt@lexort.com wrote:
Tor 0.2.6.2-alpha (just in the process of being released) has some changes to queuing behaviour using the KIST algorithm.
The KIST algorithm keeps the queues inside tor, and makes prioritisation decisions from there, rather than writing as much as possible to the OS TCP queues. I'm not sure how functional it is on *BSDs, but Nick Mathewson should be able to comment on that. (I've cc'd tor-dev and Nick.)
I don't think we merged that branch yet, since it's not ready for general use. Additionally, it's not currently functional on the *BSDs. The KIST code last I checked only is used under Linux. While the full portability story is in #12890 it looks roughly like:
- Linux - Supported.
- Windows - Possible, needs code in tor.
- Darwin - Possible, uses interfaces marked as undocumented/internal.
- FreeBSD - Requires a trivial kernel patch (interface is there, information exposed is incomplete).
- Other BSDs - Requires a kernel patch, which is more involved than the FreeBSD one (implementing the required interface vs exposing more information). The patch is still trivial for anyone that's familiar with the TCP/IP code.
I don't think we should be in the business of maintaining kernel patches either, so I'm not sure what the right thing to do would be for non-Darwin *BSD.
Regards,
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On Dec 31, 2014 10:38 PM, "Yawning Angel" yawning@schwanenlied.me wrote:
I don't think we merged that branch yet, since it's not ready for general use. Additionally, it's not currently functional on the *BSDs. The KIST code last I checked only is used under Linux.
This is right; the KIST code is not currently in Tor at all, and the patch I wrote is not running on bsd at all.
Two questions I have:
Is this behavior a regression against some older tor version?
What part of tor is accounting for all these time/gettimeofday calls?
I'm working on question #2 now - I'm less well equipped to work on #1 at the moment.
It's probably also worth mentioning that the dump with >400,000 calls to gettimeofday() also had 30,714 calls to clock_gettime().
On 01/01/2015 03:47 PM, Nick Mathewson wrote:
On Dec 31, 2014 10:38 PM, "Yawning Angel" <yawning@schwanenlied.me mailto:yawning@schwanenlied.me> wrote:
I don't think we merged that branch yet, since it's not ready for general use. Additionally, it's not currently functional on the *BSDs. The KIST code last I checked only is used under Linux.
This is right; the KIST code is not currently in Tor at all, and the patch I wrote is not running on bsd at all.
Two questions I have:
Is this behavior a regression against some older tor version?
What part of tor is accounting for all these time/gettimeofday calls?
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On 01/01/2015 03:47 PM, Nick Mathewson wrote:
What part of tor is accounting for all these time/gettimeofday calls?
I set up an OpenBSD 5.6 VM and built Tor 0.2.5.10 on it. I couldn't get chutney working because of a number of strange errors that seemed platform-specific, so I went for a less thorough solution. I can't run a Tor relay from home, so the only activity the Tor process had was some Firefox traffic I created.
There are surprisingly few syscall analysis tools available on OpenBSD, and I'm new to syscall/interrupt tuning, so I eventually ended up selecting the advanced "printf()" technique.
As a quick refresher to those not familiar, there is a linear heirarchy of time functions in the Tor code, each calling the one below it:
gettimeofday() < tor_gettimeofday() < tor_gettimeofday_cached() < tor_gettimeofday_cached_monotonic()
The first two will always make the syscall, while the other two will first try the cache. I applied print statements to instances of all four.
There was a steady stream of tor_gettimeofday() calls from circuit_expire_building() - about two per second with a single idle TCP connection.
When there was active traffic flow, there were many more gettimeofday() calls that were almost exclusively the product of tor_gettimeofday_cached_monotonic(). This function is called in three places:
* once in src/or/relay.c * once in src/or/buffers.c * once in src/or/circuitlist.c
The first two account for the bulk of the calls, as they are in the core data relaying logic.
Ultimately, the problem seems to be that the caching is very weak. At most, only half of the calls to tor_gettimeofday_cached_monotonic() use the cache. It appears in the vomiting print statements that loading a single simple HTML page (http://www.openbsd.org/faq/ports/guide.html to be exact) will cause >30 gettimeofday() syscalls. You can imagine how that would accumulate for an exit carrying 800 KB/s if the caching doesn't improve much with additional circuits.
If I understand correctly, this implies either a bug in the caching logic or the simple fact that libevent calls back and clears the cache too much for the caching to be very useful.
The cache is only cleared in two places (both pretty self-explanatory, and both triggered by libevent callbacks):
* connection_handle_read() * connection_handle_write()
Apparently, the above functions are called >30 times when fetching a JS-less non-HTTPS HTML page. I don't know enough about TCP to comment on whether this makes sense.
Sorry for the magnum opus. I thought I'd try to clarify the context for those interested, and typing this helped organize my thoughts.
Let me know if I've made any mistakes thus far. I'll start looking into potential improvements. I'm eager to work with someone on this if anyone's interested. Also, if you have any ideas or suggestions, please share.
Thanks for reading!
Libertas
I also completely forgot to mention this comment:
https://trac.torproject.org/projects/tor/ticket/14056#comment:6
I'll look into doing the replacements. If anyone has concerns about whether approx_time is exact enough, please let me know - I had never heard of it previously. I'll look into it myself as well.
On 01/01/2015 11:42 PM, Libertas wrote:
On 01/01/2015 03:47 PM, Nick Mathewson wrote:
What part of tor is accounting for all these time/gettimeofday calls?
I set up an OpenBSD 5.6 VM and built Tor 0.2.5.10 on it. I couldn't get chutney working because of a number of strange errors that seemed platform-specific, so I went for a less thorough solution. I can't run a Tor relay from home, so the only activity the Tor process had was some Firefox traffic I created.
There are surprisingly few syscall analysis tools available on OpenBSD, and I'm new to syscall/interrupt tuning, so I eventually ended up selecting the advanced "printf()" technique.
As a quick refresher to those not familiar, there is a linear heirarchy of time functions in the Tor code, each calling the one below it:
gettimeofday() < tor_gettimeofday() < tor_gettimeofday_cached() < tor_gettimeofday_cached_monotonic()
The first two will always make the syscall, while the other two will first try the cache. I applied print statements to instances of all four.
There was a steady stream of tor_gettimeofday() calls from circuit_expire_building() - about two per second with a single idle TCP connection.
When there was active traffic flow, there were many more gettimeofday() calls that were almost exclusively the product of tor_gettimeofday_cached_monotonic(). This function is called in three places:
- once in src/or/relay.c
- once in src/or/buffers.c
- once in src/or/circuitlist.c
The first two account for the bulk of the calls, as they are in the core data relaying logic.
Ultimately, the problem seems to be that the caching is very weak. At most, only half of the calls to tor_gettimeofday_cached_monotonic() use the cache. It appears in the vomiting print statements that loading a single simple HTML page (http://www.openbsd.org/faq/ports/guide.html to be exact) will cause >30 gettimeofday() syscalls. You can imagine how that would accumulate for an exit carrying 800 KB/s if the caching doesn't improve much with additional circuits.
If I understand correctly, this implies either a bug in the caching logic or the simple fact that libevent calls back and clears the cache too much for the caching to be very useful.
The cache is only cleared in two places (both pretty self-explanatory, and both triggered by libevent callbacks):
- connection_handle_read()
- connection_handle_write()
Apparently, the above functions are called >30 times when fetching a JS-less non-HTTPS HTML page. I don't know enough about TCP to comment on whether this makes sense.
Sorry for the magnum opus. I thought I'd try to clarify the context for those interested, and typing this helped organize my thoughts.
Let me know if I've made any mistakes thus far. I'll start looking into potential improvements. I'm eager to work with someone on this if anyone's interested. Also, if you have any ideas or suggestions, please share.
Thanks for reading!
Libertas
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On Thu, 01 Jan 2015 23:42:42 -0500 Libertas libertas@mykolab.com wrote:
The first two account for the bulk of the calls, as they are in the core data relaying logic.
Ultimately, the problem seems to be that the caching is very weak. At most, only half of the calls to tor_gettimeofday_cached_monotonic() use the cache. It appears in the vomiting print statements that loading a single simple HTML page (http://www.openbsd.org/faq/ports/guide.html to be exact) will cause
30 gettimeofday() syscalls. You can imagine how that would accumulate for an exit carrying 800 KB/s if the caching
doesn't improve much with additional circuits.
So while optimization is cool and all, I'm not seeing why this specifically is the underlying issue.
Each cell can contain 498 bytes of user payload. Looking at things simplistically this is 800 KiB/s -> 1644 cells/sec, leaving you with approximately 608 microseconds of processing time per cell.
On my i5-4250U box, gettimeofday() takes 22 ns on Linux, and 2441 ns on FreeBSD. I'm not sure how accurate the FreeBSD results are as it was in a VirtualBox VM (getpid() on the same VM takes 124 ns). If someone has a OpenBSD box they should benchmark gettimeofday() and see how long the call takes.
Taking the FreeBSD case (since we know that tor works fine on Linux), a single gettimeofday() call takes approximately, 0.39% of the per-cell processing budget.
For reference (assuming gettimeofday() in *BSD really is this shit performance wise), 7000 calls to gettimeofday() is 17.09 ms worth of calls.
The clock code in tor does need love, so I wouldn't object to cleanup, but I'm not sure it's in the state where it's causing the massive performance degradation that you are seeing.
Regards,
I would be willing to take a look at and work the clock cleanup if you guys where able to give me some ideas on where I should start and focused on the clean up and rewrite
Sent from my iPhone
On Jan 2, 2015, at 2:20 AM, Yawning Angel yawning@schwanenlied.me wrote:
On Thu, 01 Jan 2015 23:42:42 -0500 Libertas libertas@mykolab.com wrote:
The first two account for the bulk of the calls, as they are in the core data relaying logic.
Ultimately, the problem seems to be that the caching is very weak. At most, only half of the calls to tor_gettimeofday_cached_monotonic() use the cache. It appears in the vomiting print statements that loading a single simple HTML page (http://www.openbsd.org/faq/ports/guide.html to be exact) will cause
30 gettimeofday() syscalls. You can imagine how that would accumulate for an exit carrying 800 KB/s if the caching
doesn't improve much with additional circuits.
So while optimization is cool and all, I'm not seeing why this specifically is the underlying issue.
Each cell can contain 498 bytes of user payload. Looking at things simplistically this is 800 KiB/s -> 1644 cells/sec, leaving you with approximately 608 microseconds of processing time per cell.
On my i5-4250U box, gettimeofday() takes 22 ns on Linux, and 2441 ns on FreeBSD. I'm not sure how accurate the FreeBSD results are as it was in a VirtualBox VM (getpid() on the same VM takes 124 ns). If someone has a OpenBSD box they should benchmark gettimeofday() and see how long the call takes.
Taking the FreeBSD case (since we know that tor works fine on Linux), a single gettimeofday() call takes approximately, 0.39% of the per-cell processing budget.
For reference (assuming gettimeofday() in *BSD really is this shit performance wise), 7000 calls to gettimeofday() is 17.09 ms worth of calls.
The clock code in tor does need love, so I wouldn't object to cleanup, but I'm not sure it's in the state where it's causing the massive performance degradation that you are seeing.
Regards,
-- Yawning Angel _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On 01/02/2015 05:20 AM, Yawning Angel wrote:
So while optimization is cool and all, I'm not seeing why this specifically is the underlying issue.
A lot of people have been reporting underperformance on OpenBSD, and time syscalls are a very common source of performance discrepancy between Linux and *BSD:
https://www.freebsd.org/doc/en/books/porters-handbook/dads-avoiding-linuxism...
They're also the predominant form of syscall made by Tor IME, so I thought it would be worth investigating.
Each cell can contain 498 bytes of user payload. Looking at things simplistically this is 800 KiB/s -> 1644 cells/sec, leaving you with approximately 608 microseconds of processing time per cell.
On my i5-4250U box, gettimeofday() takes 22 ns on Linux, and 2441 ns on FreeBSD. I'm not sure how accurate the FreeBSD results are as it was in a VirtualBox VM (getpid() on the same VM takes 124 ns). If someone has a OpenBSD box they should benchmark gettimeofday() and see how long the call takes.
This benchmark:
https://github.com/rtsisyk/time-bench
Gave these outputs:
Count: 1000000 call 0.01 ns/call memset 2.95 ns/call gettimeofday 1368.05 ns/call time 1374.71 ns/call CLOCK_REALTIME 1344.93 ns/call CLOCK_MONOTONIC 1226.56 ns/call CLOCK_PROCESS_CPUTIME_ID 1259.13 ns/call CLOCK_THREAD_CPUTIME_ID 1308.11 ns/call
This was on OpenBSD 5.6 with an Intel Atom D2700.
Taking the FreeBSD case (since we know that tor works fine on Linux), a single gettimeofday() call takes approximately, 0.39% of the per-cell processing budget.
For reference (assuming gettimeofday() in *BSD really is this shit performance wise), 7000 calls to gettimeofday() is 17.09 ms worth of calls.
I don't know if this is the case, but could all these syscalls increase latency and thereby lower the perceived available bandwidth? I know that my relay can move 10.5 MB/s in each direction, but its advertised bandwidth stays at around 2.8-4.8 MB/s.
I don't know much about TCP, but it seems that if the syscalls were made right when we wanted to read or write each cell (which seems to be the case), it would cause latency greater than if they were just part of a homogenous workload. I'm probably not using the proper terms here, but do you see what I'm saying?
The clock code in tor does need love, so I wouldn't object to cleanup, but I'm not sure it's in the state where it's causing the massive performance degradation that you are seeing.
I agree that this probably isn't the sole cause of OpenBSD's Tor woes. Even if not, though, it could still contribute.
Yawning Angel yawning@schwanenlied.me writes:
On Thu, 1 Jan 2015 14:19:08 +1100 teor teor2345@gmail.com wrote:
On 1 Jan 2015, at 07:39 , Greg Troxel gdt-0NXpHMFAxfDQT0dZR+AlfA@public.gmane.org wrote:
Tor 0.2.6.2-alpha (just in the process of being released) has some changes to queuing behaviour using the KIST algorithm.
The KIST algorithm keeps the queues inside tor, and makes prioritisation decisions from there, rather than writing as much as possible to the OS TCP queues. I'm not sure how functional it is on *BSDs, but Nick Mathewson should be able to comment on that. (I've cc'd tor-dev and Nick.)
I don't think we merged that branch yet, since it's not ready for general use. Additionally, it's not currently functional on the *BSDs. The KIST code last I checked only is used under Linux. While the full portability story is in #12890 it looks roughly like:
- Linux - Supported.
- Windows - Possible, needs code in tor.
- Darwin - Possible, uses interfaces marked as undocumented/internal.
- FreeBSD - Requires a trivial kernel patch (interface is there, information exposed is incomplete).
- Other BSDs - Requires a kernel patch, which is more involved than the FreeBSD one (implementing the required interface vs exposing more information). The patch is still trivial for anyone that's familiar with the TCP/IP code.
I don't think we should be in the business of maintaining kernel patches either, so I'm not sure what the right thing to do would be for non-Darwin *BSD.
I think it makes sense to get these patches into the various systems. To ease that, it would be good to have an (experimental track, perhaps) RFC or i-d or the equivalent with the rules for the mechanism and the API specs. That both makes it easier for people and makes it more likely that the new mechanism is not a moving target.