On 01/01/2015 03:47 PM, Nick Mathewson wrote:
What part of tor is accounting for all these time/gettimeofday calls?
I set up an OpenBSD 5.6 VM and built Tor 0.2.5.10 on it. I couldn't get chutney working because of a number of strange errors that seemed platform-specific, so I went for a less thorough solution. I can't run a Tor relay from home, so the only activity the Tor process had was some Firefox traffic I created.
There are surprisingly few syscall analysis tools available on OpenBSD, and I'm new to syscall/interrupt tuning, so I eventually ended up selecting the advanced "printf()" technique.
As a quick refresher to those not familiar, there is a linear heirarchy of time functions in the Tor code, each calling the one below it:
gettimeofday() < tor_gettimeofday() < tor_gettimeofday_cached() < tor_gettimeofday_cached_monotonic()
The first two will always make the syscall, while the other two will first try the cache. I applied print statements to instances of all four.
There was a steady stream of tor_gettimeofday() calls from circuit_expire_building() - about two per second with a single idle TCP connection.
When there was active traffic flow, there were many more gettimeofday() calls that were almost exclusively the product of tor_gettimeofday_cached_monotonic(). This function is called in three places:
* once in src/or/relay.c * once in src/or/buffers.c * once in src/or/circuitlist.c
The first two account for the bulk of the calls, as they are in the core data relaying logic.
Ultimately, the problem seems to be that the caching is very weak. At most, only half of the calls to tor_gettimeofday_cached_monotonic() use the cache. It appears in the vomiting print statements that loading a single simple HTML page (http://www.openbsd.org/faq/ports/guide.html to be exact) will cause >30 gettimeofday() syscalls. You can imagine how that would accumulate for an exit carrying 800 KB/s if the caching doesn't improve much with additional circuits.
If I understand correctly, this implies either a bug in the caching logic or the simple fact that libevent calls back and clears the cache too much for the caching to be very useful.
The cache is only cleared in two places (both pretty self-explanatory, and both triggered by libevent callbacks):
* connection_handle_read() * connection_handle_write()
Apparently, the above functions are called >30 times when fetching a JS-less non-HTTPS HTML page. I don't know enough about TCP to comment on whether this makes sense.
Sorry for the magnum opus. I thought I'd try to clarify the context for those interested, and typing this helped organize my thoughts.
Let me know if I've made any mistakes thus far. I'll start looking into potential improvements. I'm eager to work with someone on this if anyone's interested. Also, if you have any ideas or suggestions, please share.
Thanks for reading!
Libertas