Hello everyone!
For a while now, I've noted that when running a non-exit relay under fairly heavy load that is allowed to be utilised as a client, it has a tendency to stall when trying to create new circuits.
If the relay has not been used as a client for a while and has a need to create new circuits, CPU load shoots up to 100 % and the stalls for some amount of time. I've seen stalls of over 1 minute.
I've taken two profiling traces of the relay, one during normal operations and during the stalling behaviour. This was done on version 0.4.4.6 and each profile is approximately a 10 second recording.
Normal case:
- 84,34% event_base_loop
- 80,19% event_process_active_single_queue
- 42,87% scheduler_evt_callback
- 42,86% kist_scheduler_run
- 20,23% connection_handle_write
- 9,47% update_socket_info_impl
- 5,07% channel_flush_some_cells
- 2,49% smartlist_pqueue_pop
- 2,19% outbuf_table_add
- 1,06% smartlist_pqueue_add
- 28,38% conn_read_callback
- 5,27% periodic_event_dispatch
- 3,32% second_elapsed_callback
- 1,69% run_connection_housekeeping
- 0,98% circuit_build_needed_circs
- 1,73% save_state_callback
- 1,50% conn_write_callback
- 1,02% connection_ap_attach_pending
- 4,02% epoll_dispatch
Stalling case:
- 86,27% event_base_loop
- 85,52% event_process_active_single_queue
- 31,78% periodic_event_dispatch
- 30,00% second_elapsed_callback
- 28,84% circuit_build_needed_circs
- 15,35% connection_ap_attach_pending - 15,34% connection_ap_handshake_attach_circuit - 15,34% circuit_get_open_circ_or_launch 10,14% circuit_get_best - 4,99% circuit_launch_by_extend_info - 4,88% circuit_establish_circuit + 2,11% router_choose_random_node + 2,05% nodelist_add_node_and_family 0,69% count_acceptable_nodes.isra.0
- 13,39% circuit_launch_by_extend_info
- 0,80% run_connection_housekeeping
- 1,74% save_state_callback
- 30,76% connection_ap_attach_pending
- connection_ap_handshake_attach_circuit
- 30,74% circuit_get_open_circ_or_launch
- 30,63% circuit_get_best 1,18% circuit_get_global_list
0,75% epoll_dispatch
- 11,50% conn_read_callback
- 10,47% scheduler_evt_callback
- 10,47% scheduler_evt_callback
- kist_scheduler_run
- 6,27% connection_handle_write
- 1,54% channel_flush_some_cells
- 0,98% update_socket_info_impl
- 0,91% smartlist_pqueue_pop
(The traces above are meant to be indented for legibility, but some mail clients think they know better.)
Has anyone seen similar behaviour or have an inkling as to what's going on? I thought it best to ask before diving into the code.
Cordially, Andreas Kempe