Hello everyone!
For a while now, I've noted that when running a non-exit relay under
fairly heavy load that is allowed to be utilised as a client, it has a
tendency to stall when trying to create new circuits.
If the relay has not been used as a client for a while and has a need
to create new circuits, CPU load shoots up to 100 % and the stalls for
some amount of time. I've seen stalls of over 1 minute.
I've taken two profiling traces of the relay, one during normal
operations and during the stalling behaviour. This was done on version
0.4.4.6 and each profile is approximately a 10 second recording.
Normal case:
>- 84,34% event_base_loop
> - 80,19% event_process_active_single_queue
> - 42,87% scheduler_evt_callback
> - 42,86% kist_scheduler_run
> + 20,23% connection_handle_write
> + 9,47% update_socket_info_impl
> + 5,07% channel_flush_some_cells
> + 2,49% smartlist_pqueue_pop
> + 2,19% outbuf_table_add
> + 1,06% smartlist_pqueue_add
> + 28,38% conn_read_callback
> - 5,27% periodic_event_dispatch
> - 3,32% second_elapsed_callback
> + 1,69% run_connection_housekeeping
> + 0,98% circuit_build_needed_circs
> + 1,73% save_state_callback
> + 1,50% conn_write_callback
> + 1,02% connection_ap_attach_pending
> + 4,02% epoll_dispatch
Stalling case:
>- 86,27% event_base_loop
> - 85,52% event_process_active_single_queue
> - 31,78% periodic_event_dispatch
> - 30,00% second_elapsed_callback
> - 28,84% circuit_build_needed_circs
> - 15,35% connection_ap_attach_pending
> - 15,34% connection_ap_handshake_attach_circuit
> - 15,34% circuit_get_open_circ_or_launch
> 10,14% circuit_get_best
> - 4,99% circuit_launch_by_extend_info
> - 4,88% circuit_establish_circuit
> + 2,11% router_choose_random_node
> + 2,05% nodelist_add_node_and_family
> 0,69% count_acceptable_nodes.isra.0
> + 13,39% circuit_launch_by_extend_info
> + 0,80% run_connection_housekeeping
> + 1,74% save_state_callback
> - 30,76% connection_ap_attach_pending
> - connection_ap_handshake_attach_circuit
> - 30,74% circuit_get_open_circ_or_launch
> - 30,63% circuit_get_best
> 1,18% circuit_get_global_list
> + 11,50% conn_read_callback
> + 10,47% scheduler_evt_callback
> - 10,47% scheduler_evt_callback
> - kist_scheduler_run
> + 6,27% connection_handle_write
> + 1,54% channel_flush_some_cells
> + 0,98% update_socket_info_impl
> + 0,91% smartlist_pqueue_pop
> 0,75% epoll_dispatch
(The traces above are meant to be indented for legibility, but some
mail clients think they know better.)
Has anyone seen similar behaviour or have an inkling as to what's
going on? I thought it best to ask before diving into the code.
Cordially,
Andreas Kempe