0. Summary
I've got some end-of-month deliverables to try to figure out how Tor
fits together to identify the most important areas to test, the
hardest areas to test, and the best ways to modularize Tor in the
future. Here's my current draft. Some of these ideas are
half-baked; many will need more rounds of revision before they
can be acted on.
For more information see tickets #15236 and #15234.
1. Testing Tor: Where we should focus
1.1. Let's focus on complicated, volatile things.
These functions are the highest in length and in complexity:
circuit_expire_building
parse_port_config
run_scheduled_events
connection_ap_handshake_rewrite_and_attach
options_act
networkstatus_parse_vote_from_string
connection_dir_client_reached_eof
directory_handle_command_get
networkstatus_compute_consensus
options_validate
If we focus on the number of commits that have code still live in
a function, we have:
circuit_expire_building
parse_port_config
run_scheduled_events
connection_ap_handshake_rewrite_and_attach
options_act
networkstatus_parse_vote_from_string
connection_dir_client_reached_eof
directory_handle_command_get
networkstatus_compute_consensus
options_validate
And if we want to focus on average commits per line of code in a
function, weighted in various ways, these ones stand out:
add_stream_log_impl
connection_close_immediate
connection_connect_sockaddr
connection_dir_client_reached_eof
connection_edge_destroy
connection_exit_begin_conn
connection_handle_write_impl
consider_testing_reachability
directory_info_has_arrived
init_keys
options_act
options_validate
router_add_to_routerlist
router_dump_router_to_string
tor_free_all
router_rebuild_descriptor
run_scheduled_events
If we focus on how many changes have *ever* been made to a
function, we get these as the top offenders:
router_rebuild_descriptor
onion_extend_cpath
directory_handle_command
dns_resolve
assert_connection_ok
connection_handle_write
connection_handle_listener_read
dns_found_answer
circuit_extend
main
connection_edge_process_inbuf
connection_ap_handshake_attach_circuit
config_assign
init_keys
circuit_send_next_onion_skin
connection_dir_process_inbuf
prepare_for_poll
connection_exit_begin_conn
connection_ap_handshake_process_socks
router_dump_router_to_string
fetch_from_buf_socks
getconfig
run_scheduled_events
connection_edge_process_relay_cell
do_main_loop
No single list above is uniquely the right one to focus on;
instead, we should probably use them in tandem to identify areas
to work on.
1.2. What will be hardest to test?
The entire "blob" described in section 2.1 below should be
considered hard-to-test because of the number of functions
reachable from one another. Removing some of this callgraph
complexity will help us test, but it's not completely trivial.
See Appendix A for a list of functions of this kind.
2. Untangling Tor: Modules and You
If you've done a software engineering class, you probably got
taught how to diagram the modules or layers of a software project.
I tried this with Tor back in 2005 and my head spun; things have
not gotten easier since then.
Looking at Tor's call graphs tell us a lot about our issues with
modularizing Tor. In the abstract, we'd like to divide Tor into
layers and modules, with most modules only calling a few others,
and the "central" functions limited to as little code as possible.
This will make modules more independent and help us test them in
isolation. This will also help us with future design ideas like
splitting Tor into separate processes for testing or security or
robustness purposes.
2.1. Fun with callgraphs
I did some analysis of Tor's callgraph to figure out where we
stand today. About 27% of the functions that tor calls
(internally and externally) can reach only 10 or fewer other
functions that Tor knows about. About 82% can reach a smaller
portion of Tor.
Then there comes a block of functions that can reach _nearly every
other function_ in the block, or in Tor. They are all reachable
from one another. There are about 450 of these, or 11% of the
functions in Tor. I'll call them "The Blob."
Finally, there are functions that can reach the Blob, and all the
functions that the Blob calls, but which are not themselves part
of the Blob. These are about 100 functions, or 2% of the
functions in Tor.
Breaking up the Blob is a high priority for making Tor more
modular.
So, how can we do that?
* Let's start by looking at which functions, if they were made
unnecessary, or if they stopped calling into the Blob, would
remove the most other functions from the Blob.
* Let's also start by hand-annotating the Blob with a list of
functions that logically shouldn't be part of the Blob, and then
seeing what they call that pulls them into the Blob.
Here are some preliminary results for the best functions to try to
remove from the Blob, and some antipatterns to avoid:
* The following functions pull many, many other functions into the
Blob: control_event_* router_rebuild_descriptor_
directory_fetches_from_authorities *mark_for_close* *free
*free_
If we could stop them calling back into the blob, we would
shrink the blob's size down to 254 functions, and reduce the
number of outside functions calling into the blob to 74.
Also worth investigating is seeing whether to remove the call
from 'connection_write_to_buf_impl_' to
'connection_handle_write', and the call from
'conection_handle_write_impl' to 'connection_handle_read'.
Either of these changes reduces the number of functions
reachable from blob functions severely, and shifts many
functions from Blob-members to Blob-callers.
Finally, these might be more work to unblob:
'connection_or_change_state' 'tor_cleanup' 'init_keys'
'connection_or_close_for_error' But doing so would take the
blob down to 60 functions, with the blob-callers to 118.
* Many functions that are meant to simply note that an event has
occurred, so that some work must happen. We could refactor most
of these to instead use a deferred callback type, rather than
invoking the functions that call the work directly. This would
also help us decouple Blob functions from one another.
* I have the blob functions listed in Appendix A.1, with the ones
that would still in the blob after the changes above listed in
Appendix A.2.
2.2. Separating modules into processes and libraries; module-level APIs.
Let's imagine this architecture: Tor runs as a series of
conceptually separate processes that communicate over a reasonable
IPC mechanism. These processes correspond to module boundaries;
their APIs need to be carefully designed. There is a central
communications process that does the work of transferring commands
and replies from A to B.
APIs would be divided into publish/subscribe types and
remote-procedure call types. Modules would be divided into
stateful/non-stateful; non-stateful ones could be shared
libraries; or if their processing is sensitive, they could be kept
in very sandboxed processes.
How do we get there?
To make the proposal concrete, I suggest for a 0th-order draft
that we imagine the following simplified version:
* That all src/common modules become shared libraries with no
state.
* That we divide (or imagine how to divide) the contents of
src/or into separate modules, with more clear connections
between them. We imagine doing this one part at a time,
starting by adding a nice simple publish/subscribe mechanism.
* That the controller API be reimagined as a a layer on top of
the rest of this.
* That any multiprocess version of this be imagined, for
concreteness, as ProtocolBufs over nanomsg. (I'm not tied to
this particular format/transport combination; just proposing
it for concreteness.
We would need a transition plan here. I'd suggest that some of
the easiest modules to extract first would be voting, rephist,
node information management, path selection, directory data, and
hidden services. Most of these have the virtue of being more
easily testable once they are extracted.
Once we have some significant pieces split into libraries, and
submodules, we could start doing process-level refactoring, moving
towards an architecture built out of processes and parts of these kinds:
* Network: Has libevent-driven main loop. Accepts commands in a
pool, maybe from a subthread.
* Utility: Is just event-driven. Answers requests, gives
responses.
* Manager: starts up processes as needed and connects them to
one another.
* Dispatcher: relays messages from where they come from to where
they go.
Our best next steps in this area, IMO, seem to be to identify one
or two key areas and abstractions, and begin isolating them in
practice so as to better get a feel for how this kind of
transition works, so we can iterate our design.
[Appendices in attachment so as to keep this to a reasonable size.]