Re: [tor-dev] QUIC TOR Debugging Question (no attach)

29 Apr 2016

      Tim:
Sorry for not being specific enough on my questions. I'll try to give more
detailed questions later instead of higher-level problems.
Regarding the frequency of my emails, I apologize for the long intervals
but the reason is that I'm not full-time on this project and a lot of times
I had exams and I can only work on the QUIC TOR project for a couple of
days every week. Fortunately, I'm not nearly done with all my finals for
this semester and I can put more time into this project from now on.
Right now, I have two specific questions:
1. We just switched to testing on EmuLab (each node is a standalone
machine) from chutney. After the switch, a particular bug on chutney
disappeared: on chutney, some nodes used to crash mysteriously with no log
outputs (all the log simply stops, with no stack trace or anything). This
bug only occurs when there's existing cache (the first run after chutney
configure is fine). After porting onto EmuLab (a testing framework), using
almost identical torrc file, this bug disappeared and everything runs just
fine for now. Right now we are ignoring this bug. Have you seen similar
issues on chutney?
2. The circuit building process is taking too long and many of them
expires. We have 4 relays where 2 of which are also authorities. From the
logs, I'm seeing a lot of the following lines:
- circuit_expire_building(): Abandoning circ XX XXXX:XX:12345 (state
   0,0: doing handshakes, purpose 5, len 3)
   -
   - router_choose_random_node(): We found 3 running nodes.
   router_choose_random_node(): We removed 1 excludednodes, leaving 2 nodes.
   router_choose_random_node(): We removed 2 excludedsmartlist, leaving 0
   nodes      .
The first line happens when we have connected to the first node and waiting
for a response from the second or sometimes the third relay. And the second
log happens when we are trying to choose the path to use for a circuit. *What
could I do to increase the number of available nodes? Should I increase the
frequency of reachability tests? *
After looking at the code, *there's a circuitbuild.c line 2172 describing
why some nodes are excluded, which I don't quite understand*. Specifically,
the comment says: "XXXX025 use the using_as_guard flag to accomplish this."
where can I find more information on this XXXX025 issue (committed here
https://lists.torproject.org/pipermail/tor-commits/2013-March/053977.html)?
*Why are these routers being excluded? *
Please let me know if you want more specific information on those issues.
Thank you!
Li.
On Sun, Apr 24, 2016 at 11:33 PM, Tim Wilson-Brown - teor <
teor2345@gmail.com> wrote:
...
...
On 25 Apr 2016, at 06:44, Xiaofan Li xli2@andrew.cmu.edu wrote:
Hi Tim and everyone on tor-dev,
Our QUIC + TOR project has almost been fully implemented. We are
debugging the last few bits of bugs. Update:
...
  • We've now able to build many complete circuits with QUIC as its

underlying protocol.
...
  • We have not debugged the actual communication part yet. We are

aware of certain failure cases for QUIC (e.g. line 15642 of the log is
being debugged right now). So we cannot send actually client data yet.
...
  • The current state uses QUIC for OR connections only. Thus a

dual-path is implemented as suggested in my last email thread.
...
  • TLS is completely bypassed and important state (that is set up

in tls_handshake functions) is preserved and refactored out. e.g.
conn->/chan->state purpose, etc.
...
  • Some tinkering and re-designing of QUIC itself is also underway.

The fact that QUIC is a transport protocol on application layer makes it
painful to interact with the event and timer systems of TOR. We are trying
to improve this aspect now.
...
The debugging log I was attaching was too big for the tor-dev list. So
if you are interested to take a look at the file, let me know.
Large debug logs contain too much information to be helpful to you or to
us.
Try warning, notice, or info level logs, in that order.
Using high-level logs makes it easier to work out where your attempts to
send data have broken down.
Once you've identified where communication has broken down, try to fix it.
If you can't fix it, you're welcome to ask for advice.
Please quote a small number of relevant log messages, tell us what you
think they mean, and what you've tried to do to fix it.
Also feel free to provide a link to logs at that level for people to look
through.
This makes it more likely that people will recognise your issue and
respond by helping you to fix it.
...
Some particularly concerning things in the log:
      • circuit_get_by_circid_channel_impl(): found nothing for circ_id
14801, channel ID 2 (0x7f758bb6b740)
...
Then it just attaches this circ onto this channel.. Is this normal?
      • Line 4901 circuit_receive_relay_cell(): Passing on unrecognized
cell.
...
It happens a lot. Is this normal?
      • This sequence happened a lot around 7500.
relay_send_command_from_edge_(): delivering 10 cell forward.
circuit_package_relay_cell(): crypting a layer of the relay cell.
circuit_package_relay_cell(): crypting a layer of the relay cell.
circuit_package_relay_cell(): crypting a layer of the relay cell.
It seems like its decrypting and forwarding cells along. Is it normal
for TOR (with TCP) to do this in a burst? Because I'm seeing about ~1s of
repeated calls.
I honestly don't think these are concerning at all. But I don't really
know.
And I can't find out, because I don't know which version of tor you've
based your changes on.
Here's how you can find out whether these log messages are typical or not:
Run the original version of Tor that you've based your QUIC changes on,
with the same network configuration.
(Does it work? If not, your QUIC network will likely never work either.)
Then compare the warning, notice, and info logs to tor with QUIC.
Stop at the first log that differs in non-trivial ways.
This is a log level that's useful for you.
(High-level logs will also cause you less concern about spurious messages.)
This way, you can answer your own questions about which logs and
behaviours are normal, and which ones you've introduced.
Feel free to report back with any log messages from the unmodified version
of Tor that might indicate bugs.
...
Some more general questions:
      • Internal Circuits: any docs? What is it used for? Measuring
bandwidth?
Relay bandwidth testing, relay reachability testing (default chutney
configs skip this using AssumeReachable), client directory fetches, hidden
service directory document uploads, onion services (hidden services), …
Read the ~12 instances of CIRCLAUNCH_IS_INTERNAL in the tor source code
for more details.
...
How many internal circuits are required by the system?
As many are as necessary to support the operation of the Tor client /
relay / onion service at the current time.
Initially, 2 or 3 (read circuit_predict_and_launch_new for more details).
...
  • circuit wide ID format. We had a bug regarding this last week.

The check in process_create_cell always fails because line 281-295 in
command.c always failed (the check for CIRC_ID_TYPE and id_is_high).
Currently we commented out this check. What does it affect? And could we do
this?
I can't see how this could be your client communication issue. It's only
an issue if the circuit IDs collide, which should be unlikely in small
networks.
When two relays create circuits on a connection, one uses the lower half
of the circuit id space, and one uses the upper half. This prevents circuit
IDs colliding. Read the definitions of circ_id_type, circ_id_type_t, and
channel_set_circid_type for details.
The version of the link protocol determines how this decision is made.
I assume that your tor has chan->conn->link_proto >=
MIN_LINK_PROTO_FOR_WIDE_CIRC_IDS.
(You can check this by printing out the value of chan->conn->link_proto
everywhere channel_set_circid_type is called.)
So you've removed TLS client identity and TLS server identity keys.
What do get_tlsclient_identity_key and get_server_identity_key return?
Null bytes?
Is there a publicly known key in QUIC that's known by both sides and
stable for the life of a connection?
If so, use that.
If not, always pass 0 for consider_identity to channel_set_circid_type, so
that the initiator uses the upper half of the circuit IDs, regardless of
keys.
Breaking other parts of the circuit management code could also cause this
issue.
...
  • From a high level, when a client sends data using a circuit,

what is its code path? Which special (as in, specific to client-initiated
communication) functions are called?
I'm not sure how to answer this question. The unhelpful but accurate
answer is "not many codepaths are client-specific, if there are any at all".
Regardless of its role in the network, every tor instance performs common
operations like retrieving consensus documents and building circuits. And,
if configured to do so, tor instances can perform multiple roles.
Here are some high-level differences between client and server
communication in the tor network that could be causing your issues:
Typically, clients, onion services, and bridges retrieve directory
documents using "begindir", a TLS connection to the ORPort. Relays and
authorities do this unencrypted over the DirPort. If you haven't replaced
TLS with QUIC correctly, clients may fail to bootstrap or retrieve
directory documents. There should be log messages about this.
Clients have a SOCKSPort open, and in response to application requests
they make an AP (application) connection that's linked to a stream on a
circuit that's been extended to the destination exit relay. They then send
requests received on the SOCKSPort to the destination relay, and receive
responses that they forward to the application. (The onion service setup is
slightly more complex, but transmits data in a similar way.)
Have you read torguts?
https://gitweb.torproject.org/user/nickm/torguts.git/
Any part of this process could break and cause client communication to
fail.
Parts of the relay code could also break in ways that cause client
communication to fail.
I can't see how to describe specific code paths without more specific (and
precise) detail about what's failing, and whether it's failing on clients
or relays. You can find this in the logs, if you log sensibly. Let us know
what you find, and what you tried to do to fix it.
What high-level success or failure message (warning, log, info) is logged
on the client right after you try to make an application connection?
Does the connection reach a relay? The exit? DNS? The remote site?
What warning, notice, or info-level message is logged on the last tor node
where the connection stops working?
(Or what DNS or HTTP request is sent to the remote server / site?)
...
Any other comment on the log is greatly appreciated, since everyone here
is probably more familiar than me with what a normal bootstrapping process
would look like.
Don't worry too much about the log messages. They're designed to be used
for debugging once there is a known issue.
The vast majority are harmless, and many need context to interpret. You
can find this context by searching the tor code for unique words or phrases
in the log message. (But keep in mind that log strings are often composed
of shorter strings.)
Some general requests for future questions:
It would be much easier and faster for me (and perhaps others) to help you
if you asked questions after trying to identify and fix issues yourself. I
encourage you to try some of the things I've suggested, and ask more
precise questions next time.
Personally, I would find it easier to respond to targeted questions that
come one at a time, every few days or every week, rather than a large email
every few weeks.
It might also be helpful to be able to see the source code you're working
on, rather than trying to guess, what changes you've made, from what I
remember, about what you said, about your design, in previous emails.
Tim
Tim Wilson-Brown (teor)
teor2345 at gmail dot com
PGP 968F094B
ricochet:ekmygaiu4rzgsk6n

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] QUIC TOR Debugging Question (no attach)