Hi all,
I am working on integrating uTCP and uTLS ( http://arxiv.org/abs/1103.0463) into Tor to see if we can lower the latency due to head of line blocking across circuits. This is Tracker Item #7679 ( https://trac.torproject.org/projects/tor/ticket/7679). I want to first test the Tor code's ability to process cells from different circuits out of order with respect to how TCP delivered them. To test this, I made some changes to the file named src/or/channel.c.
1. I forced queueing of cells into the channel's incoming_queue with: channel_queue_cell(...) { ... need_to_queue = 1; // Force all cells into the queue, so they do NOT go directly to the handler. ...
2. Now, look for case when two cells from different circuits are present in the incoming_queue and process the second cell before the first cell. channel_process_cells(...) { ... while (NULL != (q = TOR_SIMPLEQ_FIRST(&chan->incoming_queue))) { // Remove q from the incoming_queue. // if the queue is still not empty, get the next one, // if the circuit ids do not match, Swap the cells. } ... }
My problem is that number 2. above never occurs. I have not observed a case where the incoming_queue ever has two cells from different circuits. In fact, I don't think I've ever had a time when the incoming_queue has more than 1 cell in it. I am hammering a small tor test network with 30+ curl requests at once. I have two proxies, each of them uses the same entry node, and the same exit node, and there is only one other node in the network, so the circuit that is set up should be the same for every single request. Am I missing something? Will this function (channel_process_cells) ever process more than one cell at a time? I've checked the logs to verify that different circuits are actually set up for the different requests (rather than the Proxies just reusing the existing circuit and giving each new request a new stream id).
Thanks for the help!
Fitz
On Thu, May 02, 2013 at 10:58:54AM -0400, MF Nowlan wrote:
I am working on integrating uTCP and uTLS ( http://arxiv.org/abs/1103.0463) into Tor to see if we can lower the latency due to head of line blocking across circuits.
You have to be careful to preserve cell ordering *within* circuits, of course.
[Do you know about TCP-over-DTLS and PCTCP?]
- Now, look for case when two cells from different circuits are
present in the incoming_queue and process the second cell before the first cell. channel_process_cells(...) { ... while (NULL != (q = TOR_SIMPLEQ_FIRST(&chan->incoming_queue))) { // Remove q from the incoming_queue. // if the queue is still not empty, get the next one, // if the circuit ids do not match, Swap the cells. } ... }
My problem is that number 2. above never occurs. I have not observed a case where the incoming_queue ever has two cells from different circuits. In fact, I don't think I've ever had a time when the incoming_queue has more than 1 cell in it. I am hammering a small tor test network with 30+ curl requests at once. I have two proxies, each of them uses the same entry node, and the same exit node, and there is only one other node in the network, so the circuit that is set up should be the same for every single request. Am I missing something? Will this function (channel_process_cells) ever process more than one cell at a time? I've checked the logs to verify that different circuits are actually set up for the different requests (rather than the Proxies just reusing the existing circuit and giving each new request a new stream id).
I'm not very familiar with the new channel code, unfortunately (it's on our "todo" list to port TCP-over-DTLS and/or PCTCP to channels, though). But at least in the pre-channel code, the input queues would indeed almost always be empty. Any why wouldn't they be? As soon as TLS reports a cell has arrived, you should be able to process it immediately. Where the HOL blocking occurs is in the Tor *output* queue on the sending side and/or the TCP receive buffer on the receiving side. If a IP packet containing (...TCP...TLS...) a cell gets dropped, all future data on that connection will get stalled in the TCP receive buffer (if the window is open) or the Tor output queue (if the window is closed). Once it gets delivered to Tor, then the stall has already been resolved.
- Ian
Yes, the initial plan is to maintain ordering within a circuit. Although, even this is not strictly necessary if the application wishes to handle out of order packets. For example, a web server could theoretically handle requests for different resources in any order. But that's a different story.
Yes, we are aware of TCP-over-DTLS and Defenestrator. I'm not familiar with PCTCP. I'll read up on that.
What you're saying about HOL blocking in the output queue for a relay makes sense if the receive window fills up, but I didn't explain how uTCP actually works. uTCP (and paired with uTLS) is a kernel patch that will expose to the application (in this case, Tor) segments that have arrived even if they are NOT the next logical segments in the stream. uTCP breaks TCP's delivery semantics, without affecting the on-the-wire protocol. Thus, it prevents HOL blocking from occurring in the TCP receive buffer. Now, it should be clear why the queue is important: I need to keep unordered cells in the queue if they are out of order within a circuit, but if they are the "next expected cell" for its circuit, then I can process it, regardless of whether it came in order or out of order from TCP.
I'm implementing this by changing the wire format to include a sequence number for each cell. The sequence number increments within a circuit. The receiver increments its "expected" sequence number with each in order arrival within a circuit.
It makes sense why the code is written as it is to process cells immediately, since, as you point out, the HOL issue has already been resolved by the time it gets to Tor. I was really just trying to manually inject some cell reordering to test Tor's ability to handle cells from different circuits out-of-order with respect to how TCP delivered them. But I'll skip this testing step and go to the next one, which is just running it atop uTCP/uTLS directly and using the incoming_queue to hold cells as needed. I'll have to change a bit more of the code, particularly the asserts that the incoming_queue is empty after a call to channel_process_cells(…).
If you have any other thoughts on this issue, please send them along. I appreciate your insight.
Regards,
Fitz
On May 4, 2013, at 9:05 AM, Ian Goldberg iang@cs.uwaterloo.ca wrote:
On Thu, May 02, 2013 at 10:58:54AM -0400, MF Nowlan wrote:
I am working on integrating uTCP and uTLS ( http://arxiv.org/abs/1103.0463) into Tor to see if we can lower the latency due to head of line blocking across circuits.
You have to be careful to preserve cell ordering *within* circuits, of course.
[Do you know about TCP-over-DTLS and PCTCP?]
- Now, look for case when two cells from different circuits are
present in the incoming_queue and process the second cell before the first cell. channel_process_cells(...) { ... while (NULL != (q = TOR_SIMPLEQ_FIRST(&chan->incoming_queue))) { // Remove q from the incoming_queue. // if the queue is still not empty, get the next one, // if the circuit ids do not match, Swap the cells. } ... }
My problem is that number 2. above never occurs. I have not observed a case where the incoming_queue ever has two cells from different circuits. In fact, I don't think I've ever had a time when the incoming_queue has more than 1 cell in it. I am hammering a small tor test network with 30+ curl requests at once. I have two proxies, each of them uses the same entry node, and the same exit node, and there is only one other node in the network, so the circuit that is set up should be the same for every single request. Am I missing something? Will this function (channel_process_cells) ever process more than one cell at a time? I've checked the logs to verify that different circuits are actually set up for the different requests (rather than the Proxies just reusing the existing circuit and giving each new request a new stream id).
I'm not very familiar with the new channel code, unfortunately (it's on our "todo" list to port TCP-over-DTLS and/or PCTCP to channels, though). But at least in the pre-channel code, the input queues would indeed almost always be empty. Any why wouldn't they be? As soon as TLS reports a cell has arrived, you should be able to process it immediately. Where the HOL blocking occurs is in the Tor *output* queue on the sending side and/or the TCP receive buffer on the receiving side. If a IP packet containing (...TCP...TLS...) a cell gets dropped, all future data on that connection will get stalled in the TCP receive buffer (if the window is open) or the Tor output queue (if the window is closed). Once it gets delivered to Tor, then the stall has already been resolved.
- Ian
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On Sat, May 04, 2013 at 12:13:09PM -0400, MF Nowlan wrote:
What you're saying about HOL blocking in the output queue for a relay makes sense if the receive window fills up, but I didn't explain how uTCP actually works. uTCP (and paired with uTLS) is a kernel patch that will expose to the application (in this case, Tor) segments that have arrived even if they are NOT the next logical segments in the stream. uTCP breaks TCP's delivery semantics, without affecting the on-the-wire protocol. Thus, it prevents HOL blocking from occurring in the TCP receive buffer. Now, it should be clear why the queue is important: I need to keep unordered cells in the queue if they are out of order within a circuit, but if they are the "next expected cell" for its circuit, then I can process it, regardless of whether it came in order or out of order from TCP.
I'm implementing this by changing the wire format to include a sequence number for each cell. The sequence number increments within a circuit. The receiver increments its "expected" sequence number with each in order arrival within a circuit.
Ah, that sounds similar to the code we wrote for Conflux (to appear at PETS 2013). In Conflux, streams can be split across mutiple circuits that share an exit node. Then cells on any given stream can arrive in the wrong order at the endpoints (OP and exit), and must be reassembled. Conflux is end-to-end at the stream level, rather than your hop-by-hop at the circuit level, and of course addresses a different problem (what happens if there are lots of low-bandwidth bridges, say because of Flash Proxies, for example).
It makes sense why the code is written as it is to process cells immediately, since, as you point out, the HOL issue has already been resolved by the time it gets to Tor. I was really just trying to manually inject some cell reordering to test Tor's ability to handle cells from different circuits out-of-order with respect to how TCP delivered them. But I'll skip this testing step and go to the next one, which is just running it atop uTCP/uTLS directly and using the incoming_queue to hold cells as needed. I'll have to change a bit more of the code, particularly the asserts that the incoming_queue is empty after a call to channel_process_cells(…).
You also have to make sure to stay asleep, even if there are cells in the input queue, I suppose.
- Ian
On Thu, May 02, 2013 at 10:58:54AM -0400, MF Nowlan wrote:
Hi all,
I am working on integrating uTCP and uTLS ( http://arxiv.org/abs/1103.0463) into Tor to see if we can lower the latency due to head of line blocking across circuits. This is Tracker Item #7679 ( https://trac.torproject.org/projects/tor/ticket/7679). I want to first test the Tor code's ability to process cells from different circuits out of order with respect to how TCP delivered them. To test this, I made some changes to the file named src/or/channel.c.
- I forced queueing of cells into the channel's incoming_queue with:
channel_queue_cell(...) { ... need_to_queue = 1; // Force all cells into the queue, so they do NOT go directly to the handler. ...
- Now, look for case when two cells from different circuits are
present in the incoming_queue and process the second cell before the first cell. channel_process_cells(...) { ... while (NULL != (q = TOR_SIMPLEQ_FIRST(&chan->incoming_queue))) { // Remove q from the incoming_queue. // if the queue is still not empty, get the next one, // if the circuit ids do not match, Swap the cells. } ... }
My problem is that number 2. above never occurs. I have not observed a case where the incoming_queue ever has two cells from different circuits. In fact, I don't think I've ever had a time when the incoming_queue has more than 1 cell in it. I am hammering a small tor test network with 30+ curl requests at once. I have two proxies, each of them uses the same entry node, and the same exit node, and there is only one other node in the network, so the circuit that is set up should be the same for every single request. Am I missing something? Will this function (channel_process_cells) ever process more than one cell at a time? I've checked the logs to verify that different circuits are actually set up for the different requests (rather than the Proxies just reusing the existing circuit and giving each new request a new stream id).
[I wrote the channel code, so I'll explain it]
Under normal operation with the current channeltls.c implementation, those queues should rarely, if ever have more than one cell. The queue exists because the channel abstraction includes possibilities like going temporarily into CHANNEL_STATE_MAINT and not processing new traffic, in which case those queues will accumulate temporarily unprocessable cells. I believe some of the opening and closing cases might cause them to fill. None of these normally occur with channeltls.c, which is currently the only channel implementation layer, and at least the CHANNEL_STATE_MAINT one never occurs.
Short version: you're seeing normal behavior, don't worry about it.