Testing Network Node Availability

List overview All Threads
Download

newer

older

Re: [tor-dev] [proposal]...

Exit relay proportions for test...

Xiaofan Li

5 May 2016 5 May '16

11:34 p.m.

Hi all, We've finished building TOR on QUIC and everything works fine with chutney. However, we have an issue with the node availability when we test on real networks. We need this to work in order to evaluate the effect of HOL blocking with QuicTor vs. Tor.

Specifically:

1. When using our QuicTor and normal TOR* without path restriction*, warning messages like: [warn] Failed to find node for hop 1 of our path. Discarding this circuit. Keeps showing up but the node would actually reach 100% bootstrap and all file transfer functionalities work fine. 2. However, our real issue is when I *restrict the path selection* to 3 pre-determined nodes for all exit circuits, the client will not reach 100% anymore and keeps hanging at 85% (or 80% sometimes). 3. We've thoroughly tested path restriction with chutney and it works.

Our network has 11 nodes: 3 clients and 8 relays (2 of which are authorities). We already have assumeReachable and I've tuned up voting frequencies, just like chutney's default config. Any other configuration flag that could help propagating router availability info?

Any idea at all? Li.

Attachments:

attachment.html (text/html — 1.3 KB)

Show replies by date

Tim Wilson-Brown - teor

6 May 6 May

9:13 a.m.

...

On 6 May 2016, at 09:34, Xiaofan Li xli2@andrew.cmu.edu wrote:

Hi all, We've finished building TOR on QUIC and everything works fine with chutney. However, we have an issue with the node availability when we test on real networks. We need this to work in order to evaluate the effect of HOL blocking with QuicTor vs. Tor.

What do you mean by a "real network"? Do you mean a test network with your own authorities on non-localhost IP addresses?

...

Specifically: • When using our QuicTor and normal TOR without path restriction, warning messages like: [warn] Failed to find node for hop 1 of our path. Discarding this circuit. Keeps showing up but the node would actually reach 100% bootstrap and all file transfer functionalities work fine.

That's not a normal error message. Perhaps you have found a bug in an unstable version of Tor - it might be wise to rebase on 0.2.7.6. Perhaps your Tor configuration or network configuration are broken. It's hard to tell without more information.

Try telling your authorities to have your client only use 1 guard: ConsensusParams NumDirectoryGuards=1 NumEntryGuards=1 (Normally, a client won't re-use any of its 3 guards as a middle or exit. TestingTorNetwork disables this behaviour. These parameters make it so each client only selects 1 guard.)

What other log messages does Tor give just before or after that message? What log messages does Tor give that mention "guard"?

...

• However, our real issue is when I restrict the path selection to 3 pre-determined nodes for all exit circuits, the client will not reach 100% anymore and keeps hanging at 85% (or 80% sometimes).

Does this happen with both Tor and QuicTor?

Perhaps you are using options that restrict path selection in a way that breaks on non-test networks. Perhaps your restrictions conflict with options that change between your chutney and "real" networks. (Or are you implementing these path restrictions in code?) Perhaps you have found a bug in an unstable version of Tor - it might be wise to rebase on 0.2.7.6. It's hard to tell without more information.

What torrc options are you using on the client to restrict paths? EntryNodes, ExitNodes, StrictNodes? How do you restrict the middle node?

What do "80%" and "85%" mean to your client? What other log messages does Tor give just before or after those messages? What are the proportions of Guard, Middle, and Exit relay descriptors that Tor logs as it bootstraps? Does Tor warn you about your path restrictions?

I also wonder why you need to use path restrictions at all. Can't you just put 3 authorities and no relays in your network, and there will be only one possible path? (Of course, those relays could be selected in any order.) Or you could modify the functions that do path selection so they use the same 3 hard-coded relays every time.

...

• We've thoroughly tested path restriction with chutney and it works. Our network has 11 nodes: 3 clients and 8 relays (2 of which are authorities). We already have assumeReachable and I've tuned up voting frequencies, just like chutney's default config. Any other configuration flag that could help propagating router availability info?

If you give one node the Guard flag, and only one node exits anywhere, then that's your path. Try using the following options on your authorities: TestingDirAuthVoteGuard <guard-fingerprint> TestingDirAuthVoteGuardIsStrict 1 TestingDirAuthVoteExit <exit-fingerprint> TestingDirAuthVoteExitIsStrict 1

You could try comparing the options that chutney sets with the "real network" options you're using, and switching one chutney option to a real network option until the bug occurs.

Remember that "TestingTorNetwork" changes a lot of options at once. It's the one I'd try first.

Tim

Tim Wilson-Brown (teor)

teor2345 at gmail dot com PGP 968F094B ricochet:ekmygaiu4rzgsk6n

Roger Dingledine

11:52 a.m.

On Fri, May 06, 2016 at 07:13:04PM +1000, Tim Wilson-Brown - teor wrote:

...

On 6 May 2016, at 09:34, Xiaofan Li xli2@andrew.cmu.edu wrote:

...
??? However, our real issue is when I restrict the path selection to 3 pre-determined nodes for all exit circuits, the client will not reach 100% anymore and keeps hanging at 85% (or 80% sometimes).

See Section 5.5 of control-spec.txt for what these bootstrap percentages mean: https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n2976

So for 85% or 80%, it looks like Tor is not connecting to, or not completing the handshake with, its guard relay.

Tim's guesses for what might be going wrong sound plausible. It sounds like your Tor client either doesn't have enough relays to choose from, or it doesn't have enough relays with the required properties.

Li: with respect to the name 'QuicTor', please see https://www.torproject.org/docs/trademark-faq#researchpapers and then https://www.torproject.org/docs/trademark-faq#combining That is, if you are planning to have a program that is like Tor but with different behavior, please think of a more original name than "QuicTor". Otherwise we'll be back in a situation like that guy who published "Advanced Tor" and didn't understand why we were worried that users would be confused. :)

...

(Normally, a client won't re-use any of its 3 guards as a middle or exit. TestingTorNetwork disables this behaviour.

Tim: I think this statement might be wrong? Tor picks its exit first, then picks a current guard that doesn't overlap with the exit, then picks a middle that doesn't overlap with either of them.

See e.g. choose_good_middle_server().

--Roger

Tim Wilson-Brown - teor

12:57 p.m.

...

On 6 May 2016, at 21:52, Roger Dingledine arma@mit.edu wrote:

...
(Normally, a client won't re-use any of its 3 guards as a middle or exit. TestingTorNetwork disables this behaviour.

Tim: I think this statement might be wrong? Tor picks its exit first, then picks a current guard that doesn't overlap with the exit, then picks a middle that doesn't overlap with either of them.

See e.g. choose_good_middle_server().

Apologies, I didn't remember or explain the details of path selection very well.

When tor is selecting an entry node for CIRCUIT_PURPOSE_TESTING in choose_good_entry_server(), it excludes all guard nodes, and excludes the exit. This can mean that 1 exit, 3 (or more) directory guards, and possibly another 3 (or more) entry guards are excluded. The entry guards are not necessarily the same as the directory guards (this can happen if the directory guards do not have descriptors, or the entry guards are not directories).

So if you're excluding 7 or more nodes in a small network, this can cause path selection failures. Of course, it could be that entry guard selection is failing for other reasons, like a lack of nodes in the consensus.

Without context from the logs, it's hard to tell whether it's testing circuits or standard circuits that are causing the path failures.

Tim

Tim Wilson-Brown (teor)

teor2345 at gmail dot com PGP 968F094B ricochet:ekmygaiu4rzgsk6n

Xiaofan Li

7:10 p.m.

Thanks for the replies!

1. About the name: Thanks for the headsup! We'll definitely pay attention to the trademark rules when we publish our results. We are not planning to roll out our own version of Tor. I think our most important goal is probably: demonstrate a possibility of UDP-based protocol to solve some of TOR's hard performance problems. And hope that you guys would consider using it in future versions. This leads me to a question about licensing: I believe TOR and QUIC have different (conflicting) licenses. Would it even be a possibility that QUIC ever makes it into TOR?

2. About our network config and clarification:

What do you mean by a "real network"?

...

Do you mean a test network with your own authorities on non-localhost IP addresses?

Yes, our testing framework is using EmuLab with customized bandwidth, latency, queue size and drop rate.

...

From Tor's perspective, we are still using "TestingTorNetwork".

I also wonder why you need to use path restrictions at all.

3. For path restriction, we have our own implementation. We parse the config file and use the nickname in choose_good_*() functions to return the corresponding nodes. We have to use this because restricting the middle node* is very important to us for testing HOL blocking problem.* (We have to manually create 1-hop overlapping path for two clients and test the interference.)

4. Regarding the issue, it's probably not entry guard problem, because: 1) Shouldn't that give "failed to select hop 0" instead of hop 1? 2) I can see in our debugging log that we failed on the extending info with the second node. The node returned by choose_good_middle_server is not NULL but the routerinfo_t pointer is NULL. Any idea why? My guess is that consensus is a little short for some reasons, how do I validate this guess? Does the global router list contain everything on the consensus?

5. More observation on this issue: For both tor and our tor, when I decrease the size of the network (i.e. number of relays in the network), the hanging issue resolves itself..

I'll try rebase back to an official release today.

Thanks, Li.

On Fri, May 6, 2016 at 8:57 AM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:

...

...
On 6 May 2016, at 21:52, Roger Dingledine arma@mit.edu wrote:

...
(Normally, a client won't re-use any of its 3 guards as a middle or exit. TestingTorNetwork disables this behaviour.

Tim: I think this statement might be wrong? Tor picks its exit first, then picks a current guard that doesn't overlap with the exit, then picks a middle that doesn't overlap with either of them.

See e.g. choose_good_middle_server().

Apologies, I didn't remember or explain the details of path selection very well.

When tor is selecting an entry node for CIRCUIT_PURPOSE_TESTING in choose_good_entry_server(), it excludes all guard nodes, and excludes the exit. This can mean that 1 exit, 3 (or more) directory guards, and possibly another 3 (or more) entry guards are excluded. The entry guards are not necessarily the same as the directory guards (this can happen if the directory guards do not have descriptors, or the entry guards are not directories).

So if you're excluding 7 or more nodes in a small network, this can cause path selection failures. Of course, it could be that entry guard selection is failing for other reasons, like a lack of nodes in the consensus.

Without context from the logs, it's hard to tell whether it's testing circuits or standard circuits that are causing the path failures.

Tim

Tim Wilson-Brown (teor)

teor2345 at gmail dot com PGP 968F094B ricochet:ekmygaiu4rzgsk6n

tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Tim Wilson-Brown - teor

10:18 p.m.

...

On 7 May 2016, at 05:10, Xiaofan Li xli2@andrew.cmu.edu wrote:

Thanks for the replies!

About the name:

Thanks for the headsup! We'll definitely pay attention to the trademark rules when we publish our results. We are not planning to roll out our own version of Tor. I think our most important goal is probably: demonstrate a possibility of UDP-based protocol to solve some of TOR's hard performance problems. And hope that you guys would consider using it in future versions. This leads me to a question about licensing: I believe TOR and QUIC have different (conflicting) licenses. Would it even be a possibility that QUIC ever makes it into TOR?

I am not a lawyer or a software architect for Tor, so these are simply my opinions:

Here is the tor license: https://gitweb.torproject.org/tor.git/tree/LICENSE

Is QUIC under the Chromium license? I couldn't find a QUIC-specific one. https://chromium.googlesource.com/chromium/src/+/master/LICENSE

If so, the licenses look compatible to me, they're both BSD-style, and are almost identical.

What restrictions concern you?

On the architecture side:

I'm not sure if Tor is looking for alternative transport protocols like QUIC. One of the issues is that any modified client is easy to fingerprint. So, as with IPv6, we'd need relays to run QUIC and TCP in parallel for some time, then clients could optionally use QUIC when there were enough relays supporting it. Perhaps relays could open a QUIC UDP port on the same port as their TCP ORPort, and then advertise support in their descriptors. But TCP would remain the default for the foreseeable future.

For example, our IPv6 adoption is still at the stage where clients need to be explicitly configured to use it. (And parts of it are only coming out in 0.2.8.)

If your modifications don't work like this, then it would be very hard for us to adopt them. Even if they did, I don't know if they solve any pressing issues for us. (And we'd need both a theoretical security analysis, and a code review. And new features come with new risks and new bugs.)

...

... I also wonder why you need to use path restrictions at all. 3. For path restriction, we have our own implementation. We parse the config file and use the nickname in choose_good_*() functions to return the corresponding nodes. We have to use this because restricting the middle node is very important to us for testing HOL blocking problem. (We have to manually create 1-hop overlapping path for two clients and test the interference.)

Regarding the issue, it's probably not entry guard problem, because: 1) Shouldn't that give "failed to select hop 0" instead of hop 1?

Yes, you're right, in that message, hop counts are 0-based.

But the code is inconsistent in onion_extend_cpath:

if (!info) { log_warn(LD_CIRC,"Failed to find node for hop %d of our path. Discarding " "this circuit.", cur_len); return -1; }

log_debug(LD_CIRC,"Chose router %s for hop %d (exit is %s)", extend_info_describe(info), cur_len+1, build_state_get_exit_nickname(state));

The control spec says hops are 1-based, so we should fix the logging.

See: https://trac.torproject.org/projects/tor/ticket/18982

I've given you credit for reporting this issue, please feel free to provide your preferred name (or decline) on the ticket.

...

I can see in our debugging log that we failed on the extending info with the second node. The node returned by choose_good_middle_server is not NULL but the routerinfo_t pointer is NULL. Any idea why?

Perhaps you looked it up the wrong way, or it's not in the consensus. What code are you using to look up the node?

Are you using extend_info_from_node()? If not, please note that different fields are present depending on whether you use descriptors (ri) or microdescriptors (md).

...

My guess is that consensus is a little short for some reasons, how do I validate this guess?

Read the plain-text cached-{microdesc-}consensus file in the tor client's data directory and check if the middle node is in it. Read the plain-text cached-{descriptors,microdescs} file in the tor client's data directory and check if the middle node is in it.

...

Does the global router list contain everything on the consensus?

I'm not sure exactly what you're referring to here, please provide a function or global variable name for this list.

...

More observation on this issue:

For both tor and our tor, when I decrease the size of the network (i.e. number of relays in the network), the hanging issue resolves itself..

Hmm, then it's likely a configuration issue with your network.

...

I'll try rebase back to an official release today.

That might help, we are still fixing bugs in 0.2.8.

Tim

Tim Wilson-Brown (teor)

teor2345 at gmail dot com PGP 968F094B ricochet:ekmygaiu4rzgsk6n

Xiaofan Li

8 May 8 May

5:52 a.m.

Tim,

I'm not sure if Tor is looking for alternative transport protocols like

...

QUIC.

What if it's a lot faster than TCP on Tor?

...

One of the issues is that any modified client is easy to fingerprint. So, as with IPv6, we'd need relays to run QUIC and TCP in parallel for some time, then clients could optionally use QUIC when there were enough relays supporting it. Perhaps relays could open a QUIC UDP port on the same port as their TCP ORPort, and then advertise support in their descriptors. But TCP would remain the default for the foreseeable future. For example, our IPv6 adoption is still at the stage where clients need to be explicitly configured to use it. (And parts of it are only coming out in 0.2.8.) If your modifications don't work like this, then it would be very hard for us to adopt them.

It does work like this. Our testing version has "parallel codepath" and supports both QUIC and TCP. And we devised our QUIC API to look almost exactly like the traditional UNIX socket API. So, code change is almost minimal.

...

Even if they did, I don't know if they solve any pressing issues for us.

What about the head-of-line blocking issue and the congestion control issue raised in 2009 https://svn.torproject.org/svn/projects/roadmaps/2009-03-11-performance.pdf?

...

From this paper https://eprint.iacr.org/2015/235.pdf, it seems they

haven't been completely solved.

...

(And we'd need both a theoretical security analysis, and a code review. And new features come with new risks and new bugs.)

Of course! We don't expect Tor to suddenly start using QUIC because of a couple of emails. But I believe we do have something to argue for QUIC based on both theories and experimental results. We would probably make a formal, published argument soon.

I've given you credit for reporting this issue, please feel free to provide

...

your preferred name (or decline) on the ticket.

Thanks!

About the issue, I've checkout the 0.2.8 commit and tested on that. The problem is still there so I looked deeper into it. I've run it many time and it seems like once I start restricting path, it becomes undeterministic whether the bootstrap will succeed. And I think it might have something to do with the cache-microdesc-consensus file fetched by that client. Just for recap, I'm running a network with 11 nodes (2 relays) and 2 clients who have path restriction. My observations are:

- Each client will have a cache-microdesc-consensus file with 4 relays in it. relay 0, 1 and 2 will always be there and the last one changes each time I start the network. - When the all 3 nodes on the restricted path are on the cache-microdesc-consensus file, the bootstrap will succeed quickly. For example, if my path is restricted to R2->R3->R1, since 0, 1 and 2 are always present in the consensus, whenever R3 is there, the bootstrap will work. - When one of the node is not on the consensus, the bootstrap will be stuck and never reach 100%. Depending on which node of the path is not included in the consensus, the error message varies. In the above example, if R3 is not in the consensus, we will fail to connect to hop 1 (assume 0-based logging). - I waited for a long time (~30min) and nothing would improve: consensus does not contain more nodes and bootstrap would still be stuck.

I think the root of the problem might be the consensus having too few nodes.. Is it normal for a cache-microdesc-consensus file to only have 4 nodes in a 11-node network? Should I look into how the code that generate the consensus?

The routerlist_t I mentioned is in routerlist.c, line 124.

124 http://192.168.1.14:8080/source/xref/tor/src/or/routerlist.c#124/** Global list of all of the routers that we know about. */125 http://192.168.1.14:8080/source/xref/tor/src/or/routerlist.c#125*static* routerlist_t http://192.168.1.14:8080/source/s?defs=routerlist_t&project=tor *routerlist http://192.168.1.14:8080/source/s?refs=routerlist&project=tor = NULL http://192.168.1.14:8080/source/s?defs=NULL&project=tor;

But now I think this probably just stores the same info as the cache-microdesc-consensus file, right?

Hmm, then it's likely a configuration issue with your network.

Shouldn't chutney also fail if it is a configuration issue? Or are you saying it's a configuration issue with my underlying network topology? The only thing different in the torrc files for the chutney run and the Emulab run is "Sandbox 1" and "RunAsDaemon 1" but I don't think they cause any issue?

Thanks! Li.

On Fri, May 6, 2016 at 6:18 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:

...

...
On 7 May 2016, at 05:10, Xiaofan Li xli2@andrew.cmu.edu wrote:

Thanks for the replies!

About the name:

Thanks for the headsup! We'll definitely pay attention to the trademark

rules when we publish our results. We are not planning to roll out our own version of Tor. I think our most important goal is probably: demonstrate a possibility of UDP-based protocol to solve some of TOR's hard performance problems. And hope that you guys would consider using it in future versions.

...
This leads me to a question about licensing: I believe TOR and QUIC have

different (conflicting) licenses. Would it even be a possibility that QUIC ever makes it into TOR?

I am not a lawyer or a software architect for Tor, so these are simply my opinions:

Here is the tor license: https://gitweb.torproject.org/tor.git/tree/LICENSE

Is QUIC under the Chromium license? I couldn't find a QUIC-specific one. https://chromium.googlesource.com/chromium/src/+/master/LICENSE

If so, the licenses look compatible to me, they're both BSD-style, and are almost identical.

What restrictions concern you?

On the architecture side:

I'm not sure if Tor is looking for alternative transport protocols like QUIC. One of the issues is that any modified client is easy to fingerprint. So, as with IPv6, we'd need relays to run QUIC and TCP in parallel for some time, then clients could optionally use QUIC when there were enough relays supporting it. Perhaps relays could open a QUIC UDP port on the same port as their TCP ORPort, and then advertise support in their descriptors. But TCP would remain the default for the foreseeable future.

For example, our IPv6 adoption is still at the stage where clients need to be explicitly configured to use it. (And parts of it are only coming out in 0.2.8.)

If your modifications don't work like this, then it would be very hard for us to adopt them. Even if they did, I don't know if they solve any pressing issues for us. (And we'd need both a theoretical security analysis, and a code review. And new features come with new risks and new bugs.)

...
... I also wonder why you need to use path restrictions at all. 3. For path restriction, we have our own implementation. We parse the

config file and use the nickname in choose_good_*() functions to return the corresponding nodes. We have to use this because restricting the middle node is very important to us for testing HOL blocking problem. (We have to manually create 1-hop overlapping path for two clients and test the interference.)

...

Regarding the issue, it's probably not entry guard problem, because:

Shouldn't that give "failed to select hop 0" instead of hop 1?

Yes, you're right, in that message, hop counts are 0-based.

But the code is inconsistent in onion_extend_cpath:

if (!info) { log_warn(LD_CIRC,"Failed to find node for hop %d of our path. Discarding " "this circuit.", cur_len); return -1; }

log_debug(LD_CIRC,"Chose router %s for hop %d (exit is %s)", extend_info_describe(info), cur_len+1, build_state_get_exit_nickname(state));

The control spec says hops are 1-based, so we should fix the logging.

See: https://trac.torproject.org/projects/tor/ticket/18982

I've given you credit for reporting this issue, please feel free to provide your preferred name (or decline) on the ticket.

...

I can see in our debugging log that we failed on the extending info

with the second node. The node returned by choose_good_middle_server is not NULL but the routerinfo_t pointer is NULL. Any idea why?

Perhaps you looked it up the wrong way, or it's not in the consensus. What code are you using to look up the node?

Are you using extend_info_from_node()? If not, please note that different fields are present depending on whether you use descriptors (ri) or microdescriptors (md).

...
My guess is that consensus is a little short for some reasons, how do I

validate this guess?

Read the plain-text cached-{microdesc-}consensus file in the tor client's data directory and check if the middle node is in it. Read the plain-text cached-{descriptors,microdescs} file in the tor client's data directory and check if the middle node is in it.

...
Does the global router list contain everything on the consensus?

I'm not sure exactly what you're referring to here, please provide a function or global variable name for this list.

...

More observation on this issue:

For both tor and our tor, when I decrease the size of the network (i.e.

number of relays in the network), the hanging issue resolves itself..

Hmm, then it's likely a configuration issue with your network.

...
I'll try rebase back to an official release today.

That might help, we are still fixing bugs in 0.2.8.

Tim

Tim Wilson-Brown (teor)

teor2345 at gmail dot com PGP 968F094B ricochet:ekmygaiu4rzgsk6n

tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Xiaofan Li

5:57 a.m.

Sorry for the spam, but I have a critical typo on the previous message. Instead of "with 11 nodes (2 relays) and 2 clients who have path restriction." I meant 11 nodes (2 authorities) and 2 clients.

On Sun, May 8, 2016 at 1:52 AM, Xiaofan Li xli2@andrew.cmu.edu wrote:

...

Tim,

I'm not sure if Tor is looking for alternative transport protocols like

...
QUIC.

What if it's a lot faster than TCP on Tor?

...
One of the issues is that any modified client is easy to fingerprint. So, as with IPv6, we'd need relays to run QUIC and TCP in parallel for some time, then clients could optionally use QUIC when there were enough relays supporting it. Perhaps relays could open a QUIC UDP port on the same port as their TCP ORPort, and then advertise support in their descriptors. But TCP would remain the default for the foreseeable future. For example, our IPv6 adoption is still at the stage where clients need to be explicitly configured to use it. (And parts of it are only coming out in 0.2.8.) If your modifications don't work like this, then it would be very hard for us to adopt them.

It does work like this. Our testing version has "parallel codepath" and supports both QUIC and TCP. And we devised our QUIC API to look almost exactly like the traditional UNIX socket API. So, code change is almost minimal.

...
Even if they did, I don't know if they solve any pressing issues for us.

What about the head-of-line blocking issue and the congestion control issue raised in 2009 https://svn.torproject.org/svn/projects/roadmaps/2009-03-11-performance.pdf? From this paper https://eprint.iacr.org/2015/235.pdf, it seems they haven't been completely solved.

...
(And we'd need both a theoretical security analysis, and a code review. And new features come with new risks and new bugs.)

Of course! We don't expect Tor to suddenly start using QUIC because of a couple of emails. But I believe we do have something to argue for QUIC based on both theories and experimental results. We would probably make a formal, published argument soon.

I've given you credit for reporting this issue, please feel free to

...
provide your preferred name (or decline) on the ticket.

Thanks!

About the issue, I've checkout the 0.2.8 commit and tested on that. The problem is still there so I looked deeper into it. I've run it many time and it seems like once I start restricting path, it becomes undeterministic whether the bootstrap will succeed. And I think it might have something to do with the cache-microdesc-consensus file fetched by that client. Just for recap, I'm running a network with 11 nodes (2 relays) and 2 clients who have path restriction. My observations are:

Each client will have a cache-microdesc-consensus file with 4 relays

in it. relay 0, 1 and 2 will always be there and the last one changes each time I start the network.

When the all 3 nodes on the restricted path are on the cache-microdesc-consensus

file, the bootstrap will succeed quickly. For example, if my path is restricted to R2->R3->R1, since 0, 1 and 2 are always present in the consensus, whenever R3 is there, the bootstrap will work.

When one of the node is not on the consensus, the bootstrap will be

stuck and never reach 100%. Depending on which node of the path is not included in the consensus, the error message varies. In the above example, if R3 is not in the consensus, we will fail to connect to hop 1 (assume 0-based logging).

I waited for a long time (~30min) and nothing would improve:

consensus does not contain more nodes and bootstrap would still be stuck.

I think the root of the problem might be the consensus having too few nodes.. Is it normal for a cache-microdesc-consensus file to only have 4 nodes in a 11-node network? Should I look into how the code that generate the consensus?

The routerlist_t I mentioned is in routerlist.c, line 124.

124 http://192.168.1.14:8080/source/xref/tor/src/or/routerlist.c#124/** Global list of all of the routers that we know about. */125 http://192.168.1.14:8080/source/xref/tor/src/or/routerlist.c#125*static* routerlist_t http://192.168.1.14:8080/source/s?defs=routerlist_t&project=tor *routerlist http://192.168.1.14:8080/source/s?refs=routerlist&project=tor = NULL http://192.168.1.14:8080/source/s?defs=NULL&project=tor;

But now I think this probably just stores the same info as the cache-microdesc-consensus file, right?

Hmm, then it's likely a configuration issue with your network.

Shouldn't chutney also fail if it is a configuration issue? Or are you saying it's a configuration issue with my underlying network topology? The only thing different in the torrc files for the chutney run and the Emulab run is "Sandbox 1" and "RunAsDaemon 1" but I don't think they cause any issue?

Thanks! Li.

On Fri, May 6, 2016 at 6:18 PM, Tim Wilson-Brown - teor < teor2345@gmail.com> wrote:

...
...
On 7 May 2016, at 05:10, Xiaofan Li xli2@andrew.cmu.edu wrote:

Thanks for the replies!

About the name:

Thanks for the headsup! We'll definitely pay attention to the trademark

rules when we publish our results. We are not planning to roll out our own version of Tor. I think our most important goal is probably: demonstrate a possibility of UDP-based protocol to solve some of TOR's hard performance problems. And hope that you guys would consider using it in future versions.

...
This leads me to a question about licensing: I believe TOR and QUIC

have different (conflicting) licenses. Would it even be a possibility that QUIC ever makes it into TOR?

I am not a lawyer or a software architect for Tor, so these are simply my opinions:

Here is the tor license: https://gitweb.torproject.org/tor.git/tree/LICENSE

Is QUIC under the Chromium license? I couldn't find a QUIC-specific one. https://chromium.googlesource.com/chromium/src/+/master/LICENSE

If so, the licenses look compatible to me, they're both BSD-style, and are almost identical.

What restrictions concern you?

On the architecture side:

I'm not sure if Tor is looking for alternative transport protocols like QUIC. One of the issues is that any modified client is easy to fingerprint. So, as with IPv6, we'd need relays to run QUIC and TCP in parallel for some time, then clients could optionally use QUIC when there were enough relays supporting it. Perhaps relays could open a QUIC UDP port on the same port as their TCP ORPort, and then advertise support in their descriptors. But TCP would remain the default for the foreseeable future.

For example, our IPv6 adoption is still at the stage where clients need to be explicitly configured to use it. (And parts of it are only coming out in 0.2.8.)

If your modifications don't work like this, then it would be very hard for us to adopt them. Even if they did, I don't know if they solve any pressing issues for us. (And we'd need both a theoretical security analysis, and a code review. And new features come with new risks and new bugs.)

...
... I also wonder why you need to use path restrictions at all. 3. For path restriction, we have our own implementation. We parse the

config file and use the nickname in choose_good_*() functions to return the corresponding nodes. We have to use this because restricting the middle node is very important to us for testing HOL blocking problem. (We have to manually create 1-hop overlapping path for two clients and test the interference.)

...

Regarding the issue, it's probably not entry guard problem, because:

Shouldn't that give "failed to select hop 0" instead of hop 1?

Yes, you're right, in that message, hop counts are 0-based.

But the code is inconsistent in onion_extend_cpath:

if (!info) { log_warn(LD_CIRC,"Failed to find node for hop %d of our path. Discarding " "this circuit.", cur_len); return -1; }

log_debug(LD_CIRC,"Chose router %s for hop %d (exit is %s)", extend_info_describe(info), cur_len+1, build_state_get_exit_nickname(state));

The control spec says hops are 1-based, so we should fix the logging.

See: https://trac.torproject.org/projects/tor/ticket/18982

I've given you credit for reporting this issue, please feel free to provide your preferred name (or decline) on the ticket.

...

I can see in our debugging log that we failed on the extending info

with the second node. The node returned by choose_good_middle_server is not NULL but the routerinfo_t pointer is NULL. Any idea why?

Perhaps you looked it up the wrong way, or it's not in the consensus. What code are you using to look up the node?

Are you using extend_info_from_node()? If not, please note that different fields are present depending on whether you use descriptors (ri) or microdescriptors (md).

...
My guess is that consensus is a little short for some reasons, how do I

validate this guess?

Read the plain-text cached-{microdesc-}consensus file in the tor client's data directory and check if the middle node is in it. Read the plain-text cached-{descriptors,microdescs} file in the tor client's data directory and check if the middle node is in it.

...
Does the global router list contain everything on the consensus?

I'm not sure exactly what you're referring to here, please provide a function or global variable name for this list.

...

More observation on this issue:

For both tor and our tor, when I decrease the size of the network (i.e.

number of relays in the network), the hanging issue resolves itself..

Hmm, then it's likely a configuration issue with your network.

...
I'll try rebase back to an official release today.

That might help, we are still fixing bugs in 0.2.8.

Tim

Tim Wilson-Brown (teor)

teor2345 at gmail dot com PGP 968F094B ricochet:ekmygaiu4rzgsk6n

tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Tim Wilson-Brown - teor

6:04 a.m.

...

On 8 May 2016, at 01:52, Xiaofan Li xli2@andrew.cmu.edu wrote:

About the issue, I've checkout the 0.2.8 commit and tested on that. The problem is still there so I looked deeper into it. I've run it many time and it seems like once I start restricting path, it becomes undeterministic whether the bootstrap will succeed. And I think it might have something to do with the cache-microdesc-consensus file fetched by that client. Just for recap, I'm running a network with 11 nodes (2 relays) and 2 clients who have path restriction.

As you said in your next email, this is meant to be "11 nodes (2 authorities)".

Please use an odd number of authorities, such as 3. If you have an even number of authorities, they can't break ties, and this can cause a consensus not to form, or perhaps to lose nodes.

...

My observations are: • Each client will have a cache-microdesc-consensus file with 4 relays in it. relay 0, 1 and 2 will always be there and the last one changes each time I start the network. • When the all 3 nodes on the restricted path are on the cache-microdesc-consensus file, the bootstrap will succeed quickly. For example, if my path is restricted to R2->R3->R1, since 0, 1 and 2 are always present in the consensus, whenever R3 is there, the bootstrap will work. • When one of the node is not on the consensus, the bootstrap will be stuck and never reach 100%. Depending on which node of the path is not included in the consensus, the error message varies. In the above example, if R3 is not in the consensus, we will fail to connect to hop 1 (assume 0-based logging). • I waited for a long time (~30min) and nothing would improve: consensus does not contain more nodes and bootstrap would still be stuck. I think the root of the problem might be the consensus having too few nodes.. Is it normal for a cache-microdesc-consensus file to only have 4 nodes in a 11-node network? Should I look into how the code that generate the consensus?

If you can't get all 11 relays in your consensus, you have a network configuration issue between those relays and the authorities, not a Tor code issue.

...

The routerlist_t I mentioned is in routerlist.c, line 124. 124/** Global list of all of the routers that we know about. */ 125static routerlist_t *routerlist = NULL;

But now I think this probably just stores the same info as the cache-microdesc-consensus file, right?

Yes.

...

...
Hmm, then it's likely a configuration issue with your network.

Shouldn't chutney also fail if it is a configuration issue? Or are you saying it's a configuration issue with my underlying network topology?

It's one or the other. I really can't tell based on the information you've given. I'm just guessing.

...

The only thing different in the torrc files for the chutney run and the Emulab run is "Sandbox 1" and "RunAsDaemon 1" but I don't think they cause any issue?

They could, if your configuration asks them to access files that are blocked by the sandbox.

But it's far more likely that some of the relays are configured with the wrong addresses and ports (either in the torrc or in the OS), or aren't actually connected to your network properly at lower layers, such as TCP or IP or ethernet.

Tim

Tim Wilson-Brown (teor)

teor2345 at gmail dot com PGP 968F094B ricochet:ekmygaiu4rzgsk6n

Roger Dingledine

6:46 a.m.

On Sun, May 08, 2016 at 02:04:23AM -0400, Tim Wilson-Brown - teor wrote:

...

...
??? Each client will have a cache-microdesc-consensus file with 4 relays in it. relay 0, 1 and 2 will always be there and the last one changes each time I start the network.

Are all your relays on just a few IP addresses? If so, see the AuthDirMaxServersPerAddr config option.

If that doesn't do it, I'd suggest looking at more detailed logs at the authorities. Do they receive the relay descriptors from all the relays? How do their reachability tests go for each relay?

...

...
??? When one of the node is not on the consensus, the bootstrap will be stuck and never reach 100%. Depending on which node of the path is not included in the consensus, the error message varies. In the above example, if R3 is not in the consensus, we will fail to connect to hop 1 (assume 0-based logging).

If you try to extend to a relay that isn't in the consensus, then it's not surprising that the circuit will fail.

Speaking of which, you might be happier using the stem library and the "extendcircuit" controller command, rather than hacking the Tor code yourself. Once you explained that you had modified your Tor code in unspecified ways, the number of possible explanations for what's going wrong for you has become very large. :)

...

But it's far more likely that some of the relays are configured with the wrong addresses and ports (either in the torrc or in the OS), or aren't actually connected to your network properly at lower layers, such as TCP or IP or ethernet.

Yep -- this is certainly worth exploring too.

--Roger

Tim Wilson-Brown - teor

12:42 p.m.

...

On 8 May 2016, at 02:46, Roger Dingledine arma@mit.edu wrote:

On Sun, May 08, 2016 at 02:04:23AM -0400, Tim Wilson-Brown - teor wrote:

...
...
??? Each client will have a cache-microdesc-consensus file with 4 relays in it. relay 0, 1 and 2 will always be there and the last one changes each time I start the network.

Are all your relays on just a few IP addresses? If so, see the AuthDirMaxServersPerAddr config option.

You might also need to change AuthDirMaxServersPerAuthAddr.

If you've set TestingTorNetwork, it should set AuthDirMaxServersPerAddr and AuthDirMaxServersPerAuthAddr to 0 (unlimited). If you think you've set TestingTorNetwork, but you still only have 2 servers per IP, that's a clue that your settings aren't being applied to the Tor instances you're running.

...

If that doesn't do it, I'd suggest looking at more detailed logs at the authorities. Do they receive the relay descriptors from all the relays? How do their reachability tests go for each relay?

Similarly, TestingTorNetwork sets AssumeReachable 1, which skips reachability self-checks on relays, and relay reachability checks on authorities.

Otherwise, relays self-check their ORPort and DirPort before uploading a descriptor. In 0.2.7.6 and 0.2.8.1-alpha, there's a bug where relays sometimes submit an 0 DirPort. See https://trac.torproject.org/projects/tor/ticket/18050

Relays check the ORPort and DirPort ports, on the torrc option Address (or an automatically guessed address), and not on their listen addresses. However, the listen addresses must be listening on, or redirected from, the configured Address. This is a frequent source of relay operator confusion. See https://trac.torproject.org/projects/tor/ticket/13953

If they can't reach their own ORPort and DirPort when they check it, relays won't post a descriptor.

These are just guesses. We could help you more with this if you posted the relevant lines from relays and authorities that say what happens to the missing descriptors.

...

...
...
??? When one of the node is not on the consensus, the bootstrap will be stuck and never reach 100%. Depending on which node of the path is not included in the consensus, the error message varies. In the above example, if R3 is not in the consensus, we will fail to connect to hop 1 (assume 0-based logging).

If you try to extend to a relay that isn't in the consensus, then it's not surprising that the circuit will fail.

Speaking of which, you might be happier using the stem library and the "extendcircuit" controller command, rather than hacking the Tor code yourself. Once you explained that you had modified your Tor code in unspecified ways, the number of possible explanations for what's going wrong for you has become very large. :)

I am also quite concerned about the extra hardcoded path code you're using. It is so easy to modify tor code in ways that are subtly wrong. I have done it many times.

While we could help by reviewing this extra code, I'd encourage you to follow Roger's suggestion that you use the tested stem / tor controller code for building set paths. You'll likely write fewer lines of code this way, too.

...

...
But it's far more likely that some of the relays are configured with the wrong addresses and ports (either in the torrc or in the OS), or aren't actually connected to your network properly at lower layers, such as TCP or IP or ethernet.

Yep -- this is certainly worth exploring too.

--Roger

Tim Wilson-Brown (teor)

teor2345 at gmail dot com PGP 968F094B ricochet:ekmygaiu4rzgsk6n

Xiaofan Li

9 May 9 May

6:51 a.m.

Hi, We'll be looking into the configurations soon. Thanks for the advice!

Today we handed in our project report on QUIC + TOR, which concluded our semester-long project as well as our journey as undergrads here at CMU. I just want to thank everyone who helped us during this time, since I've asked many questions and not all of them made sense. Special thanks to Tim for his knowledge and patience! With your help, we were able to accomplish a lot in the short few months.

After graduating next week, Kevin (my project partner) and I will both move to California for our first job! I'll still keep an eye on Tor and contribute if opportunity allows. As for this project, we will release all the code in a few months after publication submissions. Meanwhile, I will still work, although less frequently, on the project for possible improvements. We hope you'll take a look at our ideas once it becomes available!

Thanks again! Li.

On Sun, May 8, 2016 at 8:42 AM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:

...

...
On 8 May 2016, at 02:46, Roger Dingledine arma@mit.edu wrote:

On Sun, May 08, 2016 at 02:04:23AM -0400, Tim Wilson-Brown - teor wrote:

...
...
??? Each client will have a cache-microdesc-consensus file with 4
relays in it. relay 0, 1 and 2 will always be there and the last one changes each time I start the network.

...
Are all your relays on just a few IP addresses? If so, see the AuthDirMaxServersPerAddr config option.

You might also need to change AuthDirMaxServersPerAuthAddr.

If you've set TestingTorNetwork, it should set AuthDirMaxServersPerAddr and AuthDirMaxServersPerAuthAddr to 0 (unlimited). If you think you've set TestingTorNetwork, but you still only have 2 servers per IP, that's a clue that your settings aren't being applied to the Tor instances you're running.

...
If that doesn't do it, I'd suggest looking at more detailed logs at the authorities. Do they receive the relay descriptors from all the relays? How do their reachability tests go for each relay?

Similarly, TestingTorNetwork sets AssumeReachable 1, which skips reachability self-checks on relays, and relay reachability checks on authorities.

Otherwise, relays self-check their ORPort and DirPort before uploading a descriptor. In 0.2.7.6 and 0.2.8.1-alpha, there's a bug where relays sometimes submit an 0 DirPort. See https://trac.torproject.org/projects/tor/ticket/18050

Relays check the ORPort and DirPort ports, on the torrc option Address (or an automatically guessed address), and not on their listen addresses. However, the listen addresses must be listening on, or redirected from, the configured Address. This is a frequent source of relay operator confusion. See https://trac.torproject.org/projects/tor/ticket/13953

If they can't reach their own ORPort and DirPort when they check it, relays won't post a descriptor.

These are just guesses. We could help you more with this if you posted the relevant lines from relays and authorities that say what happens to the missing descriptors.

...
...
...
??? When one of the node is not on the consensus, the bootstrap
will be stuck and never reach 100%. Depending on which node of the path is not included in the consensus, the error message varies. In the above example, if R3 is not in the consensus, we will fail to connect to hop 1 (assume 0-based logging).

...
If you try to extend to a relay that isn't in the consensus, then it's not surprising that the circuit will fail.

Speaking of which, you might be happier using the stem library and the "extendcircuit" controller command, rather than hacking the Tor code yourself. Once you explained that you had modified your Tor code in unspecified ways, the number of possible explanations for what's going wrong for you has become very large. :)

I am also quite concerned about the extra hardcoded path code you're using. It is so easy to modify tor code in ways that are subtly wrong. I have done it many times.

While we could help by reviewing this extra code, I'd encourage you to follow Roger's suggestion that you use the tested stem / tor controller code for building set paths. You'll likely write fewer lines of code this way, too.

...
...
But it's far more likely that some of the relays are configured with the wrong addresses and ports (either in the torrc or in the OS), or aren't actually connected to your network properly at lower layers, such as TCP or IP or ethernet.

Yep -- this is certainly worth exploring too.

--Roger

Tim Wilson-Brown (teor)

teor2345 at gmail dot com PGP 968F094B ricochet:ekmygaiu4rzgsk6n

tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

3104

Age (days ago)

3108

Last active (days ago)

tor-dev@lists.torproject.org

11 comments

3 participants

tags (0)

participants (3)

Roger Dingledine
Tim Wilson-Brown - teor
Xiaofan Li