(Sorry for cross-posting, but I think this is a topic for tor-dev@, not tor-talk@. If you agree, please reply on tor-dev@ only. tor-talk@ people can follow the thread here:
https://lists.torproject.org/pipermail/tor-dev/2013-June/thread.html)
On 6/6/13 7:32 PM, Norman Danner wrote:
I have two questions regarding a possible research project.
First, the research question: can one use machine-learning techniques to construct a model of Tor client behavior? Or in a more general form: can one use <fill-in-the-blank> to construct a model of Tor client behavior? A student of mine did some work on this over the last year, and the results are encouraging, though not strong enough to do anything with yet.
Second, the meta-question: is it worthwhile to answer the first question? It seems to me that if the answer to the first question is "yes," then the solution could be used to (at least) provide better simulations of Tor (e.g., via Shadow or ExperimenTor). This possibly naive thought would imply that the answer to the second question is "yes."
I'd be interested to hear responses to my second question, either validating my naive thought or explaining why the first question isn't worth answering. I'd accept responses to my first question, too, in case this has already been done.
Hi Norman,
yes, it's worthwhile to answer this question! I can imagine how at least Shadow and the Tor path generator would benefit from better client models. User number estimates on the metrics website might benefit from them, too.
I found two tickets where we asked similar questions before, and maybe there are more tickets like these:
https://trac.torproject.org/projects/tor/ticket/2963
https://trac.torproject.org/projects/tor/ticket/6295
Some very early thoughts:
- How do we make sure that we ask a representative set of people to instrument their clients and export data on their usage behavior? If we only ask people who read their favorite news site twice per day, our client model will be just that, but not representative for all Tor users. (Still, we would know more than we know now.)
- Can we somehow aggregate usage information enough to make it safe for people to send actual usage reports to us? I could imagine having a torrc flag that is disabled by default and that, when enabled, writes sanitized usage information to disk. For this we need a very good idea what we're planning to do with the data, and we'll need to specify the aggregation approach in a tech report and get it reviewed by the community.
Are your student's results available somewhere?
Best, Karsten
On 6/7/13 2:37 AM, Karsten Loesing wrote:
(Sorry for cross-posting, but I think this is a topic for tor-dev@, not tor-talk@. If you agree, please reply on tor-dev@ only. tor-talk@ people can follow the thread here:
OK, following up on tor-dev. I thought tor-dev might be the more appropriate list, but the description for tor-talk is "all discussion about theory, design, and development of Onion Routing," whereas that for tor-dev is "discussion regarding Tor development." Maybe since I think more about theory than code, the former description seemed more applicable...
On 6/6/13 7:32 PM, Norman Danner wrote:
I have two questions regarding a possible research project.
First, the research question: can one use machine-learning techniques to construct a model of Tor client behavior? Or in a more general form: can one use <fill-in-the-blank> to construct a model of Tor client behavior? A student of mine did some work on this over the last year, and the results are encouraging, though not strong enough to do anything with yet.
Second, the meta-question: is it worthwhile to answer the first question?
[snip] Hi Norman,
yes, it's worthwhile to answer this question! I can imagine how at least Shadow and the Tor path generator would benefit from better client models. User number estimates on the metrics website might benefit from them, too.
I found two tickets where we asked similar questions before, and maybe there are more tickets like these:
https://trac.torproject.org/projects/tor/ticket/2963
https://trac.torproject.org/projects/tor/ticket/6295
Some very early thoughts:
- How do we make sure that we ask a representative set of people to
instrument their clients and export data on their usage behavior? If we only ask people who read their favorite news site twice per day, our client model will be just that, but not representative for all Tor users. (Still, we would know more than we know now.)
- Can we somehow aggregate usage information enough to make it safe for
people to send actual usage reports to us? I could imagine having a torrc flag that is disabled by default and that, when enabled, writes sanitized usage information to disk. For this we need a very good idea what we're planning to do with the data, and we'll need to specify the aggregation approach in a tech report and get it reviewed by the community.
Are your student's results available somewhere?
The written portion of my student Julian Applebaum's Senior thesis is available at
http://wesscholar.wesleyan.edu/etd_hon_theses/1042/
Our focus in this project (which I left intentionally vague in my posting) was to try to model clients at a very high level. Specifically, we wanted to see if we could model something like the timing patterns of the Tor cells that clients send to the network. An intended application (not yet completely thought through...) would be to use such information to get a more accurate sense of how well timing attacks work, by deploying them (in simulation) against presumably realistic clients. Our strategy was roughly:
* Instrument our guard node to record cell arrival times from clients pseudononymously (i.e., we know when two different cells belong to the same circuit, but we only record circuits as A, B, etc.). * Record such data for a short period of time. * Represent each circuit as a time series. * Cluster the collection of time series using Markov model clustering techniques.
The intent is that each cluster (represented by a single hidden Markov model) represents a "type" of client, even though we don't know for sure what that client type does. We can make some guesses about some: the "type" of steady high-volume cell counts is probably a bulk downloader; the "type" of steady zero cell counts is probably an unused circuit; etc. But in some sense, I'm thinking that what counts is the behavior of the client, not the reason for that behavior. We don't have to instrument clients for this. Of course, then one has to ask whether this kind of modeling is in fact useful. It is somewhat different than what you are envisioning, I think.
There are about a billion variations (at last count) on this theme. We chose one particular one as a test case to play with the methodology. I think the methodology is mostly OK, though I'm not completely satisfied with the results of the particular variation Julian worked on. So now I'm trying to figure out whether to push this forward and in particular what directions and end goals would be useful.
- Norman
On 6/7/13 8:04 PM, Norman Danner wrote:
On 6/7/13 2:37 AM, Karsten Loesing wrote:
(Sorry for cross-posting, but I think this is a topic for tor-dev@, not tor-talk@. If you agree, please reply on tor-dev@ only. tor-talk@ people can follow the thread here:
OK, following up on tor-dev. I thought tor-dev might be the more appropriate list, but the description for tor-talk is "all discussion about theory, design, and development of Onion Routing," whereas that for tor-dev is "discussion regarding Tor development." Maybe since I think more about theory than code, the former description seemed more applicable...
I haven't looked at list descriptions for a while, but tor-dev@ has similar discussions like this one, whereas tor-talk@ doesn't. Consider this decision for tor-dev@ as the result of human-learning techniques, rather than an evaluation of list descriptions. ;)
On 6/6/13 7:32 PM, Norman Danner wrote:
I have two questions regarding a possible research project.
First, the research question: can one use machine-learning techniques to construct a model of Tor client behavior? Or in a more general form: can one use <fill-in-the-blank> to construct a model of Tor client behavior? A student of mine did some work on this over the last year, and the results are encouraging, though not strong enough to do anything with yet.
Second, the meta-question: is it worthwhile to answer the first question?
[snip] Hi Norman,
yes, it's worthwhile to answer this question! I can imagine how at least Shadow and the Tor path generator would benefit from better client models. User number estimates on the metrics website might benefit from them, too.
I found two tickets where we asked similar questions before, and maybe there are more tickets like these:
https://trac.torproject.org/projects/tor/ticket/2963
https://trac.torproject.org/projects/tor/ticket/6295
Some very early thoughts:
- How do we make sure that we ask a representative set of people to
instrument their clients and export data on their usage behavior? If we only ask people who read their favorite news site twice per day, our client model will be just that, but not representative for all Tor users. (Still, we would know more than we know now.)
- Can we somehow aggregate usage information enough to make it safe for
people to send actual usage reports to us? I could imagine having a torrc flag that is disabled by default and that, when enabled, writes sanitized usage information to disk. For this we need a very good idea what we're planning to do with the data, and we'll need to specify the aggregation approach in a tech report and get it reviewed by the community.
Are your student's results available somewhere?
The written portion of my student Julian Applebaum's Senior thesis is available at
http://wesscholar.wesleyan.edu/etd_hon_theses/1042/
Our focus in this project (which I left intentionally vague in my posting) was to try to model clients at a very high level. Specifically, we wanted to see if we could model something like the timing patterns of the Tor cells that clients send to the network. An intended application (not yet completely thought through...) would be to use such information to get a more accurate sense of how well timing attacks work, by deploying them (in simulation) against presumably realistic clients. Our strategy was roughly:
- Instrument our guard node to record cell arrival times from clients
pseudononymously (i.e., we know when two different cells belong to the same circuit, but we only record circuits as A, B, etc.).
- Record such data for a short period of time.
- Represent each circuit as a time series.
- Cluster the collection of time series using Markov model clustering
techniques.
The intent is that each cluster (represented by a single hidden Markov model) represents a "type" of client, even though we don't know for sure what that client type does. We can make some guesses about some: the "type" of steady high-volume cell counts is probably a bulk downloader; the "type" of steady zero cell counts is probably an unused circuit; etc. But in some sense, I'm thinking that what counts is the behavior of the client, not the reason for that behavior. We don't have to instrument clients for this. Of course, then one has to ask whether this kind of modeling is in fact useful. It is somewhat different than what you are envisioning, I think.
There are about a billion variations (at last count) on this theme. We chose one particular one as a test case to play with the methodology. I think the methodology is mostly OK, though I'm not completely satisfied with the results of the particular variation Julian worked on. So now I'm trying to figure out whether to push this forward and in particular what directions and end goals would be useful.
Interesting stuff! You're indeed taking a different approach than I were envisioning by gathering data on a single guard rather than on a set of volunteering clients. Both approaches have their pros and cons, but I think your approach leads to some interesting results and can be done in a privacy-preserving fashion.
Two thoughts:
- I could imagine that your results are quite valuable for modeling better Shadow/ExperimenTor clients or for deriving better client models for Tor path simulators. Maybe Julian's thesis already has some good data for that, or maybe we'll have to repeat the experiment in a slightly different setting. I'm cc'ing Rob (the Shadow author) and Aaron (working on a path simulator) to make sure they saw this thread. I can help by reviewing code changes to Tor to make sure data is gathered in a privacy-preserving way, and I'd appreciate if those code changes would be made public together with analysis results.
- It might be interesting to observe how Tor usage changes over time. Maybe the research experiment leads to a set of classifiers telling us when a circuit is most likely used for bulk downloads, used for web browsing, used for IRC, unused, or whatever. We could then extend circuit statistics to have all relays report aggregate data of how circuits can be classified. Requires a proposal and code, but I could help with those.
Fun stuff!
Best, Karsten
On 6/10/13 4:40 AM, Karsten Loesing wrote:
On 6/6/13 7:32 PM, Norman Danner wrote:
I have two questions regarding a possible research project.
First, the research question: can one use machine-learning techniques to construct a model of Tor client behavior? Or in a more general form: can one use <fill-in-the-blank> to construct a model of Tor client behavior? A student of mine did some work on this over the last year, and the results are encouraging, though not strong enough to do anything with yet.
The intent is that each cluster (represented by a single hidden Markov model) represents a "type" of client, even though we don't know for sure what that client type does. We can make some guesses about some: the "type" of steady high-volume cell counts is probably a bulk downloader; the "type" of steady zero cell counts is probably an unused circuit; etc. But in some sense, I'm thinking that what counts is the behavior of the client, not the reason for that behavior. We don't have to instrument clients for this. Of course, then one has to ask whether this kind of modeling is in fact useful. It is somewhat different than what you are envisioning, I think.
There are about a billion variations (at last count) on this theme. We chose one particular one as a test case to play with the methodology. I think the methodology is mostly OK, though I'm not completely satisfied with the results of the particular variation Julian worked on. So now I'm trying to figure out whether to push this forward and in particular what directions and end goals would be useful.
Interesting stuff! You're indeed taking a different approach than I were envisioning by gathering data on a single guard rather than on a set of volunteering clients. Both approaches have their pros and cons, but I think your approach leads to some interesting results and can be done in a privacy-preserving fashion.
Two thoughts:
- I could imagine that your results are quite valuable for modeling
better Shadow/ExperimenTor clients or for deriving better client models for Tor path simulators. Maybe Julian's thesis already has some good data for that, or maybe we'll have to repeat the experiment in a slightly different setting. I'm cc'ing Rob (the Shadow author) and Aaron (working on a path simulator) to make sure they saw this thread. I can help by reviewing code changes to Tor to make sure data is gathered in a privacy-preserving way, and I'd appreciate if those code changes would be made public together with analysis results.
I'm in the process of rewriting the data collection code, and will e-mail later with some of the details. But maybe off-list initially, as I think the first few passes will be very special-purpose and hence not of general interest (though I'm happy to discuss it more publicly if that's more appropriate).
Right now I'm considering focusing on trying to get a reasonable (partial) answer to the following question: how well do various timing-analysis attacks actually work? That is, how well do they work when the client model is "accurate?" I'm not even sure how exactly to define "accurate," though I can think of at least a few different ways. But I'm hoping that by focusing on a relatively narrow question, we can see manageable chunks of questions related to what kinds of data can be reasonably collected, and how can we use that data for other purposes.
- It might be interesting to observe how Tor usage changes over time.
Maybe the research experiment leads to a set of classifiers telling us when a circuit is most likely used for bulk downloads, used for web browsing, used for IRC, unused, or whatever. We could then extend circuit statistics to have all relays report aggregate data of how circuits can be classified. Requires a proposal and code, but I could help with those.
Yes, I can see a number of longer-range applications like this. I'm not sure I want to think about proposals and code just yet.
- Norman
Continuing this discussion of client behavior simulation...
I'm in the process of rewriting the data collection code..
One thing I need to do is make a reasonable guess as to whether a given connection is from a client. Is there a straightforward way to do that programmatically? As a first pass, I'd even take "isn't a known relay/authority/etc."
I've been poking through the source code, and I assume I'll find something appropriate eventually. But I wouldn't mind a shortcut...
- Norman
On 6/26/13 5:59 PM, Norman Danner wrote:
Continuing this discussion of client behavior simulation...
I'm in the process of rewriting the data collection code..
One thing I need to do is make a reasonable guess as to whether a given connection is from a client. Is there a straightforward way to do that programmatically? As a first pass, I'd even take "isn't a known relay/authority/etc."
I've been poking through the source code, and I assume I'll find something appropriate eventually. But I wouldn't mind a shortcut...
This code looks related:
/* only report it to the geoip module if it's not a known router */ if (!router_get_by_id_digest(chan->identity_digest)) { if (channel_get_addr_if_possible(chan, &remote_addr)) { geoip_note_client_seen(GEOIP_CLIENT_CONNECT, &remote_addr, now);
https://gitweb.torproject.org/tor.git/blob/HEAD:/src/or/channel.c#l2379
Best, Karsten