(Sorry for cross-posting, but I think this is a topic for tor-dev@, not tor-talk@. If you agree, please reply on tor-dev@ only. tor-talk@ people can follow the thread here:
https://lists.torproject.org/pipermail/tor-dev/2013-June/thread.html)
On 6/6/13 7:32 PM, Norman Danner wrote:
I have two questions regarding a possible research project.
First, the research question: can one use machine-learning techniques to construct a model of Tor client behavior? Or in a more general form: can one use <fill-in-the-blank> to construct a model of Tor client behavior? A student of mine did some work on this over the last year, and the results are encouraging, though not strong enough to do anything with yet.
Second, the meta-question: is it worthwhile to answer the first question? It seems to me that if the answer to the first question is "yes," then the solution could be used to (at least) provide better simulations of Tor (e.g., via Shadow or ExperimenTor). This possibly naive thought would imply that the answer to the second question is "yes."
I'd be interested to hear responses to my second question, either validating my naive thought or explaining why the first question isn't worth answering. I'd accept responses to my first question, too, in case this has already been done.
Hi Norman,
yes, it's worthwhile to answer this question! I can imagine how at least Shadow and the Tor path generator would benefit from better client models. User number estimates on the metrics website might benefit from them, too.
I found two tickets where we asked similar questions before, and maybe there are more tickets like these:
https://trac.torproject.org/projects/tor/ticket/2963
https://trac.torproject.org/projects/tor/ticket/6295
Some very early thoughts:
- How do we make sure that we ask a representative set of people to instrument their clients and export data on their usage behavior? If we only ask people who read their favorite news site twice per day, our client model will be just that, but not representative for all Tor users. (Still, we would know more than we know now.)
- Can we somehow aggregate usage information enough to make it safe for people to send actual usage reports to us? I could imagine having a torrc flag that is disabled by default and that, when enabled, writes sanitized usage information to disk. For this we need a very good idea what we're planning to do with the data, and we'll need to specify the aggregation approach in a tech report and get it reviewed by the community.
Are your student's results available somewhere?
Best, Karsten