GoSC - Website Fingerprinting project - tor-dev

List overview All Threads
Download

newer

GoSC - Website Fingerprinting project

older

Tor Browser Weekly IRC meetings...

Re: [tor-dev] Torbirdy

Marc Juarez

11 Mar 2014 11 Mar '14

4:46 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Lunar:

...

Have you read Mike Perry's long blog post on the topic? https://blog.torproject.org/blog/critique-website-traffic-fingerprinting-att...

It outlines future research work in evaluating the efficiency of fingerprinting attacks, and also mention a couple of promising defenses.

Yes, I am aware of it and I'm currently working on a study to evaluate the efficiency of these attacks.

As Mike Perry said in the post, most of the attacks give an unrealistic advantage to the adversary and probably countermeasures work much better than what has been shown so far.

However, some of the results of these articles suggest that there exist coarse-grained traffic features that are invariant to randomized pipelines (RP, SPDY) and thus can still identify web pages (Dyer et. al.). Also, edit-distance based classifiers broke some old versions of the RP implemented in Tor Browser.

It's an open problem to see if these features actually uniquely identify web pages in larger worlds than the ones considered in the literature. In any case, link-padding strategies are specially designed to conceal these features with the minimal amount of cover traffic and are becoming affordable in terms of bandwidth.

The project I propose would be directed to address this bug ticket:

https://trac.torproject.org/projects/tor/ticket/7028

For example, I would like to implement the common building blocks for link-padding countermeasures (such as a "traffic generator controller" in the onion proxy and the entry guard).

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTHz3kAAoJEGfJ5xfgazlxOwoH/ilP01mjxyyeuPbko/xkKGfw 5gMJGFa0m4StabJ1bhbqJWF2jsu6W2SBCVCqea0emvOvab0WzhuvQvzQb2obZpzr NReUhhiGRhj4xN+2yAL3Vh8zoyZigKtSQPoykwa/AFG7TvIGppHZb0S5ii7Hk/Yf UtwO3Kt3YCGJT3FoKEJQgN2wz82Skbo4Xiw0bknemPFmIkxlxyplkkD0D34OdEYx YPPxRBMuEnpa1H5H3bPMXK55kBP91qS5fg23hsN6Gksy45yhLeiOJZBi0n3nlHD2 xtp4sWdQhTUDFP85kSmbzW/eufuc4WAI6k5KeZtc5YrmIl9eOnAu+tG8Ghu4Vpc= =Rf6n -----END PGP SIGNATURE-----

Show replies by date

Mike Perry

19 Mar 19 Mar

2:30 a.m.

Marc Juarez:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Lunar:

...
Have you read Mike Perry's long blog post on the topic? https://blog.torproject.org/blog/critique-website-traffic-fingerprinting-att...

It outlines future research work in evaluating the efficiency of fingerprinting attacks, and also mention a couple of promising defenses.

Yes, I am aware of it and I'm currently working on a study to evaluate the efficiency of these attacks.

As Mike Perry said in the post, most of the attacks give an unrealistic advantage to the adversary and probably countermeasures work much better than what has been shown so far.

However, some of the results of these articles suggest that there exist coarse-grained traffic features that are invariant to randomized pipelines (RP, SPDY) and thus can still identify web pages (Dyer et. al.). Also, edit-distance based classifiers broke some old versions of the RP implemented in Tor Browser.

It's an open problem to see if these features actually uniquely identify web pages in larger worlds than the ones considered in the literature. In any case, link-padding strategies are specially designed to conceal these features with the minimal amount of cover traffic and are becoming affordable in terms of bandwidth.

The project I propose would be directed to address this bug ticket:

https://trac.torproject.org/projects/tor/ticket/7028

For example, I would like to implement the common building blocks for link-padding countermeasures (such as a "traffic generator controller" in the onion proxy and the entry guard).

This sounds like a good summer-sized amount of work. I think I am in agreement with George that pluggable transports are a good place to start for prototyping this work. That way, you can experiment with custom padding protocols easily, without needing to make invasive changes to tor-core for each revision, each time.

For example, it would be neat to be able to transmit a set of statistics to your bridge node during the connection handshake or with the circuit setup, so that you don't have to always request downstream padding cells with a upstream cell, and downstream padding can asynchronously arrive according to some probability or histrogram distribution you specify.

You could also obviously specify a number of cells to send in response to a padding cell request (from O..N, where N is some reasonable cap similar to a largeish web object size). The current Tor link padding protocol supports neither of these operations.

More advanced padding protocols are also possible, but may also be overkill. We can discuss those further if this sounds interesting. I'd also like to hear any ideas you might have on the design and/or implementation of such a protocol.

Related: Do you happen to have any existing classifier code working already, by any chance?

One of the ideas I've been considering is taking a closer look at the nearest-neighbor edit distances between page class labels, for the edit distance based classifiers. This distance provides us with an estimate of the ideal minimum cover traffic we will need to make testing instances jump from one nearest-neighbor label to another (causing a false positive). It will also decrease as the world size increases (more class labels in the same amount of N-dimensional space).

A successful defense should change of the distribution of edit distances of test instances around their class labels (it will increase the intra-class variance) and this in turn will increase the size of the threshold around class labels for a given accuracy rate, reducing accuracy or increasing false positives.

It may also be the case that low or no cost defenses (like a smarter use of SPDY) do this, too, but we'll be able to see it for sure with padding.

Does this make sense?

-- Mike Perry

Kevin P Dyer

3:25 a.m.

On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry mikeperry@torproject.orgwrote:

...

[snip] Related: Do you happen to have any existing classifier code working already, by any chance?

If It helps, the code [2] from our website fingerprinting paper [1] is public. It includes the edit-distance classifier [3] from [4], which wasn't reported on in [1], I believe.

-Kevin

[1] https://kpdyer.com/publications/oakland2012-peekaboo.pdf [2] https://github.com/kpdyer/website-fingerprinting [3] https://github.com/kpdyer/website-fingerprinting/blob/master/classifiers/ESO... [4] http://dl.acm.org/citation.cfm?id=1888898

Marc Juarez

11:01 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi everyone,

Thanks for the answers. Some inline comments below.

On 03/12/2014 09:25 PM, Roger Dingledine wrote:

...

It sounds like the implementation might be the easy part, compared to the design part. And since GSoC is mostly about implementation, and we would want to be confident you won't spend all summer distracted by design, it would be best if your proposal includes a lot of the design.

Yes, the project has an important research component that is still ongoing. I guess this proposal is not well suited for a GSoC project. Still, I am very interested in getting involved in the Tor dev community and specially in the implementation of website fingerprinting countermeasures. So, I would like to volunteer for this while I continue with the research.

On 03/12/2014 09:25 PM, Roger Dingledine wrote:

...

It seems like some of the approaches would best be done inside Tor (as modifications to the Tor program), and some of them would best be done in a separate pluggable transport? Or should they all be done in a PT? Can the bandwidth shaping in Scramblesuit (obfsproxy) be used as a building block here?

On Wed 12 Mar 2014 11:14:06 PM CET, George Kadianakis wrote:

...

If we ever wanted to deploy these anti-traffic-analysis PTs to the whole network, we would have to add PT support to all clients (HSes might also benefit from this) and to all relays.

On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry@torproject.org

...

This sounds like a good summer-sized amount of work. I think I am in agreement with George that pluggable transports are a good place to start for prototyping this work. That way, you can experiment with custom padding protocols easily, without needing to make invasive changes to tor-core for each revision, each time.

Yes, PTs look like a good starting point. However, I think that for a final countermeasure it is required to pad until the middle-node (to frustrate and adversary who controls the guard) and, as George said, I guess deeper modifications will be needed in this case.

On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry@torproject.org

...

For example, it would be neat to be able to transmit a set of statistics to your bridge node during the connection handshake or with the circuit setup, so that you don't have to always request downstream padding cells with a upstream cell, and downstream padding can asynchronously arrive according to some probability or histrogram distribution you specify.

You could also obviously specify a number of cells to send in response to a padding cell request (from O..N, where N is some reasonable cap similar to a largeish web object size). The current Tor link padding protocol supports neither of these operations.

Definitely. These are the type of building blocks I was referring to. Any padding-based countermeasure would benefit of the modifications you proposed. Something that I think it will be needed for any smart defense that tries to minimize the overhead is some regularly updated database of webpage traces. Since websites change constantly and we should assume that the adversary will train on the most recent data, the defense has to act accordingly.

On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry@torproject.org

...

Related: Do you happen to have any existing classifier code working already, by any chance?

Yes. We are using a modified version of Tao’s classifier (edit-distance based SVM) for the research project.

On 03/19/2014 04:25 AM, Kevin P Dyer wrote:

...

If It helps, the code [2] from our website fingerprinting paper [1] is public. It includes the edit-distance classifier [3] from [4], which wasn't reported on in [1], I believe.

Thanks, Kevin. In particular, we are using the Damerau-Levenshtein distance instead of the plain Levenshtein, which also takes into account transpositions. I think it describes better network traffic, specially when RP is used.

On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry@torproject.org

...

One of the ideas I've been considering is taking a closer look at the nearest-neighbor edit distances between page class labels, for the edit distance based classifiers. This distance provides us with an estimate of the ideal minimum cover traffic we will need to make testing instances jump from one nearest-neighbor label to another (causing a false positive). It will also decrease as the world size increases (more class labels in the same amount of N-dimensional space).

A successful defense should change of the distribution of edit distances of test instances around their class labels (it will increase the intra-class variance) and this in turn will increase the size of the threshold around class labels for a given accuracy rate, reducing accuracy or increasing false positives.

Yes, that sounds very interesting. I guess something relevant in such a scheme is the distribution used to pick the web page in the k-NN cluster that will be used to generate the cover traffic. It would be interesting to see what is the distribution that minimizes the overhead introduced by this cover traffic while preventing statistical attacks.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTKXj1AAoJEGfJ5xfgazlxru4IAKPVJ2iCfEZCvV9g7ufTpQff hcaav5taSme2bvB8rdMYsr70sHzqJL/3XwJ05duf7G/bCYhuHrJvq6Ci6GAPhe0q rDUMzkQOXLoZ9GvhzEbb+N/3EXahsW0W/mUjSxBvfxQLqk4I/13x6LhkSRvs7Ibv 1dj4D32VVPZF5kWTvztKZoFXtqkt6LTZDkak1j1vN2h5Vdriu7NwERiGBYGBIl1l LW44vPGu5qCxg/VPUIki0Te6LLqYLEfcUBhVTOBThE2RmBu4iCiivBi9seWeqXL8 /sh5Fgr2NnndrwYG0FDDY8GTJJ/unKJCqN2cX6WgNfox/2N43xAq3es6732Igew= =07zn -----END PGP SIGNATURE-----

Ian Goldberg

12:13 p.m.

On Wed, Mar 19, 2014 at 12:01:09PM +0100, Marc Juarez wrote:

...

Hi everyone,

Thanks for the answers. Some inline comments below.

You may also be interested in our new tech report:

T. Wang, X. Cai, R. Nithyanand, R. Johnson and I. Goldberg Effective Attacks and Provable Defenses for Website Fingerprinting CACR 2014-05 http://cacr.uwaterloo.ca/techreports/2014/cacr2014-05.pdf

- Ian

Marc Juarez

20 Mar 20 Mar

3:20 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 03/19/2014 01:13 PM, Ian Goldberg wrote:

...

You may also be interested in our new tech report:

T. Wang, X. Cai, R. Nithyanand, R. Johnson and I. Goldberg Effective Attacks and Provable Defenses for Website Fingerprinting CACR 2014-05 http://cacr.uwaterloo.ca/techreports/2014/cacr2014-05.pdf

Thank you, Ian. I'll read it.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTKwdSAAoJEGfJ5xfgazlxUdkH/2MG9EzklnRrGVrAZfA9G1RA ZYtCsVErZw2d+iH+Vu/hFIwPPSjrk7Kwp/dk9pvI5oKigEmOWaRgwcx7hqXgbcZ2 ruf4QkqD064HT4b5UwmliWSpQut1e0dQ06EtGvvbzy1Mzy7cnBtACdzLeFM+NaI5 OiPpEKkb3pYJAIJIms63JfFjf0VWb5G+OSjpYXu+9OqWIci/cRpVMjwvrWz+bKi5 pxib7wWeI1V5c56WDa/g0aBJ47cHeGpY7zOgmkhZ9IqUBV+4uXrOeQIfWAA8iP6F Fw/s8/C4Il4iSO0TfooHrCMWI5yG8QX5B7Wos5c9+T+IZbwjYEe/Y4GlrUVV+0o= =8yhX -----END PGP SIGNATURE-----

Mike Perry

19 Mar 19 Mar

5:22 p.m.

Marc Juarez:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi everyone,

Thanks for the answers. Some inline comments below.

On 03/12/2014 09:25 PM, Roger Dingledine wrote:

...
It sounds like the implementation might be the easy part, compared to the design part. And since GSoC is mostly about implementation, and we would want to be confident you won't spend all summer distracted by design, it would be best if your proposal includes a lot of the design.

Yes, the project has an important research component that is still ongoing. I guess this proposal is not well suited for a GSoC project. Still, I am very interested in getting involved in the Tor dev community and specially in the implementation of website fingerprinting countermeasures. So, I would like to volunteer for this while I continue with the research.

I disagree.

I think prototyping a solution is a great GSoC project, assuming you can begin work on the prototype immediately and work on it during the summer. We have a fairly clear idea what we think a padding protocol should look like (at least for an initial prototype), and if you keep the scope narrow, you should be able to generate running code as a pluggable transport done during the summer project portion, at the very least.

If it ends up too easy, you can always start the work on proposing and transferring your padding protocol to tor-core itself, but I still think prototyping in PT land first is a wise move.

I would consider mentoring such an app (at the very least as a secondary mentor), and I suspect other people on the pluggable transport side would as well.

I still encourage you to at least submit an application and see what happens, but keep it focused on the coding side.

I look forward to seeing a GSoC proposal from you, and keep in mind that the deadline is Friday. We are also allowed to ask you for updates and clarifications to your proposal all the way up until April 18th.

...

On 03/12/2014 09:25 PM, Roger Dingledine wrote:

...
It seems like some of the approaches would best be done inside Tor (as modifications to the Tor program), and some of them would best be done in a separate pluggable transport? Or should they all be done in a PT? Can the bandwidth shaping in Scramblesuit (obfsproxy) be used as a building block here?

On Wed 12 Mar 2014 11:14:06 PM CET, George Kadianakis wrote:

...
If we ever wanted to deploy these anti-traffic-analysis PTs to the whole network, we would have to add PT support to all clients (HSes might also benefit from this) and to all relays.

On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry@torproject.org

...
This sounds like a good summer-sized amount of work. I think I am in agreement with George that pluggable transports are a good place to start for prototyping this work. That way, you can experiment with custom padding protocols easily, without needing to make invasive changes to tor-core for each revision, each time.

Yes, PTs look like a good starting point. However, I think that for a final countermeasure it is required to pad until the middle-node (to frustrate and adversary who controls the guard) and, as George said, I guess deeper modifications will be needed in this case.

Right. I suggested PTs because they will be easy to reign in scope and rapid prototype. I am also interested in padding to the middle node, but be aware that in both cases (Guard and middle), your defense should be tunable in terms of the capacity it consumes. It will be possible (and likely) that we'll have consensus parameters that specify how much traffic is available for padding to both Guard and Middle node, when we roll this out.

I actually missed George's concerns about having to roll out PTs to all nodes. I don't think that really matters or is a concern. If the PT design looks good and functions well (especially if you have a working classifier), it will be much easier to make a convincing argument to deploy the mechanisms in tor-core, and will also likely make that implementation work smoother and more focused on exactly what you discover you do need from the prototype.

If it turns out to be too easy, you can begin work on implementing (or at least proposing) your mechanisms in tor-core.

...

On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry@torproject.org

...
For example, it would be neat to be able to transmit a set of statistics to your bridge node during the connection handshake or with the circuit setup, so that you don't have to always request downstream padding cells with a upstream cell, and downstream padding can asynchronously arrive according to some probability or histrogram distribution you specify.

You could also obviously specify a number of cells to send in response to a padding cell request (from O..N, where N is some reasonable cap similar to a largeish web object size). The current Tor link padding protocol supports neither of these operations.

Definitely. These are the type of building blocks I was referring to. Any padding-based countermeasure would benefit of the modifications you proposed. Something that I think it will be needed for any smart defense that tries to minimize the overhead is some regularly updated database of webpage traces. Since websites change constantly and we should assume that the adversary will train on the most recent data, the defense has to act accordingly.

Well maybe. It will be important to have the padding protocol support primitives such that we can deploy a wide variety of padding mechanisms.

It may be the case that some statistical padding models don't require frequent update, at the expense of decreased traffic efficiency (which depending on overall accuracy reduction, may be an acceptable tradeoff).

I am very optimistic about the efficacy of any defenses. One major dirty secret of all of this work is that in the "Open World", typically only one single web page (or a very small number) are recognized. When that classifier is strained with more labels, I suspect it will quickly collapse on its own.

...

On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry@torproject.org

...
Related: Do you happen to have any existing classifier code working already, by any chance?

Yes. We are using a modified version of Tao’s classifier (edit-distance based SVM) for the research project.

On 03/19/2014 04:25 AM, Kevin P Dyer wrote:

...
If It helps, the code [2] from our website fingerprinting paper [1] is public. It includes the edit-distance classifier [3] from [4], which wasn't reported on in [1], I believe.

Thanks, Kevin. In particular, we are using the Damerau-Levenshtein distance instead of the plain Levenshtein, which also takes into account transpositions. I think it describes better network traffic, specially when RP is used.

On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry@torproject.org

...
One of the ideas I've been considering is taking a closer look at the nearest-neighbor edit distances between page class labels, for the edit distance based classifiers. This distance provides us with an estimate of the ideal minimum cover traffic we will need to make testing instances jump from one nearest-neighbor label to another (causing a false positive). It will also decrease as the world size increases (more class labels in the same amount of N-dimensional space).

A successful defense should change of the distribution of edit distances of test instances around their class labels (it will increase the intra-class variance) and this in turn will increase the size of the threshold around class labels for a given accuracy rate, reducing accuracy or increasing false positives.

Yes, that sounds very interesting. I guess something relevant in such a scheme is the distribution used to pick the web page in the k-NN cluster that will be used to generate the cover traffic. It would be interesting to see what is the distribution that minimizes the overhead introduced by this cover traffic while preventing statistical attacks.

Right. Or modeling the nature of the edit distances between many different labels. At scale, statistical patterns may emerge between clusters of labels. I suspect there are certain types of edit distances that are fairly common between many similar sites.

-- Mike Perry

Marc Juarez

20 Mar 20 Mar

3:33 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Mike, thanks for your comments. My main concern for applying to GSoC is the time commitment. As I said in my first message, I would like to work part-time in the project. Anyway, I will submit a proposal and specify this there.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTKwo6AAoJEGfJ5xfgazlxaloIAI2e0ssAgX//xbBD/jnTP59s I7kJqRLhRPc2qE9DWGEsCU4NF309h66gboK4+CUhq1FGMIj08Q/VRmoumiSzOmzx ksC3eFaoxTSD0P/Nl0wDkjhrXTn1uJVqRP1OlMDvVEj5630I3ZvZ1cSxe7cTFEAd To6xOIH89e3pAv6kpn/rby292kXDyhZFteuIkj8cU5fxTuQ/lytcKFKxgdVvTxBT XZGG+sYfvgYtVoL01IYfHsSOtHBMv8Nr1huBOz1UvRoih66Hxq1NcClxNfW7DcdT 0Kvf4Gp3SeKRxnuux1PzJPwFOLFhGixCLjfiV2h+rnSezk0yM4EEMdpvZShFxmw= =Jovi -----END PGP SIGNATURE-----

3885

Age (days ago)

3894

Last active (days ago)

tor-dev@lists.torproject.org

7 comments

4 participants

tags (0)

participants (4)

Ian Goldberg
Kevin P Dyer
Marc Juarez
Mike Perry