Mike Perry mikeperry@torproject.org writes:
George Kadianakis:
Hello Mike,
I had a talk with Marc and Mohsen today about WTF-PAD. I now understand much more about WTF-PAD and how it works with regards to histograms. I think I might even understand enough to start some sort of conversation about it:
Here are some takeaways:
- Marc and Mohsen think that WTF-PAD might not be the way forward because of its various drawbacks and its complexity. Apparently there are various attacks on WTF-PAD that Roger has discovered (SENDME cells side-channels?) and also the deep learning crowd has done some pretty good damage to the WTF-PAD padding (90%-60% accuracy?). They also told me that achieving needed precision on the timings might be a PITA.
Are there citations for any of this? Last I heard Matt Wright was working on a deep learning study but the results were mixed.
I think this is the best we have in terms of public results: https://arxiv.org/abs/1801.02265
- From what I understand you are also hoping to use WTF-PAD to protect against circuit fingerprinting and not just website fingerprinting. They told me that while this might be plausible, there is no current research on how well it can achieve that. Are we hoping to do that? And what research remains here? How can I help? Which parts of the Tor circuit protocol are we hoping to hide?
I am designing WTF-PAD to be a framework for deploying padding against arbitrary traffic analysis attacks. It is meant to allow us to define histograms on the fly (in the Tor consensus) as these are studied. The fact that they have not yet been studied is not super relevant to deploying the framework for it now.
ACK.
What other traffic analysis attacks are we looking at addressing here?
I'm thinking of stuff like "circuit fingerprinting of onion services", but I wonder if histograms and random sampling is too crude to actually be able to help against sophisticated attacks. I don't have a suggestion for something better currently.
On that topic, is it decided whether the adaptive padding of WTF-PAD will also happen during circuit construction, or only after that?
Marc and Mohsen suggested using application-layer defences because the application-layer has much better view of the actual structures that are sent on the wire, instead of the black box view that the network layer has.
In particular they were mainly concerned about onion services fingerprinting because they are part of a restricted closed world, whereas they were less concerned about the entire internet because of its vast size.
They suggested that we could investigate using the service-side "alpaca" library for onion services (e.g. as part of securedrop?) which should resolve the most pressing concern of HS identification.
I mean yeah application-layer defenses are useful for website traffic fingerprinting, but that is a very narrow slice of the traffic analysis problems that I want this framework to solve.
WTF-PAD also doesn't rule out hidden service operators using alpaca, either.
Agreed.
- They also told me of research by Tobias Pulls which eliminates the needs for histograms in WTF-PAD and instead it samples from the probability distribution directly. They think that this can simplify things somewhat. Any thoughts on this?
Yes this is actually exactly what I want to do with the next iteration of WTF-PAD! The question is what form/model to use for these probability distributions. Right now we're encoding inter-burst and inter-packet timings with some weird geometric distribution determining how long these bursts should go on for, when it might be more natural to encode and sample from length-based distributions/histograms.
(Histograms vs distribution is not the problem -- its what they encode and how they encode it that matters).
I don't see this paper on Tobias's website. Is it up anywhere yet?
Hmm. Looking at the README of wtfpad (see the APE section), I think this blog post is the best resource we have on this: https://www.cs.kau.se/pulls/hot/thebasketcase-ape/