Hey Philipp!
Thanks for the interest! I'm one of the authors on the paper. My response is inline.
On Wednesday, August 19, 2015, Philipp Winter phw@nymity.ch wrote:
They claim that they are able to detect obfs3, obfs4, FTE, and meek
using entropy analysis and machine learning.
I wonder if their dataset allows for such a conclusion. They use a
(admittedly, large) set of flow traces gathered at a college campus.
One of the traces is from 2010. The Internet was a different place back
then.
Correct, we used datasets collected in 2010, 2012, and 2014, which total to
1TB of data and 14M TCP flows.
We could have, say, just used the 2014 dataset. However, we wanted to show that the choice of dataset matters and even with millions of traces, the collection date and network-sensor location can impact results.
I would also expect college traces to be very different from
country-level traces. For example, the latter should contain
significantly more file sharing, and other traffic that is considered
inappropriate in a college setting. Many countries also have popular
web sites and applications that might be completely missing in their
data sets.
That's probably accurate. I bet that even across different types of universities (e.g., technical vs. non-technical) one might see very different patterns. Certainly different countries (e.g., Iran vs. China) will see different patterns, too.
For that reason, we're going to release our code [1] prior to CCS. Liang Wang, a grad student at University of Wisconsin - Madison, lead a substantial engineering effort to make this possible. We undersold it in the paper, but it makes it easy to re-run all these experiments on new datasets. We'd *love* it if others could rerun the experiments against new datasets and report their results.
Considering the rate difference between normal and obfuscated traffic,
the false positive rate in the analysis is significant. Trained
classifiers also seem to do badly when classifying traces they weren't
trained for.
We definitely encountered this. If you train on one dataset and test on a different one, then accuracy plummeted.
I think that raises a really interesting research question: what does it mean for two datasets to be different? For this type of classification problem, what level of granularity/frequency would a network operator train at to achieve optimal accuracy and low false positives? (e.g., do you need a classifier per country? state? city? neighborhood?) Also, how often does one need to retrain? daily? weekly?
I guess all we showed is that datasets collected from sensors at different network locations (and years apart) are different enough to impact classifier accuracy. Probably not surprising...
The authors suggest active probing to reduce false
positives, but don't mention that this doesn't work against obfs4 and
meek.
I don't want to get too off track here, but do obfs4 and meek really resist against active probing from motivated countries? Don't we still have the unsolved bridge/key distribution problem?
Finally, we’ll be working on a full version of this paper with additional results. If anyone is interested in reviewing and providing feedback, we’d love to hear it. (Philipp - do you mind if I reach out to you directly?)
-Kevin