anti-censorship-team September 2020

anti-censorship-team@lists.torproject.org

3 participants
4 discussions

Next steps for unclassifiable protocols
by Roger Dingledine 04 Oct '20

04 Oct '20

Hi folks, Here are some brainstorming thoughts to get us started improving our "unclassifiable protocols" idea. ---------------------------------------------------------------------- For context, right now we have essentially one deployed design in this space, which is obfs4: https://gitweb.torproject.org/pluggable-transports/obfs4.git/tree/doc/obfs4… This protocol does a short handshake (one client message, one server message) to establish session keys, then it just sends bytes back and forth with an (additional) layer of encryption so no payload bytes assist an attacker with doing content-based protocol classification. ---------------------------------------------------------------------- One of the more thorough analyses of obfs4 comes from the FTE folks: https://www.freehaven.net/anonbib/#ccs2015obfsdetection where they show a classifier that assesses entropy of the first server-side packet, plus looking at the length of that initial message, and they can consistently pick out obfs4 flows with only a 0.25% false positive rate compared to background traffic they recorded at their university. That false positive rate is where it gets interesting: it sounds good until we realize that it means it flags 1 in every 400 flows as being obfs4, i.e. it falsely tags about 36000 flows in the 14M flow dataset. And the false positives aren't randomly distributed either: 1.22% of their SSL flows are falsely tagged as obfs4, and it's the ones where the SNI in the SSL handshake has a lot of entropy (because everything else in the SSL handshake is already random). Also, entire protocols like RTMP get caught up and tagged as obfs4, because their handshake starts out with a block of random bytes. A key takeaway is that if they ran this classifier, with its high true positive rate and low false positive rate, on a large country's network traffic, approximately none of the flows that it would tag as obfs4 would actually be obfs4. Their Section 7 in that paper has a good discussion of the base rate issues and how they apply here. Are there other papers that design or analyze classifiers for obfs4? ---------------------------------------------------------------------- The good news: It would seem that obfs4 is most effective in the "long tail of varied traffic" scenario. That is, considering the spectrum between "small corporate network" and "China backbone", obfs4 needs that more broad background traffic in order to make its false positives too painful to block. The bad news: I still worry about an attack that puts together many building blocks, each of which individually is like the "1% false positive rate" classifier in the above paper, but that together drive their false positive rate low enough that blocking is safe to do. One observation there is that the more complexity there is to a protocol, the harder it is to "really" look like it, i.e. to match it in all dimensions at once. Consider this lovingly handcrafted graph, where the X axis is how thoroughly we try to look like some expected protocol, and the Y axis is how close we can get to making the adversary unwilling to block us: 1_| |_1 | | |\ /| | \ / | | \ / | | \ / | 0_| \___________________________________________/ |_0 | | There's a huge valley in the middle, where we're trying to look like something, but we're not doing it well at all, so there is little damage from blocking us. The ramp on the right is the game that FTE tries to play, where in theory if they're not perfectly implementing their target protocol then the adversary can deploy a distinguisher and separate their traffic from the "real" protocol, but in practice there are enough broken and weird implementations of the protocol in the wild that even being "close enough" might cause the censor to hesitate. And the ramp on the left is our unclassifiable traffic game, where we avoid looking like any particular expected protocol, and instead rely on the long tail of legitimate traffic to include something that we blend with. Observation: *both* of these ramps are in the "broken in theory, but works in practice" situation. I started out thinking it was just the obfs4 one, and that the FTE one was somehow better grounded in theory. But neither side is actually trying to build the ideal perfect protocol, whatever that even is. The game in both cases is about false positives, which come from messy (i.e. hard to predict) real-world background traffic. One research question I wrestled with while making the graph: which ramp is steeper? That is, which of these approaches is more forgiving? Does one of them have a narrower margin for error than the other, where you really have to be right up against the asymptote or it doesn't work so well? For both approaches, their success depends on the variety of expected background traffic. The steepness of the right-hand (look-like-something) ramp also varies greatly based on the specific protocol it's trying to look like. At first glance we might think that the more complex the protocol, the better you're going to need to be at imitating it in all dimensions. That is, the more aspects of the protocol you need to get right, the more likely you'll slip up on one of them. But competing in the other direction is: the more complex the protocol, the more broken weird implementations there could be in practice. I raise the protocol complexity question here because I think it has something subtle and critical to do with the look-like-nothing strategy. Each dimension of the protocol represents another opportunity to deploy a classifier building block, where each classifier by itself is too risky to rely on, but the composition of these blocks produces the killer distinguisher. One of the features of the unclassifiable protocol that we need to maintain, as we explore variations of it, is the simplicity: it needs to be the case that the adversary can't string together enough building-block classifiers to reach high confidence. We need to force them into building classifiers for *every other protocol*, rather than letting them build a classifier for our protocol. (I'll also notice that I'm mushing together the notion of protocol complexity with other notions like implementation popularity and diversity: a complex proprietary protocol with only one implementation will be no fun to imitate, but the same level of complexity where every vendor implements their own version will be much more workable.) ---------------------------------------------------------------------- I've heard two main proposed ways in which we could improve obfs in theory -- and hopefully thus in practice too: (A) Aaron's idea of using the latest adversarial machine learning approaches to evolve a traffic transform that resists classifying. That is, play off the classifiers with our transform, in many different background traffic scenarios, such that we end up with a transform that resists classifying (low true positive and/or high false positive) in many of the scenarios. (B) Philipp's idea from scramblesuit of having the transform be parameterized, and for each bridge we choose and stick with a given set of parameters. That way we're not generating *flows* that each aim to blend in, but rather we're generating bridges that each aim to blend differently. This diversity should adapt well to many different background traffic scenarios because in every scenario some bridges might be identified but some bridges will stay under the radar. At first glance these two approaches look orthogonal, i.e. we can do both of them at once. For example, Aaron's approach tells us the universe of acceptable parameters, and Philipp's approach gives us diversity within that universe. Aaron: How do we choose the parameter space for our transforms to explore? How much of that can be automated, and how much needs to be handcrafted by subject-matter-experts? I see how deep learning can produce a magic black-box classifier, but I don't yet see how that approach can present us with a magic black-box transform. And as a last note, it would sure be nice to produce transforms that are robust relative to background traffic, i.e. to not be brittle or overfit to a particular scenario. Otherwise we're giving up one of our few advantages in the arms race, which is that right now we force the censor to characterize the traffic -- including expected future traffic! -- and assess whether it's safe to block us. There. Hopefully some of these ideas will cause you to have better ideas. :) --Roger

2 2

Metrics to watch for Salmon feedback loop
by Roger Dingledine 29 Sep '20

29 Sep '20

While brainstorming for a recent funding proposal, I wrote up this list of "Things we should measure to track our impacts/success" in the context of the Salmon bridge distribution strategy. Or to put it another way, these are questions I'll be asking phw et al to understand how things are going, once we deploy it. Some of the questions are variations on others, i.e. a single data source can answer several of them. So it's probably better framed as "questions to track" rather than "data sources we should collect". (1a) how many obfs4 / httpt bridges are running total? Bridges report their existence to the bridge authority, and bridgedb/rdsys aggregate them and send them to the metrics datasets. So we should already have these numbers. (1b) out of those, how many are we distributing via salmon? This is a parameter that we choose, and we will choose it based on how successful Salmon is compared to our existing distribution strategies. If we choose a larger fraction of our bridges to be used for Salmon, it's a good indication that we're finding Salmon to be an effective option. (2a) how often we are reachability-testing each bridge? We'll probably start off doing daily testing, but we should aim to get the frequency higher, and we might end up doing more targeted just-in-time testing where whenever a user reports that a bridge is blocked, we launch a test for it right then in order to make a decision about how to respond to the user. (2b) from how many vantage points we are doing this reachability-testing? We will want to start with n+1 vantage points, one for each target region and one "control" outside the censored area. That might be sufficient for a long time, or we might learn that we need to split up our testing across more vantage points, or test more in particular regions. (2c) how many of our bridges are currently reachable from each of these vantage points? Ideally the answer will be "all of them", but the reality is that some blocking will occur. So the higher the fraction here the better, and in some sense it is a measure of the health of our plans, and/or a measure of the intensity of attention from the censor. (2d) When a bridge stops working in China, what fraction of the time is it because the bridge went down, i.e. normal churn? One of the tradeoffs with a community of volunteer bridges is that bridges naturally come and go over time. Understanding the dynamics of our bridge population is key because it impacts the rate at which users need fresh bridges even when there is no censorship. (3a) how many total users have registered with the salmon system? This long-term measure of how many people have tried to use our system lets us see how well our outreach is working. (3b) how many of those users do we think are recently active? Tracking how many people are still using it is a key indicator for both usability and blocking-resistance: "do people actually find it useful?" (3c) what is the rate of new users registering with the system? First of all this helps us understand growth in interest, for example from our outreach efforts, but also it helps us understand how many fresh bridge addresses we need to support this growth. It is tied into the next item which is the other side of the question: (3d) how many bridges do we have in reserve (not yet filled with users)? The trouble comes when this number reaches 0. So the target is that we always have some bridges in reserve, which means we are keeping up with the rate of new user registrations. If this number hits zero, it means we need to activate more of our partners to get fresh bridges. (4a) how many high-reputation users are currently assigned at least one bridge that we think works? This number summarizes Salmon's success for our high-value or established users. That is, if this number remains high, then Salmon is succeeding at providing availability to its core set of users, even if the other numbers aren't doing well. (4b) how many low-reputation users are currently assigned at least one bridge that we think works? This number measures the other side: what is the health of the Salmon system at adding new users, compared to the attention the censor is giving at adding fake users in order to find and block our bridges? (5a) at what rate are we filling bridges with new salmon users? This one is quite related to 3c above, but it is from the bridge availability side: the best number here is the highest possible number such that 3d doesn't go to 0. (5b) at what rate are we filling bridges with existing salmon users? By looking at the rate where established users need new bridges, we can understand how much stability we have in the system. (5c) at what rate are users reporting failed bridges but we think the bridges are working and reachable for them? This point aims to measure our false positives, which could stem from logic errors inside Tor and Tor Browser (e.g. reporting bridges down when actually our internet isn't on), or from non-uniform blocking within countries, or probably many other reasons. If this rate gets much above 0, it's a bug report that we need to track down and understand. --Roger

1 0

goptlib's git repository, and its (recent) source code
by Vinicius Zavam 27 Sep '20

27 Sep '20

olá Anti-censorship Team, I write to you to share the current status of "goptlib" and its git repositories, plus mirrors (and the source code). As maintainer of "a few" packages depending on goptlib, I missed an updated version of its main git repo (git.tpo/pluggable-transports/goptlib) into the actual GitHub Orga. of the Tor Project (@torproject), or any official mirror like the ones hosted at the new GitLab server - AFAIK, most people relied on @ahf's GitHub sync in the past (also not up to date). - FYI: packages depending on goptlib which were recently ported into *BSD include the OONI Probe CLI So, I took the chance of getting a synced copy into the TorBSD's goptlib repo (github.com/torbsd/goptlib) - which was sadly also with an outdated version of the source code. - I took the already existing repo, and just synced it; did not create or change anything. This move is based on the lack of support of the ports frameworks we currently use on OpenBSD/FreeBSD/NetBSD; they work nicely with GitHub's or GitLab's API, but not with cgit(web). another solution for that would be serving a tarball via https://dist.tpo (but I did not find any). I would be pretty much happy to help and would volunteer myself to keep it always on track, should any extra hands are needed here; but I have a couple questions: - should my GH account be linked to tpo's (@torproject)? - would you like us (TorBSD/tdp) to sign any additional sync or commit? - how oft should we sync (or/and sign)? - is it fine to keep the current updated repo, or would you like us to "deprecate" it? - any other suggestions, ideas or objections? You can reach out to me via email or on IRC (my key+info are on https://tpo/about/people page) That's it... and TYVM for the hard work! Muito obrigado :) KR, -- Vinicius Zavam https://keybase.io/egypcio

2 1

Reading group 2020-10-01: Triplet Censors: Demystifying Great Firewall's DNS Censorship Behavior
by David Fifield 17 Sep '20

17 Sep '20

After the team meeting on 2020-10-01 we will discuss this paper from FOCI 2020: Triplet Censors: Demystifying Great Firewall's DNS Censorship Behavior https://www.usenix.org/conference/foci20/presentation/anonymous

1 0

2024

2023

2022

2021

2020

2019

anti-censorship-team September 2020