Hello all,
What follows is a summary of the primitives that Marc Juarez aims to implement for his Google Summer of Code project on prototyping defenses for Website Traffic Fingerprinting and follow-on research.
After a review of Tamaraw[1], Adaptive Padding[2], CS-BuFLO[3], and the Supersequence[4] work, as well as some discussion at the Tor dev meeting, we came up with this list of padding message primitives.
These functions are meant to be sent as command cells from one endpoint to the other. Some are bidirectional commands, others can only be sent in one direction (client to relay, or relay to client).
For now, they will be implemented as ad-hoc messages between two modified obfsproxy (called wfpadtools) instances. The source code for this fork lives here for the moment: https://bitbucket.org/mjuarezm/obfsproxy-wfpadtools/
Long term, I am thinking that all of these should be specified as if they were their own RELAY_* cell type from tor-spec.txt: https://gitweb.torproject.org/torspec.git/blob/HEAD:/tor-spec.txt#l1215
This means that padding would be a circuit-level property, and it would be possible to send it to and from any hop in the circuit, due to leaky-pipe topology (using the "recognized" field). In a world of very cheap and excessive middle and Guard node bandwidth, we would run this padding to the middle node. For the wfpadtools prototype, it will obviously only cover the first hop.
Here's the list of primitives, broken down by research paper:
Generic Messages (common to all defenses): RELAY_IGNORE() Description: Simple fixed-length (CELL_SIZE) padding cell. Direction: Bidirectional (client to relay, relay to client) Parameters: None
RELAY_SEND_PADDING(N,t) Description: Send the requested number of padding cells in response. Direction: Client to relay Parameters: N: Number of padding cells to send in response to this cell t: microseconds delay before sending
RELAY_APP_HINT(session_id, status) Description: A hint from the application layer for session start/stop Direction: Client to relay Parameters: session_id: A string identifying the session (ie keyed hash of url bar domain) status: "Start" or "Stop", indicating session start and end.
Adaptive Padding Messages: RELAY_BURST_HISTOGRAM(histogram[], labels_ms[], remove_toks, fuzz, when) Description: Specifies a histogram that encodes a delay distribution representing the probability of sending a single padding packet after a given delay in response to either an upstream cell, or a client-originating cell. Direction: Client to relay Parameters: histogram[]: Contains delay distribution of sending an IGNORE packet after sending a real packet labels_ms[]: millisecond labels for the bins (with "Infinity"/"Ignore" bin to allow encoding the probability of not sending any padding packet in response to this packet). remove_toks: If true, follow Adaptive Padding token removal rules. If false, histograms are immutable. fuzz: If true, randomize the delay uniformly between bin labels. If false, use bin lables as exact delay values. when: If set to "receive", this histogram governs the probably of sending a padding packet after some delay in response to a packet originating from. If set to "send", this histogram governs padding packets that are transmitted after a packet arrives from upstream (the middle node). In both cases, the padding packet is sent in the direction of the client.
RELAY_GAP_HISTOGRAM(histogram[], labels_ms[], remove_toks, fuzz, when) Description: Specifies a histogram that encodes a delay distribution representing the probability of sending a single additional padding packet after a given delay following a padding packet that originated at this hop. Direction: Client to relay Parameters: histogram[]: Contains delay distribution of sending an IGNORE packet after sending an IGNORE packet labels_ms[]: millisecond labels for the bins (with "Infinity"/"Ignore" bin to allow encoding the probability of not sending any padding packet in response to this packet). remove_toks: If true, follow Adaptive Padding token removal rules. If false, histograms are immutable. fuzz: If true, randomize the delay uniformly between bin labels. If false, use bin lables as exact delay values. when: If "receive", this histogram applies to locally-inserted padding packets that were initially sent in response to client-originated data. If "send", this histogram applies to packets sent in response to locally-inserted padding packets sent in response to upstream data. Note that this means that implementations must maintain this metadata as internal state as the system transitions from BURST_HISTOGRAM initiated padding into GAP_HISTOGRAM initiated padding. In both cases, the padding packet is sent in the direction of the client.
CS-BuFLO Messages: RELAY_TOTAL_PAD(session_id, t): Description: Requests that endpoint pad all batches to nearest 2^K cells total or until APP_HINT(session_id, stop) Direction: Client to relay Parameters: session_id: The session ID from APP_HINT() t: The number of microseconds to wait between cells to consider them part of the same batch.
RELAY_PAYLOAD_PAD(session_id, t, R): XXX: The paper was not clear enough for me to understand this primitive
Tamaraw: RELAY_BATCH_PAD(session_id, L, t) Description: Requests that endpoint pad all batches of cells to L cells total or until APP_HINT(session_id, stop) Direction: Client to relay Parameters: session_id: The session ID from APP_HINT() L: The multiple of cells to pad to. t: The number of microseconds to wait between cells to consider them part of the same batch.
Comments, questions, and suggestions are welcome! In particular, I'm not sure I got either the CS-Buflo or the tamaraw primitives completely correct.
1. http://cacr.uwaterloo.ca/techreports/2013/cacr2013-30.pdf 2. http://freehaven.net/anonbib/cache/ShWa-Timing06.pdf 3. http://arxiv.org/abs/1401.6022 4. http://cacr.uwaterloo.ca/techreports/2014/cacr2014-05.pdf
I've inserted a couple of corrections and clarifications below. I'm leaving original text fully quoted for the archive.
Mike Perry:
Hello all,
What follows is a summary of the primitives that Marc Juarez aims to implement for his Google Summer of Code project on prototyping defenses for Website Traffic Fingerprinting and follow-on research.
After a review of Tamaraw[1], Adaptive Padding[2], CS-BuFLO[3], and the Supersequence[4] work, as well as some discussion at the Tor dev meeting, we came up with this list of padding message primitives.
These functions are meant to be sent as command cells from one endpoint to the other. Some are bidirectional commands, others can only be sent in one direction (client to relay, or relay to client).
For now, they will be implemented as ad-hoc messages between two modified obfsproxy (called wfpadtools) instances. The source code for this fork lives here for the moment: https://bitbucket.org/mjuarezm/obfsproxy-wfpadtools/
Long term, I am thinking that all of these should be specified as if they were their own RELAY_* cell type from tor-spec.txt: https://gitweb.torproject.org/torspec.git/blob/HEAD:/tor-spec.txt#l1215
This means that padding would be a circuit-level property, and it would be possible to send it to and from any hop in the circuit, due to leaky-pipe topology (using the "recognized" field). In a world of very cheap and excessive middle and Guard node bandwidth, we would run this padding to the middle node. For the wfpadtools prototype, it will obviously only cover the first hop.
Here's the list of primitives, broken down by research paper:
Generic Messages (common to all defenses): RELAY_IGNORE() Description: Simple fixed-length (CELL_SIZE) padding cell. Direction: Bidirectional (client to relay, relay to client) Parameters: None
RELAY_SEND_PADDING(N,t) Description: Send the requested number of padding cells in response. Direction: Client to relay Parameters: N: Number of padding cells to send in response to this cell t: microseconds delay before sending
RELAY_APP_HINT(session_id, status) Description: A hint from the application layer for session start/stop Direction: Client to relay Parameters: session_id: A string identifying the session (ie keyed hash of url bar domain) status: "Start" or "Stop", indicating session start and end.
Adaptive Padding Messages: RELAY_BURST_HISTOGRAM(histogram[], labels_ms[], remove_toks, fuzz, when) Description: Specifies a histogram that encodes a delay distribution representing the probability of sending a single padding packet after a given delay in response to either an upstream cell, or a client-originating cell. Direction: Client to relay Parameters: histogram[]: Contains delay distribution of sending an IGNORE packet after sending a real packet labels_ms[]: millisecond labels for the bins (with "Infinity"/"Ignore" bin to allow encoding the probability of not sending any padding packet in response to this packet).
In both of the Adaptive Padding primitives, a considerably more compact form is possible for bin labeling. The original Adaptive Padding literature used a 2^K exponential spacing between these bins, but didn't rigorously analyze the choice. If experimentation finds this is indeed reasonable/optimal, we can send a single parameter W instead of a labels_ms. W would be bin width in milliseconds, and L is the length of the histogram[] array. The labels would then be computed as W*2^K with K ranging from 0 to L-2, with the addition of a zero-delay bin label.
For the prototype though, I think we should allow the labeling to remain a free-form array, for maximum flexibility in research. We may decide we want some form of polynomial spacing instead of exponential here, for example.
remove_toks: If true, follow Adaptive Padding token removal rules. If false, histograms are immutable. fuzz: If true, randomize the delay uniformly between bin labels. If false, use bin labels as exact delay values.
"fuzz" here might be better called "interpolate" in both this and the below primitive, since linear interpolation is what I mean here.
I suppose we could imagine smoother types of interpolation, like quadratic, cubic, etc, but I don't expect the actual interpolation mechanism to matter a whole lot. I expect it to be most useful for preventing the adversary from subtracting the padding due to its arrival at a fixed delay quantity after another packet.
when: If set to "receive", this histogram governs the probably of sending a padding packet after some delay in response to a packet originating from. If set to "send", this histogram governs padding
^ | insert "the client" after that "from"
packets that are transmitted after a packet arrives from upstream (the middle node). In both cases, the padding packet is sent in the direction of the client.
RELAY_GAP_HISTOGRAM(histogram[], labels_ms[], remove_toks, fuzz, when) Description: Specifies a histogram that encodes a delay distribution representing the probability of sending a single additional padding packet after a given delay following a padding packet that originated at this hop. Direction: Client to relay Parameters: histogram[]: Contains delay distribution of sending an IGNORE packet after sending an IGNORE packet labels_ms[]: millisecond labels for the bins (with "Infinity"/"Ignore" bin to allow encoding the probability of not sending any padding packet in response to this packet). remove_toks: If true, follow Adaptive Padding token removal rules. If false, histograms are immutable. fuzz: If true, randomize the delay uniformly between bin labels. If false, use bin labels as exact delay values.
when: If "receive", this histogram applies to locally-inserted padding packets that were initially sent in response to client-originated data. If "send", this histogram applies to packets sent in response to locally-inserted padding packets sent in response to upstream data. Note that this means that implementations must maintain this metadata as internal state as the system transitions from BURST_HISTOGRAM initiated padding into GAP_HISTOGRAM initiated padding. In both cases, the padding packet is sent in the direction of the client.
CS-BuFLO Messages: RELAY_TOTAL_PAD(session_id, t): Description: Requests that endpoint pad all batches to nearest 2^K cells total or until APP_HINT(session_id, stop) Direction: Client to relay Parameters: session_id: The session ID from APP_HINT() t: The number of microseconds to wait between cells to consider them part of the same batch.
RELAY_PAYLOAD_PAD(session_id, t, R): XXX: The paper was not clear enough for me to understand this primitive
Tamaraw: RELAY_BATCH_PAD(session_id, L, t) Description: Requests that endpoint pad all batches of cells to L cells total or until APP_HINT(session_id, stop) Direction: Client to relay Parameters: session_id: The session ID from APP_HINT() L: The multiple of cells to pad to. t: The number of microseconds to wait between cells to consider them part of the same batch.
Comments, questions, and suggestions are welcome! In particular, I'm not sure I got either the CS-Buflo or the tamaraw primitives completely correct.
- http://cacr.uwaterloo.ca/techreports/2013/cacr2013-30.pdf
- http://freehaven.net/anonbib/cache/ShWa-Timing06.pdf
- http://arxiv.org/abs/1401.6022
- http://cacr.uwaterloo.ca/techreports/2014/cacr2014-05.pdf
-- Mike Perry
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Mike Perry:
I've inserted a couple of corrections and clarifications below. I'm leaving original text fully quoted for the archive.
[nah. *snip*]
I've converted this text and a few other clarifications and additions into something that more closely resembles a Tor proposal.
It now lives at: https://gitweb.torproject.org/user/mikeperry/torspec.git/blob/refs/heads/mul...
I'm really rather excited about the Adaptive Padding primitives in particular, especially with our extensions to the original idea. I think it has a lot of promise to be extremely effective, tunable, and flexible as a traffic fingerprinting defense.
Those of you who know just how crazy^Wambitious I am also know that I think it will actually impede end-to-end correlation as well (at least, for short-lived/bursty web-like traffic). But we'll cross that bridge when it starts burning. First, let's put this traffic fingerprinting problem out of its misery :)