On Tue, Oct 9, 2012 at 12:31 PM, Robert Ransom rransom.8774@gmail.com wrote: [...]
AES-CTR + HMAC-SHA512/256.
AES-CTR + Poly1305. Poly1305 requires nonces, but we can use a counter for those.
Poly1305AES requires nonces. Poly1305 itself requires (computationally-indistinguishable-from-) independent keys for each message.
Right; I meant to say the output of a stream cipher used as a PRF, but what I meant isn't what I said. Should've proofed more carefully
Please actually read the paper (see also http://cr.yp.to/papers.html#pema section 2.5 for how DJB uses Poly1305 now).
I read everything I cited. If there is something I didn't understand, or something I missed, or something I got wrong, or something I said wrong, that doesn't mean I didn't read the paper.
I am not going to be able to draw all the right inferences from every paper on my own, though. And I am *definitely*, *frequently*, going to read papers, come up with questions, and post those questions here sometimes even when the paper, properly understood, would answer my questions. If I were writing for publication, I'd want to keep all my ideas secret until I could answer all my questions and make sure all my answers were right, but I'm not writing for publication -- I am writing to get feedback from other people and learn things. Thank you for helping me learn things!
Also thank you for reminding me to read http://cr.yp.to/antiforgery/securitywcs-20050227.pdf again. I'll check it out.
Salsa20 + Poly1305.
For a padding PRNG, we could use any stream cipher we like.
In each case, we'd want to authenticate not only the content of the cell, but also the previous cell's authenticator and the position of the cell within the stream.
AES-GCM+AES is going to be the fastest on CPUs with specific support for AES and GCM, but not on other architectures until/unless more chips grow instructions specialized for AES and GF(2^128). Salsa20+Poly1305 should be fastest on other architectures.
This entire category of designs still has the problems that it had before: it leaks the length of the circuit to the last node in the circuit, and consumes a fairly large portion of each cell (for one truncated mac per hop). Further, it's going to be a bit fiddly (though not impossible) to get it to work with rendezvous circuits.
Explicitly leaking the circuit length is very very bad.
Are there any non-obvious reasons why? Does it lead to any better attacks than the obvious ones?
The MAC-based designs do not mention how to prevent end-to-end tagging in the exit-to-client direction. I suspect they won't try to prevent it at all.
That's correct. Why would it be an attack for the exit to send a covert signal to the client? The exit already has valid means to send overt signals to the client, and the client Tor is presumed not to want to break its own anonymity.
II. WIDE-BLOCK DESIGNS
Turn we now to designs where we encrypt each block of the cipher using a 509-byte SPRP, such that each block's SPRP is keyed dependently on the original key and on the encrypted value of the previous block.
There be dragons here. Let's talk about some ways we could build that. I'll start by talking about wide-block encryption, then talk about getting the "unrecoverability" property where any missing or corrupted block makes all future blocks unrecoverable.
The wide-block SPRP I've used before is LIONESS [Bear-Lion]. It requires two keyed hash operations and two stream cipher operations, so it's going to be at best half the speed of the mac-and-pad designs above: a little worse, maybe, since it forces us to rekey with every block.
Might that be acceptable? Right now, AES is not a huge fraction of our runtime, and a relay does AES on each cell three times (TLS,relay crypto,TLS) and SHA1 on each cell twice (TLS,TLS). An exit does AES on each cell twice (TLS,relay) and SHA1 on each cell twice (TLS,relay). So for relays, with a naive SHA1-AES LIONESS, we'd be increasing the stream-cipher operations by 25% and the digests by 100%. On an exit, we'd be increasing the stream-cipher operations by 33% and the digests by 100%.
If we were to go with LIONESS, we'd surely want to look into faster, better keyed-hash algorithms than SHA1. I don't know whether one of the polynomial MACs above would be good enough, or whether we need other cryptographic digest properties that those don't give us and we'd need to turn to SHA256 or something.
I've heard some suggestions that we should look into BEAR or LION instead, but I'm leery of the idea. They are faster, requiring one fewer application of their primitives, but they are PRPs, not SPRPs -- meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section 6]. I don't know a way to exploit this in the Tor network, but deviating from an ideal block cipher seems to me like one of those ideas that's practically begging for disaster.
The notion of ‘Ind-CCA2’ is defined for public-key encryption systems; it doesn't make sense for permutations.
Even LIONESS is not an ‘ideal block cipher’ -- that term has an actual definition, and it is a much stronger assumption than Tor's relay crypto needs. (http://cs.nyu.edu/~dodis/ps/ic-ro.pdf and http://cs.nyu.edu/~dodis/ps/ext-cipher.pdf contain some potentially useful, potentially interesting information.)
Keep in mind that Tor's current relay crypto breaks *completely* if the circuit-extension handshake ever produces the same session key twice, and some parts of Tor's protocols will still require that assumption even if the relay crypto doesn't. Therefore, you really don't need to worry about attacks that require more than one call to the permutation in either direction.
Ah, so I think you are claiming that LION or BEAR on its own is fine if they are only used once with each key, and that any sane cell chaining construction we use will involve re-keying them with each block, so we can ignore any attack on them that would require multiple invocations of the encryption/decryption function with the same key.
Do I understand you correctly there?
[...]
The multiplication operations here appear to be multiplication by a primitive element, and multiplication by a per-key element. The encryption step can be realized with a somewhat unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB approach. I don't know what you'd need to do to substitute in an orthodox stream cipher for the one used in iHCH. Sarkar seems to see iHCH as a successor to HCH, which is a little worrisome given that HCH is a spiritual descendant of the patented XCB, but to me the two constructions (HCH, iHCH) look practically nothing alike except for their use of a counter mode step.
I am a little intimidated by the idea of trying to implement one of these ciphers.
LION (implemented with ChaCha and CubeHash512, with an extra 256 bits of chaining output from each step) sounds like the safe-and-sane approach for now; changing the protocol again if the U.S. patent system expires won't be nearly as hard as changing the protocol the first time.
(BLAKE-512 might be a faster hash function on some processors, but requires 64-bit operations. BLAKE-256 would be faster, but then (a) you don't get a chaining output from any function computed directly over R, and (b) you still need to produce 512 bits of (computationally-indistinguishable-from-)independent key material from some hash function in the chaining step. CubeHash512 will never be broken as an entropy extractor in less time than the Curve25519-based circuit-extension handshake.)
Interesting; I had stopped considering CubeHash when got dropped from the SHA-3 competition. I'll have to find out who's been working on analyzing it since then.
Can you be more explicit about where the chaining output is taken from when, and how exactly it gets used, in this design?
Also, why LION and not BEAR?
(Does the XCB patent cover LION-like constructions with the ‘hash function’ in the middle replaced with a polynomial-evaluation function? I'm not convinced that a suitable approximately-256-bit polynomial-evaluation function will be much faster than ChaCha, especially in 32-bit code.)
I haven't read the XCB patent, and won't, pending legal advice from Wendy telling me that it's okay to read it.
[...]
The obvious way to implement GF(2^128) multiplication of a large number of secret inputs y by one learned-at-runtime secret c is:
- Compute a table of c*X^i for i = 0, ..., 127. (This table takes
128*128 bits, or 2048 bytes. Multiplication by X is easy and tolerably fast.)
- For each input y (= y_0*X^0 + ... + y_127*X^127, with the y_i in
GF(2)), compute y_0*(c*X^0) + ... + y_127*(c*X^127). (You've done this step for Pynchon Gate already; hopefully that runs in constant time.)
Hm. Did you figure out cycles-per-byte on that one?
Back-of-the-envelope:
Assuming 64-bit words, we're looking at a 128 instances of "get a bit from y," 128*2 instances of "Load the corresponding word from the table, then constant-time-conditionally-xor that word to the accumulator."
The fast portable implementation of the conditional-xor I know, for a single bit in 'b', and a value in 'x' to be conditionally xored into 'a' is: "a ^= x & ~(b-1)".
I *think* that's roughly 128 (and, sub, not) to get a mask for each bit, 128 shifts, then 256 loads, 256 ands (to apply the mask), and 256 xors. (Some of these operations are unnecessary; I am committing fencepost errors here. I'm also assuming all loads are one-cycle.)
So that's about 1280 operations for 16 bytes of input to be multiplied by c, so that seems like something like 80 instructions per byte for a naive implementation. There are still probably more optimizations to be done, especially if we have wacky SIMD features to play with, or we can combine some of the operations above into single ones. Dynamic multiple issue and such CPU architectural tricks might get it down even more. Nevertheless, it still looks like it'd be expensive enough getting GF(2^128) right to make GF(2^128) unattractive.
Still, maybe somebody should hack this up for the public good, whether we turn out to need GF(2^128) or not.
As a further wrinkle with GF(2^128), OpenSSL doesn't seem to actually expose its "multiply in GF(2^128)" functions as far as I can tell: we would need to snarf such code from elsewhere.
Oh! ARMv8 has an optional crypto instruction set that supports AES, SHA256, and GF(2^128) multiplication [ARMv8]. If it looks like most of the ARMs we care about are going to get it, that could factor into our planning.
I won't believe claims that AES hardware will be widely available until it actually is present in every new processor from a major manufacturer.
Agreed.