I should share with the list an update of where I am with a design for an improved relay crypto protocol. For background and motivation, please see the last thread on the topic [Prop202].
There are three main questions remaining for me in choosing among new relay crypto protocols. Basically, they are: "Am I comfortable with this system?", "Among systems I'm comfortable with, how good is this system?", and "Do I know how to implement this system?"
Unfortunately, the stuff I am currently comfortable with and know that I could implement is not nearly as good as the stuff that I'm _nearly_ comfortable enough to use and I don't know how to implement.
Let's talk about some designs in detail, using the same terminology as proposal 202.
Note:
As usual, this is probably largely wrong. If you find it on a mailing list archive years from now, please don't try to learn cryptography from it.
What I'm looking for most right now is places where my reasoning is wrong, places where I'm mistaken about what tradeoffs I need need to make, and places where there's research that I should have read or remembered.
I. MAC-AND-PAD DESIGNS
These are the least involved to implement: all you need is an AEAD construction and a PRNG. Those are both pretty well-understood: you can build a good PRNG out of any stream cipher, and you can build the AEAD mode out of a stream cipher and a MAC, or out of various other constructions.
As a wrinkle: it would be good to have a system that generates "two MACs for the price of one." That's because in our topology, we need the ability to address relay cells to any node in the circuit, and one easy way to do that here is to have a construction that generates two MACs, and uses one for "cells that should arrive here" and another for "cells that we should relay". We also want shorter MACs: 16 bytes per hop is excessive for our needs.
Authenticators like GCM whose output is the raw result of a polynomial evaluation aren't safe to truncate: Ferguson [Ferguson] has a result showing that it's easier than it should be to perform forgery against truncated-MAC GCM constructions.
On the other hand, polynomial authenticators seem safe to truncate (generally) if the encryption step comes _after_ the polynomial evaluation. See for example [Poly1305-Trunc]'s discussion of truncated Poly1305 -- I hope it's right.
So to be concrete, let me suggest a few modes of operation. I believe I'm competent to implement these:
AES-GCM+AES: AES-GCM, except that we add (modulo 2^128) the output of AES(nonce) to each GCM output, in hopes that it will make GCM safe to truncate. (This is probably crazy somehow.)
AES-CTR + HMAC-SHA512/256.
AES-CTR + Poly1305. Poly1305 requires nonces, but we can use a counter for those.
Salsa20 + Poly1305.
For a padding PRNG, we could use any stream cipher we like.
In each case, we'd want to authenticate not only the content of the cell, but also the previous cell's authenticator and the position of the cell within the stream.
AES-GCM+AES is going to be the fastest on CPUs with specific support for AES and GCM, but not on other architectures until/unless more chips grow instructions specialized for AES and GF(2^128). Salsa20+Poly1305 should be fastest on other architectures.
This entire category of designs still has the problems that it had before: it leaks the length of the circuit to the last node in the circuit, and consumes a fairly large portion of each cell (for one truncated mac per hop). Further, it's going to be a bit fiddly (though not impossible) to get it to work with rendezvous circuits.
II. WIDE-BLOCK DESIGNS
Turn we now to designs where we encrypt each block of the cipher using a 509-byte SPRP, such that each block's SPRP is keyed dependently on the original key and on the encrypted value of the previous block.
There be dragons here. Let's talk about some ways we could build that. I'll start by talking about wide-block encryption, then talk about getting the "unrecoverability" property where any missing or corrupted block makes all future blocks unrecoverable.
The wide-block SPRP I've used before is LIONESS [Bear-Lion]. It requires two keyed hash operations and two stream cipher operations, so it's going to be at best half the speed of the mac-and-pad designs above: a little worse, maybe, since it forces us to rekey with every block.
Might that be acceptable? Right now, AES is not a huge fraction of our runtime, and a relay does AES on each cell three times (TLS,relay crypto,TLS) and SHA1 on each cell twice (TLS,TLS). An exit does AES on each cell twice (TLS,relay) and SHA1 on each cell twice (TLS,relay). So for relays, with a naive SHA1-AES LIONESS, we'd be increasing the stream-cipher operations by 25% and the digests by 100%. On an exit, we'd be increasing the stream-cipher operations by 33% and the digests by 100%.
If we were to go with LIONESS, we'd surely want to look into faster, better keyed-hash algorithms than SHA1. I don't know whether one of the polynomial MACs above would be good enough, or whether we need other cryptographic digest properties that those don't give us and we'd need to turn to SHA256 or something.
I've heard some suggestions that we should look into BEAR or LION instead, but I'm leery of the idea. They are faster, requiring one fewer application of their primitives, but they are PRPs, not SPRPs -- meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section 6]. I don't know a way to exploit this in the Tor network, but deviating from an ideal block cipher seems to me like one of those ideas that's practically begging for disaster.
Can we get faster than LIONESS? Indeed we can! There are a pile of constructions. If I understand correctly, and I'm not missing anything, the ones we might want to use fall into two broad categories: those that (like LIONESS) split the message into a short bit and a long bit, then treat them differently; and those that apply a transformation on the message, encrypt the message, and transform it again.
The first category (split, then frob each part based on the other once or twice) would appear to be more of a patent minefield, thanks to the XCB patent. The ones to look at here are [HCTR] and [HCH], which require two applications of a universal hash function (can be polynomial-based) and approximately one encryption pass. HCH is a little more unlike XCB. Neither is exactly trivial. We could build either one out of whatever field and stream cipher we wanted, as above, though bitsliced AES-CTR wouldn't be efficient in this case: we aren't generating enough stream at once for bitslicing to pay off.
The second category (frob, encrypt, frob) is pretty elegant IMO. The best-explained of these I've seen so far are in a paper by Palash Sarkar [Efficient-Tweakable], though the earlier TET construction [TET] might also be cool. For these, you need an invertible block-wise (Almost) (Xor-)Universal hash function, typically implemented with GF(2^128). I'm not sure if you could use a different field. The multiplication operations here appear to be multiplication by a primitive element, and multiplication by a per-key element. The encryption step can be realized with a somewhat unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB approach. I don't know what you'd need to do to substitute in an orthodox stream cipher for the one used in iHCH. Sarkar seems to see iHCH as a successor to HCH, which is a little worrisome given that HCH is a spiritual descendant of the patented XCB, but to me the two constructions (HCH, iHCH) look practically nothing alike except for their use of a counter mode step.
I am a little intimidated by the idea of trying to implement one of these ciphers.
Above, I haven't taken one of our requirements into account: that any change to a single cell must make all future cells unrecoverable.
There are modes that are supposed to prevent this, and applying them to a decent wide-block cipher might solve the issue. IGE is one of them [IGE], but it turns out to be broken by an attacker who knows some plaintext. The Accumulated Block Chaining [ABC] construction is supposed to fix that; I'm not too sure whether it's correct or efficient.
As a blunt-force approach to the problem, we could rekey between each block, using new keys based on a MAC of the last block. This would be pretty inefficient for primitives that need any serious amount of per-key computation.
Finally, we could look into constructions that produce an extra secret output incidental to their regular operation, and which are easily rekeyed for each operation.
In this whole field, we need to keep an eye out for the patents on CMC, EME, and XCB. "Joy."
III. CHOICE OF CIPHERS AND OTHER PRIMITIVES
I am hearing exactly two recommendations for encryption primitives nowadays: AES and Salsa20 (and its family). Nobody's recommending anything else as far as I can tell.
AES is still the IBM of the block cipher world, which "Nobody Ever Got Fired For Using." It's a bit of a pain to use it in software, though. Its most obvious implementations have timing attacks due to table lookups [DJB-Timing]. OpenSSL has three fast x86 implementations: one of them uses vector permutations, one uses bit-slicing, and one uses the aesni instructions to invoke the chip's built-in AES capabilities. On high-end Intel chips, these take about 12-25, 7-9, and 1.5 cycles per byte respectively. The bitsliced one really needs to encrypt 4KiB at a time (yes, kibibytes), which makes it fine for counter mode but not so good for CBC. (Those numbers are from the OpenSSL source.) Key setup times are cheap for all of these (I think!), but the bitsliced implementation gets expensive if you change the key (or the stream position).
Salsa20 (and its descendant, ChaCha) is the Obvious Second Choice for people who don't want to use AES. It is faster than AES on every platform except those with hardware support for AES; 4-7 cycles per byte seems typical for high-end Intel stuff. You would probably have to go out of your way to implement it in a way that was vulnerable to timing side-channel attacks. The only arguments for its insecurity that I'm aware of are that although it has fewer known flaws than AES, it has received less attention than AES, therefore is likelier to _unsuspected_ problems. The counterargument there is that there _are_ several dozen cryptographers who have tried to attack it (or attack reduced-round variants of it), several of whom have also done successful results against AES. [DJB-Comm] Per-key setup time is basically nonexistent.
There are a lot of constructions out there that want to do multiplication in GF(2^128) as a basic operation. That turns out to be less than totally straightforward, though. On newer intel chips, you've got a "multiply without carry" instruction that can supposedly let you get to around 2 cycles per byte. On other platforms, you're reduced to worse trickery to try to get good performance without timing side-channels, for an amount of trickery dependent on what operation you're trying to do exactly.
The easiest GF(2^128) operation to implement safely and quickly is multiplication by a known compile-time value. (It's easy enough that even I could do it.) Next easiest is multiply a large number of values by the same run-time-fixed value -- you do precomputation on the value in order to generate some tables. (The scary, fast variants use big tables; the still-a-little-scary variants use 256-byte tables, and get performance on high-end Intel boxes around 7-10 cycles per byte when doing GCM.) Full-on multiplication of two arbitrary GF(2^128) values is slowest still.
As a further wrinkle with GF(2^128), OpenSSL doesn't seem to actually expose its "multiply in GF(2^128)" functions as far as I can tell: we would need to snarf such code from elsewhere.
Oh! ARMv8 has an optional crypto instruction set that supports AES, SHA256, and GF(2^128) multiplication [ARMv8]. If it looks like most of the ARMs we care about are going to get it, that could factor into our planning.
IV. HOW MUCH DOES PERFORMANCE MATTER HERE ANYWAY?
A fine question. The right place too look would IMO be at profiles on a busy non-exit non-AESNI server, to see how much time is spent in AES there. Make sure you look at all the AES variants, since it's not uncommon to see more than one used at once.
A ridiculously large amount of that time will be in bn_sqr4x_mont and its unindicted co-conspirators , but expect this to get cut a great deal when we move to ECC for TLS DHE and for circuit extension.
So if we're willing to make a lot of assumptions, then you can figure out a worst-case performance impact like this. Take the fraction of the time spent in bn_and_friends and remove them from the profile. (Assume that circuit creation dominates over TLS renegotiation because hey why not.) Then assume that the AES calls used for relay crypto's counter mode turn into whatever new relay encryption thing we're proposing. See how much that change costs Tor's performance overall.
V. ACKNOWLEDGMENTS
Thanks to everybody who's schooled me about this stuff, especially Daniel J. Bernstein, Ian Goldberg, and Robert Ransom. Assume that the good ideas are here due to their patience and that the bad ideas are here due to my slowness on the uptake. Thanks to Susan Born for a much-needed copy-edit. Assume that every correctly structured sentence is one she fixed, and that the bad ones I messed up after her editing.
VI. References
[Prop202] https://lists.torproject.org/pipermail/tor-dev/2012-June/003649.html
[Ferguson] Niels Ferguson, "Authentication weaknesses in GCM", 2005. http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/comments/CWC-GCM/Fergus...
[DJB-Timing] Daniel J. Bernstein, "Cache-timing attacks on AES", 2004. http://cr.yp.to/papers.html#cachetiming
[DJB-Comm] Daniel J. Bernstein, personal communication
[Poly1305-Trunc] http://osdir.com/ml/encryption.poly1305/2005-09/msg00007.html
[ARMv8] "ARMv8 Instruction Set Overview", 2011, http://board.flatassembler.net/download.php?id=5698
[Bear-Lion] Ross Anderson and Eli Biham. "Two Practical and Provably Secure Block Ciphers: BEAR and LION". http://www.cl.cam.ac.uk/~rja14/Papers/bear-lion.pdf
[Efficient-Tweakable] Palash Sarkar, "Efficient Tweakable Enciphering Schemes from (Block-Wise) Universal Hash Functions". 2008. http://eprint.iacr.org/2008/004.pdf
[IGE] Gligor and Donescu, "On Message Integrity in Symmetric Encryption" has the good IGE analysis: http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/ige/ige-s... The original IGE proposal was by Carl Campell in an old NIST publication that I can't find online; the paper above has a reference for it if you want to chase it more.
[ABC] Lars R. Knudsen, "Block Cipher Chaining Modes of Operation". http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/abc/abc-s...
[HCTR] Peng Wang, Dengguo Feng, and Wenling Wu. "HCTR: A variable-input-length enciphering mode." 2005. http://delta.cs.cinvestav.mx/~debrup/hctr.pdf
[HCH] Debrup Chakraborty and Palash Sarkar. "HCH: A new tweakable enciphering scheme using the hash-encrypt-hash approach." http://biblioteca.cinvestav.mx/indicadores/texto_completo/cinvestav/2006/136...
[TET] Shai Halevi. "Invertible universal hashing and the TET encryption mode." 2007. http://www.iacr.org/archive/crypto2007/46220405/46220405.pdf
On 10/8/12, Nick Mathewson nickm@torproject.org wrote:
I should share with the list an update of where I am with a design for an improved relay crypto protocol. For background and motivation, please see the last thread on the topic [Prop202].
There are three main questions remaining for me in choosing among new relay crypto protocols. Basically, they are: "Am I comfortable with this system?", "Among systems I'm comfortable with, how good is this system?", and "Do I know how to implement this system?"
Unfortunately, the stuff I am currently comfortable with and know that I could implement is not nearly as good as the stuff that I'm _nearly_ comfortable enough to use and I don't know how to implement.
Let's talk about some designs in detail, using the same terminology as proposal 202.
Note:
As usual, this is probably largely wrong. If you find it on a mailing list archive years from now, please don't try to learn cryptography from it.
What I'm looking for most right now is places where my reasoning is wrong, places where I'm mistaken about what tradeoffs I need need to make, and places where there's research that I should have read or remembered.
I. MAC-AND-PAD DESIGNS
These are the least involved to implement: all you need is an AEAD construction and a PRNG. Those are both pretty well-understood: you can build a good PRNG out of any stream cipher, and you can build the AEAD mode out of a stream cipher and a MAC, or out of various other constructions.
As a wrinkle: it would be good to have a system that generates "two MACs for the price of one." That's because in our topology, we need the ability to address relay cells to any node in the circuit, and one easy way to do that here is to have a construction that generates two MACs, and uses one for "cells that should arrive here" and another for "cells that we should relay". We also want shorter MACs: 16 bytes per hop is excessive for our needs.
Authenticators like GCM whose output is the raw result of a polynomial evaluation aren't safe to truncate: Ferguson [Ferguson] has a result showing that it's easier than it should be to perform forgery against truncated-MAC GCM constructions.
On the other hand, polynomial authenticators seem safe to truncate (generally) if the encryption step comes _after_ the polynomial evaluation. See for example [Poly1305-Trunc]'s discussion of truncated Poly1305 -- I hope it's right.
That reference contains a rather large amount of bullshit; if any of the results stated there are correct, it's purely by accident. Read the security proofs in http://cr.yp.to/papers.html#securitywcs and http://cr.yp.to/papers.html#poly1305 (I have told you to read them multiple times now) if you want to use a truncated polynomial-evaluation MAC.
Or, you can pass the polynomial-evaluation MAC output through a yucky messy one-way function, then truncate the result of *that*. (But note that the only suitable one-way functions that you can compute efficiently are Salsa20 or ChaCha; AES is not suitable and cannot be computed efficiently.)
So to be concrete, let me suggest a few modes of operation. I believe I'm competent to implement these:
AES-GCM+AES: AES-GCM, except that we add (modulo 2^128) the output of AES(nonce) to each GCM output, in hopes that it will make GCM safe to truncate. (This is probably crazy somehow.)
This is utter crap. AES-GCM already adds (in GF(2^128)) the output of AES(nonce) (roughly) to each GCM output; adding the same value in a different field/group (a) will not help and (b) will almost certainly totally break the construction.
AES-CTR + HMAC-SHA512/256.
AES-CTR + Poly1305. Poly1305 requires nonces, but we can use a counter for those.
Poly1305AES requires nonces. Poly1305 itself requires (computationally-indistinguishable-from-) independent keys for each message. Please actually read the paper (see also http://cr.yp.to/papers.html#pema section 2.5 for how DJB uses Poly1305 now).
Salsa20 + Poly1305.
For a padding PRNG, we could use any stream cipher we like.
In each case, we'd want to authenticate not only the content of the cell, but also the previous cell's authenticator and the position of the cell within the stream.
AES-GCM+AES is going to be the fastest on CPUs with specific support for AES and GCM, but not on other architectures until/unless more chips grow instructions specialized for AES and GF(2^128). Salsa20+Poly1305 should be fastest on other architectures.
This entire category of designs still has the problems that it had before: it leaks the length of the circuit to the last node in the circuit, and consumes a fairly large portion of each cell (for one truncated mac per hop). Further, it's going to be a bit fiddly (though not impossible) to get it to work with rendezvous circuits.
Explicitly leaking the circuit length is very very bad.
The MAC-based designs do not mention how to prevent end-to-end tagging in the exit-to-client direction. I suspect they won't try to prevent it at all.
II. WIDE-BLOCK DESIGNS
Turn we now to designs where we encrypt each block of the cipher using a 509-byte SPRP, such that each block's SPRP is keyed dependently on the original key and on the encrypted value of the previous block.
There be dragons here. Let's talk about some ways we could build that. I'll start by talking about wide-block encryption, then talk about getting the "unrecoverability" property where any missing or corrupted block makes all future blocks unrecoverable.
The wide-block SPRP I've used before is LIONESS [Bear-Lion]. It requires two keyed hash operations and two stream cipher operations, so it's going to be at best half the speed of the mac-and-pad designs above: a little worse, maybe, since it forces us to rekey with every block.
Might that be acceptable? Right now, AES is not a huge fraction of our runtime, and a relay does AES on each cell three times (TLS,relay crypto,TLS) and SHA1 on each cell twice (TLS,TLS). An exit does AES on each cell twice (TLS,relay) and SHA1 on each cell twice (TLS,relay). So for relays, with a naive SHA1-AES LIONESS, we'd be increasing the stream-cipher operations by 25% and the digests by 100%. On an exit, we'd be increasing the stream-cipher operations by 33% and the digests by 100%.
If we were to go with LIONESS, we'd surely want to look into faster, better keyed-hash algorithms than SHA1. I don't know whether one of the polynomial MACs above would be good enough, or whether we need other cryptographic digest properties that those don't give us and we'd need to turn to SHA256 or something.
I've heard some suggestions that we should look into BEAR or LION instead, but I'm leery of the idea. They are faster, requiring one fewer application of their primitives, but they are PRPs, not SPRPs -- meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section 6]. I don't know a way to exploit this in the Tor network, but deviating from an ideal block cipher seems to me like one of those ideas that's practically begging for disaster.
The notion of ‘Ind-CCA2’ is defined for public-key encryption systems; it doesn't make sense for permutations.
Even LIONESS is not an ‘ideal block cipher’ -- that term has an actual definition, and it is a much stronger assumption than Tor's relay crypto needs. (http://cs.nyu.edu/~dodis/ps/ic-ro.pdf and http://cs.nyu.edu/~dodis/ps/ext-cipher.pdf contain some potentially useful, potentially interesting information.)
Keep in mind that Tor's current relay crypto breaks *completely* if the circuit-extension handshake ever produces the same session key twice, and some parts of Tor's protocols will still require that assumption even if the relay crypto doesn't. Therefore, you really don't need to worry about attacks that require more than one call to the permutation in either direction.
Can we get faster than LIONESS? Indeed we can! There are a pile of constructions. If I understand correctly, and I'm not missing anything, the ones we might want to use fall into two broad categories: those that (like LIONESS) split the message into a short bit and a long bit, then treat them differently; and those that apply a transformation on the message, encrypt the message, and transform it again.
The first category (split, then frob each part based on the other once or twice) would appear to be more of a patent minefield, thanks to the XCB patent. The ones to look at here are [HCTR] and [HCH], which require two applications of a universal hash function (can be polynomial-based) and approximately one encryption pass. HCH is a little more unlike XCB. Neither is exactly trivial. We could build either one out of whatever field and stream cipher we wanted, as above, though bitsliced AES-CTR wouldn't be efficient in this case: we aren't generating enough stream at once for bitslicing to pay off.
The second category (frob, encrypt, frob) is pretty elegant IMO. The best-explained of these I've seen so far are in a paper by Palash Sarkar [Efficient-Tweakable], though the earlier TET construction [TET] might also be cool. For these, you need an invertible block-wise (Almost) (Xor-)Universal hash function, typically implemented with GF(2^128). I'm not sure if you could use a different field.
Please actually *read* http://cr.yp.to/papers.html#securitywcs this time (read the appendix first). If you use polynomial evaluation over a different field, your ‘hash function’ will have small differential properties with respect to addition *in that field*. The Poly1305 paper then proves that the polynomial-evaluation part of Poly1305 also has small differential properties with respect to addition in Z/(2^128)Z .
In short, you can use a different field for polynomial evaluation *if* you also use a different addition operation.
(If you're going to pass the result of the polynomial-evaluation function through a one-way function so that you can tee off some bits for a chaining output, you can use whatever addition operation you want after the OWF.)
The multiplication operations here appear to be multiplication by a primitive element, and multiplication by a per-key element. The encryption step can be realized with a somewhat unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB approach. I don't know what you'd need to do to substitute in an orthodox stream cipher for the one used in iHCH. Sarkar seems to see iHCH as a successor to HCH, which is a little worrisome given that HCH is a spiritual descendant of the patented XCB, but to me the two constructions (HCH, iHCH) look practically nothing alike except for their use of a counter mode step.
I am a little intimidated by the idea of trying to implement one of these ciphers.
LION (implemented with ChaCha and CubeHash512, with an extra 256 bits of chaining output from each step) sounds like the safe-and-sane approach for now; changing the protocol again if the U.S. patent system expires won't be nearly as hard as changing the protocol the first time.
(BLAKE-512 might be a faster hash function on some processors, but requires 64-bit operations. BLAKE-256 would be faster, but then (a) you don't get a chaining output from any function computed directly over R, and (b) you still need to produce 512 bits of (computationally-indistinguishable-from-)independent key material from some hash function in the chaining step. CubeHash512 will never be broken as an entropy extractor in less time than the Curve25519-based circuit-extension handshake.)
(Does the XCB patent cover LION-like constructions with the ‘hash function’ in the middle replaced with a polynomial-evaluation function? I'm not convinced that a suitable approximately-256-bit polynomial-evaluation function will be much faster than ChaCha, especially in 32-bit code.)
Above, I haven't taken one of our requirements into account: that any change to a single cell must make all future cells unrecoverable.
There are modes that are supposed to prevent this, and applying them to a decent wide-block cipher might solve the issue. IGE is one of them [IGE], but it turns out to be broken by an attacker who knows some plaintext. The Accumulated Block Chaining [ABC] construction is supposed to fix that; I'm not too sure whether it's correct or efficient.
As a blunt-force approach to the problem, we could rekey between each block, using new keys based on a MAC of the last block. This would be pretty inefficient for primitives that need any serious amount of per-key computation.
Use a ‘PRF’, not a MAC. (And don't use primitives that need per-key precomputation.)
Finally, we could look into constructions that produce an extra secret output incidental to their regular operation, and which are easily rekeyed for each operation.
This is the only option that you actually have a design for.
You'll still need to process the chaining outputs from LION with a PRF (i.e. keyed hash function).
In this whole field, we need to keep an eye out for the patents on CMC, EME, and XCB. "Joy."
III. CHOICE OF CIPHERS AND OTHER PRIMITIVES
I am hearing exactly two recommendations for encryption primitives nowadays: AES and Salsa20 (and its family). Nobody's recommending anything else as far as I can tell.
AES is still the IBM of the block cipher world, which "Nobody Ever Got Fired For Using." It's a bit of a pain to use it in software, though. Its most obvious implementations have timing attacks due to table lookups [DJB-Timing]. OpenSSL has three fast x86 implementations: one of them uses vector permutations, one uses bit-slicing, and one uses the aesni instructions to invoke the chip's built-in AES capabilities. On high-end Intel chips, these take about 12-25, 7-9, and 1.5 cycles per byte respectively. The bitsliced one really needs to encrypt 4KiB at a time (yes, kibibytes), which makes it fine for counter mode but not so good for CBC. (Those numbers are from the OpenSSL source.) Key setup times are cheap for all of these (I think!), but the bitsliced implementation gets expensive if you change the key (or the stream position).
Salsa20 (and its descendant, ChaCha) is the Obvious Second Choice for people who don't want to use AES. It is faster than AES on every platform except those with hardware support for AES; 4-7 cycles per byte seems typical for high-end Intel stuff. You would probably have to go out of your way to implement it in a way that was vulnerable to timing side-channel attacks. The only arguments for its insecurity that I'm aware of are that although it has fewer known flaws than AES, it has received less attention than AES, therefore is likelier to _unsuspected_ problems. The counterargument there is that there _are_ several dozen cryptographers who have tried to attack it (or attack reduced-round variants of it), several of whom have also done successful results against AES. [DJB-Comm] Per-key setup time is basically nonexistent.
Per-key setup time consists of one vector addition (of the first half of the key into the constants). (DJB's timing reports in http://cr.yp.to/papers.html#chacha repeat this operation for each block, but I would rather do it once per key if at all possible.)
There are a lot of constructions out there that want to do multiplication in GF(2^128) as a basic operation. That turns out to be less than totally straightforward, though. On newer intel chips, you've got a "multiply without carry" instruction that can supposedly let you get to around 2 cycles per byte. On other platforms, you're reduced to worse trickery to try to get good performance without timing side-channels, for an amount of trickery dependent on what operation you're trying to do exactly.
The easiest GF(2^128) operation to implement safely and quickly is multiplication by a known compile-time value. (It's easy enough that even I could do it.) Next easiest is multiply a large number of values by the same run-time-fixed value -- you do precomputation on the value in order to generate some tables. (The scary, fast variants use big tables; the still-a-little-scary variants use 256-byte tables, and get performance on high-end Intel boxes around 7-10 cycles per byte when doing GCM.) Full-on multiplication of two arbitrary GF(2^128) values is slowest still.
The obvious way to implement GF(2^128) multiplication of a large number of secret inputs y by one learned-at-runtime secret c is:
* Compute a table of c*X^i for i = 0, ..., 127. (This table takes 128*128 bits, or 2048 bytes. Multiplication by X is easy and tolerably fast.) * For each input y (= y_0*X^0 + ... + y_127*X^127, with the y_i in GF(2)), compute y_0*(c*X^0) + ... + y_127*(c*X^127). (You've done this step for Pynchon Gate already; hopefully that runs in constant time.)
As a further wrinkle with GF(2^128), OpenSSL doesn't seem to actually expose its "multiply in GF(2^128)" functions as far as I can tell: we would need to snarf such code from elsewhere.
Oh! ARMv8 has an optional crypto instruction set that supports AES, SHA256, and GF(2^128) multiplication [ARMv8]. If it looks like most of the ARMs we care about are going to get it, that could factor into our planning.
I won't believe claims that AES hardware will be widely available until it actually is present in every new processor from a major manufacturer.
Robert Ransom
On 10/9/12, Robert Ransom rransom.8774@gmail.com wrote:
On 10/8/12, Nick Mathewson nickm@torproject.org wrote:
The second category (frob, encrypt, frob) is pretty elegant IMO. The best-explained of these I've seen so far are in a paper by Palash Sarkar [Efficient-Tweakable], though the earlier TET construction [TET] might also be cool. For these, you need an invertible block-wise (Almost) (Xor-)Universal hash function, typically implemented with GF(2^128). I'm not sure if you could use a different field.
Please actually *read* http://cr.yp.to/papers.html#securitywcs this time (read the appendix first). If you use polynomial evaluation over a different field, your ‘hash function’ will have small differential properties with respect to addition *in that field*. The Poly1305 paper then proves that the polynomial-evaluation part of Poly1305 also has small differential properties with respect to addition in Z/(2^128)Z .
In short, you can use a different field for polynomial evaluation *if* you also use a different addition operation.
Sorry -- that paper does require polynomials over a field of the same size as a block cipher's block size (for AES, that means GF(2^128)), and does not work with general almost-(xor-)universal hash functions.
(If you're going to pass the result of the polynomial-evaluation function through a one-way function so that you can tee off some bits for a chaining output, you can use whatever addition operation you want after the OWF.)
I don't see a way to obtain a chaining output from iHCH or HOH.
The multiplication operations here appear to be multiplication by a primitive element, and multiplication by a per-key element. The encryption step can be realized with a somewhat unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB approach. I don't know what you'd need to do to substitute in an orthodox stream cipher for the one used in iHCH. Sarkar seems to see iHCH as a successor to HCH, which is a little worrisome given that HCH is a spiritual descendant of the patented XCB, but to me the two constructions (HCH, iHCH) look practically nothing alike except for their use of a counter mode step.
iHCH and HOH use a block cipher, not just a stream cipher.
Robert Ransom
On Tue, Oct 9, 2012 at 12:31 PM, Robert Ransom rransom.8774@gmail.com wrote: [...]
AES-CTR + HMAC-SHA512/256.
AES-CTR + Poly1305. Poly1305 requires nonces, but we can use a counter for those.
Poly1305AES requires nonces. Poly1305 itself requires (computationally-indistinguishable-from-) independent keys for each message.
Right; I meant to say the output of a stream cipher used as a PRF, but what I meant isn't what I said. Should've proofed more carefully
Please actually read the paper (see also http://cr.yp.to/papers.html#pema section 2.5 for how DJB uses Poly1305 now).
I read everything I cited. If there is something I didn't understand, or something I missed, or something I got wrong, or something I said wrong, that doesn't mean I didn't read the paper.
I am not going to be able to draw all the right inferences from every paper on my own, though. And I am *definitely*, *frequently*, going to read papers, come up with questions, and post those questions here sometimes even when the paper, properly understood, would answer my questions. If I were writing for publication, I'd want to keep all my ideas secret until I could answer all my questions and make sure all my answers were right, but I'm not writing for publication -- I am writing to get feedback from other people and learn things. Thank you for helping me learn things!
Also thank you for reminding me to read http://cr.yp.to/antiforgery/securitywcs-20050227.pdf again. I'll check it out.
Salsa20 + Poly1305.
For a padding PRNG, we could use any stream cipher we like.
In each case, we'd want to authenticate not only the content of the cell, but also the previous cell's authenticator and the position of the cell within the stream.
AES-GCM+AES is going to be the fastest on CPUs with specific support for AES and GCM, but not on other architectures until/unless more chips grow instructions specialized for AES and GF(2^128). Salsa20+Poly1305 should be fastest on other architectures.
This entire category of designs still has the problems that it had before: it leaks the length of the circuit to the last node in the circuit, and consumes a fairly large portion of each cell (for one truncated mac per hop). Further, it's going to be a bit fiddly (though not impossible) to get it to work with rendezvous circuits.
Explicitly leaking the circuit length is very very bad.
Are there any non-obvious reasons why? Does it lead to any better attacks than the obvious ones?
The MAC-based designs do not mention how to prevent end-to-end tagging in the exit-to-client direction. I suspect they won't try to prevent it at all.
That's correct. Why would it be an attack for the exit to send a covert signal to the client? The exit already has valid means to send overt signals to the client, and the client Tor is presumed not to want to break its own anonymity.
II. WIDE-BLOCK DESIGNS
Turn we now to designs where we encrypt each block of the cipher using a 509-byte SPRP, such that each block's SPRP is keyed dependently on the original key and on the encrypted value of the previous block.
There be dragons here. Let's talk about some ways we could build that. I'll start by talking about wide-block encryption, then talk about getting the "unrecoverability" property where any missing or corrupted block makes all future blocks unrecoverable.
The wide-block SPRP I've used before is LIONESS [Bear-Lion]. It requires two keyed hash operations and two stream cipher operations, so it's going to be at best half the speed of the mac-and-pad designs above: a little worse, maybe, since it forces us to rekey with every block.
Might that be acceptable? Right now, AES is not a huge fraction of our runtime, and a relay does AES on each cell three times (TLS,relay crypto,TLS) and SHA1 on each cell twice (TLS,TLS). An exit does AES on each cell twice (TLS,relay) and SHA1 on each cell twice (TLS,relay). So for relays, with a naive SHA1-AES LIONESS, we'd be increasing the stream-cipher operations by 25% and the digests by 100%. On an exit, we'd be increasing the stream-cipher operations by 33% and the digests by 100%.
If we were to go with LIONESS, we'd surely want to look into faster, better keyed-hash algorithms than SHA1. I don't know whether one of the polynomial MACs above would be good enough, or whether we need other cryptographic digest properties that those don't give us and we'd need to turn to SHA256 or something.
I've heard some suggestions that we should look into BEAR or LION instead, but I'm leery of the idea. They are faster, requiring one fewer application of their primitives, but they are PRPs, not SPRPs -- meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section 6]. I don't know a way to exploit this in the Tor network, but deviating from an ideal block cipher seems to me like one of those ideas that's practically begging for disaster.
The notion of ‘Ind-CCA2’ is defined for public-key encryption systems; it doesn't make sense for permutations.
Even LIONESS is not an ‘ideal block cipher’ -- that term has an actual definition, and it is a much stronger assumption than Tor's relay crypto needs. (http://cs.nyu.edu/~dodis/ps/ic-ro.pdf and http://cs.nyu.edu/~dodis/ps/ext-cipher.pdf contain some potentially useful, potentially interesting information.)
Keep in mind that Tor's current relay crypto breaks *completely* if the circuit-extension handshake ever produces the same session key twice, and some parts of Tor's protocols will still require that assumption even if the relay crypto doesn't. Therefore, you really don't need to worry about attacks that require more than one call to the permutation in either direction.
Ah, so I think you are claiming that LION or BEAR on its own is fine if they are only used once with each key, and that any sane cell chaining construction we use will involve re-keying them with each block, so we can ignore any attack on them that would require multiple invocations of the encryption/decryption function with the same key.
Do I understand you correctly there?
[...]
The multiplication operations here appear to be multiplication by a primitive element, and multiplication by a per-key element. The encryption step can be realized with a somewhat unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB approach. I don't know what you'd need to do to substitute in an orthodox stream cipher for the one used in iHCH. Sarkar seems to see iHCH as a successor to HCH, which is a little worrisome given that HCH is a spiritual descendant of the patented XCB, but to me the two constructions (HCH, iHCH) look practically nothing alike except for their use of a counter mode step.
I am a little intimidated by the idea of trying to implement one of these ciphers.
LION (implemented with ChaCha and CubeHash512, with an extra 256 bits of chaining output from each step) sounds like the safe-and-sane approach for now; changing the protocol again if the U.S. patent system expires won't be nearly as hard as changing the protocol the first time.
(BLAKE-512 might be a faster hash function on some processors, but requires 64-bit operations. BLAKE-256 would be faster, but then (a) you don't get a chaining output from any function computed directly over R, and (b) you still need to produce 512 bits of (computationally-indistinguishable-from-)independent key material from some hash function in the chaining step. CubeHash512 will never be broken as an entropy extractor in less time than the Curve25519-based circuit-extension handshake.)
Interesting; I had stopped considering CubeHash when got dropped from the SHA-3 competition. I'll have to find out who's been working on analyzing it since then.
Can you be more explicit about where the chaining output is taken from when, and how exactly it gets used, in this design?
Also, why LION and not BEAR?
(Does the XCB patent cover LION-like constructions with the ‘hash function’ in the middle replaced with a polynomial-evaluation function? I'm not convinced that a suitable approximately-256-bit polynomial-evaluation function will be much faster than ChaCha, especially in 32-bit code.)
I haven't read the XCB patent, and won't, pending legal advice from Wendy telling me that it's okay to read it.
[...]
The obvious way to implement GF(2^128) multiplication of a large number of secret inputs y by one learned-at-runtime secret c is:
- Compute a table of c*X^i for i = 0, ..., 127. (This table takes
128*128 bits, or 2048 bytes. Multiplication by X is easy and tolerably fast.)
- For each input y (= y_0*X^0 + ... + y_127*X^127, with the y_i in
GF(2)), compute y_0*(c*X^0) + ... + y_127*(c*X^127). (You've done this step for Pynchon Gate already; hopefully that runs in constant time.)
Hm. Did you figure out cycles-per-byte on that one?
Back-of-the-envelope:
Assuming 64-bit words, we're looking at a 128 instances of "get a bit from y," 128*2 instances of "Load the corresponding word from the table, then constant-time-conditionally-xor that word to the accumulator."
The fast portable implementation of the conditional-xor I know, for a single bit in 'b', and a value in 'x' to be conditionally xored into 'a' is: "a ^= x & ~(b-1)".
I *think* that's roughly 128 (and, sub, not) to get a mask for each bit, 128 shifts, then 256 loads, 256 ands (to apply the mask), and 256 xors. (Some of these operations are unnecessary; I am committing fencepost errors here. I'm also assuming all loads are one-cycle.)
So that's about 1280 operations for 16 bytes of input to be multiplied by c, so that seems like something like 80 instructions per byte for a naive implementation. There are still probably more optimizations to be done, especially if we have wacky SIMD features to play with, or we can combine some of the operations above into single ones. Dynamic multiple issue and such CPU architectural tricks might get it down even more. Nevertheless, it still looks like it'd be expensive enough getting GF(2^128) right to make GF(2^128) unattractive.
Still, maybe somebody should hack this up for the public good, whether we turn out to need GF(2^128) or not.
As a further wrinkle with GF(2^128), OpenSSL doesn't seem to actually expose its "multiply in GF(2^128)" functions as far as I can tell: we would need to snarf such code from elsewhere.
Oh! ARMv8 has an optional crypto instruction set that supports AES, SHA256, and GF(2^128) multiplication [ARMv8]. If it looks like most of the ARMs we care about are going to get it, that could factor into our planning.
I won't believe claims that AES hardware will be widely available until it actually is present in every new processor from a major manufacturer.
Agreed.
On 10/9/12, Nick Mathewson nickm@alum.mit.edu wrote:
On Tue, Oct 9, 2012 at 12:31 PM, Robert Ransom rransom.8774@gmail.com wrote: [...]
AES-CTR + HMAC-SHA512/256.
AES-CTR + Poly1305. Poly1305 requires nonces, but we can use a counter for those.
Poly1305AES requires nonces. Poly1305 itself requires (computationally-indistinguishable-from-) independent keys for each message.
Right; I meant to say the output of a stream cipher used as a PRF, but what I meant isn't what I said. Should've proofed more carefully
Please actually read the paper (see also http://cr.yp.to/papers.html#pema section 2.5 for how DJB uses Poly1305 now).
I read everything I cited. If there is something I didn't understand, or something I missed, or something I got wrong, or something I said wrong, that doesn't mean I didn't read the paper.
I am not going to be able to draw all the right inferences from every paper on my own, though. And I am *definitely*, *frequently*, going to read papers, come up with questions, and post those questions here sometimes even when the paper, properly understood, would answer my questions. If I were writing for publication, I'd want to keep all my ideas secret until I could answer all my questions and make sure all my answers were right, but I'm not writing for publication -- I am writing to get feedback from other people and learn things.
You have been talking about using Poly1305 truncated to 64 bits for weeks. It is truly not difficult to find Theorem 3.3 in the Poly1305 paper and figure out that the polynomial-evaluation part of Poly1305 truncated to its first (least significant) 64 bits has differential probabilities of at most 2^(-34) for Tor-cell-size messages (and thus Poly1305 has a probability of forgery of at most 2^(-34) when truncated to its first 64 bits).
This entire category of designs still has the problems that it had before: it leaks the length of the circuit to the last node in the circuit, and consumes a fairly large portion of each cell (for one truncated mac per hop). Further, it's going to be a bit fiddly (though not impossible) to get it to work with rendezvous circuits.
Explicitly leaking the circuit length is very very bad.
Are there any non-obvious reasons why? Does it lead to any better attacks than the obvious ones?
* Cannibalized circuits are one hop longer than non-cannibalized circuits; knowing that a particular circuit was cannibalized leaks information about the client's previous exit-selection behaviour.
* Some users might want to use alternate path-selection algorithms (e.g. http://freehaven.net/anonbib/#ccs2011-trust ); they might not want to leak the fact that they are using such algorithms to the non-first relays in their entry-guard chains.
The MAC-based designs do not mention how to prevent end-to-end tagging in the exit-to-client direction. I suspect they won't try to prevent it at all.
That's correct. Why would it be an attack for the exit to send a covert signal to the client? The exit already has valid means to send overt signals to the client, and the client Tor is presumed not to want to break its own anonymity.
* Mallory's malicious entry and exit relays suspect (based on timing correlation) that they control both ends of a circuit, and want to confirm that. The exit tags a cell it sends to the client by XORing it with a secret random constant; the entry XORs (what it believes is) that cell with the same constant. If the circuit survives, Mallory's relays know that they control both ends.
* Mallory doesn't want to pay the bandwidth costs for non-compromised traffic, so his/her/its entry and exit relays tag *every* circuit's first exit-to-client cell with the same secret random constant. (You added a crapload of bugs to 0.2.3.x based on Mike Perry's claim that someone is likely to actually perform an ‘attack’ of this form.)
I've heard some suggestions that we should look into BEAR or LION instead, but I'm leery of the idea. They are faster, requiring one fewer application of their primitives, but they are PRPs, not SPRPs -- meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section 6]. I don't know a way to exploit this in the Tor network, but deviating from an ideal block cipher seems to me like one of those ideas that's practically begging for disaster.
The notion of ‘Ind-CCA2’ is defined for public-key encryption systems; it doesn't make sense for permutations.
Even LIONESS is not an ‘ideal block cipher’ -- that term has an actual definition, and it is a much stronger assumption than Tor's relay crypto needs. (http://cs.nyu.edu/~dodis/ps/ic-ro.pdf and http://cs.nyu.edu/~dodis/ps/ext-cipher.pdf contain some potentially useful, potentially interesting information.)
Keep in mind that Tor's current relay crypto breaks *completely* if the circuit-extension handshake ever produces the same session key twice, and some parts of Tor's protocols will still require that assumption even if the relay crypto doesn't. Therefore, you really don't need to worry about attacks that require more than one call to the permutation in either direction.
Ah, so I think you are claiming that LION or BEAR on its own is fine if they are only used once with each key, and that any sane cell chaining construction we use will involve re-keying them with each block, so we can ignore any attack on them that would require multiple invocations of the encryption/decryption function with the same key.
Do I understand you correctly there?
Yes.
[...]
The multiplication operations here appear to be multiplication by a primitive element, and multiplication by a per-key element. The encryption step can be realized with a somewhat unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB approach. I don't know what you'd need to do to substitute in an orthodox stream cipher for the one used in iHCH. Sarkar seems to see iHCH as a successor to HCH, which is a little worrisome given that HCH is a spiritual descendant of the patented XCB, but to me the two constructions (HCH, iHCH) look practically nothing alike except for their use of a counter mode step.
I am a little intimidated by the idea of trying to implement one of these ciphers.
LION (implemented with ChaCha and CubeHash512, with an extra 256 bits of chaining output from each step) sounds like the safe-and-sane approach for now; changing the protocol again if the U.S. patent system expires won't be nearly as hard as changing the protocol the first time.
(BLAKE-512 might be a faster hash function on some processors, but requires 64-bit operations. BLAKE-256 would be faster, but then (a) you don't get a chaining output from any function computed directly over R, and (b) you still need to produce 512 bits of (computationally-indistinguishable-from-)independent key material from some hash function in the chaining step. CubeHash512 will never be broken as an entropy extractor in less time than the Curve25519-based circuit-extension handshake.)
Interesting; I had stopped considering CubeHash when got dropped from the SHA-3 competition. I'll have to find out who's been working on analyzing it since then.
Can you be more explicit about where the chaining output is taken from when, and how exactly it gets used, in this design?
For encryption:
* Split the 512-byte plaintext cell into 480-byte R and 32-byte L. (The names are reversed to match the original description of LION.) * Let header be the first 3 bytes of R; replace the first 3 bytes of R with zeros. * Let C_1 be the first 32 bytes of ChaCha(L XOR K_1, “LION p.1”); let T be the next 480 bytes. * XOR T into R. Replace the first 3 bytes of R with zeros. * Let T be the first 32 bytes of CubeHash512(R); let C_2 be the other 32 bytes. * XOR T into L. * Let C_3 be the first 32 bytes of ChaCha(L XOR K_2, “LION p.2”); let T be the next 480 bytes. * XOR T into R. Replace the first 3 bytes of R with header. * The ciphertext is concat(R, L).
(Decryption is easy, and produces the same C_i.)
To compute keys for the next block:
* Compute CubeHash512(concat(K_3, C_1, C_2, C_3)); let K_1 be its first 32 bytes; let K_2 be the next 32 bytes.
The chaining step absolutely requires 512 bits of output. That could be obtained by computing one ChaCha block using the output of BLAKE-256 as a key, but CubeHash is simpler, at least as secure, and probably as fast.
Also, why LION and not BEAR?
ChaCha12 is definitely faster than BLAKE-256 or CubeHash, and you aren't going to use a reduced-round variant of a hash function.
[...]
The obvious way to implement GF(2^128) multiplication of a large number of secret inputs y by one learned-at-runtime secret c is:
- Compute a table of c*X^i for i = 0, ..., 127. (This table takes
128*128 bits, or 2048 bytes. Multiplication by X is easy and tolerably fast.)
- For each input y (= y_0*X^0 + ... + y_127*X^127, with the y_i in
GF(2)), compute y_0*(c*X^0) + ... + y_127*(c*X^127). (You've done this step for Pynchon Gate already; hopefully that runs in constant time.)
Hm. Did you figure out cycles-per-byte on that one?
Nope. This was just a proof-of-concept. (It also provides an upper bound on the amount of table that a secure implementation can reasonably use; anything with a larger table will need to do too much I/O.)
Back-of-the-envelope:
Assuming 64-bit words, we're looking at a 128 instances of "get a bit from y," 128*2 instances of "Load the corresponding word from the table, then constant-time-conditionally-xor that word to the accumulator."
All 64-bit processors have SIMD instructions. In general, consider 32-bit words and 4-element (128-bit) vectors of 32-bit words; those are efficient enough everywhere.
The fast portable implementation of the conditional-xor I know, for a single bit in 'b', and a value in 'x' to be conditionally xored into 'a' is: "a ^= x & ~(b-1)".
Invert y (bitwise) first (in one vector instruction or four word instructions); then you can skip inverting each b-1 . (Someone may need to patch your Pynchon Gate code.)
I *think* that's roughly 128 (and, sub, not) to get a mask for each bit, 128 shifts, then 256 loads, 256 ands (to apply the mask), and 256 xors. (Some of these operations are unnecessary; I am committing fencepost errors here. I'm also assuming all loads are one-cycle.)
So that's about 1280 operations for 16 bytes of input to be multiplied by c, so that seems like something like 80 instructions per byte for a naive implementation. There are still probably more optimizations to be done, especially if we have wacky SIMD features to play with, or we can combine some of the operations above into single ones. Dynamic multiple issue and such CPU architectural tricks might get it down even more. Nevertheless, it still looks like it'd be expensive enough getting GF(2^128) right to make GF(2^128) unattractive.
With the not operation for each bit removed, the 64-bit-word operation count drops to around 1024.
With 32-bit words, that's 512 each of load, and, xor ; the total number of instructions is around 1920 on a plain 32-bit processor.
With SIMD, that's 128 each of load, and, xor , but the ‘compute a mask for each bit’ step is annoying and processor-specific, so it's not much better in terms of instruction count than 64-bit words. However, the table can be computed in a slightly different format to make it easier to compute masks in the SIMD registers; that should get down to not much more than 768 instructions.
Still, maybe somebody should hack this up for the public good, whether we turn out to need GF(2^128) or not.
Look for a better algorithm first -- there should be one.
Robert Ransom
On Tue, 9 Oct 2012 00:28:38 -0400 Nick Mathewson nickm@torproject.org wrote:
So to be concrete, let me suggest a few modes of operation. I believe I'm competent to implement these:
I think (IMHO) Keccak makes many (most?) symmetric encryption modes obsolete in the near future.
Now Keccak-Hash is SHA-3 winner. It is not only a hash. Keccak is universal and can be used to authenticated stream encryption with one pass with input any amount of pads and output any amount of additional MACs from one-pass operation (so called duplexing mode).
http://sponge.noekeon.org/SpongeDuplex.pdf
"Duplexing the sponge: single-pass authenticated encryption and other applications" Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche.
In this year Keccak will recieve only a hash status officialy. Later we can see many other modes of using Keccak as universal RO-indistinguishable PRF with good security proofs and tons of analysis published already. Some parts of protocols can be done more simply with Keccak: new padding modes for RSA instead of OAEP is one example.
Cite: " In a sponge function, the input is like a white page: It does not impose any specific structure to it. Additional optional inputs (e.g., key, nonce, personalization data) can be appended or prepended to the input message according to a well-defined convention, possibly under the hood of diversification as proposed in [6, Section “Domain separation”]. K supports all the possible applications of sponge functions and duplex objects described in [6, Chapters “Sponge applications” and “Duplex applications”]. These include hash function, randomized hash function, hash function instance differentiation, slow one-way function, parallel and tree hashing, mask generating function, key derivation function, deterministic random bit generator, reseedable pseudo random bit sequence generator, message authentication code (MAC) function, stream cipher, random-access stream cipher and authenticated encryption. "
http://keccak.noekeon.org/Keccak-submission-3.pdf
"The Keccak SHA-3 submission"
Guido Bertoni, Joan Daemen, Michael Peeters, Gilles Van Asshe
Keccak is hardware fast and can be realased in GPU at first.
"Keccak Tree hashing on GPU, using Nvidia Cuda API" https://sites.google.com/site/keccaktreegpu/
If NIST adopt many uses Keccak as standards then the most of cryptoinfrastructure migrate to it. Keccak in the future is more then AES today and makes many uses of AES (and any other blockciphers) unnecessary (excluding PRP-modes for disk encryption, but PRF-PRP transformation modes is potentially possible too).
On Thu, 11 Oct 2012 19:17:22 +0000 unknown unknown@pgpru.com wrote:
On Tue, 9 Oct 2012 00:28:38 -0400 Nick Mathewson nickm@torproject.org wrote:
So to be concrete, let me suggest a few modes of operation. I believe I'm competent to implement these:
I think (IMHO) Keccak makes many (most?) symmetric encryption modes obsolete in the near future.
What I wrote about Keccak one year ago:
https://lists.torproject.org/pipermail/tor-dev/2011-November/003020.html
Now Keccak is SHA-3. I hope Keccak will soon replace everything: from old messy code in the entropy accumulator and distillator in /dev/random to AES-CTR-HMAC encryption in SSL. From complex RSA-OAEP padding mode in signatures to exotic tweaked block-ciphers.
Keccak is simpler to implement without mistakes and easier to design a protocols with solid security proofs. It's universal by generic inner structure properties without need any special additional modules or switches.
Thus spake Nick Mathewson (nickm@torproject.org):
I should share with the list an update of where I am with a design for an improved relay crypto protocol. For background and motivation, please see the last thread on the topic [Prop202].
There are three main questions remaining for me in choosing among new relay crypto protocols. Basically, they are: "Am I comfortable with this system?", "Among systems I'm comfortable with, how good is this system?", and "Do I know how to implement this system?"
Unfortunately, the stuff I am currently comfortable with and know that I could implement is not nearly as good as the stuff that I'm _nearly_ comfortable enough to use and I don't know how to implement.
Let's talk about some designs in detail, using the same terminology as proposal 202.
[*snip*]
II. WIDE-BLOCK DESIGNS
[*snip*]
I am a little intimidated by the idea of trying to implement one of these ciphers.
I too am worried that trying to code and deploy relatively new constructions is likely risky, or at least very tricky.
Above, I haven't taken one of our requirements into account: that any change to a single cell must make all future cells unrecoverable.
Will UDP transports make this even more tricky?
There are modes that are supposed to prevent this, and applying them to a decent wide-block cipher might solve the issue. IGE is one of them [IGE], but it turns out to be broken by an attacker who knows some plaintext. The Accumulated Block Chaining [ABC] construction is supposed to fix that; I'm not too sure whether it's correct or efficient.
Am I crazy to think we might try to stop the bleeding of tagging attacks by figuring out a way to use ABC or IGE mode as a stopgap until people can code and evaluate new constructions for performance and timing side-channels? ABC/IGE would "only" involve a mode change, rather than an entire relay protocol upgrade and new cipher coding..
IGE might also actually exist in OpenSSL: http://www.links.org/?p=137
It also sounds like IGE is only broken if we try to use it for authentication.. We don't really need that property, do we? What we really want is the plaintext corruption property at the middle node upon ciphertext modification..
We could also remove a lot of known plaintext by replacing zero-fill with random fill in RELAY_RESOLVE, RELAY_BEGIN, and other short relay cells. That should only be expensive at the client...
VI. References
[Prop202] https://lists.torproject.org/pipermail/tor-dev/2012-June/003649.html
[Ferguson] Niels Ferguson, "Authentication weaknesses in GCM", 2005. http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/comments/CWC-GCM/Fergus...
[DJB-Timing] Daniel J. Bernstein, "Cache-timing attacks on AES", 2004. http://cr.yp.to/papers.html#cachetiming
[DJB-Comm] Daniel J. Bernstein, personal communication
[Poly1305-Trunc] http://osdir.com/ml/encryption.poly1305/2005-09/msg00007.html
[ARMv8] "ARMv8 Instruction Set Overview", 2011, http://board.flatassembler.net/download.php?id=5698
[Bear-Lion] Ross Anderson and Eli Biham. "Two Practical and Provably Secure Block Ciphers: BEAR and LION". http://www.cl.cam.ac.uk/~rja14/Papers/bear-lion.pdf
[Efficient-Tweakable] Palash Sarkar, "Efficient Tweakable Enciphering Schemes from (Block-Wise) Universal Hash Functions". 2008. http://eprint.iacr.org/2008/004.pdf
[IGE] Gligor and Donescu, "On Message Integrity in Symmetric Encryption" has the good IGE analysis: http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/ige/ige-s... The original IGE proposal was by Carl Campell in an old NIST publication that I can't find online; the paper above has a reference for it if you want to chase it more.
[ABC] Lars R. Knudsen, "Block Cipher Chaining Modes of Operation". http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/abc/abc-s...
[HCTR] Peng Wang, Dengguo Feng, and Wenling Wu. "HCTR: A variable-input-length enciphering mode." 2005. http://delta.cs.cinvestav.mx/~debrup/hctr.pdf
[HCH] Debrup Chakraborty and Palash Sarkar. "HCH: A new tweakable enciphering scheme using the hash-encrypt-hash approach." http://biblioteca.cinvestav.mx/indicadores/texto_completo/cinvestav/2006/136...
[TET] Shai Halevi. "Invertible universal hashing and the TET encryption mode." 2007. http://www.iacr.org/archive/crypto2007/46220405/46220405.pdf _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On Thu, Oct 18, 2012 at 6:10 PM, Mike Perry mikeperry@torproject.org wrote: [...]
There are modes that are supposed to prevent this, and applying them to a decent wide-block cipher might solve the issue. IGE is one of them [IGE], but it turns out to be broken by an attacker who knows some plaintext. The Accumulated Block Chaining [ABC] construction is supposed to fix that; I'm not too sure whether it's correct or efficient.
Am I crazy to think we might try to stop the bleeding of tagging attacks by figuring out a way to use ABC or IGE mode as a stopgap until people can code and evaluate new constructions for performance and timing side-channels? ABC/IGE would "only" involve a mode change, rather than an entire relay protocol upgrade and new cipher coding..
ABC or IGE wouldn't help us much on their own without a wide-block cipher, and IGE is just plain broken. (See explanation below.)
Remember, in the document I originally sent, I was talking about using ABC or some other corruption-propagation mode at a block level. That requires a wide-block cipher, though. And it turns out we can do better if the corruption-propagation is part of the wide-block idea.
We'd also burn our performance on platforms without AES acceleration, I think.
IGE might also actually exist in OpenSSL: http://www.links.org/?p=137
It also sounds like IGE is only broken if we try to use it for authentication.. We don't really need that property, do we? What we really want is the plaintext corruption property at the middle node upon ciphertext modification..
That _is_ a kind of authentication, or an analogue to it. And the point is that an adversary can repair a hole in the stream, and *stop* the plaintext corruption. So IGE does not deliver the property we would want for it, even if we could use it.
Check out this thread, and the stuff it references: http://www.mail-archive.com/cryptography@metzdowd.com/msg06599.html
We could also remove a lot of known plaintext by replacing zero-fill with random fill in RELAY_RESOLVE, RELAY_BEGIN, and other short relay cells. That should only be expensive at the client...
So long as there is a block's worth of known or guessed plaintext, IGE fails to ensure that changes propagate forward. Like, 16 bytes worth of guessable HTTP in a payload (if you're thinking about this in a non-wide-block scenario).
Two general process thoughts:
* I may be saying this from an overabundance of caution, but: I don't think we should use cryptographic primitives and constructions with known flaws, even if we can't see a way for them to hurt us right now, and even if we can come up with a solid-seeming argument for how those flaws can't hurt us.. That's how we got into our AES-CTR mess in the first place.
* I know everybody wants our crypto problems to get solved, but it's critical to get this stuff right. I think that the way to do right by our users is by taking the time we will need to design the right thing properly, rather than jumping into something halfcocked. We all acknowledge that it's easy for people and organizations to screw this stuff up: so let's take our time and actually come up with something solid. Against the current pain and badness of our current system, we must weigh the potential harm of jumping precipitously into something that turns out to be broken because we didn't think about it hard enough.
And a more general observation:
* I think the only thing we could get designed & analyzed enough to include in Tor 0.2.4 would be hmac-and-pad, which nobody actually likes much. I think that by the time we get anything good here well designed and discussed enough to reach a sensible consensus about it, it will be 2013, and time to start adding features to 0.2.5 (at the earliest).
So here's to getting it right!
yrs,
Thus spake Nick Mathewson (nickm@alum.mit.edu):
On Thu, Oct 18, 2012 at 6:10 PM, Mike Perry mikeperry@torproject.org wrote: [...]
There are modes that are supposed to prevent this, and applying them to a decent wide-block cipher might solve the issue. IGE is one of them [IGE], but it turns out to be broken by an attacker who knows some plaintext. The Accumulated Block Chaining [ABC] construction is supposed to fix that; I'm not too sure whether it's correct or efficient.
Am I crazy to think we might try to stop the bleeding of tagging attacks by figuring out a way to use ABC or IGE mode as a stopgap until people can code and evaluate new constructions for performance and timing side-channels? ABC/IGE would "only" involve a mode change, rather than an entire relay protocol upgrade and new cipher coding..
ABC or IGE wouldn't help us much on their own without a wide-block cipher, and IGE is just plain broken. (See explanation below.)
Remember, in the document I originally sent, I was talking about using ABC or some other corruption-propagation mode at a block level. That requires a wide-block cipher, though. And it turns out we can do better if the corruption-propagation is part of the wide-block idea.
We'd also burn our performance on platforms without AES acceleration, I think.
IGE might also actually exist in OpenSSL: http://www.links.org/?p=137
It also sounds like IGE is only broken if we try to use it for authentication.. We don't really need that property, do we? What we really want is the plaintext corruption property at the middle node upon ciphertext modification..
That _is_ a kind of authentication, or an analogue to it. And the point is that an adversary can repair a hole in the stream, and *stop* the plaintext corruption. So IGE does not deliver the property we would want for it, even if we could use it.
I am still wondering if it is possible to eliminate enough consecutive regions of known plaintext to make this acceptable for the short-term, until we figure out the wide-block thing for real. From the attack here: http://www.mail-archive.com/cryptography@metzdowd.com/msg06599.html it looks as though as long as we can avoid 32 consecutive bytes of known plaintext (two consecutive 128bit cipher blocks), we can prevent hole-closing.
If you want to know why I'm crazy enough to still be wondering this, see subsequent paragraphs.
Check out this thread, and the stuff it references: http://www.mail-archive.com/cryptography@metzdowd.com/msg06599.html
We could also remove a lot of known plaintext by replacing zero-fill with random fill in RELAY_RESOLVE, RELAY_BEGIN, and other short relay cells. That should only be expensive at the client...
So long as there is a block's worth of known or guessed plaintext, IGE fails to ensure that changes propagate forward. Like, 16 bytes worth of guessable HTTP in a payload (if you're thinking about this in a non-wide-block scenario).
Hrmm.. I think that failures after the stream is established are way less dangerous than ways you can tag and cause failures *before* the stream is established. In the pre-established case, Tor keeps retrying transparently behind the user's back until it gets a compromised exit. In the post-established case, the user is completely unable to use Tor 80% or 90% of the time, because the circuit is torn down *after* their user agent has begun sending data.. In other words, at least we would fail closed.
This reminds me of something I also wanted to ask about. Technically for the tagging attack, all we need to authenticate is circuit construction and RELAY_RESOLVE and RELAY_BEGIN. Might there be ways to get this without the expense and complications of either truncated MAC's or wide-block ciphers? Or at least remove known-plaintext from *those* cells?
Two general process thoughts:
- I may be saying this from an overabundance of caution, but: I don't
think we should use cryptographic primitives and constructions with known flaws, even if we can't see a way for them to hurt us right now, and even if we can come up with a solid-seeming argument for how those flaws can't hurt us.. That's how we got into our AES-CTR mess in the first place.
I would argue that where we *really* need an overabundance of caution is to ensure we provide the agility to change the cipher mode/construction for this scheme in a very short period of time. I don't think our *real* woes are because we didn't think hard enough about cryptography or the security properties of AES_CTR. They're because we fixed the cipher and mode at "AES_CTR", and now we're going to be stuck with vulnerability to a very dangerous attack for years.. "If you're typing the letters AES into your code, you're doing it wrong."
Based on this idea, I'm wondering if we should spend more of our time thinking hard about making the relay protocol be able to support changing the construction/primitive so we can support a readily available but non-ideal mode for 0.2.4.x, but then upgrade to something stronger for 0.2.5.x. (And when *that* construction/implementation turns out to be flawed or have side-channels, we can switch again in 0.2.6.x).
If we spend time on ensuring this agility instead of pondering the deep magic of wide-block ciphers, we might be able to roll out AES_IGE + eliminate consecutive regions of pre-established relay cell known plaintext for 0.2.4.x, and then save the deep magic for 0.2.5.x or beyond.
I looked through Proposal 202, and I don't see any mechanism for switching constructions/cipher choices in there?
- I know everybody wants our crypto problems to get solved, but it's
critical to get this stuff right. I think that the way to do right by our users is by taking the time we will need to design the right thing properly, rather than jumping into something halfcocked. We all acknowledge that it's easy for people and organizations to screw this stuff up: so let's take our time and actually come up with something solid. Against the current pain and badness of our current system, we must weigh the potential harm of jumping precipitously into something that turns out to be broken because we didn't think about it hard enough.
Will I ever be able to convince you of the value of "jumping early and often?" ;)
Related, have you seen this drunken rant (I'm linking to a summary because the original rant is very TL;DR): http://cryptochaos.com/liberal-vs-conservative-software-engineering
I think one of the main reasons most Tor folks find my development style so obnoxious is because I'm definitely from the extreme-liberal school of thought :).
And a more general observation:
- I think the only thing we could get designed & analyzed enough to
include in Tor 0.2.4 would be hmac-and-pad, which nobody actually likes much. I think that by the time we get anything good here well designed and discussed enough to reach a sensible consensus about it, it will be 2013, and time to start adding features to 0.2.5 (at the earliest).
So here's to getting it right!
I agree.. But if we don't leave room to get the crypto wrong and still swap it out quickly, we're going to be here again in another year or three, being even more sad than this time :/.
On Thu, Oct 18, 2012 at 11:18 PM, Mike Perry mikeperry@torproject.org wrote:
Thus spake Nick Mathewson (nickm@alum.mit.edu):
On Thu, Oct 18, 2012 at 6:10 PM, Mike Perry mikeperry@torproject.org wrote: [...]
There are modes that are supposed to prevent this, and applying them to a decent wide-block cipher might solve the issue. IGE is one of them [IGE], but it turns out to be broken by an attacker who knows some plaintext. The Accumulated Block Chaining [ABC] construction is supposed to fix that; I'm not too sure whether it's correct or efficient.
Am I crazy to think we might try to stop the bleeding of tagging attacks by figuring out a way to use ABC or IGE mode as a stopgap until people can code and evaluate new constructions for performance and timing side-channels? ABC/IGE would "only" involve a mode change, rather than an entire relay protocol upgrade and new cipher coding..
ABC or IGE wouldn't help us much on their own without a wide-block cipher, and IGE is just plain broken. (See explanation below.)
Remember, in the document I originally sent, I was talking about using ABC or some other corruption-propagation mode at a block level. That requires a wide-block cipher, though. And it turns out we can do better if the corruption-propagation is part of the wide-block idea.
We'd also burn our performance on platforms without AES acceleration, I think.
IGE might also actually exist in OpenSSL: http://www.links.org/?p=137
It also sounds like IGE is only broken if we try to use it for authentication.. We don't really need that property, do we? What we really want is the plaintext corruption property at the middle node upon ciphertext modification..
That _is_ a kind of authentication, or an analogue to it. And the point is that an adversary can repair a hole in the stream, and *stop* the plaintext corruption. So IGE does not deliver the property we would want for it, even if we could use it.
I am still wondering if it is possible to eliminate enough consecutive regions of known plaintext to make this acceptable for the short-term, until we figure out the wide-block thing for real. From the attack here: http://www.mail-archive.com/cryptography@metzdowd.com/msg06599.html it looks as though as long as we can avoid 32 consecutive bytes of known plaintext (two consecutive 128bit cipher blocks), we can prevent hole-closing.
But 32 consecutive bytes of known plaintext are pretty much inevitable, right? That's how protocols work.
Also, that is *one* attack on IGE. That one needs a given amount of known plaintext. I have no idea which others exist. Generally once there's one known attack on a system, it gets less attention from cryptographers, since who'd bother going around enumerating all the other circumstances under which a broken thing is broken. So when you read that the system isn't secure because it breaks if you do X, you can't usually infer that it's okay so long as you _don't_ do X -- you need to look for actual proofs or whatever that it is safe under the circumstance you propose.
If you want to know why I'm crazy enough to still be wondering this, see subsequent paragraphs.
Check out this thread, and the stuff it references: http://www.mail-archive.com/cryptography@metzdowd.com/msg06599.html
We could also remove a lot of known plaintext by replacing zero-fill with random fill in RELAY_RESOLVE, RELAY_BEGIN, and other short relay cells. That should only be expensive at the client...
So long as there is a block's worth of known or guessed plaintext, IGE fails to ensure that changes propagate forward. Like, 16 bytes worth of guessable HTTP in a payload (if you're thinking about this in a non-wide-block scenario).
Hrmm.. I think that failures after the stream is established are way less dangerous than ways you can tag and cause failures *before* the stream is established. In the pre-established case, Tor keeps retrying transparently behind the user's back until it gets a compromised exit. In the post-established case, the user is completely unable to use Tor 80% or 90% of the time, because the circuit is torn down *after* their user agent has begun sending data.. In other words, at least we would fail closed.
So could one workaround right answer be to time out after fewer exits, and/or notice differential stream failure rates between different guards? That would be a pretty neat thing to do; I wonder if it would work.
Of course we'd need to figure it out nice and get it implemented solidly.
This reminds me of something I also wanted to ask about. Technically for the tagging attack, all we need to authenticate is circuit construction and RELAY_RESOLVE and RELAY_BEGIN. Might there be ways to get this without the expense and complications of either truncated MAC's or wide-block ciphers? Or at least remove known-plaintext from *those* cells?
I don't think those are the only attack opportunities here. Again, they're ones that have been explained and proposed, but there's surely more stuff too.
Two general process thoughts:
- I may be saying this from an overabundance of caution, but: I don't
think we should use cryptographic primitives and constructions with known flaws, even if we can't see a way for them to hurt us right now, and even if we can come up with a solid-seeming argument for how those flaws can't hurt us.. That's how we got into our AES-CTR mess in the first place.
I would argue that where we *really* need an overabundance of caution is to ensure we provide the agility to change the cipher mode/construction for this scheme in a very short period of time. I don't think our *real* woes are because we didn't think hard enough about cryptography or the security properties of AES_CTR. They're because we fixed the cipher and mode at "AES_CTR", and now we're going to be stuck with vulnerability to a very dangerous attack for years.. "If you're typing the letters AES into your code, you're doing it wrong."
Well, keep in mind that we didn't, and still don't, have a drop-in replacement that's any good. Our design right now has a place where you plug in a stream cipher. Sure, we could have made it so you can drop in RC4 or 3DES-OFB or whatever craziness we would have come up with in 2004. But dropping in something *good* instead of AES_CTR requires that it not be a stream cipher. And we don't have a non-stream-cipher mode that works here.
Based on this idea, I'm wondering if we should spend more of our time thinking hard about making the relay protocol be able to support changing the construction/primitive so we can support a readily available but non-ideal mode for 0.2.4.x, but then upgrade to something stronger for 0.2.5.x. (And when *that* construction/implementation turns out to be flawed or have side-channels, we can switch again in 0.2.6.x).
If we spend time on ensuring this agility instead of pondering the deep magic of wide-block ciphers, we might be able to roll out AES_IGE + eliminate consecutive regions of pre-established relay cell known plaintext for 0.2.4.x, and then save the deep magic for 0.2.5.x or beyond.
I looked through Proposal 202, and I don't see any mechanism for switching constructions/cipher choices in there?
That'd be semi-implicit in proposal 200, where you use the create cell type to select the crypto you want.
Migrating to a new algorithm here will be kind of fun -- or rather, disabling the old one will be -- since the only way to turn off the old one entirely is to stop allowing servers who don't support the new one on the network. That could use a better writeup and thought process than we've given it yet.
- I know everybody wants our crypto problems to get solved, but it's
critical to get this stuff right. I think that the way to do right by our users is by taking the time we will need to design the right thing properly, rather than jumping into something halfcocked. We all acknowledge that it's easy for people and organizations to screw this stuff up: so let's take our time and actually come up with something solid. Against the current pain and badness of our current system, we must weigh the potential harm of jumping precipitously into something that turns out to be broken because we didn't think about it hard enough.
Will I ever be able to convince you of the value of "jumping early and often?" ;)
Only by having it pay off. :)
yrs,