Re: [tor-dev] Even more notes on relay-crypto constructions

9 Oct 2012

      On Tue, Oct 9, 2012 at 12:31 PM, Robert Ransom rransom.8774@gmail.com wrote:
 [...]
...
...
AES-CTR + HMAC-SHA512/256.
AES-CTR + Poly1305.  Poly1305 requires nonces, but we can use a
  counter for those.
Poly1305AES requires nonces.  Poly1305 itself requires
(computationally-indistinguishable-from-) independent keys for each
message.
Right; I meant to say the output of a stream cipher used as a PRF, but
what I meant isn't what I said.  Should've proofed more carefully
...
Please actually read the paper (see also
http://cr.yp.to/papers.html#pema section 2.5 for how DJB uses Poly1305
now).
I read everything I cited.  If there is something I didn't understand,
or something I missed, or something I got wrong, or something I said
wrong, that doesn't mean I didn't read the paper.
I am not going to be able to draw all the right inferences from every
paper on my own, though.  And I am *definitely*, *frequently*, going
to read papers, come up with questions, and post those questions here
sometimes even when the paper, properly understood, would answer my
questions.  If I were writing for publication, I'd want to keep all my
ideas secret until I could answer all my questions and make sure all
my answers were right, but I'm not writing for publication -- I am
writing to get feedback from other people and learn things.  Thank you
for helping me learn things!
Also thank you for reminding me to read
http://cr.yp.to/antiforgery/securitywcs-20050227.pdf again.  I'll
check it out.
...
...
Salsa20 + Poly1305.
For a padding PRNG, we could use any stream cipher we like.
In each case, we'd want to authenticate not only the content of the
cell, but also the previous cell's authenticator and the position of
the cell within the stream.
AES-GCM+AES is going to be the fastest on CPUs with specific support
for AES and GCM, but not on other architectures until/unless more
chips grow instructions specialized for AES and GF(2^128).
Salsa20+Poly1305 should be fastest on other architectures.
This entire category of designs still has the problems that it had
before: it leaks the length of the circuit to the last node in the
circuit, and consumes a fairly large portion of each cell (for one
truncated mac per hop).  Further, it's going to be a bit fiddly
(though not impossible) to get it to work with rendezvous circuits.
Explicitly leaking the circuit length is very very bad.
Are there any non-obvious reasons why?  Does it lead to any better
attacks than the obvious ones?
...
The MAC-based designs do not mention how to prevent end-to-end tagging
in the exit-to-client direction.  I suspect they won't try to prevent
it at all.
That's correct.  Why would it be an attack for the exit to send a
covert signal to the client?  The exit already has valid means to send
overt signals to the client, and the client Tor is presumed not to
want to break its own anonymity.
...
...
II. WIDE-BLOCK DESIGNS
Turn we now to designs where we encrypt each block of the cipher using
a 509-byte SPRP, such that each block's SPRP is keyed dependently on
the original key and on the encrypted value of the previous block.
There be dragons here.  Let's talk about some ways we could build
that.  I'll start by talking about wide-block encryption, then talk
about getting the "unrecoverability" property where any missing or
corrupted block makes all future blocks unrecoverable.
The wide-block SPRP I've used before is LIONESS [Bear-Lion].  It
requires two keyed hash operations and two stream cipher operations,
so it's going to be at best half the speed of the mac-and-pad designs
above: a little worse, maybe, since it forces us to rekey with every
block.
Might that be acceptable?  Right now, AES is not a huge fraction of
our runtime, and a relay does AES on each cell three times (TLS,relay
crypto,TLS) and SHA1 on each cell twice (TLS,TLS).  An exit does AES
on each cell twice (TLS,relay) and SHA1 on each cell twice
(TLS,relay). So for relays, with a naive SHA1-AES LIONESS, we'd be
increasing the stream-cipher operations by 25% and the digests by
100%.  On an exit, we'd be increasing the stream-cipher operations by
33% and the digests by 100%.
If we were to go with LIONESS, we'd surely want to look into faster,
better keyed-hash algorithms than SHA1.   I don't know whether one of
the polynomial MACs above would be good enough, or whether we need
other cryptographic digest properties that those don't give us and
we'd need to turn to SHA256 or something.
I've heard some suggestions that we should look into BEAR or LION
instead, but I'm leery of the idea.  They are faster, requiring one
fewer application of their primitives, but they are PRPs, not SPRPs --
meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section
6].  I don't know a way to exploit this in the Tor network, but
deviating from an ideal block cipher seems to me like one of those
ideas that's practically begging for disaster.
The notion of ‘Ind-CCA2’ is defined for public-key encryption systems;
it doesn't make sense for permutations.
Even LIONESS is not an ‘ideal block cipher’ -- that term has an actual
definition, and it is a much stronger assumption than Tor's relay
crypto needs.  (http://cs.nyu.edu/~dodis/ps/ic-ro.pdf and
http://cs.nyu.edu/~dodis/ps/ext-cipher.pdf contain some potentially
useful, potentially interesting information.)
Keep in mind that Tor's current relay crypto breaks *completely* if
the circuit-extension handshake ever produces the same session key
twice, and some parts of Tor's protocols will still require that
assumption even if the relay crypto doesn't.  Therefore, you really
don't need to worry about attacks that require more than one call to
the permutation in either direction.
Ah, so I think you are claiming that LION or BEAR on its own is fine
if they are only used once with each key, and that any sane cell
chaining construction we use will involve re-keying them with each
block, so we can ignore any attack on them that would require multiple
invocations of the encryption/decryption function with the same key.
Do I understand you correctly there?
[...]
...
...
The multiplication operations here appear to be
multiplication by a primitive element, and multiplication by a per-key
element.  The encryption step can be realized with a somewhat
unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB
approach.  I don't know what you'd need to do to substitute in an
orthodox stream cipher for the one used in iHCH.  Sarkar seems to see
iHCH as a successor to HCH, which is a little worrisome given that HCH
is a spiritual descendant of the patented XCB, but to me the two
constructions (HCH, iHCH) look practically nothing alike except for
their use of a counter mode step.
I am a little intimidated by the idea of trying to implement one of
these ciphers.
LION (implemented with ChaCha and CubeHash512, with an extra 256 bits
of chaining output from each step) sounds like the safe-and-sane
approach for now; changing the protocol again if the U.S. patent
system expires won't be nearly as hard as changing the protocol the
first time.
(BLAKE-512 might be a faster hash function on some processors, but
requires 64-bit operations.  BLAKE-256 would be faster, but then (a)
you don't get a chaining output from any function computed directly
over R, and (b) you still need to produce 512 bits of
(computationally-indistinguishable-from-)independent key material from
some hash function in the chaining step.  CubeHash512 will never be
broken as an entropy extractor in less time than the Curve25519-based
circuit-extension handshake.)
Interesting; I had stopped considering CubeHash when got dropped from
the SHA-3 competition.  I'll have to find out who's been working on
analyzing it since then.
Can you be more explicit about where the chaining output is taken from
when, and how exactly it gets used, in this design?
Also, why LION and not BEAR?
...
(Does the XCB patent cover LION-like constructions with the ‘hash
function’ in the middle replaced with a polynomial-evaluation
function?  I'm not convinced that a suitable approximately-256-bit
polynomial-evaluation function will be much faster than ChaCha,
especially in 32-bit code.)
I haven't read the XCB patent, and won't, pending legal advice from
Wendy telling me that it's okay to read it.
[...]
...
The obvious way to implement GF(2^128) multiplication of a large
number of secret inputs y by one learned-at-runtime secret c is:

Compute a table of c*X^i for i = 0, ..., 127.  (This table takes

128*128 bits, or 2048 bytes.  Multiplication by X is easy and
tolerably fast.)

For each input y (= y_0*X^0 + ... + y_127*X^127, with the y_i in

GF(2)), compute y_0*(c*X^0) + ... + y_127*(c*X^127).  (You've done
this step for Pynchon Gate already; hopefully that runs in constant
time.)
Hm. Did you figure out cycles-per-byte on that one?
Back-of-the-envelope:
Assuming 64-bit words, we're looking at a 128 instances of "get a bit
from y," 128*2 instances of "Load the corresponding word from the
table, then constant-time-conditionally-xor that word to the
accumulator."
The fast portable implementation of the conditional-xor I know, for a
single bit in 'b', and a value in 'x' to be conditionally xored into
'a' is:  "a ^=  x & ~(b-1)".
I *think* that's roughly 128 (and, sub, not) to get a mask for each
bit, 128 shifts, then 256 loads, 256 ands (to apply the mask), and 256
xors.  (Some of these operations are unnecessary; I am committing
fencepost errors here. I'm also assuming all loads are one-cycle.)
So that's about 1280 operations for 16 bytes of input to be multiplied
by c, so that seems like something like 80 instructions per byte for a
naive implementation. There are still probably more optimizations to
be done, especially if we have wacky SIMD features to play with, or we
can combine some of the operations above into single ones.  Dynamic
multiple issue and such CPU architectural tricks might get it down
even more.  Nevertheless, it still looks like it'd be expensive enough
getting GF(2^128) right to make GF(2^128) unattractive.
Still, maybe somebody should hack this up for the public good, whether
we turn out to need GF(2^128) or not.
...
...
As a further wrinkle with GF(2^128), OpenSSL doesn't seem to actually
expose its "multiply in GF(2^128)" functions as far as I can tell: we
would need to snarf such code from elsewhere.
Oh! ARMv8 has an optional crypto instruction set that supports AES,
SHA256, and GF(2^128) multiplication [ARMv8].  If it looks like most
of the ARMs we care about are going to get it, that could factor into
our planning.
I won't believe claims that AES hardware will be widely available
until it actually is present in every new processor from a major
manufacturer.
Agreed.
-- 
Nick

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] Even more notes on relay-crypto constructions