Hello,
Peter (in CC) and I have recently composed a draft proposal for a new Tor handshake. It's a hybrid handshake combining Tor's current X25519-based NTor handshake with the NewHope lattice-based key exchange, in order to protect the secrecy of Tor connections today from an attacker with a quantum computer in the future.
I have not given the proposal a number. It is available in my `drafts/newhope` branch of my torspec repository:
https://gitweb.torproject.org/user/isis/torspec.git/tree/proposals/XXX-newho...
For purposes of facilitating discussion, it is also included here inline. Review and comments are very appreciated.
______________________________________________________________________________
Filename: XXX-newhope-hybrid-handshake.txt Title: Post-Quantum Secure Hybrid Handshake Based on NewHope Author: Isis Lovecruft, Peter Schwabe Created: 16 Apr 2016 Updated: 4 May 2016 Status: Draft Depends: prop#220 prop#249 prop#264
§0. Introduction
NewHope is a post-quantum-secure lattice-based key-exchange protocol based on the ring-learning-with-errors (Ring-LWE) problem. We propose a hybrid handshake for Tor, based on a combination of Tor's current NTor handshake and a shared key derived through a NewHope ephemeral key exchange.
For further details on the NewHope key exchange, the reader is referred to "Post-quantum key exchange - a new hope" by Alkim, Ducas, Pöppelmann, and Schwabe [0][1].
For the purposes of brevity, we consider that NTor is currently the only handshake protocol in Tor; the older TAP protocol is ignored completely, due to the fact that it is currently deprecated and nearly entirely unused.
§1. Motivation
An attacker currently monitoring and storing circuit-layer NTor handshakes who later has the ability to run Shor's algorithm on a quantum computer will be able to break Tor's current handshake protocol and decrypt previous communications.
It is unclear if and when such attackers equipped with large quantum computers will exist, but various estimates by researchers in quantum physics and quantum engineering give estimates of only 1 to 2 decades. Clearly, the security requirements of many Tor users include secrecy of their messages beyond this time span, which means that Tor needs to update the key exchange to protect against such attackers as soon as possible.
§2. Design
An initiator and responder, in parallel, conduct two handshakes:
- An X25519 key exchange, as described in the description of the NTor handshake in Tor proposal #216. - A NewHope key exchange.
The shared keys derived from these two handshakes are then concatenated and used as input to the SHAKE-256 extendable output function (XOF), as decribed in FIPS-PUB-202 [2], in order to produce a shared key of the desired length. The testvectors in §C assume that this key has a length of 32 bytes, but the use of a XOF allows arbitrary lengths to easily support future updates of the symmetric primitives using the key. See also §3.3.1.
§3. Specification
§3.1. Notation
Let `a || b` be the concatenation of a with b.
Let `a^b` denote the exponentiation of a to the bth power.
Let `a == b` denote the equality of a with b, and vice versa.
Let `a := b` be the assignment of the value of b to the variable a.
Let `H(x)` be 32-bytes of output of the SHAKE-256 XOF (as described in FIPS-PUB-202) applied to message x.
Let X25519 refer to the curve25519-based key agreement protocol described in RFC7748 §6.1. [3]
Let `EXP(a, b) == X25519(., b, a)` with `g == 9`. Let X25519_KEYGEN() do the appropriate manipulations when generating the secret key (clearing the low bits, twidding the high bits).
[XXX match RFC7748 notation more. --isis]
Let `X25519_KEYID(B) == B` where B is a valid X25519 public key.
When representing an element of the Curve25519 subgroup as a byte string, use the standard (32-byte, little-endian, x-coordinate-only) representation for Curve25519 points.
Let `ID` be a router's identity key taken from the router microdescriptor. In the case for relays possessing Ed25519 identity keys (c.f. Tor proposal #220), this is a 32-byte string representing the public Ed25519 identity key. For backwards and forwards compatibility with routers which do not possess Ed25519 identity keys, this is a 32-byte string created via the output of H(ID).
We refer to the router as the handshake "responder", and the client (which may be an OR or an OP) as the "initiator".
ID_LENGTH [32 bytes] H_LENGTH [32 bytes] G_LENGTH [32 bytes]
PROTOID := "pqtor-x25519-newhope-shake256-1" T_MAC := PROTOID || ":mac" T_KEY := PROTOID || ":key_extract" T_VERIFY := PROTOID || ":verify"
(X25519_SK, X25519_PK) := X25519_KEYGEN()
§3.2. Protocol
======================================================================================== | | | Fig. 1: The NewHope-X25519 Hybrid Handshake. | | | | Before the handshake the Initiator is assumed to know Z, a public X25519 key for | | the Responder, as well as the Responder's ID. | ---------------------------------------------------------------------------------------- | | | Initiator Responder | | | | SEED := H(randombytes(32)) | | x, X := X25519_KEYGEN() | | a, A := NEWHOPE_KEYGEN(SEED) | | CLIENT_HDATA := ID || Z || X || A | | | | --- CLIENT_HDATA ---> | | | | y, Y := X25519_KEYGEN() | | NTOR_KEY, AUTH := NTOR_SHAREDB(X,y,Y,z,Z,ID,B) | | M, NEWHOPE_KEY := NEWHOPE_SHAREDB(A) | | SERVER_HDATA := Y || AUTH || M | | sk := SHAKE-256(NTOR_KEY || NEWHOPE_KEY) | | | | <-- SERVER_HDATA ---- | | | | NTOR_KEY := NTOR_SHAREDA(x, X, Y, Z, ID, AUTH) | | NEWHOPE_KEY := NEWHOPE_SHAREDA(M, a) | | sk := SHAKE-256(NTOR_KEY, NEWHOPE_KEY) | | | ========================================================================================
§3.2.1. The NTor Handshake
§3.2.1.1. Prologue
Take a router with identity ID. As setup, the router generates a secret key z, and a public onion key Z with:
z, Z := X25519_KEYGEN()
The router publishes Z in its server descriptor in the "ntor-onion-key" entry. Henceforward, we refer to this router as the "responder".
§3.2.1.2. Initiator
To send a create cell, the initiator generates a keypair:
x, X := X25519_KEYGEN()
and creates the NTor portion of a CREATE2V cell's HDATA section:
CLIENT_NTOR := ID || Z || X [96 bytes]
The initiator includes the responder's ID and Z in the CLIENT_NTOR so that, in the event the responder OR has recently rotated keys, the responder can determine which keypair to use.
The initiator then concatenates CLIENT_NTOR with CLIENT_NEWHOPE (see §3.2.2), to create CLIENT_HDATA, and creates and sends a CREATE2V cell (see §A.1) to the responder.
CLIENT_NEWHOPE [1824 bytes] (see §3.2.2) CLIENT_HDATA := CLIENT_NTOR || CLIENT_NEWHOPE [1920 bytes]
If the responder does not respond with a CREATED2V cell, the initiator SHOULD NOT attempt to extend the circuit through the responder by sending fragmented EXTEND2 cells, since the responder's lack of support for CREATE2V cells is assumed to imply the responder also lacks support for fragmented EXTEND2 cells. Alternatively, for initiators with a sufficiently late consensus method, the initiator MUST check that "proto" line in the responder's descriptor (c.f. Tor proposal #264) advertises support for the "Relay" subprotocol version 3 (see §5).
§3.2.1.3. Responder
The responder generates a keypair of y, Y = X25519_KEYGEN(), and does NTOR_SHAREDB() as follows:
(NTOR_KEY, AUTH) ← NTOR_SHAREDB(X, y, Y, z, Z, ID, B): secret_input := EXP(X, y) || EXP(X, z) || ID || B || Z || Y || PROTOID NTOR_KEY := H(secret_input, T_KEY) verify := H(secret_input, T_VERIFY) auth_input := verify || ID || Z || Y || X || PROTOID || "Server" AUTH := H(auth_input, T_MAC)
The responder sends a CREATED2V cell containing:
SERVER_NTOR := Y || AUTH [64 bytes] SERVER_NEWHOPE [2048 bytes] (see §3.2.2) SERVER_HDATA := SERVER_NTOR || SERVER_NEWHOPE [2112 bytes]
and sends this to the initiator.
§3.2.1.4. Finalisation
The initiator then checks Y is in G^* [see NOTE below], and does NTOR_SHAREDA() as follows:
(NTOR_KEY) ← NTOR_SHAREDA(x, X, Y, Z, ID, AUTH) secret_input := EXP(Y, x) || EXP(Z, x) || ID || Z || X || Y || PROTOID NTOR_KEY := H(secret_input, T_KEY) verify := H(secret_input, T_VERIFY) auth_input := verify || ID || Z || Y || X || PROTOID || "Server" if AUTH == H(auth_input, T_MAC) return NTOR_KEY
Both parties check that none of the EXP() operations produced the point at infinity. [NOTE: This is an adequate replacement for checking Y for group membership, if the group is Curve25519.]
[XXX: This doesn't sound exactly right. You need the scalar tweaking of X25519 for this to work and also, the point at infinity is obviously an element of the group --isis, peter]
Both parties now have a shared value for NTOR_KEY. They expand this into the keys needed for the Tor relay protocol.
[XXX We think we want to omit the final hashing in the production of NTOR_KEY here, and instead put all the inputs through SHAKE-256. --isis, peter]
[XXX We probably want to remove ID and B from the input to the shared key material, since they serve for authentication but, as pre-established "prologue" material to the handshake, they should not be used in attempts to strengthen the cryptographic suitability of the shared key. Also, their inclusion is implicit in the DH exponentiations. I should probably ask Ian about the reasoning for the original design choice. --isis]
§3.2.2. The NewHope Handshake
§3.2.2.1. Parameters & Mathematical Structures
Let ℤ be the ring of rational integers. Let ℤq, for q ≥ 1, denote the quotient ring ℤ/qℤ. We define R = ℤ[X]/((X^n)+1) as the ring of integer polynomials modulo ((X^n)+1), and Rq = ℤq[X]/((X^n)+1) as the ring of integer polynomials modulo ((X^n)+1) where each coefficient is reduced modulo q. When we refer to a polynomial, we mean an element of Rq.
n := 1024 q := 12289
SEED [32 Bytes] NEWHOPE_POLY [1792 Bytes] NEWHOPE_REC [256 Bytes] NEWHOPE_KEY [32 Bytes]
NEWHOPE_MSGA := (NEWHOPE_POLY || SEED) NEWHOPE_MSGB := (NEWHOPE_POLY || NEWHOPE_REC)
§3.2.2.2. High-level Description of Newhope API Functions
For a description of internal functions, see §B.
(NEWHOPE_POLY, NEWHOPE_MSGA) ← NEWHOPE_KEYGEN(SEED): â := gen_a(seed) s := poly_getnoise() e := poly_getnoise() ŝ := poly_ntt(s) ê := poly_ntt(e) b̂ := pointwise(â, ŝ) + ê sp := poly_tobytes(ŝ) bp := poly_tobytes(b̂) return (sp, (bp || seed))
(NEWHOPE_MSGB, NEWHOPE_KEY) ← NEWHOPE_SHAREDB(NEWHOPE_MSGA): s' := poly_getnoise() e' := poly_getnoise() e" := poly_getnoise() b̂ := poly_frombytes(bp) â := gen_a(seed) ŝ' := poly_ntt(s') ê' := poly_ntt(e') û := poly_pointwise(â, ŝ') + ê' v := poly_invntt(poly_pointwise(b̂,ŝ')) + e" r := helprec(v) up := poly_tobytes(û) k := rec(v, r) return ((up || r), k)
NEWHOPE_KEY ← NEWHOPE_SHAREDA(NEWHOPE_MSGB, NEWHOPE_POLY): û := poly_frombytes(up) ŝ := poly_frombytes(sp) v' := poly_invntt(poly_pointwise(û, ŝ)) k := rec(v', r) return k
When a client uses a SEED within a CREATE2V cell, the client SHOULD NOT use that SEED in any other CREATE2V or EXTEND2 cells. See §4 for further discussion.
§3.3. Key Expansion
The client and server derive a shared key, SHARED, by:
HKDFID := "THESE ARENT THE DROIDS YOURE LOOKING FOR" SHARED := SHAKE_256(HKDFID || NTorKey || NewHopeKey)
§3.3.1. Note on the Design Choice
The reader may wonder why one would use SHAKE-256 to produce a 256-bit output, since the security strength in bits for SHAKE-256 is min(d/2,256) for collision resistance and min(d,256) for first- and second-order preimages, where d is the output length.
The reasoning is that we should be aiming for 256-bit security for all of our symmetric cryptography. One could then argue that we should just use SHA3-256 for the KDF. We choose SHAKE-256 instead in order to provide an easy way to derive longer shared secrets in the future without requiring a new handshake. The construction is odd, but the future is bright. As we are already using SHAKE-256 for the 32-byte output hash, we are also using it for all other 32-byte hashes involved in the protocol. Note that the only difference between SHA3-256 and SHAKE-256 with 32-byte output is one domain-separation byte.
[XXX why would you want 256-bit security for the symmetric side? Are you talking pre- or post-quantum security? --peter]
§4. Security & Anonymity Implications
This handshake protocol is one-way authenticated. That is, the server is authenticated, while the client remains anonymous.
The client MUST NOT cache and reuse SEED. Doing so gives non-trivial adversarial advantages w.r.t. all-for-the-price-of-one attacks during the caching period. More importantly, if the SEED used to generate NEWHOPE_MSGA is reused for handshakes along the same circuit or multiple different circuits, an adversary conducting a sybil attack somewhere along the path(s) will be able to correlate the identity of the client across circuits or hops.
§5. Compatibility
Because our proposal requires both the client and server to send more than the 505 bytes possible within a CREATE2 cell's HDATA section, it depends upon the implementation of a mechanism for allowing larger CREATE cells (c.f. Tor proposal #249).
We reserve the following handshake type for use in CREATE2V/CREATED2V and EXTEND2V/EXTENDED2V cells:
0x0003 [NEWHOPE + X25519 HYBRID HANDSHAKE]
We introduce a new sub-protocol number, "Relay=3", (c.f. Tor proposal #264 §5.3) to signify support this handshake, and hence for the CREATE2V and fragmented EXTEND2 cells which it requires.
There are no additional entries or changes required within either router descriptors or microdescriptors to support this handshake method, due to the NewHope keys being ephemeral and derived on-the-fly, and due to the NTor X25519 public keys already being in included within the "ntor-onion-key" entry.
Add a "UseNewHopeKEX" configuration option and a corresponding consensus parameter to control whether clients prefer using this NewHope hybrid handshake or some previous handshake protocol. If the configuration option is "auto", clients SHOULD obey the consensus parameter. The default configuration SHOULD be "auto" and the consensus value SHOULD initially be "0".
§6. Implementation
The paper by Alkim, Ducas, Pöppelmann and Schwabe describes two software implementations of NewHope, one C reference implementation and an optimized implementation using AVX2 vector instructions. Those implementations are available at [1].
Additionally, there are implementations in Go by Yawning Angel, available from [4] and in Rust by Isis Lovecruft, available from [5].
The software used to generate the test vectors in §C is based on the C reference implementation and available from:
https://code.ciph.re/isis/newhope-tor-testvectors https://github.com/isislovecruft/newhope-tor-testvectors
§7. Performance & Scalability
The computationally expensive part in the current NTor handshake is the X25519 key-pair generation and the X25519 shared-key computation. The current implementation in Tor is a wrapper to support various highly optimized implementations on different architectures. On Intel Haswell processors, the fastest implementation of X25519, as reported by the eBACS benchmarking project [6], takes 169920 cycles for key-pair generation and 161648 cycles for shared-key computation; these add up to a total of 331568 cycles on each side (initiator and responder).
The C reference implementation of NewHope, also benchmarked on Intel Haswell, takes 358234 cycles for the initiator and 402058 cycles for the Responder. The core computation of the proposed combination of NewHope and X25519 will thus mean a slowdown of about a factor of 2.1 for the Initiator and a slowdown by a factor of 2.2 for the Responder compared to the current NTor handshake. These numbers assume a fully optimized implementation of the NTor handshake and a C reference implementation of NewHope. With optimized implementations of NewHope, such as the one for Intel Haswell described in [0], the computational slowdown will be considerably smaller than a factor of 2.
§8. References
[0]: https://cryptojedi.org/papers/newhope-20160328.pdf [1]: https://cryptojedi.org/crypto/#newhope [2]: http://www.nist.gov/customcf/get_pdf.cfm?pub_id=919061 [3]: https://tools.ietf.org/html/rfc7748#section-6.1 [4]: https://github.com/Yawning/newhope [5]: https://code.ciph.re/isis/newhopers [6]: http://bench.cr.yp.to
§A. Cell Formats
§A.1. CREATE2V Cells
The client portion of the handshake should send CLIENT_HDATA, formatted into a CREATE2V cell as follows:
CREATE2V { [2114 bytes] HTYPE := 0x0003 [2 bytes] HLEN := 0x0780 [2 bytes] HDATA := CLIENT_HDATA [1920 bytes] IGNORED := 0x00 [194 bytes] }
[XXX do we really want to pad with IGNORED to make CLIENT_HDATA the same number of bytes as SERVER_HDATA? --isis]
§A.2. CREATED2V Cells
The server responds to the client's CREATE2V cell with SERVER_HDATA, formatted into a CREATED2V cell as follows:
CREATED2V { [2114 bytes] HLEN := 0x0800 [2 bytes] HDATA := SERVER_HDATA [2112 bytes] IGNORED := 0x00 [0 bytes] }
§A.3. Fragmented EXTEND2 Cells
When the client wishes to extend a circuit, the client should fragment CLIENT_HDATA into four EXTEND2 cells:
EXTEND2 { NSPEC := 0x02 { [1 byte] LINK_ID_SERVER [22 bytes] XXX LINK_ADDRESS_SERVER [8 bytes] XXX } HTYPE := 0x0003 [2 bytes] HLEN := 0x0780 [2 bytes] HDATA := CLIENT_HDATA[0,461] [462 bytes] } EXTEND2 { NSPEC := 0x00 [1 byte] HTYPE := 0xFFFF [2 bytes] HLEN := 0x0000 [2 bytes] HDATA := CLIENT_HDATA[462,954] [492 bytes] } EXTEND2 { NSPEC := 0x00 [1 byte] HTYPE := 0xFFFF [2 bytes] HLEN := 0x0000 [2 bytes] HDATA := CLIENT_HDATA[955,1447] [492 bytes] } EXTEND2 { NSPEC := 0x00 [1 byte] HTYPE := 0xFFFF [2 bytes] HLEN := 0x0000 [2 bytes] HDATA := CLIENT_HDATA[1448,1919] || 0x00[20] [492 bytes] } EXTEND2 { NSPEC := 0x00 [1 byte] HTYPE := 0xFFFF [2 bytes] HLEN := 0x0000 [2 bytes] HDATA := 0x00[172] [172 bytes] }
The client sends this to the server to extend the circuit from, and that server should format the fragmented EXTEND2 cells into a CREATE2V cell, as described in §A.1.
§A.4. Fragmented EXTENDED2 Cells
EXTENDED2 { NSPEC := 0x02 { [1 byte] LINK_ID_SERVER [22 bytes] XXX LINK_ADDRESS_SERVER [8 bytes] XXX } HTYPE := 0x0003 [2 bytes] HLEN := 0x0800 [2 bytes] HDATA := SERVER_HDATA[0,461] [462 bytes] } EXTENDED2 { NSPEC := 0x00 [1 byte] HTYPE := 0xFFFF [2 bytes] HLEN := 0x0000 [2 bytes] HDATA := SERVER_HDATA[462,954] [492 bytes] } EXTEND2 { NSPEC := 0x00 [1 byte] HTYPE := 0xFFFF [2 bytes] HLEN := 0x0000 [2 bytes] HDATA := SERVER_HDATA[955,1447] [492 bytes] } EXTEND2 { NSPEC := 0x00 [1 byte] HTYPE := 0xFFFF [2 bytes] HLEN := 0x0000 [2 bytes] HDATA := SERVER_HDATA[1448,1939] [492 bytes] } EXTEND2 { NSPEC := 0x00 [1 byte] HTYPE := 0xFFFF [2 bytes] HLEN := 0x0000 [2 bytes] HDATA := SERVER_HDATA[1940,2112] [172 bytes] }
§B. NewHope Internal Functions
gen_a(SEED): returns a uniformly random poly poly_getnoise(): returns a poly sampled from a centered binomial poly_ntt(poly): number-theoretic transform; returns a poly poly_invntt(poly): inverse number-theoretic transform; returns a poly poly_pointwise(poly, poly): pointwise multiplication; returns a poly poly_tobytes(poly): packs a poly to a NEWHOPE_POLY byte array poly_frombytes(NEWHOPE_POLY): unpacks a NEWHOPE_POLY byte array to a poly
helprec(poly): returns a NEWHOPE_REC byte array rec(poly, NEWHOPE_REC): returns a NEWHOPE_KEY
--- Description of the Newhope internal functions ---
gen_a(SEED seed) receives as input a 32-byte (public) seed. It expands this seed through SHAKE-128 from the FIPS202 standard. The output of SHAKE-128 is considered a sequence of 16-bit little-endian integers. This sequence is used to initialize the coefficients of the returned polynomial from the least significant (coefficient of X^0) to the most significant (coefficient of X^1023) coefficient. For each of the 16-bit integers first eliminate the highest two bits (to make it a 14-bit integer) and then use it as the next coefficient if it is smaller than q=12289. Note that the amount of output required from SHAKE to initialize all 1024 coefficients of the polynomial varies depending on the input seed. Note further that this function does not process any secret data and thus does not need any timing-attack protection.
poly_getnoise() first generates 4096 Bytes of uniformly random data. This can be done by reading these bytes from the system's RNG; efficient implementations will typically only read a 32-byte seed from the system's RNG and expand it through some fast PRNG (for example, ChaCha20 or AES-256 in CTR mode). The output of the PRG is considered an array of 2048 16-bit integers r[0],...,r[2047]. The coefficients of the output polynomial are computed as HW(r[0])-HW(r[1]), HW(r[2])-HW(r[3]),...,HW(r[2046])-HW(r[2047]), where HW stands for Hamming weight. Note that the choice of RNG is a local decision; different implementations are free to use different RNGs. Note further that the output of this function is secret; the PRG (and the computation of HW) need to be protected against timing attacks.
poly_ntt(poly f): For a mathematical description of poly_ntt see the [0]; a pseudocode description of a very naive inplace transformation of an input polynomial f = f[0] + f[1]*X + f[2]*X^2 + ... + f[1023]*X^1023 is the following code (all arithmetic on coefficients performed modulo q):
psi = 7 omega = 49
for i in range(0,n): t[i] = f[i] * psi^i
for i in range(0,n): f[i] = 0 for j in range(0,n): f[i] += t[j] * omega^((i*j)%n)
Note that this is not how poly_ntt should be implemented if performance is an issue; in particular, efficient algorithms for the number-theoretic transform take time O(n*log(n)) and not O(n^2) Note further that all arithmetic in poly_ntt has to be protected against timing attacks.
poly_invntt(poly f): For a mathematical description of poly_invntt see the [0]; a pseudocode description of a very naive inplace transformation of an input polynomial f = f[0] + f[1]*X + f[2]*X^2 + ... + f[1023]*X^1023 is the following code (all arithmetic on coefficients performed modulo q):
invpsi = 8778; invomega = 1254; invn = 12277;
for i in range(0,n): t[i] = f[i];
for i in range(0,n): f[i]=0; for j in range(0,n): f[i] += t[j] * invomega^((i*j)%n) f[i] *= invpsi^i f[i] *= invn
Note that this is not how poly_invntt should be implemented if performance is an issue; in particular, efficient algorithms for the inverse number-theoretic transform take time O(n*log(n)) and not O(n^2) Note further that all arithmetic in poly_invntt has to be protected against timing attacks.
poly_pointwise(poly f, poly g) performs pointwise multiplication of the two polynomials. This means that for f = (f0 + f1*X + f2*X^2 + ... + f1023*X^1023) and g = (g0 + g1*X + g2*X^2 + ... + g1023*X^1023) it computes and returns h = (h0 + h1*X + h2*X^2 + ... + h1023*X^1023) with h0 = f0*g0, h1 = f1*g1,..., h1023 = f1023*g1023.
poly_tobytes(poly f) first reduces all coefficents of f modulo q, i.e., brings them to the interval [0,q-1]. Denote these reduced coefficients as f0,..., f1023; note that they all fit into 14 bits. The function then packs those coefficients into an array of 1792 bytes r[0],..., r[1792] in "packed little-endian representation", i.e., r[0] = f[0] & 0xff; r[1] = (f[0] >> 8) & ((f[1] & 0x03) << 6) r[2] = (f[1] >> 2) & 0xff; r[3] = (f[1] >> 10) & ((f[2] & 0x0f) << 4) . . . r[1790] = (f[1022]) >> 12) & ((f[1023] & 0x3f) << 2) r[1791] = f[1023] >> 6 Note that this function needs to be protected against timing attacks. In particular, avoid non-constant-time conditional subtractions (or other non-constant-time expressions) in the reduction modulo q of the coefficients.
poly_frombytes(NEWHOPE_POLY b) is the inverse of poly_tobytes; it receives as input an array of 1792 bytes and coverts it into the internal representation of a poly. Note that poly_frombytes does not need to check whether the coefficients are reduced modulo q or reduce coefficients modulo q. Note further that the function must not leak any information about its inputs through timing information, as it is also applied to the secret key of the initiator.
helprec(poly f) computes 256 bytes of reconciliation information from the input poly f. Internally, one byte of reconciliation information is computed from four coefficients of f by a function helprec4. Let the input polynomial f = (f0 + f1*X + f2*X^2 + ... + f1023*X^1023); let the output byte array be r[0],...r[256]. This output byte array is computed as r[0] = helprec4(f0,f256,f512,f768) r[1] = helprec4(f1,f257,f513,f769) r[2] = helprec4(f2,f258,f514,f770) . . . r[255] = helprec4(f255,f511,f767,f1023), where helprec4 does the following:
helprec4(x0,x1,x2,x3): b = randombit() r0,r1,r2,r3 = CVPD4(8*x0+4*b,8*x1+4*b,8*x2+4*b,8*x3+4*b) r = (r0 & 0x03) | ((r1 & 0x03) << 2) | ((r2 & 0x03) << 4) | ((r3 & 0x03) << 6) return r
The function CVPD4 does the following:
CVPD4(y0,y1,y2,y3): v00 = round(y0/2q) v01 = round(y1/2q) v02 = round(y2/2q) v03 = round(y3/2q) v10 = round((y0-1)/2q) v11 = round((y1-1)/2q) v12 = round((y2-1)/2q) v13 = round((y3-1)/2q) t = abs(y0 - 2q*v00) t += abs(y1 - 2q*v01) t += abs(y2 - 2q*v02) t += abs(y3 - 2q*v03) if(t < 2q): v0 = v00 v1 = v01 v2 = v02 v3 = v03 k = 0 else v0 = v10 v1 = v11 v2 = v12 v3 = v13 r = 1 return (v0-v3,v1-v3,v2-v3,k+2*v3)
In this description, round() returns the closest integer and abs() returns the absolute value. Note that all computations involved in helprec operate on secret data and must be protected against timing attacks.
rec(poly f, NEWHOPE_REC r) computes the pre-hash (see paper) Newhope key from f and r. Specifically, it computes one bit of key from 4 coefficients of f and one byte of r. Let f = f0 + f1*X + f2*X^2 + ... + f1023*X^1023 and let r = r[0],r[1],...,r[255]. Let the bytes of the output by k[0],...,k[31] and let the bits of the output by k0,...,k255, where k0 = k[0] & 0x01 k1 = (k[0] >> 1) & 0x01 k2 = (k[0] >> 2) & 0x01 . . . k8 = k[1] & 0x01 k9 = (k[1] >> 1) & 0x01 . . . k255 = (k[32] >> 7) The function rec computes k0,...,k255 as k0 = rec4(f0,f256,f512,f768,r[0]) k1 = rec4(f1,f257,f513,f769,r[1]) . . . k255 = rec4(f255,f511,f767,f1023,r[255])
The function rec4 does the following:
rec4(y0,y1,y2,y3,r): r0 = r & 0x03 r1 = (r >> 2) & 0x03 r2 = (r >> 4) & 0x03 r3 = (r >> 6) & 0x03 Decode(8*y0-2q*r0, 8*y1-2q*r1, 8*y2-2q*r2, 8*y3-q*r3)
The function Decode does the following:
Decode(v0,v1,v2,v3): t0 = round(v0/8q) t1 = round(v1/8q) t2 = round(v2/8q) t3 = round(v3/8q) t = abs(v0 - 8q*t0) t += abs(v0 - 8q*t0) t += abs(v0 - 8q*t0) t += abs(v0 - 8q*t0) if(t > 1) return 1 else return 0
§C. Test Vectors
______________________________________________________________________________
Best Regards,
On Fri, 6 May 2016 19:17:11 +0000 isis isis@torproject.org wrote:
Both parties check that none of the EXP() operations produced the point at infinity. [NOTE: This is an adequate replacement for checking Y for group membership, if the group is Curve25519.]
[XXX: This doesn't sound exactly right. You need the scalar tweaking of X25519 for this to work and also, the point at infinity is obviously an element of the group --isis, peter]
Maybe reword this to specify that EXP() MUST include the check for all zero output as specified in RFC 7748. It's what our current ntor implementation does here.
Regards,
Yawning Angel transcribed 2.2K bytes:
On Fri, 6 May 2016 19:17:11 +0000 isis isis@torproject.org wrote:
Both parties check that none of the EXP() operations produced the point at infinity. [NOTE: This is an adequate replacement for checking Y for group membership, if the group is Curve25519.]
[XXX: This doesn't sound exactly right. You need the scalar tweaking of X25519 for this to work and also, the point at infinity is obviously an element of the group --isis, peter]
Maybe reword this to specify that EXP() MUST include the check for all zero output as specified in RFC 7748. It's what our current ntor implementation does here.
Thanks, good suggestion. I've added it here: https://gitweb.torproject.org/user/isis/torspec.git/commit/?h=draft/newhope&...
And removed the odd description w.r.t. "the Curve25519 group" here: https://gitweb.torproject.org/user/isis/torspec.git/commit/?h=draft/newhope&...
FWIW, the original "Both parties check that none of the EXP() […] group is Curve25519" sentence comes directly from the original NTor specification in proposal #216, so we might consider making this change there: https://gitweb.torproject.org/torspec.git/tree/proposals/216-ntor-handshake....
On Fri, 6 May 2016 19:17:11 +0000 isis isis@torproject.org wrote:
[XXX We think we want to omit the final hashing in the production of NTOR_KEY here, and instead put all the inputs through SHAKE-256. --isis, peter]
[XXX We probably want to remove ID and B from the input to the shared key material, since they serve for authentication but, as pre-established "prologue" material to the handshake, they should not be used in attempts to strengthen the cryptographic suitability of the shared key. Also, their inclusion is implicit in the DH exponentiations. I should probably ask Ian about the reasoning for the original design choice. --isis]
Oh I missed this. B at a minimum needs to be part of `auth_input`, though probably does not need to be part of `secret_input`.
Per RFC 7748:
Designers using these curves should be aware that for each public key, there are several publicly computable public keys that are equivalent to it, i.e., they produce the same shared secrets. Thus using a public key as an identifier and knowledge of a shared secret as proof of ownership (without including the public keys in the key derivation) might lead to subtle vulnerabilities.
Regards,
On Fri, 2016-05-06 at 19:17 +0000, isis wrote:
--- Description of the Newhope internal functions ---
gen_a(SEED seed) receives as input a 32-byte (public) seed. It expands this seed through SHAKE-128 from the FIPS202 standard. The output of SHAKE-128 is considered a sequence of 16-bit little-endian integers. This sequence is used to initialize the coefficients of the returned polynomial from the least significant (coefficient of X^0) to the most significant (coefficient of X^1023) coefficient. For each of the 16-bit integers first eliminate the highest two bits (to make it a 14-bit integer) and then use it as the next coefficient if it is smaller than q=12289. Note that the amount of output required from SHAKE to initialize all 1024 coefficients of the polynomial varies depending on the input seed. Note further that this function does not process any secret data and thus does not need any timing-attack protection.
Aren't the seed and polynomial a actually secret for negotiation with any node after your guard?
An adversary who can do a timing attack on a user's tor process would gain some deanonymizing information from knowing which a elements get skipped. I suppose this adversary has already nailed the user via correlation attack, but maybe worth rewording at least.
And maybe an adversary could employ different attack infrastructure if they can learn some skipped elements of a.
Best, Jeff
Jeff Burdges <burdges@...> writes:
On Fri, 2016-05-06 at 19:17 +0000, isis wrote:
--- Description of the Newhope internal functions ---
gen_a(SEED seed) receives as input a 32-byte (public) seed. It expands this seed through SHAKE-128 from the FIPS202 standard. The output of
SHAKE-128
is considered a sequence of 16-bit little-endian integers. This
sequence is
used to initialize the coefficients of the returned polynomial from
the least
significant (coefficient of X^0) to the most significant (coefficient of X^1023) coefficient. For each of the 16-bit integers first eliminate the highest two bits (to make it a 14-bit integer) and then use it as the next coefficient if it is smaller than q=12289. Note that the amount of output required from SHAKE to initialize all 1024 coefficients of the polynomial varies depending on the input seed. Note further that this function does not process any secret data and
thus does
not need any timing-attack protection.
Aren't the seed and polynomial a actually secret for negotiation with any node after your guard?
An adversary who can do a timing attack on a user's tor process would gain some deanonymizing information from knowing which a elements get skipped. I suppose this adversary has already nailed the user via correlation attack, but maybe worth rewording at least.
And maybe an adversary could employ different attack infrastructure if they can learn some skipped elements of a.
I agree with Jeff: usually in Ring-LWE, a isn't a secret value, so timing attacks don't matter. But when the value of a is shared only between Alice and Bob, then it is identifying information which could be used in a deanonymyzation attack, so it should be viewed as secret. So it's generation should be secure against timing attacks - the same amount of SHAKE output should be consumed every time.
It's hard to guarantee that any fixed, finite amount of SHAKE output will be sufficient for any rejection sampling method like gen_a. So maybe an arithmetic coding like scheme could be used? That would get much closer to log_2(12289) bits being used for each coefficient. Or let a be a system-wide parameter changing say on a daily basis? (Cuts down marginally on bandwidth and computation time at the expense of a many-for-the-price-of-one attack so as others have observed probably not a good idea).
rec(poly f, NEWHOPE_REC r) / Decode(v0,v1,v2,v3): This function operates on secret data - the values of t0,t1,t2,t3 must be kept secret, so must also be timing-attack-resistant. Also the calculation of t is incorrect as stated (copy-and-paste error; needs to be a function of v1,v2,v3,t1,t2,t3 etc.)
If I ever get chance to play with the code I'd like to have a go a producing some test vectors.
Thanks isis for this, it looks really good, I look forward to seeing a similar protocol for SIDH! (and X25519+NEWHOPE+SIDH !)
-- lukep
On Sat, May 7, 2016 at 12:41 PM, lukep lukep@tutanota.com wrote:
Jeff Burdges <burdges@...> writes:
On Fri, 2016-05-06 at 19:17 +0000, isis wrote:
--- Description of the Newhope internal functions ---
gen_a(SEED seed) receives as input a 32-byte (public) seed. It expands this seed through SHAKE-128 from the FIPS202 standard. The output of
SHAKE-128
is considered a sequence of 16-bit little-endian integers. This
sequence is
used to initialize the coefficients of the returned polynomial from
the least
significant (coefficient of X^0) to the most significant (coefficient of X^1023) coefficient. For each of the 16-bit integers first eliminate the highest two bits (to make it a 14-bit integer) and then use it as the next coefficient if it is smaller than q=12289. Note that the amount of output required from SHAKE to initialize all 1024 coefficients of the polynomial varies depending on the input seed. Note further that this function does not process any secret data and
thus does
not need any timing-attack protection.
Aren't the seed and polynomial a actually secret for negotiation with any node after your guard?
An adversary who can do a timing attack on a user's tor process would gain some deanonymizing information from knowing which a elements get skipped. I suppose this adversary has already nailed the user via correlation attack, but maybe worth rewording at least.
And maybe an adversary could employ different attack infrastructure if they can learn some skipped elements of a.
I agree with Jeff: usually in Ring-LWE, a isn't a secret value, so timing attacks don't matter. But when the value of a is shared only between Alice and Bob, then it is identifying information which could be used in a deanonymyzation attack, so it should be viewed as secret. So it's generation should be secure against timing attacks - the same amount of SHAKE output should be consumed every time.
I'm not sure I understand the concern here. An attacker sees that we got unlucky: that doesn't help them with recovering SEED under mild assumptions we need anyway about SHAKE indistinguishability.
It's hard to guarantee that any fixed, finite amount of SHAKE output will be sufficient for any rejection sampling method like gen_a. So maybe an arithmetic coding like scheme could be used? That would get much closer to log_2(12289) bits being used for each coefficient. Or let a be a system-wide parameter changing say on a daily basis? (Cuts down marginally on bandwidth and computation time at the expense of a many-for-the-price-of-one attack so as others have observed probably not a good idea).
rec(poly f, NEWHOPE_REC r) / Decode(v0,v1,v2,v3): This function operates on secret data - the values of t0,t1,t2,t3 must be kept secret, so must also be timing-attack-resistant. Also the calculation of t is incorrect as stated (copy-and-paste error; needs to be a function of v1,v2,v3,t1,t2,t3 etc.)
If I ever get chance to play with the code I'd like to have a go a producing some test vectors.
Thanks isis for this, it looks really good, I look forward to seeing a similar protocol for SIDH! (and X25519+NEWHOPE+SIDH !)
-- lukep
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On Sat, 2016-05-07 at 13:14 -0700, Watson Ladd wrote:
I'm not sure I understand the concern here. An attacker sees that we got unlucky: that doesn't help them with recovering SEED under mild assumptions we need anyway about SHAKE indistinguishability.
We're assuming the adversary controls a node in your circuit and hence sees your seed later. You get unlucky like over 400 times, so, if they can record enough of the failure pattern, then their node can recognize you from your seed.
Jeff
On Sat, 07 May 2016 23:46:28 +0200 Jeff Burdges burdges@gnunet.org wrote:
On Sat, 2016-05-07 at 13:14 -0700, Watson Ladd wrote:
I'm not sure I understand the concern here. An attacker sees that we got unlucky: that doesn't help them with recovering SEED under mild assumptions we need anyway about SHAKE indistinguishability.
We're assuming the adversary controls a node in your circuit and hence sees your seed later. You get unlucky like over 400 times, so, if they can record enough of the failure pattern, then their node can recognize you from your seed.
Hmm? The timing information that's available to a local attacker (how an adversary will be limited to just this information, and not things that enable a strong attack on it's own like packet timing escapes me) would be the total time taken for `a` generation.
So. the evil observer on Alice's side gets:
* The total number of samples (N).
Bob (or Eve) gets:
* The seed, which may correspond to something that required N samples.
I don't think there's much pattern information available to the attacker on Alice's side, but I may be missing something...
Regards,
On Sat, 2016-05-07 at 22:01 +0000, Yawning Angel wrote:
how an adversary will be limited to just this information, and not things that enable a strong attack on it's own like packet timing escapes me
Yes, it's clear that an adversary who can get CPU timing can get packet timing.
It's not clear if some adversary might prefer information about the seed to simplify their larger infrastructure, like say by not needing to worry about clock skew on their exit nodes, or even choosing to compromise exit nodes soon after the fact.
Hmm? The timing information that's available to a local attacker would be the total time taken for `a` generation.
Really? I know nothing about the limits of timing attacks. I just naively imagined they learn from the timing of CPU work vs memory writes or something.
Jeff
On Sun, 08 May 2016 02:00:51 +0200 Jeff Burdges burdges@gnunet.org wrote:
On Sat, 2016-05-07 at 22:01 +0000, Yawning Angel wrote:
how an adversary will be limited to just this information, and not things that enable a strong attack on it's own like packet timing escapes me
Yes, it's clear that an adversary who can get CPU timing can get packet timing.
It's not clear if some adversary might prefer information about the seed to simplify their larger infrastructure, like say by not needing to worry about clock skew on their exit nodes, or even choosing to compromise exit nodes soon after the fact.
The ultimate simplification would be "snoop the loopback interface and see all the cleartext instead of messing with this timing attack nonsense". :P
Hmm? The timing information that's available to a local attacker would be the total time taken for `a` generation.
Really? I know nothing about the limits of timing attacks. I just naively imagined they learn from the timing of CPU work vs memory writes or something.
What's there to derive key generation timing information from that isn't "observe traffic to the SocksPort, along with traffic upstream and compare times". There might be better ways, that somehow don't involve totally pwning the box, but it's not immediately obvious to me.
If we're going to talk about improving `a` generation overall, rejecting samples only if they are > 5 * q (and using `% q` per sample) is probably a net win since it lowers the rejection rate by a factor of ~4x...
Regards,
On Sat, 2016-05-07 at 19:41 +0000, lukep wrote:
It's hard to guarantee that any fixed, finite amount of SHAKE output will be sufficient for any rejection sampling method like gen_a.
Isn't some small multiple usually enough? I think 1024 is large enough to tend towards the expected 42%ish failures.
Also, can't one simply start the sampling over from the beginning if one runs out?
I've no idea if an maybe an arithmetic coding scheme would be more efficient.
Or let a be a system-wide parameter changing say on a daily basis?
I mentioned using the Tor collaborative random number generator for a in my other message, but only as feint to get to the meat of my argument that Isis and Peter's proposal sounds optimal. I think rotating a network wide a would get messy and dangerous in practice.
If bandwidth is an issue, then a could be derived from the ECDH handshake, thereby making it zero cost.
Jeff
On Sat, 7 May 2016 19:41:59 +0000 (UTC) lukep lukep@tutanota.com wrote:
Thanks isis for this, it looks really good, I look forward to seeing a similar protocol for SIDH! (and X25519+NEWHOPE+SIDH !)
When there is a sufficiently fast SIDH implementation, it might be worth considering (MS Research's is less slow than prior attempts at this, but misses the mark).
On Sat, 07 May 2016 22:51:27 +0200 Jeff Burdges burdges@gnunet.org wrote:
On Sat, 2016-05-07 at 19:41 +0000, lukep wrote:
It's hard to guarantee that any fixed, finite amount of SHAKE output will be sufficient for any rejection sampling method like gen_a.
I think that being in the position to gather the timing information required on the client side means the adversary has won already, so I'm not seeing the issue here apart from an abstract theoretical sense.
Isn't some small multiple usually enough? I think 1024 is large enough to tend towards the expected 42%ish failures.
Also, can't one simply start the sampling over from the beginning if one runs out?
For clarity regarding what the code does now:
The reference code generates excess output from the initial SHAKE call, and samples from that, and incrementally squeezes out an additional block one at a time as required (Enough for 1344 elements, with each additional squeeze providing 21 elements).
(The rejection sampling algorithm is somewhat wasteful since it discards 2 bits of input per sample as well, but what's done now is easier to implement.)
I've no idea if an maybe an arithmetic coding scheme would be more efficient.
Or let a be a system-wide parameter changing say on a daily basis?
I mentioned using the Tor collaborative random number generator for a in my other message, but only as feint to get to the meat of my argument that Isis and Peter's proposal sounds optimal. I think rotating a network wide a would get messy and dangerous in practice.
If bandwidth is an issue, then a could be derived from the ECDH handshake, thereby making it zero cost.
Err. I'm not sure how that will work without rejection sampling that exposes timing information, maybe I'm missing something.
I had a brief discussion with Dr. Schwabe regarding using deterministic `a` generation for unrelated reasons, with something time based being tossed around, but requiring a global clock isn't that great, and leaks clock skew information (Though I would use something like H(tweak | unixTime / 3600), which is rather coarse...), and as a peace of mind thing, I do prefer randomizing `a` on a per-connection basis.
But anyway like I said, I don't think this is an issue.
Regards,
Yawning Angel transcribed 4.3K bytes:
On Sat, 7 May 2016 19:41:59 +0000 (UTC) lukep lukep@tutanota.com wrote:
Thanks isis for this, it looks really good, I look forward to seeing a similar protocol for SIDH! (and X25519+NEWHOPE+SIDH !)
When there is a sufficiently fast SIDH implementation, it might be worth considering (MS Research's is less slow than prior attempts at this, but misses the mark).
When there is also sufficient cryptanalysis of curve-isogeny-based cryptosystems…
Also, I'm not understanding what SIDH would provide in an X25519+NewHope+SIDH construction which is not already provided in the X25519+NewHope construction, unless someday there are SIDH-based signatures.
On Sat, 07 May 2016 22:51:27 +0200 Jeff Burdges burdges@gnunet.org wrote:
On Sat, 2016-05-07 at 19:41 +0000, lukep wrote:
It's hard to guarantee that any fixed, finite amount of SHAKE output will be sufficient for any rejection sampling method like gen_a.
I think that being in the position to gather the timing information required on the client side means the adversary has won already, so I'm not seeing the issue here apart from an abstract theoretical sense.
Your point that an adversary with native-code execution on the client's machine is capable to do worse makes sense. However, given that there are browser-based cache timing attacks (e.g. RowhammerJS) which give cache eviction rates only negligably different to those of native-code exploits, I would argue that Jeff's original concern that the timing attack presents a distinguisher is actually a valid concern.
I've no idea if an maybe an arithmetic coding scheme would be more efficient.
Or let a be a system-wide parameter changing say on a daily basis?
I mentioned using the Tor collaborative random number generator for a in my other message, but only as feint to get to the meat of my argument that Isis and Peter's proposal sounds optimal. I think rotating a network wide a would get messy and dangerous in practice.
If bandwidth is an issue, then a could be derived from the ECDH handshake, thereby making it zero cost.
Err. I'm not sure how that will work without rejection sampling that exposes timing information, maybe I'm missing something.
Nope, it would still not work to fix the timing attack. Although, luckily, we already wrote some constant time code for my sorting-network idea, and then, with some coffee, Peter made it faster. (Give us something stronger to drink, and we'll probably come up with a way to get it even faster.)
Best,
isis isis@torproject.org wrote:
Hi all,
Nope, it would still not work to fix the timing attack. Although, luckily, we already wrote some constant time code for my sorting-network idea, and then, with some coffee, Peter made it faster. (Give us something stronger to drink, and we'll probably come up with a way to get it even faster.)
Still on coffee and with a size-84 Batcher sort and Yawning's 5q trick I now have an AVX2 implementation of NewHope that is faster than the original and does sampling of the polynomial a in constant time. Now I'm up for some stronger drinks...
Cheers,
Peter
Peter Schwabe <peter@...> writes:
isis <isis@...> wrote:
Hi all,
Nope, it would still not work to fix the timing attack. Although,
luckily, we
already wrote some constant time code for my sorting-network idea, and then, with some coffee, Peter made it faster. (Give us something stronger to
drink,
and we'll probably come up with a way to get it even faster.)
Still on coffee and with a size-84 Batcher sort and Yawning's 5q trick I now have an AVX2 implementation of NewHope that is faster than the original and does sampling of the polynomial a in constant time. Now I'm up for some stronger drinks...
You may want to get more drinks in - there's a new eprint on IACR archive that's claiming a faster generation of 'a' using the 5q / 16 bit trick: https://eprint.iacr.org/2016/467.pdf
The other change they make is to replace SHAKE by SHA-256 and AES-256 on the grounds that finding a backdoored 'a' would be implausible, although they no longer have the security reduction. Also they're not considering timing attacks.
They get a speed-up of about 1.5x on AVX2. Personally I don't really care about not having the security reduction for 'a' (there's still the proof in the newhope paper for the rest of the key exchange).
The bandwidth can be halved for n=512 vice n=1024 - this is the JarJar of NewHope - maybe dubious by itself but could be strong enough in hybrid with X25519 given that bandwidth is at a premium in Tor?
-- lukep
lukep lukep@tutanota.com wrote:
Hi lukep, hi all,
You may want to get more drinks in - there's a new eprint on IACR archive that's claiming a faster generation of 'a' using the 5q / 16 bit trick: https://eprint.iacr.org/2016/467.pdf
I know, I have been in contact with the authors and I think that we will want to include some of their ideas in the final version of the NewHope paper (which has been accepted at USENIX! :-)). I also told them that they do not have to do the 4 conditional subtractions of q; apparently that comment hasn't made it into the eprint version.
The other change they make is to replace SHAKE by SHA-256 and AES-256 on the grounds that finding a backdoored 'a' would be implausible, although they no longer have the security reduction. Also they're not considering timing attacks.
They get a speed-up of about 1.5x on AVX2. Personally I don't really care about not having the security reduction for 'a' (there's still the proof in the newhope paper for the rest of the key exchange).
An early version of our software was using something very similar and similarly fast (ChaCha20) until we agreed that the right thing to use is a XOF and pay for this with some performance penalty. As far as I know there is only one properly specified and analyzed XOF out there at the moment and that's SHAKE. We also considered a tree mode of SHAKE (which is faster), but decided against it to prioritize simplicity over some performance gain.
I would be very happy to see more research into design and analysis of faster XOFs, and then I would totally support using them in NewHope, but my current impression is that there's more work to be done.
The bandwidth can be halved for n=512 vice n=1024 - this is the JarJar of NewHope - maybe dubious by itself but could be strong enough in hybrid with X25519 given that bandwidth is at a premium in Tor?
There is a reason that it's called JarJar.
Best regards,
Peter
Jeff Burdges transcribed 2.6K bytes:
On Sat, 2016-05-07 at 19:41 +0000, lukep wrote:
It's hard to guarantee that any fixed, finite amount of SHAKE output will be sufficient for any rejection sampling method like gen_a.
Isn't some small multiple usually enough? I think 1024 is large enough to tend towards the expected 42%ish failures.
Also, can't one simply start the sampling over from the beginning if one runs out?
Yes, you can safely start the sampling over from the beginning without giving anything away, other than "the seed was bad".
Or let a be a system-wide parameter changing say on a daily basis?
I mentioned using the Tor collaborative random number generator for a in my other message, but only as feint to get to the meat of my argument that Isis and Peter's proposal sounds optimal. I think rotating a network wide a would get messy and dangerous in practice.
Peter and I also discussed generating `a` from the Tor shared randomness, but ultimately I feel squeamish about the potential anonymity-set segregations.
If bandwidth is an issue, then a could be derived from the ECDH handshake, thereby making it zero cost.
That would add an extra RT to the handshake, since the handshakes could no longer happen in parallel (in my construction, they're actually literally side-by-side, in the same CREATE2V cell). Separating the handshake would also mean we'd need some new cell types to handle the fact that the handshake would take 2 RTs, since Tor's design now assumes ---CREATE*--> then <---CREATED*---.
Also, deriving `a` "somehow" from the shared X25519 secret is a bit scary (c.f. the §3 "Backdoors" part of the NewHope paper, or Yawning's PoC of a backdoored NewHope handshake [0]).
[0]: https://git.schwanenlied.me/yawning/newhope/src/nobus/newhope_nobus.go
Best,
On Sun, 2016-05-08 at 13:15 +0000, isis wrote:
Also, deriving `a` "somehow" from the shared X25519 secret is a bit scary (c.f. the §3 "Backdoors" part of the NewHope paper,
Oh wow. That one is nasty.
or Yawning's PoC of a backdoored NewHope handshake [0]).
I see. The point is that being ambiguous about the security requirements of the seed for a lets you sneak in a bad usage of it elsewhere.
In some cases, I suppose both sides contributing to a might help them know the other side is not backdoored, but that's not so relevant for Tor.
Jeff
lukep transcribed 3.1K bytes:
I look forward to seeing a similar protocol for SIDH! (and X25519+NEWHOPE+SIDH !)
What benefit would SIDH be providing in an X25519+NewHope+SIDH construction which is not already part of the X25519+NewHope construction? (Other than putting us pretty solidly aboard the Slow Crypto Train…)
Best,
Jeff Burdges transcribed 3.1K bytes:
On Fri, 2016-05-06 at 19:17 +0000, isis wrote:
--- Description of the Newhope internal functions ---
gen_a(SEED seed) receives as input a 32-byte (public) seed. It expands this seed through SHAKE-128 from the FIPS202 standard. The output of SHAKE-128 is considered a sequence of 16-bit little-endian integers. This sequence is used to initialize the coefficients of the returned polynomial from the least significant (coefficient of X^0) to the most significant (coefficient of X^1023) coefficient. For each of the 16-bit integers first eliminate the highest two bits (to make it a 14-bit integer) and then use it as the next coefficient if it is smaller than q=12289. Note that the amount of output required from SHAKE to initialize all 1024 coefficients of the polynomial varies depending on the input seed. Note further that this function does not process any secret data and thus does not need any timing-attack protection.
Aren't the seed and polynomial a actually secret for negotiation with any node after your guard?
An adversary who can do a timing attack on a user's tor process would gain some deanonymizing information from knowing which a elements get skipped. I suppose this adversary has already nailed the user via correlation attack, but maybe worth rewording at least.
And maybe an adversary could employ different attack infrastructure if they can learn some skipped elements of a.
Hey Jeff,
At first, I wasn't convinced that guarding against an adversary who is both capable to run a spy process on a client machine, as well as be the middle/exit relay in the client's circuit, was within Tor's threat model. However, if it isn't within our threat model, it should be, because e.g. who knows what crap JS a browser is executing. Let's just assume this is within the model.
I haven't given it much thought yet, but the performance cost to making polynomial initialisation constant time may not actually be so bad. My naïve, I'm-still-finishing-my-breakfast solution would be to oversample (say 2048 uint16_ts) from SHAKE-128, shove them into a stable sorting network with a comparator which says "not equal" for x[i]>q and "equal" otherwise.
isis isis@torproject.org wrote:
Hi all,
I haven't given it much thought yet, but the performance cost to making polynomial initialisation constant time may not actually be so bad. My naïve, I'm-still-finishing-my-breakfast solution would be to oversample (say 2048 uint16_ts) from SHAKE-128, shove them into a stable sorting network with a comparator which says "not equal" for x[i]>q and "equal" otherwise.
My (hopefully) slightly improved variant of this one is to take three "sqeezes" from SHAKE-128 to obtain 252 16-bit integers, pad to 256, use Batcher's odd-even merge sort to move all the entries >= 5q (using Yawning's idea to lower the rejection probability) to the end of the array with the comparison that Isis described and then take the low 205, which are all smaller than 5q with overwhelmingly large (still need to do the math here) probability. Do this 5 times and you have enough coefficients for the polynomial. Should you run into one of the rare cases where you don't have 205 good entries, simply pick a new seed.
The nice thing is that you can easily vectorize this; a 16x vectorized version (producing more than enough output for 3 handshakes) takes <53000 cycles on Haswell; the code is checked into the newhope-tor-testvectors directory in the discard subdirectory. This does not include reduction mod q, but that's also not required inside the computation of NewHope.
Cheers,
Peter
Just a brief aside about post-quantum handshake approaches that seemingly do not work.
I suppose Tor folk remember the article Using Sphinx to Improve Onion Routing Circuit Construction by Aniket Kate and Ian Goldberg. As key sizes are a main obstacle to a post-quantum key exchange, one might explore using a Sphinx-like key mutation trick to save bandwidth.
I doubt SIDH could support anything like the key mutation trick in Sphinx because the public key is much too far removed from the private key.
There are key mutation tricks for Ring-LWE key exchanges though. As an example, the article Lattice Based Mix Network for Location Privacy in Mobile System by Kunwar Singh, Pandu Rangan, and A. K. Banerjee describes a primitive similar to universal reencryption.
It's likely that key mutation requires fixing the polynomial a in advance. If a must change, then maybe it could be seeded by Tor's collaborative random number generator, so that's actually okay.
Now, a Sphinx-like route building packet could consist of : (1) a polynomial u_i = s_i a + e_i, along with an onion encrypted packet that gives each server (3) maybe their reconciliation data r_i, and (3) a transformation x_i : u_i -> u_{i+1} = s_{i+1} a + e_{i+1}, where i is the node's index along the path.
Any proposal for this transformation x_i needs a proof of security. About the best you're going to do here is reducing its security to existing Ring-LWE assumptions. If say x_i means add s' a + e' so that s_{i+1} = s_i + s' and e_{i+1} = e_i + e', then you're depending upon the Ring-LWE assumptions to know that s' a + e' looks like a random polynomial.
As a result, your hybrid protocol is unlikely to provably offer stronger _anonymity_ properties than a pure Ring-LWE key exchange, even if its _cryptography_ is as strong as the stronger of Ring-LWE and ECDH.
I could say more about why say the choice of s' and e' might leak information about s_i and e_i, but I wanted to keep this brief. And the essence of the observation is that any sort of the Sphinx-like key mutation trick requires assumptions not required in a group.
I found this an interesting apparent limit on making hybrids more efficient than what Isis and Peter have proposed.
Best, Jeff
On 7 May 2016, at 05:17, isis isis@torproject.org wrote:
...
Let `ID` be a router's identity key taken from the router microdescriptor. In the case for relays possessing Ed25519 identity keys (c.f. Tor proposal #220), this is a 32-byte string representing the public Ed25519 identity key. For backwards and forwards compatibility with routers which do not possess Ed25519 identity keys, this is a 32-byte string created via the output of H(ID).
I don't understand why we do this backwards and forwards compatibility for ID, when the proposal only works for relays with an ed25519 key in their descriptor.
I'm sure I'm missing something basic - I'm still learning how to read crypto papers and specifications.
... The function CVPD4 does the following:
CVPD4(y0,y1,y2,y3): v00 = round(y0/2q) v01 = round(y1/2q) v02 = round(y2/2q) v03 = round(y3/2q) v10 = round((y0-1)/2q) v11 = round((y1-1)/2q) v12 = round((y2-1)/2q) v13 = round((y3-1)/2q) t = abs(y0 - 2q*v00) t += abs(y1 - 2q*v01) t += abs(y2 - 2q*v02) t += abs(y3 - 2q*v03) if(t < 2q): v0 = v00 v1 = v01 v2 = v02 v3 = v03 k = 0 else v0 = v10 v1 = v11 v2 = v12 v3 = v13 r = 1 return (v0-v3,v1-v3,v2-v3,k+2*v3)
In this description, round() returns the closest integer and abs() returns the absolute value. Note that all computations involved in helprec operate on secret data and must be protected against timing attacks.
round() is underspecified here: does 0.5 round to 0 or 1? Or is it not possible to get answers that are exactly halfway between two integers?
Tim
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP 968F094B ricochet:ekmygaiu4rzgsk6n
Tim Wilson-Brown - teor transcribed 3.7K bytes:
On 7 May 2016, at 05:17, isis isis@torproject.org wrote:
...
Let `ID` be a router's identity key taken from the router microdescriptor. In the case for relays possessing Ed25519 identity keys (c.f. Tor proposal #220), this is a 32-byte string representing the public Ed25519 identity key. For backwards and forwards compatibility with routers which do not possess Ed25519 identity keys, this is a 32-byte string created via the output of H(ID).
I don't understand why we do this backwards and forwards compatibility for ID, when the proposal only works for relays with an ed25519 key in their descriptor.
Relays have a Curve25519 "ntor-onion-key" in their microdescriptors, is that what you meant?
By "router's identity key from the microdescriptor", I was referring to either the RSA-1024 identity key which is in the "id rsa1024" lines, [0] or (whenever prop#220 features are fully available) the "id ed25519" lines (see prop#220 §3.2). [1] I was mostly concerned about the backwards-/forwards- compatibility because otherwise there would be an annoying length difference that would make things messy.
round() is underspecified here: does 0.5 round to 0 or 1? Or is it not possible to get answers that are exactly halfway between two integers?
Yes, that is under-specified. Peter fixed it in this commit. [2]
[0]: https://collector.torproject.org/recent/relay-descriptors/microdescs/micro/2... [1]: https://gitweb.torproject.org/torspec.git/tree/proposals/220-ecc-id-keys.txt... [2]: https://gitweb.torproject.org/user/isis/torspec.git/commit/?h=draft/newhope&...
Hello,
My tinfoil hat went crinkle in the night[0], and I had an additional thought here. Should we encrypt the `CLIENT_NEWHOPE` and `SERVER_NEWHOPE` values using <AE construct of your choice> and something derived from `EXP(Z,x)`/`EXP(X,z)`?
It doesn't have perfect forward secrecy (compromise of `z` would allow the adversary to decrypt all previous ciphertexts), but it's better than nothing.
CPU-wise it's 1 additional KDF call (assuming you squeeze out the forward and return symmetric keys at once), 1 extra CSPRNG call (for the IV), and 2 AE calls. And `len(IV) + len(Tag)` bytes of extra traffic in each direction in terms of extra network overhead, both which I think are relatively cheap.
Regards,
Yawning Angel transcribed 2.5K bytes:
Hello,
My tinfoil hat went crinkle in the night[0], and I had an additional thought here. Should we encrypt the `CLIENT_NEWHOPE` and `SERVER_NEWHOPE` values using <AE construct of your choice> and something derived from `EXP(Z,x)`/`EXP(X,z)`?
It doesn't have perfect forward secrecy (compromise of `z` would allow the adversary to decrypt all previous ciphertexts), but it's better than nothing.
CPU-wise it's 1 additional KDF call (assuming you squeeze out the forward and return symmetric keys at once), 1 extra CSPRNG call (for the IV), and 2 AE calls. And `len(IV) + len(Tag)` bytes of extra traffic in each direction in terms of extra network overhead, both which I think are relatively cheap.
This is an interesting idea. Let me just make sure I've understood correctly. In your idea, it goes like this?
Initiator Responder
SEED := H(randombytes(32)) x, X := X25519_KEYGEN() a, A := NEWHOPE_KEYGEN(SEED) CLIENT_HDATA := ID || Z || EXP(Z, X || A)
--- CLIENT_HDATA --->
y, Y := X25519_KEYGEN() X, A := EXP(z, X || A) NTOR_KEY, AUTH := NTOR_SHAREDB(X,y,Y,z,Z,ID,B) M, NEWHOPE_KEY := NEWHOPE_SHAREDB(A) SERVER_HDATA := Y || AUTH || M sk := SHAKE-256(NTOR_KEY || NEWHOPE_KEY)
<-- SERVER_HDATA ----
NTOR_KEY := NTOR_SHAREDA(x, X, Y, Z, ID, AUTH) NEWHOPE_KEY := NEWHOPE_SHAREDA(M, a) sk := SHAKE-256(NTOR_KEY, NEWHOPE_KEY)
I think I see what you're saying.
In the case of the currect context, it doesn't make much sense: an adversary with a fully opeerational quantum computer can, in polynomial time, decrypt the Curve25519 encryption to the ntor-onion-key ("Z" here) and then have trivial access to the rest of the handshake.
However, moving forward, since we intend to switch to AE(AD) ciphers at the circuit level, this makes absolute sense. This costs us two extra modular exponentiations now, but, in the future, with an AEAD cipher, we should be able to put `X` into the associated data and have the the handshake's authentication be implicit from the first round. Moving forward, this is worth the extra two modular exponentiations on either side, since it causes extra (exponential) computational effort for a classical adversary, until the time of a quantum adversary (in which it still causes polynomial effort to decrypt the handshake). For such a minor expense as two exp(), we can cause exponential expense for a classical adversay, and polynomial expense for a quantum adversary — which makes this worthwhile.
Did I understand the idea correctly?
Best Regards,
On Thu, 12 May 2016 03:48:21 +0000 isis isis@torproject.org wrote: [snippity[
This is an interesting idea. Let me just make sure I've understood correctly. In your idea, it goes like this?
Almost.
Let Seal(key, nonce, plaintext) be the forward GCM-AES128/XChaCha20Poly1305/whatever encrypt operation, with the specified key, nonce, and plaintext.
Let Unseal(key, nonce, ciphertext) be the reverse GCM-AES128/XChaCha20Poly1305/whatever decrypt operation with the specified key, nonce, and ciphertext.
Initiator Responder
SEED := H(randombytes(32)) x, X := X25519_KEYGEN() a, A := NEWHOPE_KEYGEN(SEED)
xZ := EXP(Z,x) // X25519 Scalar multiply (raw, later used in NTOR_SHAREDA) kTmp := H(tweak || xZ) nonce := randombytes(nonceSize) // Random nonce.
CLIENT_HDATA := X || nonce || Z || Seal(kTmp, nonce, ID || A) // If Seal does AEAD, also send Z as the AD
--- CLIENT_HDATA --->
zX := EXP(X,z) // X25519 Scalar multiply (raw) kTmp := H(tweak || zX) ID, A := Unseal(kTmp, nonce, CLIENT_HDATA[ciphertextOffset:]) // Abort if Unseal fails.
y, Y := X25519_KEYGEN() yX := EXP(X, y) // X25519 Scalar multiply (raw) NTOR_KEY, AUTH := NTOR_SHAREDB(zX,yX,Z,Y,ID,B) // No X25519 calls in here anymore. M, NEWHOPE_KEY := NEWHOPE_SHAREDB(A)
kTmp' := H(tweak || yX) // Return AE key, could alternatively reuse kTmp, but this has better forward secrecy. nonce' := randombytes(nonceSize) SERVER_HDATA := Y || nonce || Seal(kTmp', nonce', AUTH || M)
sk := SHAKE-256(NTOR_KEY || NEWHOPE_KEY)
<-- SERVER_HDATA ----
xY := EXP(Y,x) // X25519 Scalar multiply (raw) kTmp' := H(tweak || xY) M, AUTH := Unseal(kTmp', nonce', SERVER_HDATA[cipherTextOffset':]) // Abort if Unseal fails.
NTOR_KEY := NTOR_SHAREDA(xZ, xY, Y, Z, ID, AUTH) // No X25519 calls in here anymore either. NEWHOPE_KEY := NEWHOPE_SHAREDA(M, a) sk := SHAKE-256(NTOR_KEY, NEWHOPE_KEY)
In the case of the currect context, it doesn't make much sense: an adversary with a fully opeerational quantum computer can, in polynomial time, decrypt the Curve25519 encryption to the ntor-onion-key ("Z" here) and then have trivial access to the rest of the handshake.
Correct. In a post quantum world, this is totally pointless, especially since `Z` is publicly available from the microdescriptors, but in the mean time it's extra authenticated, and extra sekrit.
Did I understand the idea correctly?
Mostly. There's no extra EXP() calls at all. NTOR_SHAREDA/SHAREDB requires that we calculate:
Client: EXP(Z,x), EXP(Y,x) Server: EXP(X,z), EXP(X,y)
The client uses EXP(Z,x), which we have to calculate anyway for ntor, to encrypt/authenticate `ID || A` of CLIENT_HDATA (and optionally Z). (There isn't any point in including X in the AD, because if someone messes with X, Unseal() will fail).
The server uses EXP(X,y), which we have to calculate anyway for ntor, to encrypt/authenticate `AUTH || M`.
The only overhead is 2 calls to the AE(AD) construct, 2x random nonce generation, and 2 calls into H(). If we want to be extra sneaky, we could also do the NTRU handshake this way (and move the handshake identifier into the encrypted envelope) so that only the recipient can see which algorithm we're using as well (So: Bad guys must have a quantum computer and calculate `z` to figure out which post quantum algorithm we are using).
Regards,
On Thu, 2016-05-12 at 05:29 +0000, Yawning Angel wrote:
and move the handshake identifier into the encrypted envelope) so that only the recipient can see which algorithm we're using as well (So: Bad guys must have a quantum computer and calculate `z` to figure out which post quantum algorithm we are using).
This sounds like a win.
We still do not know if/when quantum computers will become practical. It was only just last year that 15 was finally factored "without cheating" : http://www.scottaaronson.com/blog/?p=2673
We do know that advancements against public key crypto systems will occur, so wrapping up the more unknown system more tightly sounds wise.
In the shorter term, SIDH would take only one extra cell, maybe none if tweaked downward, as compared to the four of New Hope, and whatever NTRU needs. This variation might be good or bad for anonymity, but it's sound better if fewer nodes can compare the numbers of packets with the algorithms used.
Jeff
On Thu, 12 May 2016 11:58:56 +0200 Jeff Burdges burdges@gnunet.org wrote:
On Thu, 2016-05-12 at 05:29 +0000, Yawning Angel wrote:
and move the handshake identifier into the encrypted envelope) so that only the recipient can see which algorithm we're using as well (So: Bad guys must have a quantum computer and calculate `z` to figure out which post quantum algorithm we are using).
This sounds like a win.
We still do not know if/when quantum computers will become practical. It was only just last year that 15 was finally factored "without cheating" : http://www.scottaaronson.com/blog/?p=2673
We do know that advancements against public key crypto systems will occur, so wrapping up the more unknown system more tightly sounds wise.
In the shorter term, SIDH would take only one extra cell, maybe none if tweaked downward, as compared to the four of New Hope, and whatever NTRU needs. This variation might be good or bad for anonymity, but it's sound better if fewer nodes can compare the numbers of packets with the algorithms used.
Well, if we move the handshake identifier inside the AE(AD) envelope, we can also add padding to normalize the handshake length at minimal extra CPU cost by adding a length field and some padding inside as well.
It would remove some of the advantages of using algorithms with shorter keys (since it would result in more traffic on the wire than otherwise would have been), but handshakes will be indistinguishable to anyone but space aliens and the final destinations...
Regards,
On Thu, 2016-05-12 at 11:17 +0000, Yawning Angel wrote:
Well, if we move the handshake identifier inside the AE(AD) envelope, we can also add padding to normalize the handshake length at minimal extra CPU cost by adding a length field and some padding inside as well.
It would remove some of the advantages of using algorithms with shorter keys (since it would result in more traffic on the wire than otherwise would have been), but handshakes will be indistinguishable to anyone but space aliens and the final destinations...
Is that even beneficial though?
If we choose our post-quantum algorithm randomly from New Hope and SIDH, and add random delays, then maybe an adversary has less information about when a circuit build is progressing to the next hop, or when it's actually being used?
Is there some long delay between circuit build and first use that makes anything done to obscure build useless?
Jeff
On Thu, 12 May 2016 14:18:41 +0200 Jeff Burdges burdges@gnunet.org wrote:
On Thu, 2016-05-12 at 11:17 +0000, Yawning Angel wrote:
Well, if we move the handshake identifier inside the AE(AD) envelope, we can also add padding to normalize the handshake length at minimal extra CPU cost by adding a length field and some padding inside as well.
It would remove some of the advantages of using algorithms with shorter keys (since it would result in more traffic on the wire than otherwise would have been), but handshakes will be indistinguishable to anyone but space aliens and the final destinations...
Is that even beneficial though?
Padding the handshake out for the number of cells? Maybe? I'm not entirely convinced here either way, just highlighting it as an option.
If we choose our post-quantum algorithm randomly from New Hope and SIDH, and add random delays, then maybe an adversary has less information about when a circuit build is progressing to the next hop, or when it's actually being used?
Dunno. I need to re-benchmark product form NTRU but I think it's in the same ballpark as newhope (May be faster, but the differential isn't totally ridiculous as anything compared to SIDH is).
I don't think SIDH is really something to worry about now anyway... having both NTRUEncrypt and newhope as options for the first pass should be sufficient, and the bulk of the code required to do this is the infrastructure work for modular/large handshakes anyway (Once we support at least one PQ algorithm, adding others is relatively easy).
Is there some long delay between circuit build and first use that makes anything done to obscure build useless?
We pre-build circuits, but the telescoping extension process and opportunistic data both mean that circuits see "traffic" near-immediately in most cases (everyone but the exit will see the traffic of handshaking to further hops, the exit sees opportunistic data in some cases).
Regards,
Just a a couple questions :
Is SIDH costing 100 times the CPU such a big deal, assuming it's running on another thread? Can it be abused for DOS attacks for example? Is that CPU time needed for symmetric crypto? etc. If so, is it worth restricting to your guard node?
Is New Hope's 3+ times the bandwidth a big deal? I suppose circuit building does not occupy much bandwidth, so no.
On Thu, 2016-05-12 at 12:33 +0000, Yawning Angel wrote:
We pre-build circuits, but the telescoping extension process and opportunistic data both mean that circuits see "traffic" near-immediately in most cases (everyone but the exit will see the traffic of handshaking to further hops, the exit sees opportunistic data in some cases).
Ok. I suppose that leaks a node's position in the circuit regardless, but perhaps that's not a concern. And I donno anything about opportunistic data.
I don't think SIDH is really something to worry about now anyway...
If you like, I could ask Luca de Feo if he imagines it getting much faster, but I suspect his answer would be only a smallish factor, like another doubling or so.
Assuming we stick to schemes with truly hybrid anonymity, then I suspect the anonymity cost of early adoption is that later parameter tweaks leak info about a user's tor version. We can always ask the MS SIDH folk, Luca, etc. what parameters could be tweaked in SIDH to get some idea.
Jeff
p.s. If taken outside Tor's context, I would disagree with your statement on SIDH :
I donno NTRU well enough to comment on even how different the underlying reconciliation is from New Hope, but there might be an argument that most advances big enough to actually break New Hope would break NTRU and NTRU' too, so maybe one Ring-LWE scheme suffices. SIDH is an entirely different beast though.
I've warm fuzzy feelings about the "evaluate on two points trick" used by Luca de Feo, et al., and by this SIDH, to fix previous attempts. It could always go down in mathematical flames, but it makes the scheme obnoxiously rigid, leaving jack for homomorphic properties, and could prove remarkably robust as a trapdoor.
By comparison, there are going to be more papers on Ring-LWE because academic cryptographers will enjoy playing with it's homomorphic properties. Yet, one could imagine the link between Ring-LWE and dihedral HSP becoming more dangerous "looking", not necessarily producing a viable quantum attack, but maybe prompting deeper disagreements about parameter choices.
In other words, I'd expect our future trust in Ring-LWE and SIDH to evolve in different ways. And counting papers will not be informative.
Imho, almost anyone protecting user-to-user communications should hybrid ECDH, Ring-LWE, and SIDH all together, as users have CPU cycles to burn. Tor is user-to-volunteer-server though, so the economics are different.
On Mon, 16 May 2016 20:54:42 +0200 Jeff Burdges burdges@gnunet.org wrote:
Just a a couple questions :
Is SIDH costing 100 times the CPU such a big deal, assuming it's running on another thread? Can it be abused for DOS attacks for example? Is that CPU time needed for symmetric crypto? etc. If so, is it worth restricting to your guard node?
Yes. Even if it's multithreaded (and the client side circuit build crypto currently is not, though will be shortly), though this obviously "depends" on what each node is doing.
Eg:
* Client side performance is totally irrelevant up until the moment it becomes a busy HS (Eg: facebookcorewwwi.onion) in which case, it matters a huge deal (the X25519 scalar mult montgomery ladder has been a prominent feature in all of our profiling attempts for quite a while).
* Relay side performance is totally irrelevant up until the moment it offers a lot of bandwidth, because the way our path selection weighs things right now, circuit builds will be weighted by consensus fraction.
We used to have a slower handshake, and it was causing issues (TAP and that stupid botnet). Maybe I'm being over-cautious here. Maybe even the hybrid construct is too slow. Maybe bandwidth matters more.
Implementing the infrastructure to allow runtime selection of a PQ handshake, and actually doing any PQ handshake are required to really investigate these sort of issues instead of arguing over algorithms...
Is New Hope's 3+ times the bandwidth a big deal? I suppose circuit building does not occupy much bandwidth, so no.
It might. Sending bytes isn't free in terms of CPU either. The NTRUEncrypt based construct is fast and has shorter keys.
I don't think SIDH is really something to worry about now anyway...
If you like, I could ask Luca de Feo if he imagines it getting much faster, but I suspect his answer would be only a smallish factor, like another doubling or so.
Assuming we stick to schemes with truly hybrid anonymity, then I suspect the anonymity cost of early adoption is that later parameter tweaks leak info about a user's tor version. We can always ask the MS SIDH folk, Luca, etc. what parameters could be tweaked in SIDH to get some idea.
Agree here, though there is an auto updater for a reason, and if we we end up being overly concerned about that we won't even be using X25519....
[snip]
In other words, I'd expect our future trust in Ring-LWE and SIDH to evolve in different ways. And counting papers will not be informative.
Yeah probably. I can envision having no choice but to use SIDH sometime in the future (or vice versa). It's an evolving field, and my current mindset is "pick one or two that probably won't kill the network (CPU/network/whatever)", integrate it in a way that is easy to switch at a later point, and deploy it.
Imho, almost anyone protecting user-to-user communications should hybrid ECDH, Ring-LWE, and SIDH all together, as users have CPU cycles to burn. Tor is user-to-volunteer-server though, so the economics are different.
Dunno, I think this depends entirely on "how much money does the person operating the server have to throw at adding more CPU" vs "how many connections/sec does it need to sustain.
I think doing something like SIDH on a massively popular website (say, hypothetically it gets added to TLS) would get expensive really quickly, but then again, it's VC monopoly money that's paying for it anyway, and it's not like they ever need to turn a profit right? (*cough* Twitter *cough*).
Regards,
Yawning Angel <yawning@...> writes:
Implementing the infrastructure to allow runtime selection of a PQ handshake, and actually doing any PQ handshake are required to really investigate these sort of issues instead of arguing over algorithms...
Right. It's probably too premature to make a final, definitive choice of PQ algorithm since any of them could be broken mathematically. The possibility of backdoors in newhope (at least two different ways to do it) makes me nervous. SIDH is (perhaps?) too slow now but computers will get faster. NTRU is patented for the moment, but that will expire soon. It's all a tradeoff between security / cpu cycles / bandwidth and that tradeoff is bound to change over time.
So we should have a protcol that allows a hybrid of X25519 for classical security plus one or more (or even zero, until we want to "switch it on") of the PQ algorithms. Use EXP(Z,x) to protect the choice of PQ algorithms and the actual PQ public key(s). Combine the X25519 shared secret and (all) the PQ shared secret(s) in one SHAKE-256 calculation to give sk.
We'd need to think about what choice of PQ algorithms the client is allowed to make and what the server would accept. The risk is that we go down a rabbit-hole of TLS-style algorithm negotiation to agree on algorithm choices that are acceptable to both - we want it to "just work" without argument. But if we don't allow some choice, then we may find that the protocol is too slow, or the bandwidth too cumbersome, or we panic when one of the PQ algorithms is broken.
[snip]
In other words, I'd expect our future trust in Ring-LWE and SIDH to evolve in different ways. And counting papers will not be informative.
Yeah probably. I can envision having no choice but to use SIDH sometime in the future (or vice versa). It's an evolving field, and my current mindset is "pick one or two that probably won't kill the network (CPU/network/whatever)", integrate it in a way that is easy to switch at a later point, and deploy it.
The important thing now is surely to get the protocol right so that we can slot algorithms in or out (then pick one or two that we actually want to integrate)
-- lukep
On Tue, 17 May 2016 17:49:46 +0000 (UTC) lukep lukep@tutanota.com wrote:
[snip]
In other words, I'd expect our future trust in Ring-LWE and SIDH to evolve in different ways. And counting papers will not be informative.
Yeah probably. I can envision having no choice but to use SIDH sometime in the future (or vice versa). It's an evolving field, and my current mindset is "pick one or two that probably won't kill the network (CPU/network/whatever)", integrate it in a way that is easy to switch at a later point, and deploy it.
The important thing now is surely to get the protocol right so that we can slot algorithms in or out (then pick one or two that we actually want to integrate)
The relevant proposals here would be:
https://gitweb.torproject.org/torspec.git/tree/proposals/264-subprotocol-ver... https://gitweb.torproject.org/torspec.git/tree/proposals/249-large-create-ce...
With emphasis on the 264, since that's probably how link handshake crypto support will be signified.
Regards,
Not sure if this has been noted before on this thread, but the BoringSSL team is working on something very similar:
https://boringssl-review.googlesource.com/#/c/7962/
On Thu, May 19, 2016 at 1:01 PM Yawning Angel yawning@schwanenlied.me wrote:
On Tue, 17 May 2016 17:49:46 +0000 (UTC) lukep lukep@tutanota.com wrote:
[snip]
In other words, I'd expect our future trust in Ring-LWE and SIDH to evolve in different ways. And counting papers will not be informative.
Yeah probably. I can envision having no choice but to use SIDH sometime in the future (or vice versa). It's an evolving field, and my current mindset is "pick one or two that probably won't kill the network (CPU/network/whatever)", integrate it in a way that is easy to switch at a later point, and deploy it.
The important thing now is surely to get the protocol right so that we can slot algorithms in or out (then pick one or two that we actually want to integrate)
The relevant proposals here would be:
https://gitweb.torproject.org/torspec.git/tree/proposals/264-subprotocol-ver...
https://gitweb.torproject.org/torspec.git/tree/proposals/249-large-create-ce...
With emphasis on the 264, since that's probably how link handshake crypto support will be signified.
Regards,
-- Yawning Angel _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On Thu, 19 May 2016 17:21:47 +0000 Deirdre Connolly durumcrustulum@gmail.com wrote:
Not sure if this has been noted before on this thread, but the BoringSSL team is working on something very similar:
Skimming the code:
* The protocol level stuff is not useful at all because the sort of problems that need to be solved (or changes) with the Tor wire protocol for any sort of PQ handshake are rather different than "just adding another TLS key exchange mechanism".
* Their hybrid construct is unauthenticated (handled separately by TLS, with a signature), and is `X25519SharedSecret | NHSharedSecret`, passed into a KDF.
* They have their own special snowflake newhope variant (The code is based on the `ref` version, with Google copyrights bolted on top), functional changes are:
* CTR-AES128 instead of SHAKE is used to sample `a` (same algorithm, doesn't have the sampling optimization or attempt to hide the rejection sampling timing variation).
* SHA256 is used instead of SHA3-256 to generate `mu` from `nu`.
* RAND_bytes() is called for noise sampling instead of using ChaCha20 or CTR-AES256.
I don't find these changes to be particularly interesting. Any system where using AES-CTR like this makes sense will benefit more from using a vectorized NTT/reconciliation.
Regards,
Granted that this is an experimental implementation (as acknowleged by the Boring devs) in a very different protocol with different tradeoffs.
On Thu, May 19, 2016 at 2:42 PM Yawning Angel yawning@schwanenlied.me wrote:
On Thu, 19 May 2016 17:21:47 +0000 Deirdre Connolly durumcrustulum@gmail.com wrote:
Not sure if this has been noted before on this thread, but the BoringSSL team is working on something very similar:
Skimming the code:
The protocol level stuff is not useful at all because the sort of problems that need to be solved (or changes) with the Tor wire protocol for any sort of PQ handshake are rather different than "just adding another TLS key exchange mechanism".
Their hybrid construct is unauthenticated (handled separately by TLS, with a signature), and is `X25519SharedSecret | NHSharedSecret`, passed into a KDF.
They have their own special snowflake newhope variant (The code is based on the `ref` version, with Google copyrights bolted on top), functional changes are:
CTR-AES128 instead of SHAKE is used to sample `a` (same algorithm, doesn't have the sampling optimization or attempt to hide the rejection sampling timing variation).
SHA256 is used instead of SHA3-256 to generate `mu` from `nu`.
RAND_bytes() is called for noise sampling instead of using ChaCha20 or CTR-AES256.
I don't find these changes to be particularly interesting. Any system where using AES-CTR like this makes sense will benefit more from using a vectorized NTT/reconciliation.
Regards,
-- Yawning Angel _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Yawning Angel yawning@schwanenlied.me wrote:
Hi Yawning,
Thanks for the more detailed description; I think I understand now what you're saying. I also agree that the cost is small (only some extra symmetric stuff happening).
I don't like the use of AES-GCM as an authenticated-encryption algorithm, but as far as I understand, AEAD is a completely separate discussion within Tor and this would be replaced by whatever that discussion's outcome is?
Correct. In a post quantum world, this is totally pointless, especially since `Z` is publicly available from the microdescriptors, but in the mean time it's extra authenticated, and extra sekrit.
Can you describe a pre-quantum attacker who breaks the non-modified key exchange and does not, with essentially the same resources, break the modified key exchange? I'm not opposed to your idea, but it adds a bit of complexity and I would like to understand what precisely the benefit is.
Best regards,
Peter
On Thu, 2016-05-12 at 15:54 +0200, Peter Schwabe wrote:
Can you describe a pre-quantum attacker who breaks the non-modified key exchange and does not, with essentially the same resources, break the modified key exchange? I'm not opposed to your idea, but it adds a bit of complexity and I would like to understand what precisely the benefit is.
Assuming I understand what Yawning wrote :
It's about metadata leakage, not actual breaks.
If Tor were randomly selecting amongst multiple post-quantum algorithms, then a malicious node potentially learns more information about the user's tor by observing the type of the subsequent node's handshake.
In particular, if there is a proliferation of post-quantum choices, then it sounds very slightly more dangerous to allow users to configure what post-quantum algorithms they use without Yawning's change.
Jeff
p.s. At the extreme example, there is my up thread comment refuting the idea of using Sphinx-like packets with Ring-LWE.
I asked : Why can't we send two polynomials (a,A) and mutate them together with a second Ring-LWE like operation for each hop? It's linear bandwidth in the number of hops as opposed to quadratic bandwidth, which saves 2-4k up in Tor's case and maybe keeps node from knowing quite as much about their position.
Answer : If you do that, it forces the whole protocol's anonymity to rest on the Ring-LWE assumption, so it's no longer a hybrid protocol for anonymity, even though cryptographically it remains hybrid.
On Thu, 12 May 2016 20:31:56 +0200 Jeff Burdges burdges@gnunet.org wrote:
On Thu, 2016-05-12 at 15:54 +0200, Peter Schwabe wrote:
Can you describe a pre-quantum attacker who breaks the non-modified key exchange and does not, with essentially the same resources, break the modified key exchange? I'm not opposed to your idea, but it adds a bit of complexity and I would like to understand what precisely the benefit is.
Assuming I understand what Yawning wrote :
It's about metadata leakage, not actual breaks.
If Tor were randomly selecting amongst multiple post-quantum algorithms, then a malicious node potentially learns more information about the user's tor by observing the type of the subsequent node's handshake.
In particular, if there is a proliferation of post-quantum choices, then it sounds very slightly more dangerous to allow users to configure what post-quantum algorithms they use without Yawning's change.
Indeed, nailed it in one.
My tinfoil hat crinkles less with the idea that people need to drill through X25519/an AEAD construct before they can start trying to break the PQ handshake (serializing the process somewhat, instead of being able to work on breaking each component of the hybrid construct in parallel)[0].
Most of my thoughts in this area stem from writing an obfuscated transport recently where I do use early encryption + padding to hide the algorithms used for the handshake.
As a side note, if `Z` wasn't a value that the bad guys could pull out of the microdesc consensus, we could avoid sending it on the wire (and use the ephemeral/static derived keys for both directions) and really win (only `X` and say... `SHA3-256(Z)` (for disambiguation) available to the attacker means that we win, period regardless of space aliens), but alas we need to distribute `Z` somehow, so this is somewhat moot (so ephemeral/static in the forward direction, ephemeral/ephemeral in the reverse is better for forward secrecy reasons).
Regards,
Some great developments in lattice-based crypto. DJB just released a paper on NTRU Prime:
1. Competitively fast compared to the leading lattice-based cryptosystems including New Hope.
2. Safer implementation of NTRU that avoids vulnerable ring structures and runs in constant-time.
3. The only implemntation that mitigates decryption failures completely, killing information leaks to adversaries.
4. Includes some handy advice for "transitional cryptography" - mixing and matching classical signature schemes with PQ public-keys.
Hi all,
- The only implementation that mitigates decryption failures completely,
killing information leaks to adversaries.
This is clearly a nice-to-have feature, but it comes with a tradeoff. To remove decryption failures you need to increase the parameter q, but this affects size (and so performance) in two ways: first, the key and ciphertext are arrays of integers mod q, so obviously increasing log_2(q) increases key and ciphertext size; but second, increasing q makes lattice reduction attacks more effective, so it means that you need to increase the dimension parameter N as well to get the same level of lattice security. Conversely, it's not difficult to calculate upper bounds on decryption failure probabilities, so it's straightforward to find a q that gives less than 2^-k chance of a decryption failure. There's no particular need for a decryption failure probability that's less than the security of the other parts of the cryptosystem.
Just wanted to explain why the standardized NTRUEncrypt parameter sets (from https://github.com/NTRUOpenSourceProject/ntru-crypto) are chosen the way they are, i.e. to have nonzero decryption failure probability. We could have chosen larger q and N but didn't think the tradeoff is worth it. Obviously the other point of view is legitimate too.
Cheers,
William
On Thu, May 12, 2016 at 9:51 PM, bancfc@openmailbox.org wrote:
Some great developments in lattice-based crypto. DJB just released a paper on NTRU Prime:
- Competitively fast compared to the leading lattice-based cryptosystems
including New Hope.
- Safer implementation of NTRU that avoids vulnerable ring structures and
runs in constant-time.
- The only implemntation that mitigates decryption failures completely,
killing information leaks to adversaries.
- Includes some handy advice for "transitional cryptography" - mixing and
matching classical signature schemes with PQ public-keys.
https://ntruprime.cr.yp.to/ntruprime-20160511.pdf
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
bancfc@openmailbox.org transcribed 0.7K bytes:
Some great developments in lattice-based crypto. DJB just released a paper on NTRU Prime:
- Competitively fast compared to the leading lattice-based cryptosystems
including New Hope.
- Safer implementation of NTRU that avoids vulnerable ring structures and
runs in constant-time.
- The only implemntation that mitigates decryption failures completely,
killing information leaks to adversaries.
- Includes some handy advice for "transitional cryptography" - mixing and
matching classical signature schemes with PQ public-keys.
Hello,
I am similarly excited to see a more comprehensive write up on the NTRU Prime idea from Dan's blog post several years ago on the idea for a subfield-logarithm attack on ideal-lattice-based cryptography. [0] The idea to remove some of the ideal structure from the lattice, while still aiming to keep a similarly high, estimated minimum, post-quantum security strength as Newhope (~2^128 bits post-quantum security and ~2^215 classical for NTRU Prime, versus ~2^206 bits post-quantum and ~2^229 classical for Newhope) and speed efficiencies competitive with NewHope, [1] by altering the original NTRU parameters is very exciting, and I'm looking forward to more research w.r.t. to the ideas of the original blog post (particularly the exploitation of the ideal structure). Additionally, the Toom6 decomposition of the "medium-sized" 768-degree polynomial in NTRU Prime in order to apply Karatsuba is quite elegant. Also, I'm glad to see that my earlier idea [2] to apply a stable sorting network in order to generate valid polynomial coefficients in constant-time is also suggested within the NTRU Prime paper.
However, from the original NTRU Prime blog post, Dan mentioned towards the end: "I don't recommend actually using NTRU Prime unless and until it survives years of serious cryptanalytic attention, including quantitative evaluation of specific parameter choices." Léo Ducas, one of the NewHope authors, has responded to the NTRU Prime paper with a casual cryptanalysis of its security claims, [3] mentioning that "A quick counter-analysis suggests the security of the proposal is overestimated by about 75 bits" bringing NTRU Prime down to ~2^140 classical security.
Current estimates on a hybrid BKZ+sieving attack combined with Dan's subfield-logarithm attack, *if* it proves successful someday (which it's uncertain yet if it will be), would (being quite generous towards the attacker) roughly halve the pre-quantum security bits for n=1024 (since the embedded subfield tricks are probably not viable), bringing NewHope down to 103/114 bits. For the case of the hybrid handshake in this proposal, it still doesn't matter, because the attacker would still also need to break X25519, which still keeps its 2^128 bits of security. (Not to mention that 103-bits post-quantum security is not terrible, considering that the attacker still needs to do 2^103 computations for each and every Tor handshake she wants to break because keys are not reused.)
Please feel free to correct me if I've misunderstood something. IANAC and all that.
Further, there are other problems in the NTRU Prime paper:
1. Defining PFS as "the server erases its key every 60 seconds" seems arbitrary and a bit strange. It also makes the claims hard to analyse in comparison with the NTRU literature (where, as far as I know, it's never stated whether or not keys may be reused, and what downsides might come with that) as well as with NewHope (where it's explicitly stated that keys should not be reused).
2. In Fig. 1.2, the number of bytes sent by a NewHope client is 1792, not 2048. (To be fair, it was 2048 bytes in an earlier version.)
3. The bandwidth estimates for NTRU Prime do not take into account that, due to providing a key-encapsulation mechanism rather than a key exchange, the client must already know the server's long-term public encryption key, in order that the client may encrypt its public key to the server in the first round of the handshake.
Further, and more specifically in the context of Tor handshakes, this requires that Tor either add a second round to the handshake (which we currently have no mechanism, nor proposed mechanism, for), or else that we add the server's NTRU Prime key to the server's (micro)descriptor, increasing descriptor size by 1232 bytes per relay. For the case of adding these keys to microdescriptors, where the average microdescriptor size is currently 416 bytes [4] and ~1 Gb/s is spent on requests to the directories for descriptors, [5] increasing the average microdescriptor size to 1648 bytes would roughly quadruple the amount of network traffic used by directories to respond to fetches (and then double that number again because additional bandwidth is used through the client's guard relay, since most directory fetches are tunneled). This means that any relay providing < ~1.142 Mb/s bandwidth would be using up more network bandwidth to tell everything else in the network about its existence and keys than the relay itself actually provides. This means that 3840 out of the current 7100 become a burden on the network. This might be doable, [6] but I just wanted to point out how much it matters for us to not need to add giant keys to descriptors.
[0]: https://blog.cr.yp.to/20140213-ideal.html [1]: https://cryptojedi.org/papers/newhope-20160328.pdf [2]: https://lists.torproject.org/pipermail/tor-dev/2016-May/010903.html [3]: https://groups.google.com/forum/?_escaped_fragment_=topic/cryptanalytic-algo... [4]: https://collector.torproject.org/recent/relay-descriptors/microdescs/micro/2... [5]: https://metrics.torproject.org/dirbytes.html?start=2016-02-16&end=2016-0... [6]: As an aside, I'd be super happy to kick the crap relays out of the network: https://lists.torproject.org/pipermail/tor-dev/2014-September/007558.html
Best Regards,
On 2016-05-16 18:53, isis agora lovecruft wrote:
Hello,
I am similarly excited to see a more comprehensive write up on the NTRU Prime idea from Dan's blog post several years ago on the idea for a subfield-logarithm attack on ideal-lattice-based cryptography. [0] The idea to remove some of the ideal structure from the lattice, while still aiming to keep a similarly high, estimated minimum, post-quantum security strength as Newhope (~2^128 bits post-quantum security and ~2^215 classical for NTRU Prime, versus ~2^206 bits post-quantum and ~2^229 classical for Newhope) and speed efficiencies competitive with NewHope, [1] by altering the original NTRU parameters is very exciting, and I'm looking forward to more research w.r.t. to the ideas of the original blog post (particularly the exploitation of the ideal structure). Additionally, the Toom6 decomposition of the "medium-sized" 768-degree polynomial in NTRU Prime in order to apply Karatsuba is quite elegant. Also, I'm glad to see that my earlier idea [2] to apply a stable sorting network in order to generate valid polynomial coefficients in constant-time is also suggested within the NTRU Prime paper.
However, from the original NTRU Prime blog post, Dan mentioned towards the end: "I don't recommend actually using NTRU Prime unless and until it survives years of serious cryptanalytic attention, including quantitative evaluation of specific parameter choices." Léo Ducas, one of the NewHope authors, has responded to the NTRU Prime paper with a casual cryptanalysis of its security claims, [3] mentioning that "A quick counter-analysis suggests the security of the proposal is overestimated by about 75 bits" bringing NTRU Prime down to ~2^140 classical security.
As you say, I think the security reduction is a bit steep but not catastrophic. However when I saw the NTRU Prime blog post before I interpreted to mean "its very likely that the powerful attack against the Smart–Vercauteren system can be extended against Lattice-based cryptosystems in general, that would completely break them". [0] This brings up another point that digresses from the discussion:
Dan and Tanja support more conservative systems like McEliece because it survived decades of attacks. In the event that cryptanalysis eliminates Lattice crypto, McEliece will remain the only viable and well studied alternative. How prohibitive are McEliece key sizes that they can never make it into Tor? Can the size problem be balanced against longer re-keying times for PFS - say once every 6 hours or more instead of every connection (there are probably many other changes needed to accomodate it). They've worked on making tradeoffs of longer decryption times to get smaller keys in their McBits implementation [1] but they still are no where near Lattice ones (McEliece has very fast encoding/decoding so it works out). With the averge webpage being 2 MBs in size, larger keys may not be that bad? Another interesting strategy for performance/efficiency is public key slicing and communicating them parallel. [2]
Current estimates on a hybrid BKZ+sieving attack combined with Dan's subfield-logarithm attack, *if* it proves successful someday (which it's uncertain yet if it will be), would (being quite generous towards the attacker) roughly halve the pre-quantum security bits for n=1024 (since the embedded subfield tricks are probably not viable), bringing NewHope down to 103/114 bits. For the case of the hybrid handshake in this proposal, it still doesn't matter, because the attacker would still also need to break X25519, which still keeps its 2^128 bits of security. (Not to mention that 103-bits post-quantum security is not terrible, considering that the attacker still needs to do 2^103 computations for each and every Tor handshake she wants to break because keys are not reused.)
Please feel free to correct me if I've misunderstood something. IANAC and all that.
Further, there are other problems in the NTRU Prime paper:
- Defining PFS as "the server erases its key every 60 seconds" seems arbitrary and a bit strange. It also makes the claims hard to
analyse in comparison with the NTRU literature (where, as far as I know, it's never stated whether or not keys may be reused, and what downsides might come with that) as well as with NewHope (where it's explicitly stated that keys should not be reused).
- In Fig. 1.2, the number of bytes sent by a NewHope client is 1792,
not 2048. (To be fair, it was 2048 bytes in an earlier version.)
- The bandwidth estimates for NTRU Prime do not take into account
that, due to providing a key-encapsulation mechanism rather than a key exchange, the client must already know the server's long-term public encryption key, in order that the client may encrypt its public key to the server in the first round of the handshake.
Further, and more specifically in the context of Tor handshakes,
this requires that Tor either add a second round to the handshake (which we currently have no mechanism, nor proposed mechanism, for), or else that we add the server's NTRU Prime key to the server's (micro)descriptor, increasing descriptor size by 1232 bytes per relay. For the case of adding these keys to microdescriptors, where the average microdescriptor size is currently 416 bytes [4] and ~1 Gb/s is spent on requests to the directories for descriptors, [5] increasing the average microdescriptor size to 1648 bytes would roughly quadruple the amount of network traffic used by directories to respond to fetches (and then double that number again because additional bandwidth is used through the client's guard relay, since most directory fetches are tunneled). This means that any relay providing < ~1.142 Mb/s bandwidth would be using up more network bandwidth to tell everything else in the network about its existence and keys than the relay itself actually provides. This means that 3840 out of the current 7100 become a burden on the network. This might be doable, [6] but I just wanted to point out how much it matters for us to not need to add giant keys to descriptors.
[6]: As an aside, I'd be super happy to kick the crap relays out of the network:
https://lists.torproject.org/pipermail/tor-dev/2014-September/007558.html
Best Regards,
[0] https://blog.cr.yp.to/20140213-ideal.html
[1] https://binary.cr.yp.to/mcbits-20130616.pdf
[2] https://cr.yp.to/talks/2016.02.24/slides-djb-20160224-a4.pdf
bancfc@openmailbox.org transcribed 7.3K bytes:
This brings up another point that digresses from the discussion:
Dan and Tanja support more conservative systems like McEliece because it survived decades of attacks. In the event that cryptanalysis eliminates Lattice crypto, McEliece will remain the only viable and well studied alternative.
First, it's not viable (for Tor's use case). I'll show that in a second. Second, there are other bases for contruction of post-quantum secure cryptosystems — not just some lattice problems or problems from coding theory.
How prohibitive are McEliece key sizes that they can never make it into Tor?
Extremely prohibitive. McEliece (using the original scheme proposed by McEliece in 1978 [0] but with the recommended post-quantum secure parameters of n=6960 k=5413 t=119) keys are 1 MB in size. [1]
Plugging this number into my previous email [2] in this thread:
- average microdescriptor size would be ~1048992 bytes (252161% larger!) - the network would use 5043 Gb/s for directory fetches (this is roughly 33 times the current total estimated capacity of the network)
Result: no more Tor Network.
Can the size problem be balanced against longer re-keying times for PFS - say once every 6 hours or more instead of every connection (there are probably many other changes needed to accomodate it).
No.
Further, there are known attacks on McEliece resulting from ciphertext malleability, i.e. adding codewords to a valid ciphertext yields another valid ciphertext. [3] This results in a trivial CCA2 attack where the adversary can add a second message m' to a ciphertext c with c' = c ⊕ m'Gpub, where Gpub is dot product of matrices G, the generating set of vectors, and P, the permutation matrix. One consequence of this ciphertext malleability is that an attacker may use the relation between two different messages encrypted to the same McEliece key to recover error bits, leading to the attacker being able to recover the plaintext. [4] Were we to use Shoup's KEM+DEM approach for transforming a public-key encryption scheme into a mechanism for deriving a shared secret (as is done in the NTRU Prime paper), this plaintext recovery attack would result in the attacker learning the shared secret, meaning that all authentication and secrecy in the handshake are lost completely. There are possible workarounds to the CCA2 attacks (e.g. Kobara-Imai Gamma Conversion) which generally increase both the implementational complexity of the scheme and increase the number of bytes required to be sent in each direction (by introducing redundancy into the codewords and uniformly-disbursed randomness to ciphertexts), however these are inelegant, kludgey fixes for a system not worth saving because ITS KEYS TAKE UP AN ENTIRE MEGABYTE.
They've worked on making tradeoffs of longer decryption times to get smaller keys in their McBits implementation [1] but they still are no where near Lattice ones (McEliece has very fast encoding/decoding so it works out).
Yes, I'm aware. Also, Peter (in CC and working with me on this proposal) is the other author of the McBits paper. If Peter thought McBits was more suitable than NewHope for a Tor handshake, then I hope he'd have mentioned that by now. :)
Also, for a minimum security of 128 bits, the smallest McBits keysize available is 65 KB; that's still not doable for Tor. (In fact, that would result in 320 Gb/s being used for directory fetches — more than double the current estimated total bandwidth capacity of the network — so thus again there would be no Tor Network.)
With the averge webpage being 2 MBs in size, larger keys may not be that bad?
Hopefully, everyone is now convinced with the arguments above that, yes, larger keys are that bad.
[0]: http://ipnpr.jpl.nasa.gov/progress_report2/42-44/44N.PDF [1]: http://pqcrypto.eu.org/docs/initial-recommendations.pdf [2]: https://lists.torproject.org/pipermail/tor-dev/2016-May/010952.html [3]: Overbeck, R., Sendrier, N. (2009). "Code-Based Cryptography". in Berstein, D.J., Buchmann, J., Dahmen, E. (Eds.), Post-Quantum Cryptography (pp. 134-136). Berlin: Springer Verlag. https://www.springer.com/us/book/9783540887010 [4]: http://link.springer.com/content/pdf/10.1007%2FBFb0052237.pdf
Best Regards,
On 2016-05-19 15:28, isis agora lovecruft wrote:
bancfc@openmailbox.org transcribed 7.3K bytes:
This brings up another point that digresses from the discussion:
Dan and Tanja support more conservative systems like McEliece because it survived decades of attacks. In the event that cryptanalysis eliminates Lattice crypto, McEliece will remain the only viable and well studied alternative.
First, it's not viable (for Tor's use case). I'll show that in a second. Second, there are other bases for contruction of post-quantum secure cryptosystems — not just some lattice problems or problems from coding theory.
How prohibitive are McEliece key sizes that they can never make it into Tor?
Extremely prohibitive. McEliece (using the original scheme proposed by McEliece in 1978 [0] but with the recommended post-quantum secure parameters of n=6960 k=5413 t=119) keys are 1 MB in size. [1]
Plugging this number into my previous email [2] in this thread:
- average microdescriptor size would be ~1048992 bytes (252161%
larger!)
- the network would use 5043 Gb/s for directory fetches (this is
roughly 33 times the current total estimated capacity of the network)
Result: no more Tor Network.
Can the size problem be balanced against longer re-keying times for PFS - say once every 6 hours or more instead of every connection (there are probably many other changes needed to accomodate it).
No.
Further, there are known attacks on McEliece resulting from ciphertext malleability, i.e. adding codewords to a valid ciphertext yields another valid ciphertext. [3] This results in a trivial CCA2 attack where the adversary can add a second message m' to a ciphertext c with c' = c ⊕ m'Gpub, where Gpub is dot product of matrices G, the generating set of vectors, and P, the permutation matrix. One consequence of this ciphertext malleability is that an attacker may use the relation between two different messages encrypted to the same McEliece key to recover error bits, leading to the attacker being able to recover the plaintext. [4] Were we to use Shoup's KEM+DEM approach for transforming a public-key encryption scheme into a mechanism for deriving a shared secret (as is done in the NTRU Prime paper), this plaintext recovery attack would result in the attacker learning the shared secret, meaning that all authentication and secrecy in the handshake are lost completely. There are possible workarounds to the CCA2 attacks (e.g. Kobara-Imai Gamma Conversion) which generally increase both the implementational complexity of the scheme and increase the number of bytes required to be sent in each direction (by introducing redundancy into the codewords and uniformly-disbursed randomness to ciphertexts), however these are inelegant, kludgey fixes for a system not worth saving because ITS KEYS TAKE UP AN ENTIRE MEGABYTE.
They've worked on making tradeoffs of longer decryption times to get smaller keys in their McBits implementation [1] but they still are no where near Lattice ones (McEliece has very fast encoding/decoding so it works out).
Yes, I'm aware. Also, Peter (in CC and working with me on this proposal) is the other author of the McBits paper. If Peter thought McBits was more suitable than NewHope for a Tor handshake, then I hope he'd have mentioned that by now. :)
Also, for a minimum security of 128 bits, the smallest McBits keysize available is 65 KB; that's still not doable for Tor. (In fact, that would result in 320 Gb/s being used for directory fetches — more than double the current estimated total bandwidth capacity of the network — so thus again there would be no Tor Network.)
With the averge webpage being 2 MBs in size, larger keys may not be that bad?
Hopefully, everyone is now convinced with the arguments above that, yes, larger keys are that bad.
[3]: Overbeck, R., Sendrier, N. (2009). "Code-Based Cryptography". in Berstein, D.J., Buchmann, J., Dahmen, E. (Eds.), Post-Quantum Cryptography (pp. 134-136). Berlin: Springer Verlag. https://www.springer.com/us/book/9783540887010 [4]: http://link.springer.com/content/pdf/10.1007%2FBFb0052237.pdf
Best Regards,
Thanks for explaining Isis and hats off to you, Yawning and Peter for leading the PQ transition.
bancfc@openmailbox.org wrote:
Hi all,
Some great developments in lattice-based crypto. DJB just released a paper on NTRU Prime:
Let me just also throw in my 2 cents:
As far as I can tell, there are now 5 approaches to post-quantum key exchange that are discussed (or at least have surfaced) in this thread: - NTRU - NewHope - NTRUprime - McEliece - SIDH
In some of the posts I got the impression that there are considerations for some sort of "algorithm agility" in the key exchange. This can mean two things: - runtime algorithm agility, such that a client can choose what algorithm to use according to some strategy; or - upgrading algorithm agility, which means that for each version number of a client, it is very clear which algorithm it will use, but the key exchange supports (easy) upgrading to better algorithms with increasing version numbers.
In my opinion, the second approach is extremely important, in particular because at the moment we certainly want some sort of PQ key exchange, but are quite uncertain about what approach will prove best in the long run. I'm not really sure whether anybody really suggested the first approach, but just in case, I think it's a terrible idea. If the decision of what algorithms to use is left to users, they will certainly end up making worse choices than experts; if it's left to randomness, users inevitably get the worst out of the set from time to time. I simply cannot think of any strategy that lets the user win here.
Now, out of the 5 proposals, my impression is that McEliece with binary Goppa codes has been killed as an idea for key exchange in Tor and that whenever somebody mentions it again, Isis will just shout "KEYSIZE" to make sure that it remains dead.
I can see good reasons for each of the first three (lattice-based) proposals:
- NTRU is around for the longest time and has, even with high-security parameters, fairly short messages. However, existing software implementations (at least the ones in SUPERCOP) are highly vulnerable to timing attacks. Key generation (with protection to against timing attacks) is, as far as I can tell, fairly expensive.
- NewHope is explicitely designed for ephemeral key exchange (not public-key encryption), i.e, for the operation that is required. It offers best performance and, as Isis pointed out, the paramters we propose have the largest security margin against all known attacks. Disadvantage (price to pay for the security margin): somewhat longer messages.
- NTRUprime is much less conservative than NewHope in its parameter choices against known attacks but offers "defenses" against attacks that may or may not work against NewHope and NTRU. The advertisement of those defenses is a pretty good job of the marketing department: the wording suggests that NTRUprime is, at the same bit-security level, at least as secure as NTRU or NewHope, but then has those defenses as additional safeguards. This is not quite true: NTRUprime operates in a different mathematical structure, which may in the long run prove more secure, it may also prove less secure and it could turn out that the "defenses" against hypothetical attacks turn out to be weaknesses against other attacks. Having said that, I really like the ideas of the NTRUprime paper and much more often than not have Dan and Tanja been right with their intuition for cryptographic design.
What I'm missing in the discussion are benchmarks of NTRU and NTRUprime (in the context of key exchange). For NTRUprime there is not even a proposal for ephemeral key exchange as needed in Tor; however, I assume that it's not too hard to replace NTRU operations (from the NTRU proposal) by NTRUprime operations. Also (please correct me if I'm wrong), for NTRU it's not clear what parameters to use and there is no optimized and timing-attack protected implementation. Would it help the discussion at this point to fix NTRU parameters, produce an optimized implementation of NTRU (not production-quality code for Tor, but something to obtain benchmarks) and compare performance of NewHope, NTRU, and NTRUprime in an ephemeral key exchange? This would be a nice project for a Ph.D. student.
Now, then there is SIDH: I really like SIDH, not so much for the shorter public keys (that's nice, but it's not like a difference of an order of magnitude), but because of two other reasons: 1.) It is a completely different approach and even if all non-standard lattice crypto falls, SIDH may survive. 2.) It's offering the same functionality as (EC)DH right now; both parties can compute their public keys without having received the public key of the other party. This is a great feature for protocol design and for migrating existing DH-based protocols to a post-quantum setting. However, I don't think that SIDH is ready for use at the moment. The reason is not only that it's about 300x slower than ECC, the reason is that security is really not understood at all. While probably everybody in the crypto community expects some progress in lattice attacks over the next few decades, I don't think that anybody seriously expects a polynomial-time algorithm that breaks, for example, Ring-LWE with suitably chosen paramters. A polynomial-time algorithm to break SIDH would clearly be a major breakthrough with an immediate-accept at any of the big crypto venues, but I would not want to really bet anything important (like my life in freedom) against this happening. I talked to Michael Naehrig (co-author of the Crypto paper this year implementing SIDH) two days ago and mentioned that SIDH popped up in this discussion as a possible approach for Tor. His reaction was roughly along the lines of "We say in our paper that you shouldn't use this, and very certainly not in Tor".
So, I hope that SIDH will survive in the long run, after serious cryptanalytic effort trying to break it; I hope that it will become faster by, say, a factor of 100 (that's not totally inconceivable, pairings took minutes to compute at the beginning, now we're below a millisecond) and if this happens then Tor should migrate to SIDH. Not now.
As I said, just my 2 cents.
Cheers,
Peter
isis transcribed 35K bytes:
Hello,
Peter (in CC) and I have recently composed a draft proposal for a new Tor handshake. It's a hybrid handshake combining Tor's current X25519-based NTor handshake with the NewHope lattice-based key exchange, in order to protect the secrecy of Tor connections today from an attacker with a quantum computer in the future.
I have not given the proposal a number. It is available in my `drafts/newhope` branch of my torspec repository:
https://gitweb.torproject.org/user/isis/torspec.git/tree/proposals/XXX-newho...
Boring SSL now includes a similar hybrid handshake under the name CECPQ1, [0] which is a really horribly crappy name. No offense to the Boring developers, but we're not that boring.
We're calling our handshake RebelAlliance, since it's an alliance between X25519 and NewHope. The proposal has been updated to reflect this.
[0]: https://boringssl-review.googlesource.com/#/c/7962/
Best,