isis isis@torproject.org wrote:
Hi all,
I haven't given it much thought yet, but the performance cost to making polynomial initialisation constant time may not actually be so bad. My naïve, I'm-still-finishing-my-breakfast solution would be to oversample (say 2048 uint16_ts) from SHAKE-128, shove them into a stable sorting network with a comparator which says "not equal" for x[i]>q and "equal" otherwise.
My (hopefully) slightly improved variant of this one is to take three "sqeezes" from SHAKE-128 to obtain 252 16-bit integers, pad to 256, use Batcher's odd-even merge sort to move all the entries >= 5q (using Yawning's idea to lower the rejection probability) to the end of the array with the comparison that Isis described and then take the low 205, which are all smaller than 5q with overwhelmingly large (still need to do the math here) probability. Do this 5 times and you have enough coefficients for the polynomial. Should you run into one of the rare cases where you don't have 205 good entries, simply pick a new seed.
The nice thing is that you can easily vectorize this; a 16x vectorized version (producing more than enough output for 3 handshakes) takes <53000 cycles on Haswell; the code is checked into the newhope-tor-testvectors directory in the discard subdirectory. This does not include reduction mod q, but that's also not required inside the computation of NewHope.
Cheers,
Peter