Wide block cipher experiment. - tor-dev

19 Mar 2015


      Hello,
Nickm mentioned to me that he was curious as to how LIONESS performs
these days (See #5460) with modern cryptographic primitives.  I've
conveyed the results to several people, but I'm also sending them here
for posterity.
Code used: https://github.com/yawning/lioness (May be incorrect, don't
use for anything other than benchmarking.  Numbers taken with a
previous version of the code without the initial memcpy, that was added
later so that the code in git could be used by the extremely brave for
other things.)
All measurements taked on an i5-4250U, so the usual caveats about
turboboost and hyperthreading apply.
Baseline (from tests/bench, AES-NI enabled):
  ===== cell_ops =====
   Inbound cells: 231.33 ns per cell. (0.45 ns per byte of payload)
  Outbound cells: 224.39 ns per cell. (0.44 ns per byte of payload)
(Note: Outbound with AES-NI disabled is ~3.0 ns per byte)
LIONESS (BLAKE2b/ChaCha, 509 byte block size):
 * ChaCha20:
   * Ted Krovetz's AVX2-ed ChaCha20/Ref AVX BLAKE2b: ~6.6 ns/byte
     (~143 MiB/s)
   * AVX2ed-ed ChaCha20, Andrew Moon's AVX2-ed Blake2b: ~5.0 ns/byte
     (~190 MiB/s)
 * ChaCha12:
   * Ted Krovetz's AVX2-ed ChaCha12/Ref AVX BLAKE2b: ~6.1 ns/byte
     (~156 MiB/s)
   * AVX2ed-ed ChaCha12, Andrew Moon's AVX2-ed Blake2b: ~4.4 ns/byte
     (~213 MiB/s)
 * ChaCha8: (Yolo swag 420 blaze it)
   * Ted Krovetz's AVX2-ed ChaCha8/Ref AVX BLAKE2b: ~5.8 ns/byte
     (~164 MiB/s)
   * AVX2ed-ed ChaCha12, Andrew Moon's AVX2-ed Blake2b: ~4.1 ns/byte
     (~232 MiB/s)
NB: Using Andrew Moon's Blake2b isn't in git, because the way I tested
it was kind of kludgey.
Profiler output:
  64.04%  lioness_test_av  lioness_test_avx2  [.] blake2b_compress
  22.43%  lioness_test_av  lioness_test_avx2  [.] chacha_stream_xor
   6.60%  lioness_test_av  lioness_test_avx2  [.] blake2b_init_key
   2.72%  lioness_test_av  lioness_test_avx2  [.] blake2b
   2.41%  lioness_test_av  libc-2.21.so       [.] __memcpy_avx_unaligned
   1.07%  lioness_test_av  lioness_test_avx2  [.] lioness_encrypt_block
Ted Krovetz's ChaCha implementation isn't quite the fastest out there,
but it doesn't lag massively behind Andrew Moon's.  Benchmarks on the
same hardware from Andrew Moon's chacha-opt/blake2b-opt are:
BLAKE2b:
 576 byte(s):
          avx2,  1468.00 cycles per call,   2.5486 cycles/byte
           avx,  1674.00 cycles per call,   2.9062 cycles/byte
           x86,  2020.00 cycles per call,   3.5069 cycles/byte
    generic/64,  2638.00 cycles per call,   4.5799 cycles/byte
ChaCha20:
 576 byte(s):
          avx2,   694.00 cycles per call,   1.2049 cycles/byte
           avx,  1104.00 cycles per call,   1.9167 cycles/byte
         ssse3,  1112.00 cycles per call,   1.9306 cycles/byte
          sse2,  1376.00 cycles per call,   2.3889 cycles/byte
           x86,  2528.00 cycles per call,   4.3889 cycles/byte
       generic,  3200.00 cycles per call,   5.5556 cycles/byte
I don't think using CTR-AES (with AES-NI) in a LIONESS construct is
going to be that big of a win, at least on my hardware, and the sort of
performance I'm seeing feels too much of a performance hit to me.
Regards,
-- 
Yawning Angel