Hello,
Nickm mentioned to me that he was curious as to how LIONESS performs these days (See #5460) with modern cryptographic primitives. I've conveyed the results to several people, but I'm also sending them here for posterity.
Code used: https://github.com/yawning/lioness (May be incorrect, don't use for anything other than benchmarking. Numbers taken with a previous version of the code without the initial memcpy, that was added later so that the code in git could be used by the extremely brave for other things.)
All measurements taked on an i5-4250U, so the usual caveats about turboboost and hyperthreading apply.
Baseline (from tests/bench, AES-NI enabled): ===== cell_ops ===== Inbound cells: 231.33 ns per cell. (0.45 ns per byte of payload) Outbound cells: 224.39 ns per cell. (0.44 ns per byte of payload)
(Note: Outbound with AES-NI disabled is ~3.0 ns per byte)
LIONESS (BLAKE2b/ChaCha, 509 byte block size): * ChaCha20: * Ted Krovetz's AVX2-ed ChaCha20/Ref AVX BLAKE2b: ~6.6 ns/byte (~143 MiB/s) * AVX2ed-ed ChaCha20, Andrew Moon's AVX2-ed Blake2b: ~5.0 ns/byte (~190 MiB/s) * ChaCha12: * Ted Krovetz's AVX2-ed ChaCha12/Ref AVX BLAKE2b: ~6.1 ns/byte (~156 MiB/s) * AVX2ed-ed ChaCha12, Andrew Moon's AVX2-ed Blake2b: ~4.4 ns/byte (~213 MiB/s) * ChaCha8: (Yolo swag 420 blaze it) * Ted Krovetz's AVX2-ed ChaCha8/Ref AVX BLAKE2b: ~5.8 ns/byte (~164 MiB/s) * AVX2ed-ed ChaCha12, Andrew Moon's AVX2-ed Blake2b: ~4.1 ns/byte (~232 MiB/s)
NB: Using Andrew Moon's Blake2b isn't in git, because the way I tested it was kind of kludgey.
Profiler output: 64.04% lioness_test_av lioness_test_avx2 [.] blake2b_compress 22.43% lioness_test_av lioness_test_avx2 [.] chacha_stream_xor 6.60% lioness_test_av lioness_test_avx2 [.] blake2b_init_key 2.72% lioness_test_av lioness_test_avx2 [.] blake2b 2.41% lioness_test_av libc-2.21.so [.] __memcpy_avx_unaligned 1.07% lioness_test_av lioness_test_avx2 [.] lioness_encrypt_block
Ted Krovetz's ChaCha implementation isn't quite the fastest out there, but it doesn't lag massively behind Andrew Moon's. Benchmarks on the same hardware from Andrew Moon's chacha-opt/blake2b-opt are:
BLAKE2b: 576 byte(s): avx2, 1468.00 cycles per call, 2.5486 cycles/byte avx, 1674.00 cycles per call, 2.9062 cycles/byte x86, 2020.00 cycles per call, 3.5069 cycles/byte generic/64, 2638.00 cycles per call, 4.5799 cycles/byte
ChaCha20: 576 byte(s): avx2, 694.00 cycles per call, 1.2049 cycles/byte avx, 1104.00 cycles per call, 1.9167 cycles/byte ssse3, 1112.00 cycles per call, 1.9306 cycles/byte sse2, 1376.00 cycles per call, 2.3889 cycles/byte x86, 2528.00 cycles per call, 4.3889 cycles/byte generic, 3200.00 cycles per call, 5.5556 cycles/byte
I don't think using CTR-AES (with AES-NI) in a LIONESS construct is going to be that big of a win, at least on my hardware, and the sort of performance I'm seeing feels too much of a performance hit to me.
Regards,