New subject: [RFC] Proposal: A First Take at PoW Over Introduction Circuits

8 May 2020


      Hi all,
I was asked by George to submit my comments about the proposal and to suggest
suitable RandomX parameters for this PoW scheme.
...
For our dynamic PoW system to work, we will need to be able to compare PoW
 tokens with each other. To do so we define a function:
        unsigned effort(uint8_t *token)
 which takes as its argument a hash output token, and returns the number of
 leading zero bits on it.
This definition makes the effort exponential, i.e. the computational resources
required to reach one notch higher effort increase by a factor of 2 each time.
I suggest to use linear effort defined as the quotient of dividing a bitstring
of 1s by the hash value:
== Example A:
effort(00000001100010101101) = 11111111111111111111 / 00000001100010101101
or in decimal:
effort(6317) = 1048575 / 6317 = 165.
This definition of effort has the advantage of directly expressing the expected
number of hashes that the client had to calculate to reach the effort.
With the exponential definition, we could have an equivalent linear effort of
either 128 (7 leading zeroes) or 256 (8 leading zeroes), while the linear
definition provides smoother classification of PoW results.
...
The EXT_FIELD content format is:
 POW_VERSION    [1 byte]
 POW_NONCE      [32 bytes]


I suggest to use a 16-byte nonce value, which is more than sufficient given the
target security level and has the benefit of reducing the replay cache size to
half.
Since we expect the seed to be valid for around 3 hours (as proposed), then even
if the service receives 1 million proofs per second and each proof has an effort
of 1 million, then the number of submitted nonces from clients will only reach
about 10^10. With a 108-bit solution space (subtracting 20 bits as the search
space per client), the probability that two clients accidentally submit the same
nonce is roughly 10^-13 (see [REF_BIRTHDAY]).
Additionally, I suggest to add the client's effort for convenience:
The updated EXT_FIELD content format would be:
POW_VERSION    [1 byte]
      POW_EFFORT     [4 bytes]
      POW_NONCE      [16 bytes]
Including the effort has 2 benefits:
1. In case the Intro Priority Queue is full, the service doesn't need to
       waste time verifying PoW solutions that have effort lower than the last
       element in the queue. While an attacker can always lie about the actual
       effort of their nonce value, I think this field can still save some CPU
       cycles when the service is under heavy load.
2. The service can conveniently verify the reported effort with
       the following inequality:
POW_EFFORT * pow_function(POW_NONCE, SEED) <= MAX_RESULT
where MAX_RESULT is the highest possible output from the pow_function.
       In the case of Example A, that would be:
165 * 6317 = 1042305 <= 1048576
...
Similar to how our cell scheduler works, the onion service subsystem will
 poll the priority queue every 100ms tick and process the first 20 cells from
 the priority queue (if they exist). The service will perform the rendezvous
 and the rest of the onion service protocol as normal.
I suggest to use a different selection method rather than always taking
the first 20 requests.
Selecting cells from the front of the queue actually minimizes the effort that
an attacker needs to expend to completely block access to the service.
The attacker can always submit the minimum effort required to occupy the first
20 slots in each 100 ms window. This can be done by submitting just 200 requests
per second and observing how many circuits are successfully open and adjusting
the effort until no other request can go through. This is the "Total overwhelm
strategy" described in § 5.1.1.
See the following examples to show how selecting the first N cells from
the queue is unfair. I will use N = 4 for clarity. E denotes some value of
effort.
== Example B:
ATTACKER                       LEGITIMATE CLIENTS
     ___________      _________         ____________       ______________
    /                          \       /                                 \
+--------+--------+--------+--------+----+----+----+----+----+----+----+----+
|   2E   |   2E   |   2E   |   2E   |  E |  E |  E |  E |  E |  E |  E |  E |
+--------+--------+--------+--------+----+----+----+----+----+----+----+----+
^        ^        ^        ^
 selected selected selected selected
Here both the attacker and the legitimate clients have expended a combined
effort of 8E. All of the attacker's cells get selected, while none of the other
clients get through.
Instead, I suggest to use Stochastic universal sampling [REF_SUS], where
the probability that a cell gets selected is proportional to its effort.
== Example C:
Here the total effort in the queue is 16E, so we first select a random value in
the interval [0, 4E) and then select 4 evenly spaced cells:
ATTACKER                       LEGITIMATE CLIENTS
     ___________      _________         ____________       ______________
    /                          \       /                                 \
+--------+--------+--------+--------+----+----+----+----+----+----+----+----+
|   2E   |   2E   |   2E   |   2E   |  E |  E |  E |  E |  E |  E |  E |  E |
+--------+--------+--------+--------+----+----+----+----+----+----+----+----+
^                  ^                  ^                  ^
  selected            selected           selected           selected
In this case, 2 cells are selected from each group, which is fairer considering
that each group expended the same effort.
== Example D:
Now if the attacker wanted to block access to all legitimate clients, they would
need to at least quadruple their total PoW effort (and there would still be
a chance that a legitimate client gets selected from the end of the queue):
ATTACKER                        LEGITIMATE CLIENTS
  __________________________          ______________________   ______________
 /                                                          \ /              \
+--------------+--------------+---------------+--------------+-+-+-+-+-+-+-+-+
|       8E     |       8E     |       8E     |       8E      |E|E|E|E|E|E|E|E|
+--------------+--------------+--------------+---------------+-+-+-+-+-+-+-+-+
^                  ^                  ^                  ^
selected            selected           selected          selected
Note: With linear effort, the original selection method becomes even worse
because the attacker can occupy the front of the queue with an effort of just
E+1 per cell rather than 2E.
In some cases, Stochastic universal sampling can select a single element
multiple times, which is not a problem for genetic algorithms, but we want to
avoid it. I suggest to restart the selection algorithm from the beginning of
the next cell in those cases and shorten the sampling interval according to
the remaining portion of the queue. See the following example.
== Example E:
+---------------+-------------+--------------+------------+-+-+-+-+-+-+-+-+-+-+
|              16E            |       8E     |     6E     |E|E|E|E|E|E|E|E|E|E|
+---------------+-------------+--------------+------------+-+-+-+-+-+-+-+-+-+-+
^                                  ^              ^              ^
  selected                            selected       selected       selected
...
In particular, the service starts with a default suggested-effort value of 15.
Everytime the service handles an introduction request from the priority queue
 in [HANDLE_QUEUE], the service compares the request's effort to the current
 suggested-effort value. If the new request's effort is lower than the
 suggested-effort, set the suggested-effort equal to the effort of the new
 request.
Everytime the service trims the priority queue in [HANDLE_QUEUE], the service
 compares the request at the trim point against the current suggested-effort
 value. If the trimmed request's effort is higher than the suggested-effort,
 set the suggested-effort equal to the effort of the new request.
I think the default suggested-effort is a bit too high. Assuming each attempt
takes around 1 ms to calculate, the client would need, on average,
2^15 ms = 33 seconds to find a solution using 1 CPU core. I suggest to specify
a minimum effort instead. Requests with effort < MIN_EFFORT and requests without
the PROOF_OF_WORK extension would be treated as having effort = 1 for
the purposes of the sampling algorithm.
Secondly, the proposed method of calculating the suggested-effort is susceptible
to gaming by attackers. Since the service can only update the value in
the directory once every 'hs-pow-desc-upload-rate-limit' seconds, they could
stop the attack for a little while just before the directory gets updated, which
would cause the suggested-effort value to be too low despite an ongoing attack.
I suggest to take the median effort of the selected cells during each 100 ms
window. For the examples above that would be:
Example C: 1.5E
  Example D: 8E
  Example E: 7E
Then I would take the median of these values over the directory update period.
...

Attacker strategies

Additional attacks:
5.1.2 PoW Spam Attack
The attacker may try to spam many requests with random values of POW_NONCE,
requiring the service to waste cycles verifying the invalid proofs. No such
request would make it into the Intro Queue, but it may still be a viable DoS
strategy depending on proof verification time and the number of intro requests
that can be practically delivered through the network.
5.1.3 Precomputed PoW attack
The attacker may precompute many valid PoW nonces and submit them all at once
before the current seed expires, overwhelming the service temporarily even
using a single computer.
...
4.2. Seed expiration issues
As mentioned in [DESC_POW], the expiration timestamp on the PoW seed can
 cause issues with clock skewed clients. Furthermore, even not clock skewed
 clients can encounter TOCTOU-style race conditions here.
The client descriptor refetch logic of [CLIENT_TIMEOUT] should take care of
 such seed-expiration issues, since the client will refetch the descriptor.
I suggest to use 2 concurrent seeds, i.e. to accept PoW both from the current
and the last seed epoch. We use this approach in Monero. It would however
require adding the seed value into the proof of work extension field and also
double the memory requirements for verification with RandomX.
...
The proposal suggests argon2, and Mike has been looking at Randomx. However,
  after further consideration and speaking with some people (props to Alex
  Biryukov), it seems like those two functions are not well fitted for this
  purpose, since they are memory-hard both for the client and the service. And
  since we are trying to minimize the verification overhead, so that the
  service can do hundreds of verifications per second, they don't seem like
  good fits.
Asymmetric function like Equihash, Cuckoo cycle [REF_CUCKOO] and MTP have the
advantage of being very fast to verify, but they run much faster on GPUs and
specialized hardware, so I don't think they are particularly suitable for this
purpose.
When we designed RandomX to be used as the PoW algorithm by Monero, we selected
the parameters very conservatively to maximize the ASIC resistance of
the algorithm. That's because it is very difficult to change the PoW algorithm
once it's deployed.
TOR is not limited by this, so we can be much more aggressive when configuring
RandomX. I suggest to use a configuration that gives the fastest possible
verification time without completely breaking the algorithm.
In particular, the following parameters should be set differently from Monero:
RANDOMX_ARGON_SALT = "RandomX-TOR-v1"
The unique RandomX salt means we do not need to use a separate salt as PoW input
as specified in § 3.2.
RANDOMX_ARGON_ITERATIONS = 1
    RANDOMX_CACHE_ACCESSES = 4
    RANDOMX_DATASET_BASE_SIZE = 1073741824
    RANDOMX_DATASET_EXTRA_SIZE = 16777216
These 4 changes reduce the RandomX Dataset size to ~1 GiB, which allows
the number of iteration to be reduced from 8 to 4. The combined effect of this
is that Dataset initialization becomes 4 times faster, which is needed due to
more frequent updates of the seed (Monero updates once per ~3 days).
RANDOMX_PROGRAM_COUNT = 2
    RANDOMX_SCRATCHPAD_L3 = 1048576
Additionally, reducing the number of programs from 8 to 2 makes the hash
calculation about 4 times faster, while still providing resistance against
program filtering strategies (see [REF_RANDOMX_PROGRAMS]). Since there are
4 times fewer writes, we also have to reduce the scratchpad size. I suggest to
use a 1 MiB scratchpad size as a compromise between scratchpad write density and
memory hardness. Most x86 CPUs will perform roughly the same with a 512 KiB and
1024 KiB scratchpad, while the larger size provides higher resistance against
specialized hardware, at the cost of possible time-memory tradeoffs (see
[REF_RANDOMX_TMTO] for details).
Lastly, we reduce the output of RandomX to just 8 bytes:
RANDOMX_HASH_SIZE = 8
64-bit preimage security is more than sufficient for proof-of-work and it allows
the result to be treated as a little-endian encoded unsigned integer for easy
effort calculation.
RandomX would be used as follows:
The service will select a 32-byte POW_SEED and initialize the cache and
the dataset:
randomx_init_cache(myCache, POW_SEED, 32);
    randomx_init_dataset(myDataset, myCache, 0, randomx_dataset_item_count());
randomx_vm *myMachine = randomx_create_vm(flags, NULL, myDataset);
Then in order to validate a PoW token, we could use something like this:
int validateProof(uint32_t pow_effort, void* pow_nonce) {
uint64_t result;
randomx_calculate_hash(myMachine, pow_nonce, 16, &result);
if (mulh(pow_effort, result) == 0) {
            return 1;
        }
return 0;
    }
I suggest to set MIN_EFFORT = 10000, which takes about 1 second on my laptop.
Requests with pow_effort < MIN_EFFORT would not be validated.
I have collected some performance figures for the fast mode with the above
RandomX configuration (~1 GiB of memory is required):
H/s = hashes per second
== CPUs:
Intel Core i3-3220
    - 1 thread           700 H/s
    - 3 threads         1400 H/s
Intel Xeon (dual core VPS, Sandy Bridge, unknown model)
    - 1 thread          2000 H/s
    - 2 threads         4000 H/s
Intel Core i5-2500K (stock)
    - 1 thread          2200 H/s
    - 4 threads         8200 H/s
Intel Core i7-8550U (laptop)
    - 1 thread          2700 H/s
    - 8 threads        10000 H/s
Intel Core i7-9850H (laptop)
    - 1 thread          3100 H/s
    - 12 threads       16000 H/s
AMD Ryzen 1700  @ 3300MHz
    - 1 thread          2300 H/s
    - 16 threads       23800 H/s
AMD Ryzen 3700X @ 3300MHz
    - 1 thread          2500 H/s
    - 16 threads       27500 H/s
AMD Epyc 7702P
    - 1 thread          2100 H/s
    - 128 threads     139000 H/s
== GPUs:
NVIDIA GeForce GTX 1660 Ti (credits to SChernykh, see [REF_RANDOMX_CUDA])
    - 3072 intensity    2600 H/s
According to the above results, the time to verify a single hash is around
400-500 μs. A mid-range GPU has a similar performance as a single CPU core.
Most CPUs made since 2011 have similar per-core performance except of low-end
CPUs without hardware AES support.
References:
[REF_BIRTHDAY]: https://en.wikipedia.org/wiki/Birthday_attack#Mathematics
[REF_SUS]: https://en.wikipedia.org/wiki/Stochastic_universal_sampling
[REF_CUCKOO]: https://github.com/tromp/cuckoo
[REF_RANDOMX_PROGRAMS]:
https://github.com/tevador/RandomX/blob/master/doc/design.md#12-the-easy-pro...
[REF_RANDOMX_TMTO]: https://github.com/tevador/RandomX/issues/65
[REF_RANDOMX_CUDA]: https://github.com/SChernykh/RandomX_CUDA

Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits