Hi all,
I was asked by George to submit my comments about the proposal and to suggest suitable RandomX parameters for this PoW scheme.
For our dynamic PoW system to work, we will need to be able to compare PoW tokens with each other. To do so we define a function: unsigned effort(uint8_t *token) which takes as its argument a hash output token, and returns the number of leading zero bits on it.
This definition makes the effort exponential, i.e. the computational resources required to reach one notch higher effort increase by a factor of 2 each time.
I suggest to use linear effort defined as the quotient of dividing a bitstring of 1s by the hash value:
== Example A:
effort(00000001100010101101) = 11111111111111111111 / 00000001100010101101
or in decimal:
effort(6317) = 1048575 / 6317 = 165.
This definition of effort has the advantage of directly expressing the expected number of hashes that the client had to calculate to reach the effort.
With the exponential definition, we could have an equivalent linear effort of either 128 (7 leading zeroes) or 256 (8 leading zeroes), while the linear definition provides smoother classification of PoW results.
The EXT_FIELD content format is:
POW_VERSION [1 byte] POW_NONCE [32 bytes]
I suggest to use a 16-byte nonce value, which is more than sufficient given the target security level and has the benefit of reducing the replay cache size to half.
Since we expect the seed to be valid for around 3 hours (as proposed), then even if the service receives 1 million proofs per second and each proof has an effort of 1 million, then the number of submitted nonces from clients will only reach about 10^10. With a 108-bit solution space (subtracting 20 bits as the search space per client), the probability that two clients accidentally submit the same nonce is roughly 10^-13 (see [REF_BIRTHDAY]).
Additionally, I suggest to add the client's effort for convenience:
The updated EXT_FIELD content format would be:
POW_VERSION [1 byte] POW_EFFORT [4 bytes] POW_NONCE [16 bytes]
Including the effort has 2 benefits:
1. In case the Intro Priority Queue is full, the service doesn't need to waste time verifying PoW solutions that have effort lower than the last element in the queue. While an attacker can always lie about the actual effort of their nonce value, I think this field can still save some CPU cycles when the service is under heavy load.
2. The service can conveniently verify the reported effort with the following inequality:
POW_EFFORT * pow_function(POW_NONCE, SEED) <= MAX_RESULT
where MAX_RESULT is the highest possible output from the pow_function. In the case of Example A, that would be:
165 * 6317 = 1042305 <= 1048576
Similar to how our cell scheduler works, the onion service subsystem will poll the priority queue every 100ms tick and process the first 20 cells from the priority queue (if they exist). The service will perform the rendezvous and the rest of the onion service protocol as normal.
I suggest to use a different selection method rather than always taking the first 20 requests.
Selecting cells from the front of the queue actually minimizes the effort that an attacker needs to expend to completely block access to the service. The attacker can always submit the minimum effort required to occupy the first 20 slots in each 100 ms window. This can be done by submitting just 200 requests per second and observing how many circuits are successfully open and adjusting the effort until no other request can go through. This is the "Total overwhelm strategy" described in § 5.1.1.
See the following examples to show how selecting the first N cells from the queue is unfair. I will use N = 4 for clarity. E denotes some value of effort.
== Example B:
ATTACKER LEGITIMATE CLIENTS ___________ _________ ____________ ______________ / \ / \
+--------+--------+--------+--------+----+----+----+----+----+----+----+----+ | 2E | 2E | 2E | 2E | E | E | E | E | E | E | E | E | +--------+--------+--------+--------+----+----+----+----+----+----+----+----+
^ ^ ^ ^ selected selected selected selected
Here both the attacker and the legitimate clients have expended a combined effort of 8E. All of the attacker's cells get selected, while none of the other clients get through.
Instead, I suggest to use Stochastic universal sampling [REF_SUS], where the probability that a cell gets selected is proportional to its effort.
== Example C:
Here the total effort in the queue is 16E, so we first select a random value in the interval [0, 4E) and then select 4 evenly spaced cells:
ATTACKER LEGITIMATE CLIENTS ___________ _________ ____________ ______________ / \ / \
+--------+--------+--------+--------+----+----+----+----+----+----+----+----+ | 2E | 2E | 2E | 2E | E | E | E | E | E | E | E | E | +--------+--------+--------+--------+----+----+----+----+----+----+----+----+
^ ^ ^ ^ selected selected selected selected
In this case, 2 cells are selected from each group, which is fairer considering that each group expended the same effort.
== Example D:
Now if the attacker wanted to block access to all legitimate clients, they would need to at least quadruple their total PoW effort (and there would still be a chance that a legitimate client gets selected from the end of the queue):
ATTACKER LEGITIMATE CLIENTS __________________________ ______________________ ______________ / \ / \
+--------------+--------------+---------------+--------------+-+-+-+-+-+-+-+-+ | 8E | 8E | 8E | 8E |E|E|E|E|E|E|E|E| +--------------+--------------+--------------+---------------+-+-+-+-+-+-+-+-+
^ ^ ^ ^ selected selected selected selected
Note: With linear effort, the original selection method becomes even worse because the attacker can occupy the front of the queue with an effort of just E+1 per cell rather than 2E.
In some cases, Stochastic universal sampling can select a single element multiple times, which is not a problem for genetic algorithms, but we want to avoid it. I suggest to restart the selection algorithm from the beginning of the next cell in those cases and shorten the sampling interval according to the remaining portion of the queue. See the following example.
== Example E:
+---------------+-------------+--------------+------------+-+-+-+-+-+-+-+-+-+-+ | 16E | 8E | 6E |E|E|E|E|E|E|E|E|E|E| +---------------+-------------+--------------+------------+-+-+-+-+-+-+-+-+-+-+
^ ^ ^ ^ selected selected selected selected
In particular, the service starts with a default suggested-effort value of 15.
Everytime the service handles an introduction request from the priority queue in [HANDLE_QUEUE], the service compares the request's effort to the current suggested-effort value. If the new request's effort is lower than the suggested-effort, set the suggested-effort equal to the effort of the new request.
Everytime the service trims the priority queue in [HANDLE_QUEUE], the service compares the request at the trim point against the current suggested-effort value. If the trimmed request's effort is higher than the suggested-effort, set the suggested-effort equal to the effort of the new request.
I think the default suggested-effort is a bit too high. Assuming each attempt takes around 1 ms to calculate, the client would need, on average, 2^15 ms = 33 seconds to find a solution using 1 CPU core. I suggest to specify a minimum effort instead. Requests with effort < MIN_EFFORT and requests without the PROOF_OF_WORK extension would be treated as having effort = 1 for the purposes of the sampling algorithm.
Secondly, the proposed method of calculating the suggested-effort is susceptible to gaming by attackers. Since the service can only update the value in the directory once every 'hs-pow-desc-upload-rate-limit' seconds, they could stop the attack for a little while just before the directory gets updated, which would cause the suggested-effort value to be too low despite an ongoing attack.
I suggest to take the median effort of the selected cells during each 100 ms window. For the examples above that would be:
Example C: 1.5E Example D: 8E Example E: 7E
Then I would take the median of these values over the directory update period.
- Attacker strategies
Additional attacks:
5.1.2 PoW Spam Attack
The attacker may try to spam many requests with random values of POW_NONCE, requiring the service to waste cycles verifying the invalid proofs. No such request would make it into the Intro Queue, but it may still be a viable DoS strategy depending on proof verification time and the number of intro requests that can be practically delivered through the network.
5.1.3 Precomputed PoW attack
The attacker may precompute many valid PoW nonces and submit them all at once before the current seed expires, overwhelming the service temporarily even using a single computer.
4.2. Seed expiration issues
As mentioned in [DESC_POW], the expiration timestamp on the PoW seed can cause issues with clock skewed clients. Furthermore, even not clock skewed clients can encounter TOCTOU-style race conditions here.
The client descriptor refetch logic of [CLIENT_TIMEOUT] should take care of such seed-expiration issues, since the client will refetch the descriptor.
I suggest to use 2 concurrent seeds, i.e. to accept PoW both from the current and the last seed epoch. We use this approach in Monero. It would however require adding the seed value into the proof of work extension field and also double the memory requirements for verification with RandomX.
The proposal suggests argon2, and Mike has been looking at Randomx. However, after further consideration and speaking with some people (props to Alex Biryukov), it seems like those two functions are not well fitted for this purpose, since they are memory-hard both for the client and the service. And since we are trying to minimize the verification overhead, so that the service can do hundreds of verifications per second, they don't seem like good fits.
Asymmetric function like Equihash, Cuckoo cycle [REF_CUCKOO] and MTP have the advantage of being very fast to verify, but they run much faster on GPUs and specialized hardware, so I don't think they are particularly suitable for this purpose.
When we designed RandomX to be used as the PoW algorithm by Monero, we selected the parameters very conservatively to maximize the ASIC resistance of the algorithm. That's because it is very difficult to change the PoW algorithm once it's deployed.
TOR is not limited by this, so we can be much more aggressive when configuring RandomX. I suggest to use a configuration that gives the fastest possible verification time without completely breaking the algorithm.
In particular, the following parameters should be set differently from Monero:
RANDOMX_ARGON_SALT = "RandomX-TOR-v1"
The unique RandomX salt means we do not need to use a separate salt as PoW input as specified in § 3.2.
RANDOMX_ARGON_ITERATIONS = 1 RANDOMX_CACHE_ACCESSES = 4 RANDOMX_DATASET_BASE_SIZE = 1073741824 RANDOMX_DATASET_EXTRA_SIZE = 16777216
These 4 changes reduce the RandomX Dataset size to ~1 GiB, which allows the number of iteration to be reduced from 8 to 4. The combined effect of this is that Dataset initialization becomes 4 times faster, which is needed due to more frequent updates of the seed (Monero updates once per ~3 days).
RANDOMX_PROGRAM_COUNT = 2 RANDOMX_SCRATCHPAD_L3 = 1048576
Additionally, reducing the number of programs from 8 to 2 makes the hash calculation about 4 times faster, while still providing resistance against program filtering strategies (see [REF_RANDOMX_PROGRAMS]). Since there are 4 times fewer writes, we also have to reduce the scratchpad size. I suggest to use a 1 MiB scratchpad size as a compromise between scratchpad write density and memory hardness. Most x86 CPUs will perform roughly the same with a 512 KiB and 1024 KiB scratchpad, while the larger size provides higher resistance against specialized hardware, at the cost of possible time-memory tradeoffs (see [REF_RANDOMX_TMTO] for details).
Lastly, we reduce the output of RandomX to just 8 bytes:
RANDOMX_HASH_SIZE = 8
64-bit preimage security is more than sufficient for proof-of-work and it allows the result to be treated as a little-endian encoded unsigned integer for easy effort calculation.
RandomX would be used as follows:
The service will select a 32-byte POW_SEED and initialize the cache and the dataset:
randomx_init_cache(myCache, POW_SEED, 32); randomx_init_dataset(myDataset, myCache, 0, randomx_dataset_item_count());
randomx_vm *myMachine = randomx_create_vm(flags, NULL, myDataset);
Then in order to validate a PoW token, we could use something like this:
int validateProof(uint32_t pow_effort, void* pow_nonce) {
uint64_t result;
randomx_calculate_hash(myMachine, pow_nonce, 16, &result);
if (mulh(pow_effort, result) == 0) { return 1; }
return 0; }
I suggest to set MIN_EFFORT = 10000, which takes about 1 second on my laptop. Requests with pow_effort < MIN_EFFORT would not be validated.
I have collected some performance figures for the fast mode with the above RandomX configuration (~1 GiB of memory is required):
H/s = hashes per second
== CPUs:
Intel Core i3-3220 - 1 thread 700 H/s - 3 threads 1400 H/s
Intel Xeon (dual core VPS, Sandy Bridge, unknown model) - 1 thread 2000 H/s - 2 threads 4000 H/s
Intel Core i5-2500K (stock) - 1 thread 2200 H/s - 4 threads 8200 H/s
Intel Core i7-8550U (laptop) - 1 thread 2700 H/s - 8 threads 10000 H/s
Intel Core i7-9850H (laptop) - 1 thread 3100 H/s - 12 threads 16000 H/s
AMD Ryzen 1700 @ 3300MHz - 1 thread 2300 H/s - 16 threads 23800 H/s
AMD Ryzen 3700X @ 3300MHz - 1 thread 2500 H/s - 16 threads 27500 H/s
AMD Epyc 7702P - 1 thread 2100 H/s - 128 threads 139000 H/s
== GPUs:
NVIDIA GeForce GTX 1660 Ti (credits to SChernykh, see [REF_RANDOMX_CUDA]) - 3072 intensity 2600 H/s
According to the above results, the time to verify a single hash is around 400-500 μs. A mid-range GPU has a similar performance as a single CPU core. Most CPUs made since 2011 have similar per-core performance except of low-end CPUs without hardware AES support.
References:
[REF_BIRTHDAY]: https://en.wikipedia.org/wiki/Birthday_attack#Mathematics
[REF_SUS]: https://en.wikipedia.org/wiki/Stochastic_universal_sampling
[REF_CUCKOO]: https://github.com/tromp/cuckoo
[REF_RANDOMX_PROGRAMS]: https://github.com/tevador/RandomX/blob/master/doc/design.md#12-the-easy-pro...
[REF_RANDOMX_TMTO]: https://github.com/tevador/RandomX/issues/65
[REF_RANDOMX_CUDA]: https://github.com/SChernykh/RandomX_CUDA
On 08 May, 21:53, tevador tevador@gmail.com wrote:
In particular, the following parameters should be set differently from Monero:
RANDOMX_ARGON_SALT = "RandomX-TOR-v1"
The unique RandomX salt means we do not need to use a separate salt as PoW input as specified in § 3.2.
RANDOMX_ARGON_ITERATIONS = 1 RANDOMX_CACHE_ACCESSES = 4 RANDOMX_DATASET_BASE_SIZE = 1073741824 RANDOMX_DATASET_EXTRA_SIZE = 16777216
These 4 changes reduce the RandomX Dataset size to ~1 GiB, which allows the number of iteration to be reduced from 8 to 4. The combined effect of this is that Dataset initialization becomes 4 times faster, which is needed due to more frequent updates of the seed (Monero updates once per ~3 days).
RANDOMX_PROGRAM_COUNT = 2 RANDOMX_SCRATCHPAD_L3 = 1048576
Additionally, reducing the number of programs from 8 to 2 makes the hash calculation about 4 times faster, while still providing resistance against program filtering strategies (see [REF_RANDOMX_PROGRAMS]). Since there are 4 times fewer writes, we also have to reduce the scratchpad size. I suggest to use a 1 MiB scratchpad size as a compromise between scratchpad write density and memory hardness. Most x86 CPUs will perform roughly the same with a 512 KiB and 1024 KiB scratchpad, while the larger size provides higher resistance against specialized hardware, at the cost of possible time-memory tradeoffs (see [REF_RANDOMX_TMTO] for details).
Lastly, we reduce the output of RandomX to just 8 bytes:
RANDOMX_HASH_SIZE = 8
64-bit preimage security is more than sufficient for proof-of-work and it allows the result to be treated as a little-endian encoded unsigned integer for easy effort calculation.
I have implemented this in the tor-pow branch of the RandomX repository:
https://github.com/tevador/RandomX/tree/tor-pow
Namely I have changed the API to return the hash value as an uint64_t and made corresponding changes in the benchmark.
Benchmark example:
./randomx-benchmark --mine \ --avx2 \ --jit \ --largePages \ --nonces 10000 \ --seed 1234 \ --init 1 \ --threads 1 \ --batch RandomX-TOR-v1 benchmark - Argon2 implementation: AVX2 - full memory mode (1040 MiB) - JIT compiled mode - hardware AES mode - large pages mode - batch mode Initializing (1 thread) ... Memory initialized in 5.32855 s Initializing 1 virtual machine(s) ... Running benchmark (10000 nonces) ... Performance: 2535.43 hashes per second Best result: Nonce: 8bc3ded34d2dcdeed9000000f95cd20c Result: d947ceff08750300 Effort: 18956 Valid: 1
At the end, it prints out the nonce that gives the highest effort value and validates it.
For the actual implementation in TOR, the RandomX validator should run in a separate thread that doesn't do anything else apart from validation and moving valid requests into the Intro Queue. This way we can reach the maximum performance of ~2000 processed requests per second.
Finally, here are some disadvantages of RandomX-TOR:
1) Fast verification requires ~1 GiB of memory. If we decide to use two overlapping seed epochs, each service will need to allocate >2 GiB of RAM just to verify the PoW. Alternatively, it is possible to use the slow mode, which requires only 256 MiB per seed, but runs 4x slower. 2) The fast mode needs about 5 seconds to initialize every time the seed is changed (can be reduced to under 1 second using multiple threads). The slow mode needs about 0.1 seconds to initialize. 3) RandomX includes a JIT compiler for maximum performance. The iOS operating system doesn't support JIT compilation, so RandomX runs about 10x slower there. 4) The JIT compiler in RandomX is currently implemented only for x86-64 and ARM64 CPU architectures. Other architectures will run very slowly (especially 32-bit systems). However, the two supported architectures cover the vast majority of devices, so this should not be an issue.
I've been working on a custom PoW algorithm specifically for this proposal. It is 10x faster to verify than RandomX-Tor and doesn't require any memory for verification.
Full write-up is here: https://github.com/tevador/equix/blob/master/devlog.md
Especially the comparison table in the Appendix may be of interest to this discussion.
T.
On Sat, May 9, 2020 at 9:38 PM tevador tevador@gmail.com wrote:
I have implemented this in the tor-pow branch of the RandomX repository:
https://github.com/tevador/RandomX/tree/tor-pow
Hi tevador,
On 9 May 2020, at 06:43, tevador tevador@gmail.com wrote:
For our dynamic PoW system to work, we will need to be able to compare PoW tokens with each other. To do so we define a function: unsigned effort(uint8_t *token) which takes as its argument a hash output token, and returns the number of leading zero bits on it.
This definition makes the effort exponential, i.e. the computational resources required to reach one notch higher effort increase by a factor of 2 each time.
I suggest to use linear effort defined as the quotient of dividing a bitstring of 1s by the hash value:
== Example A:
effort(00000001100010101101) = 11111111111111111111 / 00000001100010101101
or in decimal:
effort(6317) = 1048575 / 6317 = 165.
This definition of effort has the advantage of directly expressing the expected number of hashes that the client had to calculate to reach the effort.
With the exponential definition, we could have an equivalent linear effort of either 128 (7 leading zeroes) or 256 (8 leading zeroes), while the linear definition provides smoother classification of PoW results.
There are two possible issues with this design:
Division is expensive on some platforms, including ARM-based devices. But there might be a way to calculate an approximate value without division. (For example, bit shifts, or multiplying by an inverse.) Or we could calculate the maximum value once, and then re-use it.
Is it still possible to express the full range of difficulties? Is that expression reasonably compact?
Some advantages of this exponential distribution are: * spurious results can be filtered using a single instruction (a bit mask), * the expected effort is quick and easy to calculate, * the effort can be expressed in a compact way.
Maybe we don't need some of these properties, and a linear design would be fine.
But if we do, we could change the exponent to the square or cube root of two. There would be a smoother distribution, but a wider range, and the checks would still be reasonably fast.
T
Hi teor,
On Sun, May 10, 2020 at 6:36 AM teor teor@riseup.net wrote:
There are two possible issues with this design:
Division is expensive on some platforms, including ARM-based devices. But there might be a way to calculate an approximate value without division. (For example, bit shifts, or multiplying by an inverse.) Or we could calculate the maximum value once, and then re-use it.
Is it still possible to express the full range of difficulties? Is that expression reasonably compact?
Some advantages of this exponential distribution are:
- spurious results can be filtered using a single instruction (a bit mask),
- the expected effort is quick and easy to calculate,
- the effort can be expressed in a compact way.
Maybe we don't need some of these properties, and a linear design would be fine.
But if we do, we could change the exponent to the square or cube root of two. There would be a smoother distribution, but a wider range, and the checks would still be reasonably fast.
T
You only need 1 or 2 divisions per introduction request, so it doesn't matter even if division is expensive, because it will take a minuscule amount of time compared to the actual hashing effort.
There are 2 scenarios:
1) User wants to wait for X seconds and then submit the best result they could find.
2) User wants to wait as long as it takes to submit a result with an effort of at least E.
In case 1), the client will simply take the smallest result R found during the X seconds and calculate ACTUAL_EFFORT = MAX_RESULT / R at the end.
In case 2), the client will calculate TARGET = MAX_RESULT / E at the beginning and keep hashing until they find a result R <= TARGET. Then they can calculate ACTUAL_EFFORT = MAX_RESULT / R at the end, which implies ACTUAL_EFFORT >= E.
Case 1) takes 1 division instruction, case 2) takes 2 division instructions. When hashing, the client can filter results with a single instruction (comparison).
Examples:
1) After X seconds, the client finds results 660445, 6317, 599102 ... 111847. The smallest result is 6317, so:
ACTUAL_EFFORT = 1048575 / 6317 = 165
2) The client wants to find a result with an effort of at least E = 165, so they calculate TARGET = 1048575 / 165 = 6355. Then they can keep hashing until they find R <= 6355, e.g. 6317. The actual effort is calculated as above.
So the only advantage of the exponential notation is:
- the effort can be expressed in a compact way.
This can save a few characters in the HS descriptor, at the cost of a coarse effort classification, e.g. clients who spent 60 seconds hashing will be, on average, classified the same as those who spent 100 seconds.
Hi tevador,
On 5/8/20 2:53 PM, tevador wrote:
Including the effort has 2 benefits:
1. In case the Intro Priority Queue is full, the service doesn't need to waste time verifying PoW solutions that have effort lower than the last element in the queue. While an attacker can always lie about the actual effort of their nonce value, I think this field can still save some CPU cycles when the service is under heavy load. 2. The service can conveniently verify the reported effort with the following inequality: POW_EFFORT * pow_function(POW_NONCE, SEED) <= MAX_RESULT where MAX_RESULT is the highest possible output from the pow_function. In the case of Example A, that would be: 165 * 6317 = 1042305 <= 1048576
Similar to how our cell scheduler works, the onion service subsystem will poll the priority queue every 100ms tick and process the first 20 cells from the priority queue (if they exist). The service will perform the rendezvous and the rest of the onion service protocol as normal.
I suggest to use a different selection method rather than always taking the first 20 requests.
Selecting cells from the front of the queue actually minimizes the effort that an attacker needs to expend to completely block access to the service. The attacker can always submit the minimum effort required to occupy the first 20 slots in each 100 ms window. This can be done by submitting just 200 requests per second and observing how many circuits are successfully open and adjusting the effort until no other request can go through. This is the "Total overwhelm strategy" described in § 5.1.1.
Hrm, you seem to have read the original proposal and missed some of the follow-on threads.
We moved away from a 100ms tick-based system into a top-half and bottom-half handler design, which updates the difficulty target as well. I tried to describe this in the top-half and bottom-half handler steps in my reply: https://lists.torproject.org/pipermail/tor-dev/2020-April/014219.html
In short, we let the queue grow at a faster rate than we serve, and we trim it occasionally. Those steps set the descriptor difficulty a property of based on what the service can actually serve from the queue and based on the queue trim point. We allow clients to pick a higher difficulty arbitrarily to jump to the head of the queue, if they notice they are still not getting service based on the descriptor difficulty.
This also eliminates the need for a "default" difficulty.
So in order for the attacker to "total overwhelm" that system, don't they have to submit not just 200 requests per second, but *continuously* send requests with higher difficulty than anyone else in the queue, in order to fully deny service?
Are there other reasons to do stochastic sampling over a priority queue, given this top-half and bottom-half design?
Hi Mike,
My apologies. I thought this later email from 14th April had the latest version of the proposal: https://lists.torproject.org/pipermail/tor-dev/2020-April/014225.html
In short, we let the queue grow at a faster rate than we serve, and we trim it occasionally.
What is the benefit of this approach rather than discarding low priority requests right away in the top-half handler?
Note that a priority queue is typically implemented as a heap, which does not support efficient trimming.
However, it still has the following potential issues: A) AES will bottleneck us at ~100Mbit-300Mbit at #2 in top-half above B) Extra mainloop() iterations for INTRO2s may be expensive (or not?)
I don't think AES is the main concern here. Each introduction request is 512 bytes (AFAIK), so with a PoW verification performance of 2000 requests per second, the top-half handler will bottleneck at ~1 MB/s.
Are there other reasons to do stochastic sampling over a priority queue, given this top-half and bottom-half design?
After thinking about it more, I would recommend starting with a simple priority queue as proposed. More complex solutions can be implemented later if field testing finds issues.
T.
What is the benefit of this approach rather than discarding low priority requests right away in the top-half handler?
Note that a priority queue is typically implemented as a heap, which does not support efficient trimming.
Correct me if I'm wrong.
When a cell with a small effort in the queue has any chance of getting selected, the optimal strategy for a legitimate client would be to compute nonces and send as many nonces as possible until it causes congestion on his network. Instead when only the cell with the highest effort is processed, sending more than one nonces per connection does no good for a client. We want each legitimate client to send only one nonce per connection.
As of trimming the priority queue, we don't have to use a heap. We can compress the effort into maybe 7 bits, and then store the requests in 128 arrays. Then trimming it is freeing an array. The compression can be something like floating point.
~clz(POW_NONCE) << 1 | (POW_NONCE >> (127 - clz(POW_NONCE))) & 1
That is, take the number of leading zeros and by one bit on the right of the leftmost 1 bit, then complement the first part to preserve order. We can expect the number of leading zeros to be less than 64, so this will take 7 bits. A decrement of this value means about 1.3 - 1.5 times more work, which should be finely enough grained.
On Mon, May 18, 2020 at 1:01 PM yoehoduv@protonmail.com wrote:
When a cell with a small effort in the queue has any chance of getting selected, the optimal strategy for a legitimate client would be to compute nonces and send as many nonces as possible until it causes congestion on his network. Instead when only the cell with the highest effort is processed, sending more than one nonces per connection does no good for a client. We want each legitimate client to send only one nonce per connection.
Sending many requests is not the optimal strategy. One high-effort request would have exactly the same chance of being selected as many low-effort requests with the same total effort. The difference is that with many requests, the client wouldn't know which rendezvous would be selected, so he'd have to waste resources on opening many circuits.
Anyways, I suggest using a priority queue first and see how it works in practice. To allow efficient insertion and trimming, the queue can be implemented as a red-black tree.
As of trimming the priority queue, we don't have to use a heap. We can compress the effort into maybe 7 bits, and then store the requests in 128 arrays. Then trimming it is freeing an array. The compression can be something like floating point.
~clz(POW_NONCE) << 1 | (POW_NONCE >> (127 - clz(POW_NONCE))) & 1
That is, take the number of leading zeros and by one bit on the right of the leftmost 1 bit, then complement the first part to preserve order. We can expect the number of leading zeros to be less than 64, so this will take 7 bits. A decrement of this value means about 1.3 - 1.5 times more work, which should be finely enough grained.
What you are describing is called a hashtable. The question is: what happens when one of the arrays gets filled up? You would have to discard all additional requests coming into that bucket. With your construction, it's very likely that most requests would end up in just a few buckets and the rest would remain empty (e.g. all buckets for more than 40 leading zeroes).
T.
Sending many requests is not the optimal strategy. One high-effort request would have exactly the same chance of being selected as many low-effort requests with the same total effort. The difference is that with many requests, the client wouldn't know which rendezvous would be selected, so he'd have to waste resources on opening many circuits.
Is solving for a high-effort solution different from solving for a lower-effort solution? If not, many lower-effort solutions will be found while working for a high-effort solution. Then nothing stops a client from making requests with those lower-effort solutions as well. I can expect to find 2 solutions of effort E/2 for every solution of effort E. If I make 3 requests with those solutions, my chance of succeeding doubles, at the cost of 3 times the server's verification effort.
One way is to include the target effort in requests, and include both the server-provided nonce and the target effort as the x in Hx. Then only check that the real effort comes out no less than the target effort, but use the target effort for everything else.
Anyways, I suggest using a priority queue first and see how it works in practice. To allow efficient insertion and trimming, the queue can be implemented as a red-black tree.
As of trimming the priority queue, we don't have to use a heap. We can compress the effort into maybe 7 bits, and then store the requests in 128 arrays. Then trimming it is freeing an array. The compression can be something like floating point. ~clz(POW_NONCE) << 1 | (POW_NONCE >> (127 - clz(POW_NONCE))) & 1 That is, take the number of leading zeros and by one bit on the right of the leftmost 1 bit, then complement the first part to preserve order. We can expect the number of leading zeros to be less than 64, so this will take 7 bits. A decrement of this value means about 1.3 - 1.5 times more work, which should be finely enough grained.
What you are describing is called a hashtable. The question is: what happens when one of the arrays gets filled up? You would have to discard all additional requests coming into that bucket. With your construction, it's very likely that most requests would end up in just a few buckets and the rest would remain empty (e.g. all buckets for more than 40 leading zeroes).
I was thinking of dynamically sized arrays. Anyway, my point was we don't need to compare 128-bit solutions.
On Sun, Jun 7, 2020 at 8:42 AM yoehoduv@protonmail.com wrote:
One way is to include the target effort in requests, and include both the server-provided nonce and the target effort as the x in Hx. Then only check that the real effort comes out no less than the target effort, but use the target effort for everything else.
That's a very good idea. It would also prevent "lucky" high-effort solutions (unless the client wants to play the lottery).
With the Equi-X puzzle, it would work like this:
C ... server challenge (32 bytes) N ... client nonce (16 bytes) E ... client target effort (4 bytes, little endian) S ... Equi-X solution (16 bytes) R ... hash result (4 bytes, little endian) || ... concatenation operator
The client's algorithm: Input: C 1) Select N, E 2) Calculate S = equix_solve(C || N || E) 3) Calculate R = blake2b(C || N || E || S) 4) if R * E > UINT32_MAX, go back to step 1) 5) Submit C, N, E, S (68 bytes total)
The server's algorithm: Input: C, N, E, S 1) Check that C is a valid challenge (there may be multiple challenges active at a time). 2) Check that E is above the minimum effort 3) Check that N hasn't been used before with C 4) Check equix_verify(C || N || E, S) == EQUIX_OK 5) Calculate R = blake2b(C || N || E || S) 6) Check R * E <= UINT32_MAX 7) Put the request in the queue with weight E
Optionally, C could be omitted from the extension field if there is only one global challenge. That would reduce the payload size to 36 bytes.
Note: 32-bit effort should be enough for more than a week of solving with an 8-core CPU.
T.
The client's algorithm: Input: C
Select N, E
Calculate S = equix_solve(C || N || E)
Calculate R = blake2b(C || N || E || S)
if R * E > UINT32_MAX, go back to step 1)
Submit C, N, E, S (68 bytes total)
It looks like all the 40320 permutations of the 16-bit words in S are equix solutions. Are steps 3 to 5 supposed to be repeated for all the permutations?
On Mon, Jun 8, 2020 at 3:09 AM yoehoduv@protonmail.com wrote:
It looks like all the 40320 permutations of the 16-bit words in S are equix solutions. Are steps 3 to 5 supposed to be repeated for all the permutations?
Each unique S provides only one attempt. The indices in S must be ordered in a certain way, otherwise the solution is invalid. See: https://github.com/tevador/equix/blob/master/src/equix.c#L14-L23
Hello,
after reading all the excellent feedback on this thread, I did another revision on this proposal: https://github.com/asn-d6/torspec/tree/pow-over-intro I'm inlining the full proposal in the end of this email.
Here is a changelog: - Improve attack vector section - Shrink nonce size on cells to 16 bytes - Change effort definition to linear
Here is a few things I did not do and might need some help with:
- I did not decide on the PoW function. I think to do this we miss the scheduler number crunching from dgoulet, and also I need to understand the possible options a bit more. I removed most references to argon2 and replaced them with XXX_POW.
Tevador, thanks a lot for your tailored work on equix. This is fantastic. I have a question that I don't see addressed in your very well written README. In your initial email, we discuss how Equihash does not have good GPU resistance: https://lists.torproject.org/pipermail/tor-dev/2020-May/014268.html
Since equix is using Equihash isn't this gonna be a problem here too? I'm not too worried about ASIC resistance since I doubt someone is gonna build ASICs for this problem just yet, but script kiddies with their CS:GO graphics cards attacking equix is something I'm concerned about. I bet you have thought of this, so I'm wondering what's your take here.
Right now I think the possible options are equix or the reduced Randomx (again thanks tevador) or yespower. In theory we could do all three of them and just support different versions; but that means more engineering.
In any case, we are also waiting for some Tor-specific numbers from dgoulet, so we need those before we proceed here.
- In their initial mail, tevador points out an attack where the adversary games the effort estimation logic, by pausing an attack a minute before descriptor upload, so that the final descriptor has a very small target effort. They suggest using the median effort over a long period of time to fix this. Mike, can you check that out and see how we can adapt our logic to fix this?
- In tevador's initial mail, they point how the cell should include POW_EFFORT and that we should specify a "minimum effort" value instead of just inserting any effort in the pqueue. I can understand how this can have benefits (like the June discussion between tevador and yoehoduv) but I'm also concerned that this can make us more vulnerable to [ATTACK_BOTTOM_HALF] types of attacks, by completely dropping introduction requests instead of queueing them for an abstract future. I wouldn't be surprised if my concerns are invalid and harmful here. Does anyone have intuition?
- tevador suggests we use two seeds, and always accept introductions with the previous seed. I agree this is a good idea, and it's not as complex as I originally thought (I have trauma from the v3 design where we try to support multiple time periods at the same time). However, because this doubles the vefication time, I decided to wait for dgoulet's scheduler numbers and until the PoW function is finalized to understand if we can afford the verification overhead.
- Solar Designer suggested we do Ethash's anti-DDoS trick to avoid instances of [ATTACK_TOP_HALF]. This involves wrapping the final PoW token in a fast hash with a really low difficulty, and having the verifier check that fast hash POW first. This means that a target trying to flood us with invalid PoW would need to do some work for every PoW instead of it being free. This is a decision we should take at the end after we do some number crunching and see where we are at in terms of verification time and attack models.
Thanks a lot! :)
---
Filename: xxx-pow-over-intro-v1 Title: A First Take at PoW Over Introduction Circuits Author: George Kadianakis, Mike Perry, David Goulet Created: 2 April 2020 Status: Draft
0. Abstract
This proposal aims to thwart introduction flooding DoS attacks by introducing a dynamic Proof-Of-Work protocol that occurs over introduction circuits.
1. Motivation
So far our attempts at limiting the impact of introduction flooding DoS attacks on onion services has been focused on horizontal scaling with Onionbalance, optimizing the CPU usage of Tor and applying congestion control using rate limiting. While these measures move the goalpost forward, a core problem with onion service DoS is that building rendezvous circuits is a costly procedure both for the service and for the network. For more information on the limitations of rate-limiting when defending against DDoS, see [REF_TLS_1].
If we ever hope to have truly reachable global onion services, we need to make it harder for attackers to overload the service with introduction requests. This proposal achieves this by allowing onion services to specify an optional dynamic proof-of-work scheme that its clients need to participate in if they want to get served.
With the right parameters, this proof-of-work scheme acts as a gatekeeper to block amplification attacks by attackers while letting legitimate clients through.
1.1. Related work
For a similar concept, see the three internet drafts that have been proposed for defending against TLS-based DDoS attacks using client puzzles [REF_TLS].
1.2. Threat model [THREAT_MODEL]
1.2.1. Attacker profiles [ATTACKER_MODEL]
This proposal is written to thwart specific attackers. A simple PoW proposal cannot defend against all and every DoS attack on the Internet, but there are adverary models we can defend against.
Let's start with some adversary profiles:
"The script-kiddie"
The script-kiddie has a single computer and pushes it to its limits. Perhaps it also has a VPS and a pwned server. We are talking about an attacker with total access to 10 Ghz of CPU and 10 GBs of RAM. We consider the total cost for this attacker to be zero $.
"The small botnet"
The small botnet is a bunch of computers lined up to do an introduction flooding attack. Assuming 500 medium-range computers, we are talking about an attacker with total access to 10 Thz of CPU and 10 TB of RAM. We consider the upfront cost for this attacker to be about $400.
"The large botnet"
The large botnet is a serious operation with many thousands of computers organized to do this attack. Assuming 100k medium-range computers, we are talking about an attacker with total access to 200 Thz of CPU and 200 TB of RAM. The upfront cost for this attacker is about $36k.
We hope that this proposal can help us defend against the script-kiddie attacker and small botnets. To defend against a large botnet we would need more tools in our disposal (see [FUTURE_DESIGNS]).
{XXX: Do the above make sense? What other attackers do we care about? What other metrics do we care about? Network speed? I got the botnet costs from here [REF_BOTNET] Back up our claims of defence.}
1.2.2. User profiles [USER_MODEL]
We have attackers and we have users. Here are a few user profiles:
"The standard web user"
This is a standard laptop/desktop user who is trying to browse the web. They don't know how these defences work and they don't care to configure or tweak them. They are gonna use the default values and if the site doesn't load, they are gonna close their browser and be sad at Tor. They run a 2Ghz computer with 4GB of RAM.
"The motivated user"
This is a user that really wants to reach their destination. They don't care about the journey; they just want to get there. They know what's going on; they are willing to tweak the default values and make their computer do expensive multi-minute PoW computations to get where they want to be.
"The mobile user"
This is a motivated user on a mobile phone. Even tho they want to read the news article, they don't have much leeway on stressing their machine to do more computation.
We hope that this proposal will allow the motivated user to always connect where they want to connect to, and also give more chances to the other user groups to reach the destination.
1.2.3. The DoS Catch-22 [CATCH22]
This proposal is not perfect and it does not cover all the use cases. Still, we think that by covering some use cases and giving reachability to the people who really need it, we will severely demotivate the attackers from continuing the DoS attacks and hence stop the DoS threat all together. Furthermore, by increasing the cost to launch a DoS attack, a big class of DoS attackers will disappear from the map, since the expected ROI will decrease.
2. System Overview
2.1. Tor protocol overview
+----------------------------------+ | | +-------+ INTRO1 +-----------+ INTRO2 +--------+ | |Client |-------->|Intro Point|------->| PoW |-----------+ | +-------+ +-----------+ |Verifier| | | +--------+ | | | | | | | | | +----------v---------+ | | |Intro Priority Queue| | +---------+--------------------+---+ | | | Rendezvous | | | circuits | | | v v v
The proof-of-work scheme specified in this proposal takes place during the introduction phase of the onion service protocol.
The system described in this proposal is not meant to be on all the time, and should only be enabled by services when under duress. The percentage of clients receiving puzzles can also be configured based on the load of the service.
In summary, the following steps are taken for the protocol to complete:
1) Service encodes PoW parameters in descriptor [DESC_POW] 2) Client fetches descriptor and computes PoW [CLIENT_POW] 3) Client completes PoW and sends results in INTRO1 cell [INTRO1_POW] 4) Service verifies PoW and queues introduction based on PoW effort [SERVICE_VERIFY]
2.2. Proof-of-work overview
2.2.1. Primitives
For our proof-of-work scheme we want to minimize the spread of resources between a motivated attacker and legitimate clients. This means that we are looking to minimize any benefits that GPUs or ACICs can offer to an attacker.
For this reason we chose XXX_POW
2.2.2. Dynamic PoW
DoS is a dynamic problem where the attacker's capabilities constantly change, and hence we want our proof-of-work system to be dynamic and not stuck with a static difficulty setting. Hence, instead of forcing clients to go below a static target like in Bitcoin to be successful, we ask clients to "bid" using their PoW effort. Effectively, a client gets higher priority the higher effort they put into their proof-of-work. This is similar to how proof-of-stake works but instead of staking coins, you stake work.
The benefit here is that legitimate clients who really care about getting access can spend a big amount of effort into their PoW computation, which should guarantee access to the service given reasonable adversary models. See [PARAM_TUNING] for more details about these guarantees and tradeoffs.
As a way to improve reachability and UX, the service tries to estimate the effort needed for clients to get access at any given time and places it in the descriptor. See [EFFORT_ESTIMATION] for more details.
2.2.3. PoW effort
For our dynamic PoW system to work, we will need to be able to compare PoW tokens with each other. To do so we define a function: unsigned effort(uint8_t *token) which takes as its argument a hash output token, interprets it as a bitstring, and returns the quotient of dividing a bitstring of 1s by it.
So for example: effort(00000001100010101101) = 11111111111111111111 / 00000001100010101101 or the same in decimal: effort(6317) = 1048575 / 6317 = 165.
This definition of effort has the advantage of directly expressing the expected number of hashes that the client had to calculate to reach the effort. This is in contrast to the (cheaper) exponential effort definition of taking the number of leading zero bits.
3. Protocol specification
3.1. Service encodes PoW parameters in descriptor [DESC_POW]
This whole protocol starts with the service encoding the PoW parameters in the 'encrypted' (inner) part of the v3 descriptor. As follows:
"pow-params" SP type SP seed-b64 SP expiration-time NL
[At most once]
type: The type of PoW system used. We call the one specified here "v1"
seed-b64: A random seed that should be used as the input to the PoW hash function. Should be 32 random bytes encoded in base64 without trailing padding.
suggested-effort: An unsigned integer specifying an effort value that clients should aim for when contacting the service. See [EFFORT_ESTIMATION] for more details here.
expiration-time: A timestamp in "YYYY-MM-DD SP HH:MM:SS" format after which the above seed expires and is no longer valid as the input for PoW. It's needed so that the size of our replay cache does not grow infinitely. It should be set to three hours in the future (+- some randomness). {TODO: PARAM_TUNING}
{XXX: Expiration time makes us even more susceptible to clock skews, but it's needed so that our replay cache refreshes. How to fix this? See [CLIENT_BEHAVIOR] for more details.}
3.2. Client fetches descriptor and computes PoW [CLIENT_POW]
If a client receives a descriptor with "pow-params", it should assume that the service is expecting a PoW input as part of the introduction protocol.
The client parses the descriptor and extracts the PoW parameters. It makes sure that the <expiration-time> has not expired and if it has, it needs to fetch a new descriptor.
The client should then extract the <suggested-effort> field to configure its PoW 'target' (see [REF_TARGET]). The client SHOULD NOT accept 'target' values that will cause an infinite PoW computation. {XXX: How to enforce this?}
To complete the PoW the client follows the following logic:
a) Client generates 'nonce' as 16 random bytes. b) Client derives 'seed' by decoding 'seed-b64'. c) Client derives 'labeled_seed = seed + "TorV1PoW"' d) Client computes hash_output = XXX_POW(labeled_seed, nonce) e) Client checks if effort(hash_output) >= target. e1) If yes, success! The client uses 'hash_output' as the puzzle solution and 'nonce' and 'seed' as its inputs. e2) If no, fail! The client interprets 'nonce' as a big-endian integer, increments it by one, and goes back to step (d).
At the end of the above procedure, the client should have a triplet (hash_output, seed, nonce) that can be used as the answer to the PoW puzzle. How quickly this happens depends solely on the 'target' parameter.
3.3. Client sends PoW in INTRO1 cell [INTRO1_POW]
Now that the client has an answer to the puzzle it's time to encode it into an INTRODUCE1 cell. To do so the client adds an extension to the encrypted portion of the INTRODUCE1 cell by using the EXTENSIONS field (see [PROCESS_INTRO2] section in rend-spec-v3.txt). The encrypted portion of the INTRODUCE1 cell only gets read by the onion service and is ignored by the introduction point.
We propose a new EXT_FIELD_TYPE value:
[01] -- PROOF_OF_WORK
The EXT_FIELD content format is:
POW_VERSION [1 byte] POW_NONCE [16 bytes]
where:
POW_VERSION is 1 for the protocol specified in this proposal POW_NONCE is 'nonce' from the section above
This will increase the INTRODUCE1 payload size by 19 bytes since the extension type and length is 2 extra bytes, the N_EXTENSIONS field is always present and currently set to 0 and the EXT_FIELD is 17 bytes. According to ticket #33650, INTRODUCE1 cells currently have more than 200 bytes available.
3.4. Service verifies PoW and handles the introduction [SERVICE_VERIFY]
When a service receives an INTRODUCE1 with the PROOF_OF_WORK extension, it should check its configuration on whether proof-of-work is required to complete the introduction. If it's not required, the extension SHOULD BE ignored. If it is required, the service follows the procedure detailed in this section.
If the service requires the PROOF_OF_WORK extension but received an INTRODUCE1 cell without any embedded proof-of-work, the service SHOULD consider this cell as a zero-effort introduction for the purposes of the priority queue (see section [INTRO_QUEUE]).
3.4.1. PoW verification [POW_VERIFY]
To verify the client's proof-of-work the service extracts (hash_output, seed, nonce) from the INTRODUCE1 cell and MUST do the following steps:
1) Make sure that the client's seed is identical to the active seed. 2) Check the client's nonce for replays (see [REPLAY_PROTECTION] section). 3) Verify that 'hash_output =?= XXX_POW(seed, nonce)
If any of these steps fail the service MUST ignore this introduction request and abort the protocol.
In this proposal we call the above steps the "top half" of introduction handling. If all the steps of the "top half" have passed, then the circuit is added to the introduction queue as detailed in section [INTRO_QUEUE].
3.4.1.1. Replay protection [REPLAY_PROTECTION]
The service MUST NOT accept introduction requests with the same (seed, nonce) tuple. For this reason a replay protection mechanism must be employed.
The simplest way is to use a simple hash table to check whether a (seed, nonce) tuple has been used before for the actiev duration of a seed. Depending on how long a seed stays active this might be a viable solution with reasonable memory/time overhead.
If there is a worry that we might get too many introductions during the lifetime of a seed, we can use a Bloom filter as our replay cache mechanism. The probabilistic nature of Bloom filters means that sometimes we will flag some connections as replays even if they are not; with this false positive probability increasing as the number of entries increase. However, with the right parameter tuning this probability should be negligible and well handled by clients. {TODO: PARAM_TUNING}
3.4.2. The Introduction Queue [INTRO_QUEUE]
3.4.2.1. Adding introductions to the introduction queue [ADD_QUEUE]
When PoW is enabled and a verified introduction comes through, the service instead of jumping straight into rendezvous, queues it and prioritizes it based on how much effort was devoted by the client to PoW. This means that introduction requests with high effort should be prioritized over those with low effort.
To do so, the service maintains an "introduction priority queue" data structure. Each element in that priority queue is an introduction request, and its priority is the effort put into its PoW:
When a verified introduction comes through, the service uses the effort() function with hash_output as its input, and uses the output to place requests into the right position of the priority_queue: The bigger the effort, the more priority it gets in the queue. If two elements have the same effort, the older one has priority over the newer one.
{TODO: PARAM_TUNING: If the priority queue is only ordered based on the effort what attacks can happen in various scenarios? Do we want to order on time+effort? Which scenarios and attackers should we examine here?}
3.4.2.2. Handling introductions from the introduction queue [HANDLE_QUEUE]
The service should handle introductions by pulling from the introduction queue. We call this part of introduction handling the "bottom half" because most of the computation happens in this stage.
Similar to how our cell scheduler works, the onion service subsystem will poll the priority queue every 100ms tick and process the first 20 cells from the priority queue (if they exist). The service will perform the rendezvous and the rest of the onion service protocol as normal.
With this tempo, we can process 200 introduction cells per second. {XXX: Is this good?}
After the introduction request is handled from the queue, the service trims the priority queue if the queue is too big. {TODO: PARAM_TUNING: What's the max size of the queue? How do we trim it? Can we use WRED usefully?}
{TODO: PARAM_TUNING: STRAWMAN: This needs hella tuning. Processing 20 cells per 100ms is probably unmaintainable, since each cell is quite expensive: doing so involving path selection, crypto and making circuits. We will need to profile this procedure and see how we can do this scheduling better.}
3.4.3. PoW effort estimation [EFFORT_ESTIMATION]
During its operation the service continuously keeps track of the received PoW cell efforts to inform its clients of the effort they should put in their introduction to get service. The service informs the clients by using the <suggested-effort> field in the descriptor.
In particular, the service starts with a default suggested-effort value of 5000.
Everytime the service handles an introduction request from the priority queue in [HANDLE_QUEUE], the service compares the request's effort to the current suggested-effort value. If the new request's effort is lower than the suggested-effort, set the suggested-effort equal to the effort of the new request.
{XXX tevador attack: see their email https://lists.torproject.org/pipermail/tor-dev/2020-May/014268.html where it says "Secondly, the proposed method of calculating..." They suggest using the median here and their "pause-before-desc-publish" attack seems legit.}
Everytime the service trims the priority queue in [HANDLE_QUEUE], the service compares the request at the trim point against the current suggested-effort value. If the trimmed request's effort is higher than the suggested-effort, set the suggested-effort equal to the effort of the new request.
The above two operations are meant to balance the suggested effort based on the requests currently waiting in the priority queue. If the priority queue is filled with high-effort requests, make the suggested effort higher. And when all the high-effort requests get handled and the priority queue is back to normal operation, relax the suggested effort to lower levels.
The suggested-effort is not a hard limit to the efforts that are accepted by the service, and it's only meant to serve as a guideline for clients to reduce the number of unsuccessful requests that get to the service. The service still adds requests with lower effort than suggested-effort to the priority queue in [ADD_QUEUE].
{XXX: What attacks are possible here?}
3.4.3.1. Updating descriptor with new suggested effort
When a service changes its suggested-effort value, it SHOULD upload a new descriptor with the new value.
The service should avoid uploading descriptors too often to avoid overwheming the HSDirs. The service SHOULD NOT upload descriptors more often than 'hs-pow-desc-upload-rate-limit' seconds (which is controlled through a consensus parameter and has a default value of 300 seconds).
{XXX: Is this too often? Or too rare? Perhaps we can set different limits for when the difficulty goes up and different for when it goes down. It's more important to update the descriptor when the difficulty goes up.}
{XXX: What attacks are possible here? Can the attacker intentionally hit this rate-limit and then influence the suggested effort so that clients do not learn about the new effort?}
4. Client behavior [CLIENT_BEHAVIOR]
This proposal introduces a bunch of new ways where a legitimate client can fail to reach the onion service.
Furthermore, there is currently no end-to-end way for the onion service to inform the client that the introduction failed. The INTRO_ACK cell is not end-to-end (it's from the introduction point to the client) and hence it does not allow the service to inform the client that the rendezvous is never gonna occur.
For this reason we need to define some client behaviors to work around these issues.
4.1. Clients handling timeouts [CLIENT_TIMEOUT]
Alice can fail to reach the onion service if her introduction request gets trimmed off the priority queue in [HANDLE_QUEUE], or if the service does not get through its priority queue in time and the connection times out.
{XXX: How should timeout values change here since the priority queue will cause bigger delays than usual to rendezvous?}
This section presents a heuristic method for the client getting service even in such scenarios.
If the rendezvous request times out, the client SHOULD fetch a new descriptor for the service to make sure that it's using the right suggested-effort for the PoW and the right PoW seed. The client SHOULD NOT fetch service descriptors more often than every 'hs-pow-desc-fetch-rate-limit' seconds (which is controlled through a consensus parameter and has a default value of 600 seconds).
{XXX: Is this too rare? Too often?}
When the client fetches a new descriptor, it should try connecting to the service with the new suggested-effort and PoW seed. If that doesn't work, it should double the effort and retry. The client should keep on doubling-and-retrying until it manages to get service, or its able to fetch a new descriptor again.
{XXX: This means that the client will keep on spinning and doubling-and-retrying for a service under this situation. There will never be a "Client connection timed out" page for the user. Is this good? Is this bad? Should we stop doubling-and-retrying after some iterations? Or should we throw a custom error page to the user, and ask the user to stop spinning whenever they want?}
4.2. Seed expiration issues
As mentioned in [DESC_POW], the expiration timestamp on the PoW seed can cause issues with clock skewed clients. Furthermore, even not clock skewed clients can encounter TOCTOU-style race conditions here.
The client descriptor refetch logic of [CLIENT_TIMEOUT] should take care of such seed-expiration issues, since the client will refetch the descriptor.
{XXX: Is this sufficient? Should we have multiple active seeds at the same time similar to how we have overlapping descriptors and time periods in v3? This would solve the problem but it grows the complexity of the system substantially.}
4.3. Other descriptor issues
Another race condition here is if the service enables PoW, while a client has a cached descriptor. How will the client notice that PoW is needed? Does it need to fetch a new descriptor? Should there be another feedback mechanism? {XXX}
5. Attacker strategies [ATTACK_META]
Now that we defined our protocol we need to start tweaking the various knobs. But before we can do that, we first need to understand a few high-level attacker strategies to see what we are fighting against.
5.1.1. Overwhelm PoW verification (aka "Overwhelm top half") [ATTACK_TOP_HALF]
A basic attack here is the adversary spamming with bogus INTRO cells so that the service does not have computing capacity to even verify the proof-of-work. This adversary tries to overwhelm the procedure in the [POW_VERIFY] section.
That's why we need the PoW algorithm to have extremely cheap verification time so that this attack is not possible.
5.1.2. Overwhelm rendezvous capacity (aka "Overwhelm bottom half") [ATTACK_BOTTOM_HALF]
Given the way the introduction queue works (see [HANDLE_QUEUE]), a very effective strategy for the attacker is to totally overwhelm the queue processing by sending more high-effort introductions than the onion service can handle at any given tick. This adversary tries to overwhelm the procedure in the [HANDLE_QUEUE] section.
To do so, the attacker would have to send at least 20 high-effort introduction cells every 100ms, where high-effort is a PoW which is above the estimated level of "the motivated user" (see [USER_MODEL]).
An easier attack for the adversary, is the same strategy but with introduction cells that are all above the comfortable level of "the standard user" (see [USER_MODEL]). This would block out all standard users and only allow motivated users to pass.
5.1.3. Precomputed PoW attack
The attacker may precompute many valid PoW nonces and submit them all at once before the current seed expires, overwhelming the service temporarily even using a single computer. An attacker with this attack might be aiming to DoS the service for a limited amount of time, or confuse the difficulty estimation algorithm.
6. Parameter tuning [PARAM_TUNING]
There are various parameters in this system that need to be tuned.
We will first start by tuning the default difficulty of our PoW system. That's gonna define an expected time for attackers and clients to succeed.
We are then gonna tune the parameters of our proof-of-work function. That will define the resources that an attacker needs to spend to overwhelm the onion service, the resources that the service needs to spend to verify introduction requests, and the resources that legitimate clients need to spend to get to the onon service.
6.1. PoW Difficulty settings
The difficulty setting of our PoW basically dictates how difficult it should be to get a success in our PoW system. In classic PoW systems, "success" is defined as getting a hash output below the "target". However, since our system is dynamic, we define "success" as an abstract high-effort computation.
Even tho our system is dynamic, we still need default difficulty settings that will define the metagame. The client and attacker can still aim higher or lower, but for UX purposes and for analysis purposes we do need to define some difficulties.
We hence created the table (see [REF_TABLE]) below which shows how much time a legitimate client with a single machine should expect to burn before they get a single success. The x-axis is how many successes we want the attacker to be able to do per second: the more successes we allow the adversary, the more they can overwhelm our introduction queue. The y-axis is how many machines the adversary has in her disposal, ranging from just 5 to 1000.
=============================================================== | Expected Time (in seconds) Per Success For One Machine | =========================================================================== | | | Attacker Succeses 1 5 10 20 30 50 | | per second | | | | 5 5 1 0 0 0 0 | | 50 50 10 5 2 1 1 | | 100 100 20 10 5 3 2 | | Attacker 200 200 40 20 10 6 4 | | Boxes 300 300 60 30 15 10 6 | | 400 400 80 40 20 13 8 | | 500 500 100 50 25 16 10 | | 1000 1000 200 100 50 33 20 | | | ============================================================================
Here is how you can read the table above:
- If an adversary has a botnet with 1000 boxes, and we want to limit her to 1 success per second, then a legitimate client with a single box should be expected to spend 1000 seconds getting a single success.
- If an adversary has a botnet with 1000 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 200 seconds getting a single success.
- If an adversary has a botnet with 500 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 100 seconds getting a single success.
- If an adversary has access to 50 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 10 seconds getting a single success.
- If an adversary has access to 5 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 1 seconds getting a single success.
With the above table we can create some profiles for default values of our PoW difficulty. So for example, we can use the last case as the default parameter for Tor Browser, and then create three more profiles for more expensive cases, scaling up to the first case which could be hardest since the client is expected to spend 15 minutes for a single introduction.
{TODO: PARAM_TUNING You can see that this section is completely CPU/memory agnostic, and it does not take into account potential optimizations that can come from GPU/ASICs. This is intentional so that we don't put more variables into this equation right now, but as this proposal moves forward we will need to put more concrete values here.}
6.2. XXX_POW parameters [ARGON_PARAMS]
XXX_POW
We now need to define the secondary argon2 parameters as defined in [REF_ARGON2]. This includes the number of lanes 'h', the memory size 'm', the number of iterations 't'. Section 9 of [REF_ARGON2] recommends an approach of how to tune these parameters.
To tune these parameters we are looking to *minimize* the verification speed of an onion service, while *maximizing* the sparse resources spent by an adversary trying to overwhelm the service using [ATTACK_META].
When it comes to verification speed, to verify a single introduction cell the service needs to do a single argon2 call: so the service will need to do hundreds of those per second as INTRODUCE2 cells arrive. The service will have to do this verification step even for very cheap zero-effort PoW received, so this has to be a cheap procedure so that it doesn't become a DoS vector of each own. Hence each individual argon2 call must be cheap enough to be able to be done comfortably and plentifuly by an onion service with a single host (or horizontally scaled with Onionbalance).
At the same time, the adversary will have to do thousands of these calls if she wants to make high-effort PoW, so it's this assymetry that we are looking to exploit here. Right now, the most expensive resource for adversaries is the RAM size, and that's why we chose argon2 which is memory-hard.
To minmax this game we will need
{TODO: PARAM_TUNING: I've had a hard time minmaxing this game for argon2. Even argon2 invocations with a small memory parameter will take multiple milliseconds to run on my machine, and the parameters recommended in section 8 of the paper all take many hundreds of milliseconds. This is just not practical for our use case, since we want to process hundreds of such PoW per second... I also did not manage to find a benchmark of argon2 calls for different CPU/GPU/FPGA configurations.}
7. Discussion
7.1. UX
This proposal has user facing UX consequences.
Here is some UX improvements that don't need user-input:
- Primarily, there should be a way for Tor Browser to display to users that additional time (and resources) will be needed to access a service that is under attack. Depending on the design of the system, it might even be possible to estimate how much time it will take.
And here are a few UX approaches that will need user-input and have an increasing engineering difficulty. Ideally this proposal will not need user-input and the default behavior should work for almost all cases.
a) Tor Browser needs a "range field" which the user can use to specify how much effort they want to spend in PoW if this ever occurs while they are browsing. The ranges could be from "Easy" to "Difficult", or we could try to estimate time using an average computer. This setting is in the Tor Browser settings and users need to find it.
b) We start with a default effort setting, and then we use the new onion errors (see #19251) to estimate when an onion service connection has failed because of DoS, and only then we present the user a "range field" which they can set dynamically. Detecting when an onion service connection has failed because of DoS can be hard because of the lack of feedback (see [CLIENT_BEHAVIOR])
c) We start with a default effort setting, and if things fail we automatically try to figure out an effort setting that will work for the user by doing some trial-and-error connections with different effort values. Until the connection succeeds we present a "Service is overwhelmed, please wait" message to the user.
7.2. Future work [FUTURE_WORK]
7.2.1. Incremental improvements to this proposal
There are various improvements that can be done in this proposal, and while we are trying to keep this v1 version simple, we need to keep the design extensible so that we build more features into it. In particular:
- End-to-end introduction ACKs
This proposal suffers from various UX issues because there is no end-to-end mechanism for an onion service to inform the client about its introduction request. If we had end-to-end introduction ACKs many of the problems from [CLIENT_BEHAVIOR] would be aleviated. The problem here is that end-to-end ACKs require modifications on the introduction point code and a network update which is a lengthy process.
- Multithreading scheduler
Our scheduler is pretty limited by the fact that Tor has a single-threaded design. If we improve our multithreading support we could handle a much greater amount of introduction requests per second.
7.2.2. Future designs [FUTURE_DESIGNS]
This is just the beginning in DoS defences for Tor and there are various futured designs and schemes that we can investigate. Here is a brief summary of these:
"More advanced PoW schemes" -- We could use more advanced memory-hard PoW schemes like MTP-argon2 or Itsuku to make it even harder for adversaries to create successful PoWs. Unfortunately these schemes have much bigger proof sizes, and they won't fit in INTRODUCE1 cells. See #31223 for more details.
"Third-party anonymous credentials" -- We can use anonymous credentials and a third-party token issuance server on the clearnet to issue tokens based on PoW or CAPTCHA and then use those tokens to get access to the service. See [REF_CREDS] for more details.
"PoW + Anonymous Credentials" -- We can make a hybrid of the above ideas where we present a hard puzzle to the user when connecting to the onion service, and if they solve it we then give the user a bunch of anonymous tokens that can be used in the future. This can all happen between the client and the service without a need for a third party.
All of the above approaches are much more complicated than this proposal, and hence we want to start easy before we get into more serious projects.
7.3. Environment
We love the environment! We are concerned of how PoW schemes can waste energy by doing useless hash iterations. Here is a few reasons we still decided to pursue a PoW approach here:
"We are not making things worse" -- DoS attacks are already happening and attackers are already burning energy to carry them out both on the attacker side, on the service side and on the network side. We think that asking legitimate clients to carry out PoW computations is not gonna affect the equation too much, since an attacker right now can very quickly cause the same damage that hundreds of legitimate clients do a whole day.
"We hope to make things better" -- The hope is that proposals like this will make the DoS actors go away and hence the PoW system will not be used. As long as DoS is happening there will be a waste of energy, but if we manage to demotivate them with technical means, the network as a whole will less wasteful. Also see [CATCH22] for a similar argument.
8. Acknowledgements
Thanks a lot to tevador for the various improvements to the proposal and for helping us understand and tweak the RandomX scheme.
Thanks to Solar Designer for the help in understanding the current PoW landscape, the various approaches we could take, and teaching us a few neat tricks.
9. References
[REF_ARGON2]: https://github.com/P-H-C/phc-winner-argon2/blob/master/argon2-specs.pdf https://password-hashing.net/#argon2 [REF_TABLE]: The table is based on the script below plus some manual editing for readability: https://gist.github.com/asn-d6/99a936b0467b0cef88a677baaf0bbd04 [REF_BOTNET]: https://media.kasperskycontenthub.com/wp-content/uploads/sites/43/2009/07/01... [REF_CREDS]: https://lists.torproject.org/pipermail/tor-dev/2020-March/014198.html [REF_TARGET]: https://en.bitcoin.it/wiki/Target [REF_TLS]: https://www.ietf.org/archive/id/draft-nygren-tls-client-puzzles-02.txt https://tools.ietf.org/id/draft-nir-tls-puzzles-00.html https://tools.ietf.org/html/draft-ietf-ipsecme-ddos-protection-10 [REF_TLS_1]: https://www.ietf.org/archive/id/draft-nygren-tls-client-puzzles-02.txt
Hi George,
Thanks for the update.
On Wed, Jun 10, 2020 at 2:05 PM George Kadianakis desnacked@riseup.net wrote:
Tevador, thanks a lot for your tailored work on equix. This is fantastic. I have a question that I don't see addressed in your very well written README. In your initial email, we discuss how Equihash does not have good GPU resistance: https://lists.torproject.org/pipermail/tor-dev/2020-May/014268.html
Since equix is using Equihash isn't this gonna be a problem here too? I'm not too worried about ASIC resistance since I doubt someone is gonna build ASICs for this problem just yet, but script kiddies with their CS:GO graphics cards attacking equix is something I'm concerned about. I bet you have thought of this, so I'm wondering what's your take here.
Equihash runs much faster on GPUs only if the memory requirements exceed the size of the CPU cache. This is the case for most Equihash variants that are in use by cryptocurrencies (e.g. 200,9 and 144,5), but doesn't apply to Equi-X, which uses only 2 MB.
The GPU resistance of Equi-X is based on 2 facts: 1) Each solution attempt uses a different hash function. GPUs cannot compile new kernels fast enough (it would require >1000 kernels per second), so they have to run in emulator mode, which is much slower. GPUs are also impacted by thread divergence. 2) The entire sorting phase fits into CPU cache, so CPUs can benefit from memory bandwidth comparable to GPUs (~500 GB/s).
In tevador's initial mail, they point how the cell should include POW_EFFORT and that we should specify a "minimum effort" value instead of just inserting any effort in the pqueue. I can understand how this can have benefits (like the June discussion between tevador and yoehoduv) but I'm also concerned that this can make us more vulnerable to [ATTACK_BOTTOM_HALF] types of attacks, by completely dropping introduction requests instead of queueing them for an abstract future. I wouldn't be surprised if my concerns are invalid and harmful here. Does anyone have intuition?
I don't see any downsides to including the PoW effort in the cell and specifying the minimum effort. The minimum effort is needed to reduce the verification overhead and to ensure that the queue doesn't grow indefinitely. If the effort of the introduction request is lower than the minimum, the service can simply treat it like a request without any PoW (and saving a verification call). The exact outcome would depend on the circumstances, normally the request would go to the back of the queue.
I suggest a minimum effort equivalent to ~1 second of CPU time and adjust it upward if the queue is full. This is more efficient than trimming.
The size of the queue should be enough to handle short attack bursts without dropping any requests. I'd say that a reasonable maximum queue size is {bottom half throughput} * {number of seconds the client will wait before retrying}. There is no point in queueing up more requests than this because the client will have already given up on the request by the earliest time it could be serviced.
- tevador suggests we use two seeds, and always accept introductions with
the previous seed. I agree this is a good idea, and it's not as complex as I originally thought (I have trauma from the v3 design where we try to support multiple time periods at the same time). However, because this doubles the vefication time, I decided to wait for dgoulet's scheduler numbers and until the PoW function is finalized to understand if we can afford the verification overhead.
There is no verification overhead if the seed is included in the request. If additional 32 bytes are too much, the request can include e.g. the first 4 bytes of the seed. This is enough for the service to select the correct seed from the two active ones. The chance that two subsequent seeds have the same first 32 bits is negligible (and can be even avoided completely).
- Solar Designer suggested we do Ethash's anti-DDoS trick to avoid instances
of [ATTACK_TOP_HALF]. This involves wrapping the final PoW token in a fast hash with a really low difficulty, and having the verifier check that fast hash POW first. This means that a target trying to flood us with invalid PoW would need to do some work for every PoW instead of it being free. This is a decision we should take at the end after we do some number crunching and see where we are at in terms of verification time and attack models.
This trick is mostly relevant to slow-to-verify algorithms, but can be also applied to Equi-X by reordering the server algorithm steps:
Input: C, N, E, S 1) Check that C is a valid seed (there may be multiple seeds active at a time). 2) Check that E is above the minimum effort 3) Check that N hasn't been used before with C 4) Calculate R = blake2b(C || N || E || S) 5) Check R * E <= UINT32_MAX 6) Check equix_verify(C || N || E, S) == EQUIX_OK 7) Put the request in the queue with weight E
Simple spam attacks will fail at step 5, avoiding the call to equix_verify.
However, I would not overly rely on this feature because even a single GPU can provide enough fast hashes (5-10 GH/s) for [ATTACK_TOP_HALF] on schemes like Yespower, Argon2 or RandomX. It works for Ethash because the required effort is determined by the computing power of the entire Ethereum network, so the attacker would need to compute billions of fast hashes even to pass the first check.
Hello there,
here is another round of PoW revisions: https://github.com/asn-d6/torspec/tree/pow-over-intro I'm inlining the full proposal in the end of this email.
Here is a changelog: - Actually used tevador's EquiX scheme as our PoW scheme for now. This is still tentative, but I needed some ingredients to cook with so I went for it. - Fold in David's performance measurements and use them to get some guesstimates on the default PoW difficulty etc. - Enable overlapping seed system. - Enrich the attack section of the proposal some more. - Attempt to fix an effort estimation attack pointed by tevador. - Added a bunch of "BLOCKER" tags around the proposal for things that we need to figure out or at least have some good intuition if we want to have guarantees that the proposal can work before we start implementing.
Here is what needs to happen next:
- David's performance measurements have been really useful, but they open a bunch of questions on auxiliary overheads. We are now performing more experiments to confirm the performance numbers we got and make sure we are not overshooting. I noted these issues down as BLOCKER in the proposal. While doing so we also found a pretty serious bug with our scheduler that we trying to fix: https://gitlab.torproject.org/tpo/core/tor/-/issues/40006 - Did not have time to think about the priority queue's max size. I added a BLOCKER about this in the [HANDLE_QUEUE] section. - Did not have time to think about a minimum effort feature on the queue. I guess this also depends on the scheduler. - Need to think more about the effort estimation logic and make sure that it can't backfire big time. - Need to kill all the XXXs, TODOs and BLOCKERs.
Also, tevador let me know if you'd like me to add you as a co-author on the proposal based on all your great feedback so far.
This is looking more and more plausible but let's wait for more data before we seal the deal.
Thanks for all the feedback and looking forward to more!
---
Filename: xxx-pow-over-intro-v1 Title: A First Take at PoW Over Introduction Circuits Author: George Kadianakis, Mike Perry, David Goulet Created: 2 April 2020 Status: Draft
0. Abstract
This proposal aims to thwart introduction flooding DoS attacks by introducing a dynamic Proof-Of-Work protocol that occurs over introduction circuits.
1. Motivation
So far our attempts at limiting the impact of introduction flooding DoS attacks on onion services has been focused on horizontal scaling with Onionbalance, optimizing the CPU usage of Tor and applying congestion control using rate limiting. While these measures move the goalpost forward, a core problem with onion service DoS is that building rendezvous circuits is a costly procedure both for the service and for the network. For more information on the limitations of rate-limiting when defending against DDoS, see [REF_TLS_1].
If we ever hope to have truly reachable global onion services, we need to make it harder for attackers to overload the service with introduction requests. This proposal achieves this by allowing onion services to specify an optional dynamic proof-of-work scheme that its clients need to participate in if they want to get served.
With the right parameters, this proof-of-work scheme acts as a gatekeeper to block amplification attacks by attackers while letting legitimate clients through.
1.1. Related work
For a similar concept, see the three internet drafts that have been proposed for defending against TLS-based DDoS attacks using client puzzles [REF_TLS].
1.2. Threat model [THREAT_MODEL]
1.2.1. Attacker profiles [ATTACKER_MODEL]
This proposal is written to thwart specific attackers. A simple PoW proposal cannot defend against all and every DoS attack on the Internet, but there are adverary models we can defend against.
Let's start with some adversary profiles:
"The script-kiddie"
The script-kiddie has a single computer and pushes it to its limits. Perhaps it also has a VPS and a pwned server. We are talking about an attacker with total access to 10 Ghz of CPU and 10 GBs of RAM. We consider the total cost for this attacker to be zero $.
"The small botnet"
The small botnet is a bunch of computers lined up to do an introduction flooding attack. Assuming 500 medium-range computers, we are talking about an attacker with total access to 10 Thz of CPU and 10 TB of RAM. We consider the upfront cost for this attacker to be about $400.
"The large botnet"
The large botnet is a serious operation with many thousands of computers organized to do this attack. Assuming 100k medium-range computers, we are talking about an attacker with total access to 200 Thz of CPU and 200 TB of RAM. The upfront cost for this attacker is about $36k.
We hope that this proposal can help us defend against the script-kiddie attacker and small botnets. To defend against a large botnet we would need more tools in our disposal (see [FUTURE_DESIGNS]).
1.2.2. User profiles [USER_MODEL]
We have attackers and we have users. Here are a few user profiles:
"The standard web user"
This is a standard laptop/desktop user who is trying to browse the web. They don't know how these defences work and they don't care to configure or tweak them. They are gonna use the default values and if the site doesn't load, they are gonna close their browser and be sad at Tor. They run a 2Ghz computer with 4GB of RAM.
"The motivated user"
This is a user that really wants to reach their destination. They don't care about the journey; they just want to get there. They know what's going on; they are willing to tweak the default values and make their computer do expensive multi-minute PoW computations to get where they want to be.
"The mobile user"
This is a motivated user on a mobile phone. Even tho they want to read the news article, they don't have much leeway on stressing their machine to do more computation.
We hope that this proposal will allow the motivated user to always connect where they want to connect to, and also give more chances to the other user groups to reach the destination.
1.2.3. The DoS Catch-22 [CATCH22]
This proposal is not perfect and it does not cover all the use cases. Still, we think that by covering some use cases and giving reachability to the people who really need it, we will severely demotivate the attackers from continuing the DoS attacks and hence stop the DoS threat all together. Furthermore, by increasing the cost to launch a DoS attack, a big class of DoS attackers will disappear from the map, since the expected ROI will decrease.
2. System Overview
2.1. Tor protocol overview
+----------------------------------+ | | +-------+ INTRO1 +-----------+ INTRO2 +--------+ | |Client |-------->|Intro Point|------->| PoW |-----------+ | +-------+ +-----------+ |Verifier| | | +--------+ | | | | | | | | | +----------v---------+ | | |Intro Priority Queue| | +---------+--------------------+---+ | | | Rendezvous | | | circuits | | | v v v
The proof-of-work scheme specified in this proposal takes place during the introduction phase of the onion service protocol.
The system described in this proposal is not meant to be on all the time, and should only be enabled by services when under duress. The percentage of clients receiving puzzles can also be configured based on the load of the service.
In summary, the following steps are taken for the protocol to complete:
1) Service encodes PoW parameters in descriptor [DESC_POW] 2) Client fetches descriptor and computes PoW [CLIENT_POW] 3) Client completes PoW and sends results in INTRO1 cell [INTRO1_POW] 4) Service verifies PoW and queues introduction based on PoW effort [SERVICE_VERIFY]
2.2. Proof-of-work overview
2.2.1. Primitives
For our proof-of-work function we will use the 'equix' scheme by tevador [REF_EQUIX]. Equix is an assymetric PoW function based on Equihash<60,3>. It features lightning fast verification speed, and also aims to minimize the assymetry between CPU and GPU. Furthermore, it's designed for this particular use-case and hence cryptocurrency miners are not incentivized to make optimized ASICs for it.
{TODO: define verification/proof interface.}
We tune equix in section [EQUIX_TUNING].
2.2.2. Dynamic PoW
DoS is a dynamic problem where the attacker's capabilities constantly change, and hence we want our proof-of-work system to be dynamic and not stuck with a static difficulty setting. Hence, instead of forcing clients to go below a static target like in Bitcoin to be successful, we ask clients to "bid" using their PoW effort. Effectively, a client gets higher priority the higher effort they put into their proof-of-work. This is similar to how proof-of-stake works but instead of staking coins, you stake work.
The benefit here is that legitimate clients who really care about getting access can spend a big amount of effort into their PoW computation, which should guarantee access to the service given reasonable adversary models. See [PARAM_TUNING] for more details about these guarantees and tradeoffs.
As a way to improve reachability and UX, the service tries to estimate the effort needed for clients to get access at any given time and places it in the descriptor. See [EFFORT_ESTIMATION] for more details.
2.2.3. PoW effort
For our dynamic PoW system to work, we will need to be able to compare PoW tokens with each other. To do so we define a function: unsigned effort(uint8_t *token) which takes as its argument a hash output token, interprets it as a bitstring, and returns the quotient of dividing a bitstring of 1s by it.
So for example: effort(00000001100010101101) = 11111111111111111111 / 00000001100010101101 or the same in decimal: effort(6317) = 1048575 / 6317 = 165.
This definition of effort has the advantage of directly expressing the expected number of hashes that the client had to calculate to reach the effort. This is in contrast to the (cheaper) exponential effort definition of taking the number of leading zero bits.
3. Protocol specification
3.1. Service encodes PoW parameters in descriptor [DESC_POW]
This whole protocol starts with the service encoding the PoW parameters in the 'encrypted' (inner) part of the v3 descriptor. As follows:
"pow-params" SP type SP seed-b64 SP expiration-time NL
[At most once]
type: The type of PoW system used. We call the one specified here "v1"
seed-b64: A random seed that should be used as the input to the PoW hash function. Should be 32 random bytes encoded in base64 without trailing padding.
suggested-effort: An unsigned integer specifying an effort value that clients should aim for when contacting the service. See [EFFORT_ESTIMATION] for more details here.
expiration-time: A timestamp in "YYYY-MM-DD SP HH:MM:SS" format after which the above seed expires and is no longer valid as the input for PoW. It's needed so that the size of our replay cache does not grow infinitely. It should be set to RAND_TIME(now+7200, 900) seconds.
The service should refresh its seed when expiration-time passes. The service SHOULD keep its previous seed in memory and accept PoWs using it to avoid race-conditions with clients that have an old seed. The service SHOULD avoid generating two consequent seeds that have a common 4 bytes prefix. See [INTRO1_POW] for more info.
By RAND_TIME(ts, interval) we mean a time between ts-interval and ts, chosen uniformly at random.
3.2. Client fetches descriptor and computes PoW [CLIENT_POW]
If a client receives a descriptor with "pow-params", it should assume that the service is expecting a PoW input as part of the introduction protocol.
The client parses the descriptor and extracts the PoW parameters. It makes sure that the <expiration-time> has not expired and if it has, it needs to fetch a new descriptor.
The client should then extract the <suggested-effort> field to configure its PoW 'target' (see [REF_TARGET]). The client SHOULD NOT accept 'target' values that will cause an infinite PoW computation. {XXX: How to enforce this?}
To complete the PoW the client follows the following logic:
a) Client generates 'nonce' as 16 random bytes. b) Client derives 'seed' by decoding 'seed-b64'. c) Client derives 'labeled_seed = seed + "TorV1PoW"' d) Client computes hash_output = XXX_POW(labeled_seed, nonce) e) Client checks if effort(hash_output) >= target. e1) If yes, success! The client uses 'hash_output' as the puzzle solution and 'nonce' and 'seed' as its inputs. e2) If no, fail! The client interprets 'nonce' as a big-endian integer, increments it by one, and goes back to step (d).
At the end of the above procedure, the client should have a triplet (hash_output, seed, nonce) that can be used as the answer to the PoW puzzle. How quickly this happens depends solely on the 'target' parameter.
3.3. Client sends PoW in INTRO1 cell [INTRO1_POW]
Now that the client has an answer to the puzzle it's time to encode it into an INTRODUCE1 cell. To do so the client adds an extension to the encrypted portion of the INTRODUCE1 cell by using the EXTENSIONS field (see [PROCESS_INTRO2] section in rend-spec-v3.txt). The encrypted portion of the INTRODUCE1 cell only gets read by the onion service and is ignored by the introduction point.
We propose a new EXT_FIELD_TYPE value:
[01] -- PROOF_OF_WORK
The EXT_FIELD content format is:
POW_VERSION [1 byte] POW_NONCE [16 bytes] POW_SEED [4 bytes]
where:
POW_VERSION is 1 for the protocol specified in this proposal POW_NONCE is 'nonce' from the section above POW_SEED is the first 4 bytes of the seed used
This will increase the INTRODUCE1 payload size by 23 bytes since the extension type and length is 2 extra bytes, the N_EXTENSIONS field is always present and currently set to 0 and the EXT_FIELD is 21 bytes. According to ticket #33650, INTRODUCE1 cells currently have more than 200 bytes available.
3.4. Service verifies PoW and handles the introduction [SERVICE_VERIFY]
When a service receives an INTRODUCE1 with the PROOF_OF_WORK extension, it should check its configuration on whether proof-of-work is required to complete the introduction. If it's not required, the extension SHOULD BE ignored. If it is required, the service follows the procedure detailed in this section.
If the service requires the PROOF_OF_WORK extension but received an INTRODUCE1 cell without any embedded proof-of-work, the service SHOULD consider this cell as a zero-effort introduction for the purposes of the priority queue (see section [INTRO_QUEUE]).
3.4.1. PoW verification [POW_VERIFY]
To verify the client's proof-of-work the service extracts (hash_output, seed, nonce) from the INTRODUCE1 cell and MUST do the following steps:
1) Use POW_SEED to figure out whether client is using current or previous seed. 2) Check the client's nonce for replays (see [REPLAY_PROTECTION] section). 3) Verify that 'hash_output =?= XXX_POW(seed, nonce)
If any of these steps fail the service MUST ignore this introduction request and abort the protocol.
In this proposal we call the above steps the "top half" of introduction handling. If all the steps of the "top half" have passed, then the circuit is added to the introduction queue as detailed in section [INTRO_QUEUE].
3.4.1.1. Replay protection [REPLAY_PROTECTION]
The service MUST NOT accept introduction requests with the same (seed, nonce) tuple. For this reason a replay protection mechanism must be employed.
The simplest way is to use a simple hash table to check whether a (seed, nonce) tuple has been used before for the actiev duration of a seed. Depending on how long a seed stays active this might be a viable solution with reasonable memory/time overhead.
If there is a worry that we might get too many introductions during the lifetime of a seed, we can use a Bloom filter as our replay cache mechanism. The probabilistic nature of Bloom filters means that sometimes we will flag some connections as replays even if they are not; with this false positive probability increasing as the number of entries increase. However, with the right parameter tuning this probability should be negligible and well handled by clients. {TODO: Figure bloom filter}
3.4.2. The Introduction Queue [INTRO_QUEUE]
3.4.2.1. Adding introductions to the introduction queue [ADD_QUEUE]
When PoW is enabled and a verified introduction comes through, the service instead of jumping straight into rendezvous, queues it and prioritizes it based on how much effort was devoted by the client to PoW. This means that introduction requests with high effort should be prioritized over those with low effort.
To do so, the service maintains an "introduction priority queue" data structure. Each element in that priority queue is an introduction request, and its priority is the effort put into its PoW:
When a verified introduction comes through, the service uses the effort() function with hash_output as its input, and uses the output to place requests into the right position of the priority_queue: The bigger the effort, the more priority it gets in the queue. If two elements have the same effort, the older one has priority over the newer one.
3.4.2.2. Handling introductions from the introduction queue [HANDLE_QUEUE]
The service should handle introductions by pulling from the introduction queue. We call this part of introduction handling the "bottom half" because most of the computation happens in this stage. For a description of how we expect such a system to work in Tor, see [TOR_SCHEDULER] section.
{TODO: BLOCKER: What's the max size of the queue? Do we trim it, or do we just stop adding new requests when it reaches max size? Can we use WRED? Trimming is currently used EFFORT_ESTIMATION, so if we don't do it we need to find a different way to estimate effort. See tevador's [REF_TEVADOR_2] email.}
3.4.3. PoW effort estimation [EFFORT_ESTIMATION]
The service starts with a default suggested-effort value of 5000 (see [EQUIX_DIFFICULTY] section for more info).
Then during its operation the service continuously keeps track of the received PoW cell efforts to inform its clients of the effort they should put in their introduction to get service. The service informs the clients by using the <suggested-effort> field in the descriptor.
Everytime the service handles an introduction request from the priority queue in [HANDLE_QUEUE], the service adds the request's effort to a sorted 'handled-requests-efforts' list. Everytime the service trims its priority queue it adds the median of the trimmed requests' to a sorted 'trimmed-requests-median-efforts' list.
Then every 'hs-pow-desc-upload-rate-limit' seconds (which is controlled through a consensus parameter and has a default value of 300 seconds) and while the DoS feature is enabled, the service updates its <suggested-effort> value as follows: - If the service's current <suggested-effort> value is lower than the median of the 'trimmed-requests-median-efforts' list, then set <suggested-effort> to that median (i.e. increase suggested-effort). - *Else* if the service's current <suggested-effort> value is higher than the median of the 'handled-requests-efforts' list, then set <suggested-effort> to that median (i.e. lower suggested-effort). - Either way, clear 'handled-requests-efforts' and 'trimmed-requests-median-efforts'.
The above two operations are meant to balance the suggested effort based on the requests residing in the priority queue. If the priority queue is filled with high-effort requests, make the suggested effort higher. And when all the high-effort requests get handled and the priority queue is back to normal operation, relax the suggested effort to lower levels.
Given the way the algorithm works above, priority is given to the operation that increases the suggested-effort. Also the values are taken as medians over a period of time to avoid [ATTACK_EFFORT] attacks where the attacker changes her behavior right before the descriptor upload to influence the <suggested-effort>.
{XXX: BLOCKER: Figure out of this system makes sense}
The suggested-effort is not a hard limit to the efforts that are accepted by the service, and it's only meant to serve as a guideline for clients to reduce the number of unsuccessful requests that get to the service. The service still adds requests with lower effort than suggested-effort to the priority queue in [ADD_QUEUE].
{XXX: Another approach would be to use the maximum value instead of the median here. This would give a more surefire effort estimation, but it could also cause attacks where an adversary spends 1 hour to make a single introduction with a huge PoW and then denies access to all clients for 5 minutes.}
{XXX: Does this mean that this system can auto-enable and auto-disable the DoS subsystem with reasonable accuracy?}
3.4.3.1. Updating descriptor with new suggested effort
Every 'hs-pow-desc-upload-rate-limit' seconds the service should upload a new descriptor with a new suggested-effort value.
The service should avoid uploading descriptors too often to avoid overwheming the HSDirs. The service SHOULD NOT upload descriptors more often than 'hs-pow-desc-upload-rate-limit'. The service SHOULD NOT upload a new descriptor if the suggested-effort value changes by less than 15%.
{XXX: Is this too often? Perhaps we can set different limits for when the difficulty goes up and different for when it goes down. It's more important to update the descriptor when the difficulty goes up.}
{XXX: What attacks are possible here? Can the attacker intentionally hit this rate-limit and then influence the suggested effort so that clients do not learn about the new effort?}
4. Client behavior [CLIENT_BEHAVIOR]
This proposal introduces a bunch of new ways where a legitimate client can fail to reach the onion service.
Furthermore, there is currently no end-to-end way for the onion service to inform the client that the introduction failed. The INTRO_ACK cell is not end-to-end (it's from the introduction point to the client) and hence it does not allow the service to inform the client that the rendezvous is never gonna occur.
For this reason we need to define some client behaviors to work around these issues.
4.1. Clients handling timeouts [CLIENT_TIMEOUT]
Alice can fail to reach the onion service if her introduction request gets trimmed off the priority queue in [HANDLE_QUEUE], or if the service does not get through its priority queue in time and the connection times out.
{XXX: BLOCKER: How should timeout values change here since the priority queue will cause bigger delays than usual to rendezvous?}
This section presents a heuristic method for the client getting service even in such scenarios.
If the rendezvous request times out, the client SHOULD fetch a new descriptor for the service to make sure that it's using the right suggested-effort for the PoW and the right PoW seed. The client SHOULD NOT fetch service descriptors more often than every 'hs-pow-desc-fetch-rate-limit' seconds (which is controlled through a consensus parameter and has a default value of 600 seconds).
{XXX: Is this too rare? Too often?}
When the client fetches a new descriptor, it should try connecting to the service with the new suggested-effort and PoW seed. If that doesn't work, it should double the effort and retry. The client should keep on doubling-and-retrying until it manages to get service, or its able to fetch a new descriptor again.
{XXX: This means that the client will keep on spinning and doubling-and-retrying for a service under this situation. There will never be a "Client connection timed out" page for the user. Is this good? Is this bad? Should we stop doubling-and-retrying after some iterations? Or should we throw a custom error page to the user, and ask the user to stop spinning whenever they want?}
4.3. Other descriptor issues
Another race condition here is if the service enables PoW, while a client has a cached descriptor. How will the client notice that PoW is needed? Does it need to fetch a new descriptor? Should there be another feedback mechanism?
5. Attacker strategies [ATTACK_META]
Now that we defined our protocol we need to start tweaking the various knobs. But before we can do that, we first need to understand a few high-level attacker strategies to see what we are fighting against.
5.1.1. Overwhelm PoW verification (aka "Overwhelm top half") [ATTACK_TOP_HALF]
A basic attack here is the adversary spamming with bogus INTRO cells so that the service does not have computing capacity to even verify the proof-of-work. This adversary tries to overwhelm the procedure in the [POW_VERIFY] section.
That's why we need the PoW algorithm to have a cheap verification time so that this attack is not possible: we tune this PoW parameter in section [POW_TUNING_VERIFICATION].
5.1.2. Overwhelm rendezvous capacity (aka "Overwhelm bottom half") [ATTACK_BOTTOM_HALF]
Given the way the introduction queue works (see [HANDLE_QUEUE]), a very effective strategy for the attacker is to totally overwhelm the queue processing by sending more high-effort introductions than the onion service can handle at any given tick. This adversary tries to overwhelm the procedure in the [HANDLE_QUEUE] section.
To do so, the attacker would have to send at least 20 high-effort introduction cells every 100ms, where high-effort is a PoW which is above the estimated level of "the motivated user" (see [USER_MODEL]).
An easier attack for the adversary, is the same strategy but with introduction cells that are all above the comfortable level of "the standard user" (see [USER_MODEL]). This would block out all standard users and only allow motivated users to pass.
5.1.3. Gaming the effort estimation logic [ATTACK_EFFORT]
Another way to beat this system is for the attacker to game the effort estimation logic (see [EFFORT_ESTIMATION]). Essentialy, there are two attacks that we are trying to avoid:
- Attacker sets descriptor suggested-effort to a very high value effectively making it impossible for most clients to produce a PoW token in a reasonable timeframe. - Attacker sets descriptor suggested-effort to a very small value so that most clients aim for a small value while the attacker comfortably launches an [ATTACK_BOTTOM_HALF] using medium effort PoW (see [REF_TEVADOR_1])
5.1.4. Precomputed PoW attack
The attacker may precompute many valid PoW nonces and submit them all at once before the current seed expires, overwhelming the service temporarily even using a single computer. The current scheme gives the attackers 4 hours to launch this attack since each seed lasts 2 hours and the service caches two seeds.
An attacker with this attack might be aiming to DoS the service for a limited amount of time, or to cause an [ATTACK_EFFORT] attack.
6. Parameter tuning [POW_TUNING]
There are various parameters in this PoW system that need to be tuned:
We first start by tuning the time it takes to verify a PoW token. We do this first because it's fundamental to the performance of onion services and can turn into a DoS vector of its own. We will do this tuning in a way that's agnostic to the chosen PoW function.
We will then move towards analyzing the default difficulty setting for our PoW system. That defines the expected time for clients to succeed in our system, and the expected time for attackers to overwhelm our system. Same as above we will do this in a way that's agnostic to the chosen PoW function.
Finally, using those two pieces we will tune our PoW function and pick the right default difficulty setting. At the end of this section we will know the resources that an attacker needs to overwhelm the onion service, the resources that the service needs to verify introduction requests, and the resources that legitimate clients need to get to the onion service.
6.1. PoW verification [POW_TUNING_VERIFICATION]
Verifying a PoW token is the first thing that a service does when it receives an INTRODUCE2 cell and it's detailed in section [POW_VERIFY]. This verification happens during the "top half" part of the process. Every milisecond spent verifying PoW adds overhead to the already existing "top half" part of handling an introduction cell. Hence we should be careful to add minimal overhead here so that we don't enable attacks like [ATTACK_TOP_HALF].
During our performance measurements in [TOR_MEASUREMENTS] we learned that the "top half" takes about 0.26 msecs in average, without doing any sort of PoW verification. Using that value we compute the following table, that describes the number of cells we can queue per second (aka times we can perform the "top half" process) for different values of PoW verification time:
+---------------------+-----------------------+--------------+ |PoW Verification Time| Total "top half" time | Cells Queued | | | | per second | |---------------------|-----------------------|--------------| | 0 msec | 0.26 msec | 3846 | | 1 msec | 1.26 msec | 793 | | 2 msec | 2.26 msec | 442 | | 3 msec | 3.26 msec | 306 | | 4 msec | 4.26 msec | 234 | | 5 msec | 5.26 msec | 190 | | 6 msec | 6.26 msec | 159 | | 7 msec | 7.26 msec | 137 | | 8 msec | 8.26 msec | 121 | | 9 msec | 9.26 msec | 107 | | 10 msec | 10.26 msec | 97 | +---------------------+-----------------------+--------------+
Here is how you can read the table above:
- For a PoW function with a 1ms verification time, an attacker needs to send 793 dummy introduction cells per second to succeed in a [ATTACK_TOP_HALF] attack.
- For a PoW function with a 2ms verification time, an attacker needs to send 442 dummy introduction cells per second to succeed in a [ATTACK_TOP_HALF] attack.
- For a PoW function with a 10ms verification time, an attacker needs to send 97 dummy introduction cells per second to succeed in a [ATTACK_TOP_HALF] attack.
Whether an attacker can succeed at that depends on the attacker's resources, but also on the network's capacity. {TODO: BLOCKER: Need to investigate this and see if it's possible}
Our purpose here is to have the smallest PoW verification overhead possible that also allows us to achieve all our other goals.
[Note that the table above is simply the result of a naive multiplication and does not take into account all the auxiliary overheads that happen every second like the time to invoke the mainloop, the bottom-half processes, or pretty much anything other than the "top-half" processing.
During our measurements the time to handle INTRODUCE2 cells dominates any other action time: There might be events that require a long processing time, but these are pretty infrequent (like uploading a new HS descriptor) and hence over a long time they smooth out. Hence extrapolating the total cells queued per second based on a single "top half" time seems like good enough to get some initial intuition. That said, the values of "Cells queued per second" from the table above, are likely much smaller than displayed above because of all the auxiliary overheads.]
{TODO: BLOCKER: Figure out auxiliary overheads in real scenario}
6.2. PoW difficulty analysis [POW_DIFFICULTY_ANALYSIS]
The difficulty setting of our PoW basically dictates how difficult it should be to get a success in our PoW system. An attacker who can get many successes per second can pull a successfull [ATTACK_BOTTOM_HALF] attack against our system.
In classic PoW systems, "success" is defined as getting a hash output below the "target". However, since our system is dynamic, we define "success" as an abstract high-effort computation.
Our system is dynamic but we still need a default difficulty settings that will define the metagame and be used for bootstrapping the system. The client and attacker can still aim higher or lower but for UX purposes and for analysis purposes we do need to define a default difficulty.
6.2.1. Analysis based on adversary power
In this section we will try to do an analysis of PoW difficulty without using any sort of Tor-related or PoW-related benchmark numbers.
We created the table (see [REF_TABLE]) below which shows how much time a legitimate client with a single machine should expect to burn before they get a single success. The x-axis is how many successes we want the attacker to be able to do per second: the more successes we allow the adversary, the more they can overwhelm our introduction queue. The y-axis is how many machines the adversary has in her disposal, ranging from just 5 to 1000.
=============================================================== | Expected Time (in seconds) Per Success For One Machine | =========================================================================== | | | Attacker Succeses 1 5 10 20 30 50 | | per second | | | | 5 5 1 0 0 0 0 | | 50 50 10 5 2 1 1 | | 100 100 20 10 5 3 2 | | Attacker 200 200 40 20 10 6 4 | | Boxes 300 300 60 30 15 10 6 | | 400 400 80 40 20 13 8 | | 500 500 100 50 25 16 10 | | 1000 1000 200 100 50 33 20 | | | ============================================================================
Here is how you can read the table above:
- If an adversary has a botnet with 1000 boxes, and we want to limit her to 1 success per second, then a legitimate client with a single box should be expected to spend 1000 seconds getting a single success.
- If an adversary has a botnet with 1000 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 200 seconds getting a single success.
- If an adversary has a botnet with 500 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 100 seconds getting a single success.
- If an adversary has access to 50 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 10 seconds getting a single success.
- If an adversary has access to 5 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 1 seconds getting a single success.
With the above table we can create some profiles for default values of our PoW difficulty. So for example, we can use the last case as the default parameter for Tor Browser, and then create three more profiles for more expensive cases, scaling up to the first case which could be hardest since the client is expected to spend 15 minutes for a single introduction.
6.2.2. Analysis based on Tor's performance [POW_DIFFICULTY_TOR]
To go deeper here, we can use the performance measurements from [TOR_MEASUREMENTS] to get a more specific intuition on the default difficulty. In particular, we learned that completely handling an introduction cell takes 5.55 msecs in average. Using that value, we can compute the following table, that describes the number of introduction cells we can handle per second for different values of PoW verification:
+---------------------+-----------------------+--------------+ |PoW Verification Time| Total time to handle | Cells handled| | | introduction cell | per second | |---------------------|-----------------------|--------------| | 0 msec | 5.55 msec | 180.18 | | 1 msec | 6.55 msec | 152.67 | | 2 msec | 7.55 msec | 132.45 | | 3 msec | 8.55 msec | 116.96 | | 4 msec | 9.55 mesc | 104.71 | | 5 msec | 10.55 msec | 94.79 | | 6 msec | 11.55 msec | 86.58 | | 7 msec | 12.55 msec | 79.68 | | 8 msec | 13.55 msec | 73.80 | | 9 msec | 14.55 msec | 68.73 | | 10 msec | 15.55 msec | 64.31 | +---------------------+-----------------------+--------------+
Here is how you can read the table above:
- For a PoW function with a 1ms verification time, an attacker needs to send 152 high-effort introduction cells per second to succeed in a [ATTACK_BOTTOM_HALF] attack.
- For a PoW function with a 10ms verification time, an attacker needs to send 64 high-effort introduction cells per second to succeed in a [ATTACK_BOTTOM_HALF] attack.
We can use this table to specify a default difficulty that won't allow our target adversary to succeed in an [ATTACK_BOTTOM_HALF] attack.
Of course, when it comes to this table, the same disclaimer as in section [POW_TUNING_VERIFICATION] is valid. That is, the above table is just a theoretical extrapolation and we expect the real values to be much lower since they depend on auxiliary processing overheads, and on the network's capacity. {TODO: BLOCKER: Figure out auxiliary overheads here too}
6.3. Tuning equix difficulty [EQUIX_DIFFICULTY]
The above two sections were not depending on a particular PoW scheme. They gave us an intuition on the values we are aiming for in terms of verification speed and PoW difficulty. Now we need to make things concrete:
As described in section [EFFORT_ESTIMATION] we start the service with a default suggested-effort value of 5000. Given the benchmarks of EquiX [REF_EQUIX] this should take about 2 to 3 seconds on a modern CPU.
With this default difficulty setting and given the table in [POW_DIFFICULTY_ANALYSIS] this means that an attacker with 50 boxes will be able to get about 20 successful PoWs per second, and an attacker with 100 boxes about 40 successful PoWs per second.
Then using the table in [POW_DIFFICULTY_TOR] we can see that the number of attacker's successes is not enough to overwhelm the service through an [ATTACK_BOTTOM_HALF] attack. That is, an attacker would need to do about 152 introductions per second to overwhelm the service, whereas they can only do 40 with 100 boxes.
{TODO: BLOCKER: Still, the [POW_DIFFICULTY_TOR] disclaimer about auxiliary overhead is too true and we need to figure it out. For now this section remains just to show the methodology and not to be hard on the numbers}
7. Discussion
7.1. UX
This proposal has user facing UX consequences.
Here is some UX improvements that don't need user-input:
- Primarily, there should be a way for Tor Browser to display to users that additional time (and resources) will be needed to access a service that is under attack. Depending on the design of the system, it might even be possible to estimate how much time it will take.
And here are a few UX approaches that will need user-input and have an increasing engineering difficulty. Ideally this proposal will not need user-input and the default behavior should work for almost all cases.
a) Tor Browser needs a "range field" which the user can use to specify how much effort they want to spend in PoW if this ever occurs while they are browsing. The ranges could be from "Easy" to "Difficult", or we could try to estimate time using an average computer. This setting is in the Tor Browser settings and users need to find it.
b) We start with a default effort setting, and then we use the new onion errors (see #19251) to estimate when an onion service connection has failed because of DoS, and only then we present the user a "range field" which they can set dynamically. Detecting when an onion service connection has failed because of DoS can be hard because of the lack of feedback (see [CLIENT_BEHAVIOR])
c) We start with a default effort setting, and if things fail we automatically try to figure out an effort setting that will work for the user by doing some trial-and-error connections with different effort values. Until the connection succeeds we present a "Service is overwhelmed, please wait" message to the user.
7.2. Future work [FUTURE_WORK]
7.2.1. Incremental improvements to this proposal
There are various improvements that can be done in this proposal, and while we are trying to keep this v1 version simple, we need to keep the design extensible so that we build more features into it. In particular:
- End-to-end introduction ACKs
This proposal suffers from various UX issues because there is no end-to-end mechanism for an onion service to inform the client about its introduction request. If we had end-to-end introduction ACKs many of the problems from [CLIENT_BEHAVIOR] would be aleviated. The problem here is that end-to-end ACKs require modifications on the introduction point code and a network update which is a lengthy process.
- Multithreading scheduler
Our scheduler is pretty limited by the fact that Tor has a single-threaded design. If we improve our multithreading support we could handle a much greater amount of introduction requests per second.
7.2.2. Future designs [FUTURE_DESIGNS]
This is just the beginning in DoS defences for Tor and there are various futured designs and schemes that we can investigate. Here is a brief summary of these:
"More advanced PoW schemes" -- We could use more advanced memory-hard PoW schemes like MTP-argon2 or Itsuku to make it even harder for adversaries to create successful PoWs. Unfortunately these schemes have much bigger proof sizes, and they won't fit in INTRODUCE1 cells. See #31223 for more details.
"Third-party anonymous credentials" -- We can use anonymous credentials and a third-party token issuance server on the clearnet to issue tokens based on PoW or CAPTCHA and then use those tokens to get access to the service. See [REF_CREDS] for more details.
"PoW + Anonymous Credentials" -- We can make a hybrid of the above ideas where we present a hard puzzle to the user when connecting to the onion service, and if they solve it we then give the user a bunch of anonymous tokens that can be used in the future. This can all happen between the client and the service without a need for a third party.
All of the above approaches are much more complicated than this proposal, and hence we want to start easy before we get into more serious projects.
7.3. Environment
We love the environment! We are concerned of how PoW schemes can waste energy by doing useless hash iterations. Here is a few reasons we still decided to pursue a PoW approach here:
"We are not making things worse" -- DoS attacks are already happening and attackers are already burning energy to carry them out both on the attacker side, on the service side and on the network side. We think that asking legitimate clients to carry out PoW computations is not gonna affect the equation too much, since an attacker right now can very quickly cause the same damage that hundreds of legitimate clients do a whole day.
"We hope to make things better" -- The hope is that proposals like this will make the DoS actors go away and hence the PoW system will not be used. As long as DoS is happening there will be a waste of energy, but if we manage to demotivate them with technical means, the network as a whole will less wasteful. Also see [CATCH22] for a similar argument.
8. Acknowledgements
Thanks a lot to tevador for the various improvements to the proposal and for helping us understand and tweak the RandomX scheme.
Thanks to Solar Designer for the help in understanding the current PoW landscape, the various approaches we could take, and teaching us a few neat tricks.
Appendix A. Little-t tor introduction scheduler
This section describes how we will implement this proposal in the "tor" software (little-t tor).
The following should be read as if tor is an onion service and thus the end point of all inbound data.
A.1. The Main Loop [MAIN_LOOP]
Tor uses libevent for its mainloop. For network I/O operations, a mainloop event is used to inform tor if it can read on a certain socket, or a connection object in tor.
From there, this event will empty the connection input buffer (inbuf) by extracting and processing a cell at a time. The mainloop is single threaded and thus each cell is handled sequentially.
Processing an INTRODUCE2 cell at the onion service means a series of operations (in order):
1) Unpack cell from inbuf to local buffer.
2) Decrypt cell (AES operations).
3) Parse cell header and process it depending on its RELAY_COMMAND.
4) INTRODUCE2 cell handling which means building a rendezvous circuit: i) Path selection ii) Launch circuit to first hop.
5) Return to mainloop event which essentially means back to step (1).
Tor will read at most 32 cells out of the inbuf per mainloop round.
A.2. Requirements for PoW
With this proposal, in order to prioritize cells by the amount of PoW work it has done, cells can _not_ be processed sequentially as described above.
Thus, we need a way to queue a certain number of cells, prioritize them and then process some cell(s) from the top of the queue (that is, the cells that have done the most PoW effort).
We thus require a new cell processing flow that is _not_ compatible with current tor design. The elements are:
- Validate PoW and place cells in a priority queue of INTRODUCE2 cells (as described in section [INTRO_QUEUE]).
- Defer "bottom half" INTRO2 cell processing for after cells have been queued into the priority queue.
A.3. Proposed scheduler [TOR_SCHEDULER]
The intuitive way to address the A.2 requirements would be to do this simple and naive approach:
1) Mainloop: Empty inbuf INTRODUCE2 cells into priority queue
2) Process all cells in pqueue
3) Goto (1)
However, we are worried that handling all those cells before returning to the mainloop opens possibilities of attack by an adversary since the priority queue is not gonna be kept up to date while we process all those cells. This means that we might spend lots of time dealing with introductions that don't deserve it. See [BOTTOM_HALF_SCHEDULER] for more details.
We thus propose to split the INTRODUCE2 handling into two different steps: "top half" and "bottom half" process, as also mentioned in [POW_VERIFY] section above.
A.3.1. Top half and bottom half scheduler
The top half process is responsible for queuing introductions into the priority queue as follows:
a) Unpack cell from inbuf to local buffer.
b) Decrypt cell (AES operations).
c) Parse INTRODUCE2 cell header and validate PoW.
d) Return to mainloop event which essentially means step (1).
The top-half basically does all operations of section [MAIN_LOOP] except from (4).
An then, the bottom-half process is responsible for handling introductions and doing rendezvous. To achieve this we introduce a new mainloop event to process the priority queue _after_ the top-half event has completed. This new event would do these operations sequentially:
a) Pop INTRODUCE2 cell from priority queue.
b) Parse and process INTRODUCE2 cell.
c) End event and yield back to mainloop.
A.3.2. Scheduling the bottom half process [BOTTOM_HALF_SCHEDULER]
The question now becomes: when should the "bottom half" event get triggered from the mainloop?
We propose that this event is scheduled in when the network I/O event queues at least 1 cell into the priority queue. Then, as long as it has a cell in the queue, it would re-schedule itself for immediate execution meaning at the next mainloop round, it would execute again.
The idea is to try to empty the queue as fast as it can in order to provide a fast response time to an introduction request but always leave a chance for more cells to appear between cell processing by yielding back to the mainloop. With this we are aiming to always have the most up-to-date version of the priority queue when we are completing introductions: this way we are prioritizing clients that spent a lot of time and effort completing their PoW.
If the size of the queue drops to 0, it stops scheduling itself in order to not create a busy loop. The network I/O event will re-schedule it in time.
Notice that the proposed solution will make the service handle 1 single introduction request at every main loop event. However, when we do performance measurements we might learn that it's preferable to bump the number of cells in the future from 1 to N where N <= 32.
A.4 Performance measurements
This section will detail the performance measurements we've done on tor.git for handling an INTRODUCE2 cell and then a discussion on how much more CPU time we can add (for PoW validation) before it badly degrades our performance.
A.4.1 Tor measurements [TOR_MEASUREMENTS]
In this section we will derive measurement numbers for the "top half" and "bottom half" parts of handling an introduction cell.
These measurements have been done on tor.git at commit 80031db32abebaf4d0a91c01db258fcdbd54a471.
We've measured several set of actions of the INTRODUCE2 cell handling process on Intel(R) Xeon(R) CPU E5-2650 v4. Our service was accessed by an array of clients that sent introduction requests for a period of 60 seconds.
1. Full Mainloop Event
We start by measuring the full time it takes for a mainloop event to process an inbuf containing INTRODUCE2 cells. The mainloop event processed 2.42 cells per invocation on average during our measurements.
Total measurements: 3279
Min: 0.30 msec - 1st Q.: 5.47 msec - Median: 5.91 msec Mean: 13.43 msec - 3rd Q.: 16.20 msec - Max: 257.95 msec
2. INTRODUCE2 cell processing (bottom-half)
We also measured how much time the "bottom half" part of the process takes. That's the heavy part of processing an introduction request as seen in step (4) of the [MAIN_LOOP] section:
Total measurements: 7931
Min: 0.28 msec - 1st Q.: 5.06 msec - Median: 5.33 msec Mean: 5.29 msec - 3rd Q.: 5.57 msec - Max: 14.64 msec
3. Connection data read (top half)
Now that we have the above pieces, we can use them to measure just the "top half" part of the procedure. That's when bytes are taken from the connection inbound buffer and parsed into an INTRODUCE2 cell where basic validation is done.
There is an average of 2.42 INTRODUCE2 cells per mainloop event and so we divide that by the full mainloop event mean time to get the time for one cell. From that we substract the "bottom half" mean time to get how much the "top half" takes:
=> 13.43 / (7931 / 3279) = 5.55 => 5.55 - 5.29 = 0.26
Mean: 0.26 msec
To summarize, during our measurements the average number of INTRODUCE2 cells a mainloop event processed is ~2.42 cells (7931 cells for 3279 mainloop invocations).
This means that, taking the mean of mainloop event times, it takes ~5.55msec (13.43/2.42) to completely process an INTRODUCE2 cell. Then if we look deeper we see that the "top half" of INTRODUCE2 cell processing takes 0.26 msec in average, whereas the "bottom half" takes around 5.33 msec.
The heavyness of the "bottom half" is to be expected since that's where 95% of the total work takes place: in particular the rendezvous path selection and circuit launch.
{TODO: BLOCKER: While gathering these measurements we found an issue with our scheduler which was limiting the amount of cells we were reading per mainloop invocation. We are currently analyzing it and will do another confirmation round after we fix it: https://gitlab.torproject.org/tpo/core/tor/-/issues/40006%7D
A.2. References
[REF_EQUIX]: https://github.com/tevador/equix https://github.com/tevador/equix/blob/master/devlog.md [REF_TABLE]: The table is based on the script below plus some manual editing for readability: https://gist.github.com/asn-d6/99a936b0467b0cef88a677baaf0bbd04 [REF_BOTNET]: https://media.kasperskycontenthub.com/wp-content/uploads/sites/43/2009/07/01... [REF_CREDS]: https://lists.torproject.org/pipermail/tor-dev/2020-March/014198.html [REF_TARGET]: https://en.bitcoin.it/wiki/Target [REF_TLS]: https://www.ietf.org/archive/id/draft-nygren-tls-client-puzzles-02.txt https://tools.ietf.org/id/draft-nir-tls-puzzles-00.html https://tools.ietf.org/html/draft-ietf-ipsecme-ddos-protection-10 [REF_TLS_1]: https://www.ietf.org/archive/id/draft-nygren-tls-client-puzzles-02.txt [REF_TEVADOR_1]: https://lists.torproject.org/pipermail/tor-dev/2020-May/014268.html [REF_TEVADOR_2]: https://lists.torproject.org/pipermail/tor-dev/2020-June/014358.html
Hi all,
On Mon, Jun 22, 2020 at 4:52 PM George Kadianakis desnacked@riseup.net wrote:
Also, tevador let me know if you'd like me to add you as a co-author on the proposal based on all your great feedback so far.
Thanks for the update. Yes, you can add me as a co-author.
During our performance measurements in [TOR_MEASUREMENTS] we learned that the "top half" takes about 0.26 msecs in average, without doing any sort of PoW verification.
Interesting. This confirms the need for fast PoW verification. Equi-X takes about 0.05 ms to verify, so the top-half throughput should still be over 3000 requests per second.
Our scheduler is pretty limited by the fact that Tor has a single-threaded design.
If both the top- and bottom- halves are processed by the same thread, this opens up the possibility for a "hybrid" attack. Given the performance figures for the bottom half (0.31 ms/req.) and the top half (5.5 ms/req.), the attacker can optimally deny service by submitting 91 high-effort requests and 1520 invalid requests per second. This will completely saturate the main loop because:
0.31*(1520+91) ~ 0.5 sec. 5.5*91 ~ 0.5 sec.
This attack only has half the bandwidth requirement of [ATTACK_TOP_HALF] and half the compute requirement of [ATTACK_BOTTOM_HALF].
Alternatively, the attacker can adjust the ratio between invalid and high-effort requests depending on their bandwidth and compute capabilities.
{TODO: BLOCKER: What's the max size of the queue? Do we trim it, or do we just stop adding new requests when it reaches max size? Can we use WRED? Trimming is currently used EFFORT_ESTIMATION, so if we don't do it we need to find a different way to estimate effort. See tevador's [REF_TEVADOR_2] email.}
The simplest approach is to have a "soft" maximum size of the queue. All requests with valid PoW can be added to the queue, but once per second, the queue is trimmed to the "soft" max size by removing timed-out and low-effort requests. I used this approach in my simulations (see below).
The EXT_FIELD content format is: POW_VERSION [1 byte] POW_NONCE [16 bytes] POW_SEED [4 bytes]
If we want to use Equi-X, you also need to add the solution (16 bytes) and I also recommend adding the client's target effort.
POW_VERSION [1 byte] POW_NONCE [16 bytes] POW_EFFORT [4 bytes] POW_SEED [4 bytes] POW_SOLUTION [16 bytes]
43 bytes total including the extension type and length.
The client's algorithm in section 3.2 should be modified to:
a) Client selects a target effort E. b) Client generates a random 16-byte nonce N. c) Client derives seed C by decoding 'seed-b64'. d) Client calculates S = equix_solve(C || N || E) e) Client calculates R = blake2b(C || N || E || S) f) Client checks if R * E <= UINT32_MAX. f1) If yes, success! The client can submit N, E, the first 4 bytes of C and S. f2) If no, fail! The client interprets N as a 16-byte little-endian integer, increments it by 1 and goes back to step d).
We could also add the server algorithm in 3.4:
a) Find a valid seed C that starts with POW_SEED. Fail if no such seed exists. b) Fail if E = POW_EFFORT is lower than the minimum effort. c) Fail if N = POW_NONCE is present in the replay cache. d) Calculate R = blake2b(C || N || E || S) e) Fail if R * E > UINT32_MAX f) Fail if equix_verify(C || N || E, S) != EQUIX_OK g) Put the request in the queue with a priority of E
3.4.3. PoW effort estimation [EFFORT_ESTIMATION] {XXX: BLOCKER: Figure out of this system makes sense}
I wrote a simple simulation in Python to test different ways of adjusting the suggested effort. The results are here: https://github.com/tevador/scratchpad/blob/master/tor-pow/effort_sim.md
In summary, I suggest to use MIN_EFFORT = 1000 and the following algorithm to calculate the suggested effort:
1. Sum the effort of all valid requests that have been received since the last HS descriptor update. This includes all handled requests, trimmed requests and requests still in the queue. 2. Divide the sum by the max. number of requests that the service could have handled during that time (SVC_BOTTOM_CAPACITY * HS_UPDATE_PERIOD). 3. Suggested effort = max(MIN_EFFORT, result)
This algorithm can both increase and reduce the suggested effort.
My simulations also show that bottom-half attacks are not feasible (even "The large botnet" cannot completely deny access to the service). I think further research and testing should focus on top-half attacks (or hybrid attacks).
{XXX: Does this mean that this system can auto-enable and auto-disable the DoS subsystem with reasonable accuracy?}
I tried to make the effort estimation algorithm disable PoW automatically (by setting suggested effort to 0), but it led to oscillating behavior in the case of a sustained attack (i.e. once the suggested effort is set to 0, clients cannot connect anymore because the attacker can easily saturate the service). Having an acceptably low minimum effort could work better. MIN_EFFORT = 1000 as I suggested will take around 1 second on a quad core CPU. Mobile clients can simply ignore the suggested effort and always try to connect without PoW.
The algorithm can be easily modified to auto-enable PoW in case of an attack (but will not auto-disable it). This can be done by keeping suggested effort zero as long as no requests have been trimmed from the queue. The PoW subsystem could be disabled manually by the HS operator if needed.
{XXX: BLOCKER: How should timeout values change here since the priority queue will cause bigger delays than usual to rendezvous?}
I think the timeout should stay the same. Attackers don't care about timeouts and longer timeout values would increase users' time-to-connect when the service is under attack and the suggested effort hasn't been updated yet.
tevador tevador@gmail.com writes:
Hi all,
Hello tevador,
thanks so much for your work here and for the great simulation. Also for the hybrid attack which was definitely missing from the puzzle.
I've been working on a further revision of the proposal based on your comments. I have just one small question I would like your feedback on.
3.4.3. PoW effort estimation [EFFORT_ESTIMATION] {XXX: BLOCKER: Figure out of this system makes sense}
I wrote a simple simulation in Python to test different ways of adjusting the suggested effort. The results are here: https://github.com/tevador/scratchpad/blob/master/tor-pow/effort_sim.md
In summary, I suggest to use MIN_EFFORT = 1000 and the following algorithm to calculate the suggested effort:
- Sum the effort of all valid requests that have been received since the last HS descriptor update. This includes all handled requests, trimmed requests and requests still in the queue.
- Divide the sum by the max. number of requests that the service could have handled during that time (SVC_BOTTOM_CAPACITY * HS_UPDATE_PERIOD).
- Suggested effort = max(MIN_EFFORT, result)
This algorithm can both increase and reduce the suggested effort.
I like the above logic but I'm wondering of how we can get the real SVC_BOTTOM_CAPACITY for every scenario. In particular, the SVC_BOTTOM_CAPACITY=180 value from 6.2.2 might have been true for David's testing but it will not be true for every computer and every network.
I wonder if we can adapt the above effort estimation algorithm to use an initial SVC_BOTTOM_CAPACITY magic value for the first run (let's say 180), but then derive the real SVC_BOTTOM_CAPACITY of the host in runtime and use that for subsequent runs of the algorithm.
Do you think this is possible?
George Kadianakis desnacked@riseup.net writes:
tevador tevador@gmail.com writes:
Hi all,
Hello,
I have pushed another update to the PoW proposal here: https://github.com/asn-d6/torspec/tree/pow-over-intro I also (finally) merged it upstream to torspec as proposal #327: https://github.com/torproject/torspec/blob/master/proposals/327-pow-over-int...
The most important improvements are: - Add tevador as an author. - Update PoW algorithms based on tevador's Equix feedback. - Update effort estimation algorithm based on tevador's simulation. - Include hybrid attack section. - Remove a bunch of blocker tags.
Two things I'd like to work more on:
- I'd like people to take tevador's Equix PoW function and run it on their boxes and post back benchmarks of how it performed. Particularly so if you have a GPU-enabled box, so that we can get some benchmarks from GPUs as well. That will help us tune the proposal even more.
For my laptop (with an Intel CPU i7-8550U CPU @ 1.80GHz) I got pretty accurate benchmarks (compared to https://github.com/tevador/equix#performance): $ ./equix-bench Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 1) ... 1.910000 solutions/nonce 283.829505 solutions/sec. (1 thread) 22810.327943 verifications/sec. (1 thread) $ ./equix-bench --threads 16 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 16) ... 1.910000 solutions/nonce 2296.585708 solutions/sec. (16 threads) 20223.196324 verifications/sec. (1 thread)
See how to do this here: https://github.com/tevador/equix#build
- I'd like to improve the effort estimation algorithm by dynamically adjusting SVC_BOTTOM_CAPACITY instead of having it as a static value. Otherwise, I would like to reduce the currently suggested SVC_BOTTOM_CAPACITY because I feel that 180 is too big. I would like to put it to 100 which is much more conservative. I tried to do so while updating tevador's simulation accordingly, but I found out that the simulation code does not do the graphs itself, so I didn't make much progress here.
tevador do you have the graphing code somewhere so that I can run the experiments again and see how the graphs are influenced?
Apart from that, I think the proposal is really solid. I have hence merged it as proposal #327 to torspec and further revisions can be done on top of that from now on.
Thanks for all the work here and I'm looking forward to further feedback!
On 9/22/20 07:10, George Kadianakis wrote:
George Kadianakis desnacked@riseup.net writes:
tevador tevador@gmail.com writes:
Hi all,
Hello,
I have pushed another update to the PoW proposal here: https://github.com/asn-d6/torspec/tree/pow-over-intro I also (finally) merged it upstream to torspec as proposal #327: https://github.com/torproject/torspec/blob/master/proposals/327-pow-over-int...
The most important improvements are:
- Add tevador as an author.
- Update PoW algorithms based on tevador's Equix feedback.
- Update effort estimation algorithm based on tevador's simulation.
- Include hybrid attack section.
- Remove a bunch of blocker tags.
Two things I'd like to work more on:
- I'd like people to take tevador's Equix PoW function and run it on their boxes and post back benchmarks of how it performed.
I shared some results privately with George and he suggested including the list. Results below.
Particularly so if you have a GPU-enabled box, so that we can get some benchmarks from GPUs as well. That will help us tune the proposal even more.
For anyone else following along or also contributing benchmarks, George clarified for me that the equix benchmark isn't capable of utilizing the GPU.
My results:
First results are on my w530, i7, 4 core (hyperthreaded to 8) laptop (with moderate activity in the background).
I stumbled across some weird artifacts when using more threads than processors: the benchmark reports solutions/sec continuing to increase linearly with #threads. The wall-clock time for the benchmark itself (measured with `time`) show the expected trend though of linear scaling only up to 4 (the number of physical cores), a little bump at 8 (using the hyperthreaded virtual cores), and no improvement past that.
Further below are results on my pinephone. $ time ./equix-bench --threads 1 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 1) ... 1.910000 solutions/nonce 227.714446 solutions/sec. (1 thread) 20301.439170 verifications/sec. (1 thread)
real 0m4.242s user 0m4.230s sys 0m0.012s
$ time ./equix-bench --threads 2 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 2) ... 1.910000 solutions/nonce 450.100153 solutions/sec. (2 threads) 17925.519934 verifications/sec. (1 thread)
real 0m2.184s user 0m4.294s sys 0m0.004s
$ time ./equix-bench --threads 4 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 4) ... 1.910000 solutions/nonce 876.343564 solutions/sec. (4 threads) 18863.079719 verifications/sec. (1 thread)
real 0m1.154s user 0m4.400s sys 0m0.012s
$ time ./equix-bench --threads 8 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 8) ... 1.910000 solutions/nonce 1089.198671 solutions/sec. (8 threads) 17808.857809 verifications/sec. (1 thread)
real 0m0.981s user 0m7.019s sys 0m0.052s
$ time ./equix-bench --threads 16 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 16) ... 1.910000 solutions/nonce 2183.232035 solutions/sec. (16 threads) 18936.014118 verifications/sec. (1 thread)
real 0m1.025s user 0m7.021s sys 0m0.032s
$ time ./equix-bench --threads 32 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 32) ... 1.910000 solutions/nonce 4397.259598 solutions/sec. (32 threads) 17754.229411 verifications/sec. (1 thread)
real 0m1.026s user 0m6.961s sys 0m0.049s
$ cat /proc/cpuinfo <snip> processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz stepping : 9 microcode : 0x21 cpu MHz : 1856.366 cache size : 6144 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds bogomips : 5387.48 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
Similar behavior on the (4-core aarch64) pinephone: $ time ./equix-bench --threads 1 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 1) ... 1.910000 solutions/nonce 23.920219 solutions/sec. (1 thread) 4477.199102 verifications/sec. (1 thread)
real 0m 40.35s user 0m 40.12s sys 0m 0.01s
$ time ./equix-bench --threads 2 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 2) ... 1.910000 solutions/nonce 47.683428 solutions/sec. (2 threads) 4384.937853 verifications/sec. (1 thread)
real 0m 20.45s user 0m 40.20s sys 0m 0.06s
$ time ./equix-bench --threads 4 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 4) ... 1.910000 solutions/nonce 94.149494 solutions/sec. (4 threads) 4359.695415 verifications/sec. (1 thread) real 0m 10.47s user 0m 40.71s sys 0m 0.08s
$ time ./equix-bench --threads 8 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 8) ... 1.910000 solutions/nonce 188.808873 solutions/sec. (8 threads) 4348.479398 verifications/sec. (1 thread)
real 0m 10.50s user 0m 40.61s sys 0m 0.07s
$ cat /proc/cpuinfo <snip> processor : 3 BogoMIPS : 48.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4
On Tue, Sep 22, 2020 at 2:10 PM George Kadianakis desnacked@riseup.net wrote:
if you have a GPU-enabled box, so that we can get some benchmarks from GPUs as well
Someone would have to write a GPU benchmark for that. My code is CPU-only.
tevador do you have the graphing code somewhere so that I can run the experiments again and see how the graphs are influenced?
I've uploaded the gnuplot script I used to generate the graphs here: https://github.com/tevador/scratchpad/blob/master/tor-pow/effort_sim.plt
You will need to modify the script with your path to the simulation output file.
On Thu, Sep 24, 2020 at 6:54 PM Jim Newsome jnewsome@torproject.org wrote:
I stumbled across some weird artifacts when using more threads than processors: the benchmark reports solutions/sec continuing to increase linearly with #threads. The wall-clock time for the benchmark itself (measured with `time`) show the expected trend though of linear scaling only up to 4 (the number of physical cores), a little bump at 8 (using the hyperthreaded virtual cores), and no improvement past that.
Good catch. There was a bug in the time measurement code in the benchmark. Should be fixed now in the master branch.
On 22 Jun (17:52:44), George Kadianakis wrote:
Hello there,
here is another round of PoW revisions: https://github.com/asn-d6/torspec/tree/pow-over-intro I'm inlining the full proposal in the end of this email.
Here is a changelog:
- Actually used tevador's EquiX scheme as our PoW scheme for now. This is still tentative, but I needed some ingredients to cook with so I went for it.
- Fold in David's performance measurements and use them to get some guesstimates on the default PoW difficulty etc.
- Enable overlapping seed system.
- Enrich the attack section of the proposal some more.
- Attempt to fix an effort estimation attack pointed by tevador.
- Added a bunch of "BLOCKER" tags around the proposal for things that we need to figure out or at least have some good intuition if we want to have guarantees that the proposal can work before we start implementing.
Here is what needs to happen next:
- David's performance measurements have been really useful, but they open a bunch of questions on auxiliary overheads. We are now performing more experiments to confirm the performance numbers we got and make sure we are not overshooting. I noted these issues down as BLOCKER in the proposal. While doing so we also found a pretty serious bug with our scheduler that we trying to fix: https://gitlab.torproject.org/tpo/core/tor/-/issues/40006
[snip]
(For the record)
Ok now that this bug has been fixed here are the new numbers. The time per INTRO2 cell, on average, is the same as in the proposal.
Big difference is that Tor is not handling on average ~15 cells per mainloop round during heavy DDoS. It is 15 and not 32 (theoretical limit) because the service also handles a lot of DESTROY cells due to the rendezvous circuit failing but also due to some seconds where no cells are processed because tor is busy doing other things.
We've also confirmed that the theoretical value of 180 requests per second in the proposal actually is valid. During high DDoS time, we've observed on average 165 cells per second (by removing few outliers since tor has other events that prevents cell processing for 1-3 seconds sometimes.
We've observed rate of 185cells/second so the 180 numbers holds here imo.
Cheers! David