Tim Niemeyer:
Maybe it is a load problem, because this machine has 100% cpu load? :(
Generally speaking running a relay at 100% of hardware resources all the time will not make happy users and we should optimize for a smooth tor browser experience more than a high bw or hw resource usage.
I don't think we have to worry about an exit failing 10% of DNS queries for a single day.
Single operators running a significant exit share (>0.5% exit probability) which fail at a high rate (>10%) consistently over multiple days are more relevant.
Since I don't see your exits showing up as failing currently the remainder of this email is not necessarily directed at you directly but more for the general record.
A dedicated machine for dns may be good, but currently we have only this one machine.
I actually believe in running DNS resolvers locally to keep paths short. The resources required for the resolver must be taken into account when planing the capacity of the entire server. The resolver can also require a decent amount of CPU time on fast exits.
In very constraint environments it might still makes sense to run DNS resolvers non-locally (while not using a resolver to far away) since DNS resolvers for exits can also run where exits might not be welcome.
Using a non-local resolver is obviously still better than a local resolver that can not keep up with the load.
Another way could be to recude exit capacity, but I don't know if it's a good idea to throttle it?
With the goal to have happy users (low latency reliable exits):
On a single server with multiple cores and a >1Gbit/s connectivity (server not limited by uplink bw and memory limits) I'd suggest:
1) determine your CPU's single thread performance: measure the peak bandwidth of tor traffic it can manage at a given exit policy running a single instance with no bw limits. Take some ramp-up time into account - which also exists for exits. (use measured data not advertised bandwidth - they can be far appart)
2) determine how many DNS QPS that single tor exit instance generates and what resolver CPU load (peak value after 1-2 weeks of operations)
3) run as many instances as you have cores -1 and set the bw limit (RelayBandwidthRate) in your torrc to ~80% of the peak value from (1) while ensuring that there is enough spare capacity for the resolver and the OS itself
optimize your resolver's performance and cache hit rate by playing with cache size and amount of threads. example for unbound: https://nlnetlabs.nl/documentation/unbound/howto-optimise/
Btw, in the mean time we got more upstream transit and now we are looking to get better / second hardware. But money is a limiting factor. :(
maybe it helps if you clearly communicate that you could easily do X Gbit/s of exit capacity if you only had the necessary hardware and to tell people where to enter their credit card details if they want to see that happen ;)