aAt 20:56 11/5/2015 -0600, you wrote:
On 5 November 2015 at 16:37, starlight.2015q3@binnacle.cx wrote:
By having a single thread handle consensus retrieval and sub-division, issues of "lost" relays should go away entirely.
So I'm coming around to this idea, after spending an hour trying to explain why it was bad. . .
:-D happy to hear that
It can be expressed and implemented various ways, but the core idea of centralizing this function is hard to escape.
So the main problem as I see it is that it's easy to move relays between slices that haven't happened yet - but how do you do this when some slices are completed and some aren't?
. . .if you used a work queue instead of a scanner. . .
I'm thinking of it more as a set of circular work lists, since scanning never ends.
Not then a question of "haven't happened yet" for new relays since regardless of where they're inserted they will be measured. More a question of giving priority to new relays and giving priority to relays that have exhibited a large change of bandwidth, particularly in the downward direction.
If radical restructuring is on the menu (as the remainder of your comments imply) let me put out an idea I've come up with:
I'm not enthusiastic about the current design where nine scanners blindly compete with each other for the local Tor client-router and link bandwidth.
The goal should be to figure out what percent of the BWauth link's bandwidth should be consumed to minimize scanning time while ensuring quality measurement.
Nine scanners could be replaced with one scanner that runs the entire list of relays sorted from fastest to slowest. This scanner would keep the slice approach of firing-up scans until a quota is filled. But instead of that quota being a somewhat arbitrary fixed value, the quota would be a minimum bandwidth target per the output of 'SETEVENTS BW'.
A rough guess is that whatever bandwidth is available should be 35% subscribed, though the actual setting would require empirical determination. On a 100mbps link the target might then be a continuous 4.5 million bytes/sec of scanning.
This might result in the 100 or so fastest relays being measured one-at-time, but of course it would happen rather quickly. As the scanner works down the list the number of concurrent scans would rise rapidly as the bandwidth of the targets decline and so perhaps several hundred of the slowest relays would be under measurement near the end of each pass.
My belief is the end result would be dramatically faster than the current approach, have a low circuit-fail rate, and that the measurements would be far more reliable and consistent.
FWIW, looking at https://bwauth.ritter.vg/bwauth/AA_scanner_loop_times.txt , it seems like (for whatever weird reason) scanner1 took way longer than the others. (Scanner 9 is very different, so ignore that one.)
Scanner 1 5 days, 11:07:27
Yes noticed this a few days ago, examined ticket 17482 more closely and saw it was 6.5 days for scanner-one just before the 10/30 restart.
The scanner-one log shows huge numbers of circuit failures. Observation partly inspired the above restructuring idea.