Re: [tor-dev] stale entries in bwscan.20151029-1145 - tor-dev

6 Nov 2015


      aAt 20:56 11/5/2015 -0600, you wrote:
...
On 5 November 2015 at 16:37,  starlight.2015q3@binnacle.cx wrote:
...
By having a single thread handle
consensus retrieval and sub-division,
issues of "lost" relays should
go away entirely.
So I'm coming around to this idea, after spending
an hour trying to explain why it was bad. . .
:-D  happy to hear that
It can be expressed and implemented
various ways, but the core idea
of centralizing this function
is hard to escape.
...
So the
main problem as I see it is that it's easy to move
relays between slices that haven't happened yet -
but how do you do this when some slices are
completed and some aren't?
...
. . .if you used a work queue instead of a scanner. . .
I'm thinking of it more as a set of
circular work lists, since scanning
never ends.
Not then a question of "haven't happened yet"
for new relays since regardless of where
they're inserted they will be measured.
More a question of giving priority to new
relays and giving priority to relays that
have exhibited a large change of bandwidth,
particularly in the downward direction.
If radical restructuring is on the menu
(as the remainder of your comments imply)
let me put out an idea I've come up with:
I'm not enthusiastic about the current
design where nine scanners blindly
compete with each other for the local
Tor client-router and link bandwidth.
The goal should be to figure out what
percent of the BWauth link's bandwidth
should be consumed to minimize scanning
time while ensuring quality measurement.
Nine scanners could be replaced with one
scanner that runs the entire list of
relays sorted from fastest to slowest.
This scanner would keep the slice
approach of firing-up scans until a
quota is filled.  But instead of that
quota being a somewhat arbitrary fixed
value, the quota would be a minimum
bandwidth target per the output of
'SETEVENTS BW'.
A rough guess is that whatever bandwidth
is available should be 35% subscribed,
though the actual setting would require
empirical determination.  On a 100mbps
link the target might then be a continuous
4.5 million bytes/sec of scanning.
This might result in the 100 or so fastest
relays being measured one-at-time, but
of course it would happen rather quickly.
As the scanner works down the list the
number of concurrent scans would rise
rapidly as the bandwidth of the targets
decline and so perhaps several hundred
of the slowest relays would be under
measurement near the end of each pass.
My belief is the end result would be
dramatically faster than the current
approach, have a low circuit-fail rate,
and that the measurements would be far
more reliable and consistent.
...
FWIW, looking at
https://bwauth.ritter.vg/bwauth/AA_scanner_loop_times.txt , it 
seems like (for whatever weird reason) scanner1 took way longer than 
the others. (Scanner 9 is very different, so ignore that one.)
Scanner 1
    5 days, 11:07:27
Yes noticed this a few days ago, examined ticket
17482 more closely and saw it was 6.5 days for
scanner-one just before the 10/30 restart.
The scanner-one log shows huge numbers of
circuit failures.  Observation partly inspired
the above restructuring idea.