On 2015-06-29 17:51, Speak Freely wrote:
> Hello,
>
> First of all, I love Tor. I love Tor Browser, and I love running
> relays.
>
> When the problems are solved, I will most likely spin up more relays.
>
> I'm leaving my fastest relay running, as a method of checking the
> status
> for myself. The rest have already started to expire, and within the
> next
> week or so most of the other ones will have expired as well.
>
> I'm going to try tor-dev-alpha 2.7.1 and change fingerprints, as per a
> suggestion from s7r, seeing as how I have nothing to lose.
>
> I just wish the bwauths could scan relays based off previous relative
> consensus weights... If this particular relay was at 27000, it should
> be
> higher on the list to check compared to another one I have that is at
> 487. My one relay was blazing fast with thousands of connections, my
Well, relays are ranked by capacity, and split over several scanner
processes - they do get measured against their relative peers. But it
seems that, when they fall out of the measurement process ('Unmeasured')
they must start again at the beginning. This is expected because all
relays start Unmeasured, and gradually increase their position in the
consensus (per relative capacity), in order to dampen sudden changes and
limit sybil attacks by requiring relays to stick around for a while,
increasing the cost to an adversary. It likely should not be the case
that historically long running relays should start at the bottom if they
are unmeasured for a short period of time.
We are in the process of testing increasing the number of scanner and
accompanying tor instances from 4 to 9 (double, plus one for currently
unmeasured relays) in order to decrease the amount of time each fraction
of the network takes to measure and ensure that new relays or unmeasured
relays are measured often. There are additional patches that introduce
extra exits into the slice of relays, if there are no suitable exits to
measure with. This likely won't address the above behavior, but we hope
it will reduce the number of relays that go missing. Currently we seem
to have mixed results, with one Bandwidth Authority operator claiming
minimal (50) unmeasured relays, and another claiming ~600 mixed relays.
These numbers are not directly comparable because they were not sampled
at the same time, and may not be representative of typical behavior -
it's a little too soon to tell.
It's a bit tricky to both test these changes, on the live tor network,
demonstrate that they produce sane results, and convince the directory
authority operators and partner bandwidth authority operators to upgrade
- nor do we want to do that all at once - gradual change is better. So,
the goal is to produce results that will convince operators they should
update, improve the situation for relay operators, and then start
looking at longer term solutions for the measurement problem that are
more maintainable and scalable in the long run.
> other is painfully useless with dozens, but my fastest one lost its
> consensus while the slowest one kept it's consensus. It just seems
> silly. That being said, I don't know how/if the bwauths scan in any
> order or just willy-nilly, (that's not entirely true, I know it's
> segmented to some degree as I recall reading a blog post about how it's
> chopped up) but... I'd be much less upset if my best relays worked and
> my worst relays didn't. More complaining... bleh.
>
I hope to have a testable hypothesis as to why your faster relays
suffer(ed) more than the slower relays - it could be that the fraction
of network by capacity allocated to a particular scanner is not well
balanced, and that fraction is taking significantly longer to measure.
In order to evaluate that statement I need to understand the common
characteristics of relays that become unmeasured/lose rank and see if
they are from a similar segment of the network, and whether or not that
segment of the network takes longer to measure than other segments.
Another hypothesis is that your relays are on the boundary between two
segments, and that a transition between scanner instances causes enough
missed measurements to drop your relays. It would be helpful to know
what rank the last good measurement your or other relays had before
becoming unmeasured.
It will require some cooperation with the existing deployed Bandwidth
Authorities, in order to learn what their current scan times are - I
will be writing some simple scripts to scrape these results so that we
can collect and publish some useful heuristics about the scanner
processes to better try and debug this problem.
>
> One thing I would like to point out though... it appears... These
> problems have at least a casual relationship with MyFamily.
>
> One group of MyFamily is completely done - all of them stuck at 20.
> Another group of MyFamily is working happily.
> I've been doing some tests over the past few months trying to
> understand
> why I keep having problems, and one thing has consistently popped up...
> MyFamily.
That is very interesting, because MyFamily should have nothing to do
with the scanner process at all - I'll need to think about this some
more.
>
> As one of MyFamily lost consensus, another family gained consensus back
> on or around the same time.
>
> Yes, especially nusenu, I know I'm supposed to have it all configured
> to
> be under 1 MyFamily... But in a way I'm glad I didn't, as the casual
> relationship I see really could only be seen having done what I did.
>
> I say casual because I have no proof of causation. But... it is
> interesting. If no one else has experienced similar problems, then I'd
> chock it up to a completely unexpected unrelated set of mysterious
> circumstances that should not have happened for which there is no
> explanation.
>
> Aaron, if there is anything I can do to help you please let me know.
If anything that I said above sparks a thought, please let me know :)
>
>
> So in conclusion, I'm not done, I'm just not happy.
>
> This was supposed to be a short email, oops.
>
>
> Matt
> Speak Freely