New subject: Feedback on obfuscating hidden-service statistics

20 Nov 2014

      "A. Johnson" aaron.m.johnson@nrl.navy.mil writes:
...
...
...
...
George and I have been working on a small proposal to add two 
hidden-service related statistics: number of hidden services and 
total hidden-service traffic.
Great, I’m starting to focus more on this project now. Well,
actually I’m going on a trip for a week today, but *then* I’m
focusing more on this project :-)
Sounds great!  We're meeting every Tuesday at 16:00 UTC in #tor-dev.
Feel free to drop by.
Excellent. I won’t be there this coming Tuesday, but I’ll be there the next Tuesday.
...
Replicas mean that each descriptor is stored under two identifiers, so
that's two places.  Further, descriptor identifiers change once per
day, so during a 24-hour period, there are up to four descriptor
identifiers for a hidden service.
That makes sense. It would be nice if the statistics would allow you
to identify how long (i.e. how many hour periods) each descriptor was
observed being published. That would allow us to figure out if there
are lots of short-lived services or fewer long-lived
services. Publishing statistics every hour would pretty much take care
of this. If you are really set on 24 hours, then perhaps you could add
the total number of published descriptors in addition to the number of
*unique* published descriptors.
Also, my suggestion about using additive noise applies equally well to
the descriptor statistics. And multiplicative noise is a *bad idea* if
you don’t have some adjustment for small values (e.g. 10% noise of a 0
value is 0, and 10% of 1 is only 0.1).
...
We have been thinking about many more hidden-service related
statistics in a separate document.  We're currently discussing whether
we should turn it into a tech report, because we'll probably not want
to implement most of those statistics.  If you have remarks or more
ideas, please feel free to edit the document.  We're going to have a
public review round for this, too, but that might not happen in the
next week or two.
https://etherpad.wikimedia.org/p/hs_stats_78281091
Great! I think we should go for at least a little more data in the
current proposal (what is the timeline for this, btw?). I think we
should come up with a list of statistics we might imagine gathering
and identify the subset of those that we’re comfortable gathering at
this point. For example, I think failure statistics is much more
innocuous than other data, and those would be very useful. For
example, they would help us understand how to improve the protocol is
failing, and it might help us identify misuse of hidden services
(e.g. by botnets clients stupidly looking for non-existent descriptors
or by malicious crawlers attempting to brute force descriptors). So
here are some ideas:

Number of fetch requests for descriptors that don’t exist (number of fetch requests that do succeed would of course be very useful as well)
Number of descriptor publishes to the wrong HSDir (actually I suspect that the HSDir doesn’t check this and wants to be accepting of any publish)
Number of rendezvous circuits that never connect (from the RP perspective)
Number of rendezvous circuits on which no data cells are ever sent

(CC'ed [tor-dev])
Thanks for the input Aaron!
The timeline here is that we are hoping the proposal _and_ the
implementation to be ready by mid-December. Then we are hoping that we
can deploy the code to a few relays so that we have some data by January.
So, time is tight.
I'm currently OK with the two statistics in:
https://people.torproject.org/~karsten/volatile/238-hs-relay-stats.txt
I feel that any other statistics will need to be carefully analyzed.
We should add the ideas you mentioned in the etherpad, and get them
included in the tech report (which we are also hoping to have ready in
some form by mid-January).
The tech report is supposed to contain and analyze most of the HS
statistics we can think of. It will likely contain many stats that we
will never do, but also some stats that might be a good idea. The good
ones we should eventually integrate to the Tor proposal and write code
for.
...
...
Thanks for the very valuable input!  Let me know if the following
draft looks okay, and I'll start another thread on tor-dev@.
https://people.torproject.org/~karsten/volatile/238-hs-relay-stats-2014-11-2...
"Lab(\epsilon/C)” -> "Lap(\epsilon/C)” (that was my mistake. I think
having the added noise both parameterized and included in the reported
statistics is an idea worth thinking about. Making it a parameter
allows you to easily change it without upgrading. Including it in the
statistics would allow us to correct better for noise if different
relays might be adding different amounts of noise due to inconsistent
opinions of the noise parameter (if this should never happen, then I
guess this wouldn’t be necessary).
So again, sorry that I’m not going to be very responsive on this for the next week. I’m really happy that you’re working on it!
Best,
Aaron

Re: [tor-dev] Feedback on obfuscating hidden-service statistics