teor wrote:
On 3 Nov. 2016, at 10:37, s7r s7r@sky-ip.org wrote:
I am very happy with the torspec patch.
Not quoting entirely, only want to add something wrt randomizing the value for fake clients based on David's and teor's comments:
David Goulet wrote: [SNIP]
- I think "superencrypted" -> "super-encrypted" would be nicer as everything
in the descriptor as that separation of word. Or even "client-encrypted" if we want to add extra semantic. No strong opinion apart from the "-" :).
I prefer super-encrypted vs. client-encrypted.
- [XXX consider randomization of the value 16]
If it's fixed, we basically create bucket so a client can know that there are 0-16 clients or 16-32 clients and so on.
If we randomize that value and let's say it's 7 then we have bucket of 7. If that value is randomized _every_ new descriptor, we create multiple size of buckets but over time someone could deduce (maybe) the low bound of clients by observing all random values and thus assume there are 0-<low bound>.
I'm uncertain here what's best but seems that in any case, bucketing is happening as we pad with fake "auth-client". So I would assume here, out of my head to be safe, that we might want _all_ services to kind of look the same thus a fixed value would make sense following that train of thought.
I'm liking the rest here! We'll have to think also on some padding in the INTRODUCE1 cell to avoid leaking client auth is being used.
This is true, we create buckets no matter what, but I think it's better if one has to watch a hidden service for a lot more time to determine the probable number rather than being able to tell from the first descriptor that there are 0-16 clients, 16-32 clients and so on.
I fully agree that randomizing _every_ new descriptor does not help and probably in short time someone could deduce a possible number, but I am slightly uncomfortable with a global fixed value for this. One more idea, if it's not helpful we can just go ahead with a fixed value of 16.
I think it's better if we pick a random number between 8 and 32 fake clients and remember the picked value so it will be used for every new descriptor until something in our setup changes or enough time has passed. In order to know when to reset it, we save it (in our state) along with:
- The number of real authorized clients when the random value was picked.
- Timestamp when the random value was picked + an end of life for the
random value.
We reset the random value of fake authorized clients and also its end of life when:
a) number of real authorized clients in torrc changes from what we have in our state. b) end of life for the random value is reached. End of life will be timestamp + a random period between 30 and 90 days. c) obvious case when Tor is re-installed and old state is lost.
We call this function on every HUP and (re)start. We can tune the numbers 8 - 32 and period 30 - 90 days as you like.
This way there are a lot of buckets and significantly more time needed for an observer to deduce a probable number. It is quite possible one can never deduce a "probable enough" number.
We combine this with faking extra if needed in the encrypted portion to the next multiple of 10k bytes.
It's true that it won't help if the hidden service operator changes the number of authorized clients every hour for a long period but in practice this doesn't happen - number of authorized clients changes rarely. And even in this scenario it still makes things a lot more confusing.
Compared to other parts of prop 224, this is easy to code and should be worth the effort. What do you think?
If you want to do it this way, with noise and buckets, ask someone who is good at differential privacy to do the numbers for you, rather than guessing.
You'll need to know the level of activity you want to hide.
T
As I said the numbers can be changed - I was illustrating an example. I guessed some numbers that seamed reasonable to me so I could give an example, and also because it's not a critical part. We only try to hide the number of real authorized clients, or make it as hard as possible for an observer to deduce a number close to the realistic number of authorized clients, that's all.
Simply using the numbers that were guessed without deep knowledge in differential privacy is a lot better than using a global fixed value of 16, but as I said this doesn't need to be a debate because I am not against the fixed value, only saying it's better to randomize, if the solution exists.