The next problem is how to find the proper parameters for the Laplace distribution. I guess the mean μ needs to be 0, but the hard part is 'b'. In a few papers I read, they set 'b' to (Δf/ε).
In the above, Δf is the "largest change a single participant could have on the output" of the query. Trying to fit this database paradigm to our use case, the largest change a single HS could cause to the HSDir HS counting stats is change the result by 1. So Δf is 1, and I think that ε is some kind of security (sensitivity) parameter, let's set that to 0.3 or something.
Yes, you’re right on about how to set these parameters. ε should probably be less than one, but too small (say, below 0.1) and the accuracy is horrible.
Now, I'm wondering how to do the same thing for the RP cell statistics. In this case, Δf would have to be the largest amount of cells we hope to obfuscate in an RP circuit. This is a chicken-and-egg situation, since we don't really know how many cells we usually get without doing these stats first.
Maybe we can use the preliminary stats from #13192, which contain both RP and IP cells (but IP cells will probably be a minority). Or maybe we can fit the distribution dynamically based on the amount of cells we receive every day (does this even make sense)? Or what?
There is a problem here in that RPs are distributed over all relays, and so each relay must contribute a number (and therefore noise) while probably carrying relatively little traffic. Ignoring that, the amount of cells to obfuscate depends on our privacy goal. If it is to hide whether or not a “typical” rendezvous circuit passed through a given RP, then Δf should be a number of cells that would cover most typical circuits. We can look at this another way as well, and say that we will hide, say, 10MiB of traffic, and beyond that the RP you used might start to become revealed in the stats.
Aaron