Hello there,
I inline a copy of a proposal we've been working on lately. Discussion can be found in the "Feedback on obfuscating hidden-service statistics" thread.
The proposal suggests that Tor relays add some stats about hidden service usage. We believe that these stats are not dangerous and can be useful to Tor developers and to people who want to understand hidden services and the onionspace better.
Any feedback is appreciated :)
======
Filename: 238-hs-relay-stats.txt Title: Better hidden service stats from Tor relays Author: George Kadianakis, David Goulet, Karsten Loesing, Aaron Johnson Created: 2014-11-17 Status: Draft
0. Motivation
Hidden Services is one of the least understood parts of the Tor network. We don't really know how many hidden services there are and how much they are used.
This proposal suggests that Tor relays include some hidden service related stats to their extra info descriptors. No stats are collected from Tor hidden services or clients.
While uncertainty might be a good thing in a hidden network, learning more information about the usage of hidden services can be helpful.
For example, learning how many cells are sent for hidden service purposes tells us whether hidden service traffic is 2% of the Tor network traffic or 90% of the Tor network traffic. This info can also help us during load balancing, for example if we change the path building of hidden services to mitigate guard discovery attacks [0].
Also, learning the number of hidden services, can give us an understanding of how widespread hidden services are. It will also help us understand approximately how much load is put in the network by hidden service logistics, like introduction point circuits etc.
1. Design
Tor relays will add some fields related to hidden service statistics in their extra-info descriptors.
Tor relays collect these statistics by keeping track of their hidden service directory or rendezvous point activities, slightly obfuscating the numbers and posting them to the directory authorities. Extra-info descriptors are posted to directory authorities every 24 hours.
2. Implementation
2.1. Hidden service statistics interval
We want relays to report hidden-service statistics over a long-enough time period to not put users at risk. Similar to other statistics, we suggest a 24-hour statistics interval. All related statistics are collected at the end of that interval and included in the next extra-info descriptors published by the relay.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL [At most once.]
YYYY-MM-DD HH:MM:SS defines the end of the included measurement interval of length NSEC seconds (86400 seconds by default).
A "hidserv-stats-end" line, as well as any other "hidserv-*" line, is first added after the relay has been running for at least 24 hours.
2.2. Hidden service traffic statistics
We want to learn how much of the total Tor network traffic is caused by hidden service usage. There are three phases in the rendezvous protocol where traffic is generated: (1) when hidden services make themselves available in the network, (2) when clients open connections to hidden services, and (3) when clients exchange application data with hidden services. We expect (3) to consume most bytes here, so we're focusing on this only. More precisely, we measure hidden service traffic by counting RELAY cells seen on a rendezvous point after receiving a RENDEZVOUS1 cell. These RELAY cells include commands to open or close application streams, and they include application data.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-rend-relayed-cells" SP num NL [At most once.]
Approximate number of RELAY cells seen in either direction on a circuit after receiving and successfully processing a RENDEZVOUS1 cell. The actual number observed by the directory is multiplied with a random number in [0.9, 1.1] and then gets floored before being reported.
The keyword indicates that this line is part of hidden-service statistics ("hidserv") and contains aggregate data from the relay acting as rendezvous point ("rend").
2.3. HSDir hidden service counting
We also want to learn how many hidden services exist in the network. The best place to learn this is at hidden service directories where hidden services publish their descriptors.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-dir-published-ids" SP num NL [At most once.]
Approximate number of unique hidden-service identities seen in descriptors published to and accepted by this hidden-service directory. The actual number observed by the directory is multiplied with a random number in [0.9, 1.1] and then gets floored before being reported.
This statistic requires keeping a separate data structure with unique identities seen during the current statistics interval. We could, in theory, have relays iterate over their descriptor caches when producing the daily hidden-service statistics blurb. But it's unclear how caching would affect results from such an approach, because descriptors published at the start of the current statistics interval could already have been removed, and descriptors published in the last statistics interval could still be present. Keeping a separate data structure, possibly even a probabilistic one, seems like the more accurate approach.
3. Security
The main security considerations that need discussion are what an adversary could do with reported statistics that they couldn't do without them. In the following, we're going through things the adversary could learn, how plausible that is, and how much we care. (All these things refer to hidden-service traffic, not to hidden-service counting. We should think about the latter, too.)
3.1. Identify rendezvous point of high-volume and long-lived connection
The adversary could identify the rendezvous point of a very large and very long-lived HS connection by observing a relay with unexpectedly large relay cell count.
3.2. Identify number of users of a hidden service
The adversary may be able to identify the number of users of an HS if he knows the amount of traffic on a connection to that HS (which he potentially can determine himself) and knows when that service goes up or down. He can look at the change in the total reported RP traffic to determine about how many fewer HS users there are when that HS is down.
4. Discussion
4.1. Why count only RP cells? Why not also count IP cells?
As discussed on IRC, counting only RP cells should be fine for now. Everything else is protocol overhead, which includes HSDir traffic, introduction point traffic, or rendezvous point traffic before the first RELAY cell, etc.
Furthermore, introduction points correspond to specific HSes, so publishing IP cell stats could reveal the popularity of specific HSes.
4.2. How to use these stats?
4.2.1. How to use RP Cell statistics
We plan to extrapolate reported values to network totals by dividing values by the probability of clients picking relays as rendezvous point. This approach should become more precise on faster relays and the more relays report these statistics.
We also plan to compare reported values with "cell-*" statistics to learn what fraction of traffic can be attributed to hidden services.
Ideally, we'd be able to compare values to "write-history" and "read-history" lines to compute similar fractions of traffic used for hidden services. The goal would be to avoid enabling "cell-*" statistics by default. In order for this to work we'll have to multiply reported cell numbers with the default cell size of 512 bytes (we cannot infer the actual number of bytes, because cells are end-to-end encrypted between client and service).
4.2.2. How to use HSDir HS statistics
We plan to extrapolate this value to network totals by calculating what fraction of hidden-service identities this relay was supposed to see. This extrapolation will be very rough, because each hidden-service directory is only responsible for a tiny share of hidden-service descriptors, and there is no way to increase that share significantly.
Here are some numbers: there are about 3000 directories, and each descriptor is stored on three directories. So, each directory is responsible for roughly 1/1000 of descriptor identifiers. There are two replicas for each descriptor (that is, each descriptor is stored under two descriptor identifiers), and descriptor identifiers change once per day (which means that, during a 24-hour period, there are two opportunities for each directory to see a descriptor). Hence, each descriptor is stored to four places in identifier space throughout a 24-hour period. The probability of any given directory to see a given hidden-service identity is 1-(1-1/1000)^4 = 0.00399 = 1/250. This approximation constitutes an upper threshold, because it assumes that services are running all day. An extrapolation based on this formula will lead to undercounting the total number of hidden services.
A possible inaccuracy in the estimation algorithm comes from the fact that a relay may not be acting as hidden-service directory during the full statistics interval. We'll have to look at consensuses to determine when the relay first received the "HSDir" flag, and only consider the part of the statistics interval following the valid-after time of that consensus.
4.3. Multiplicative or additive noise?
A possible alternative to multiplying the number of cells with a random factor is to introduce additive noise. Let's suppose that we would like to obscure any individual connection that contains C cells or fewer (obscuring extremely and unusually large connections seems hopeless but unnecessary). That is, we don't want the (distribution of) the cell count from any relay to change by much whether or not C cells are removed. The standard differential privacy approach would be to *add* noise from the Laplace distribution Lap(\epsilon/C), where \epsilon controls how much the statistics *distribution* can multiplicatively differ. This is not to say that we need to add noise exactly from that distribution (maybe we weaken the guarantee slightly to get better accuracy), but the same idea applies. This would apply the same to both large and small relays. We *want* to learn roughly how much hidden-service traffic each relay has - we just want to obscure the exact number within some tolerance. We'll probably want to include the algorithm and parameters used for adding noise in the "hidserv-rend-relayed-cells" line, as in, "lap=x" with x being \epsilon/C.
[0]: guard discovery: https://lists.torproject.org/pipermail/tor-dev/2014-September/007474.html
George Kadianakis desnacked@riseup.net writes:
Hello there,
I inline a copy of a proposal we've been working on lately. Discussion can be found in the "Feedback on obfuscating hidden-service statistics" thread.
The proposal suggests that Tor relays add some stats about hidden service usage. We believe that these stats are not dangerous and can be useful to Tor developers and to people who want to understand hidden services and the onionspace better.
Any feedback is appreciated :)
Hello,
I inline a better version of that proposal.
This version has more information about the obfuscation methods deployed. I also moved some paragraphs around to make the implementation parts faster to skim through.
I currently like the proposal. Maybe an area that could be improved is a better format than: "hidserv-rend-relayed-cells" SP num SP key=val SP key=val ... NL however it seems that this format is quite easy to generate using little-t-tor, and should not be too hard to parse either.
If people like this proposal please feel free to merge it to torspec.git.
BTW, implementation of this has already started at: https://trac.torproject.org/projects/tor/ticket/13192
You can also find this proposal at: https://gitweb.torproject.org/user/asn/torspec.git/log/?h=hs_stats
Feedback is welcome!
--------------------
Filename: 238-hs-relay-stats.txt Title: Better hidden service stats from Tor relays Author: George Kadianakis, David Goulet, Karsten Loesing, Aaron Johnson Created: 2014-11-17 Status: Draft
0. Motivation
Hidden Services is one of the least understood parts of the Tor network. We don't really know how many hidden services there are and how much they are used.
This proposal suggests that Tor relays include some hidden service related stats to their extra info descriptors. No stats are collected from Tor hidden services or clients.
While uncertainty might be a good thing in a hidden network, learning more information about the usage of hidden services can be helpful.
For example, learning how many cells are sent for hidden service purposes tells us whether hidden service traffic is 2% of the Tor network traffic or 90% of the Tor network traffic. This info can also help us during load balancing, for example if we change the path building of hidden services to mitigate guard discovery attacks [GUARD-DISCOVERY].
Also, learning the number of hidden services, can give us an understanding of how widespread hidden services are. It will also help us understand approximately how much load is put in the network by hidden service logistics, like introduction point circuits etc.
1. Design
Tor relays shall add some fields related to hidden service statistics in their extra-info descriptors.
Tor relays collect these statistics by keeping track of their hidden service directory or rendezvous point activities, slightly obfuscating the numbers and posting them to the directory authorities. Extra-info descriptors are posted to directory authorities every 24 hours.
2. Implementation
2.1. Hidden service statistics interval
We want relays to report hidden-service statistics over a long-enough time period to not put users at risk. Similar to other statistics, we suggest a 24-hour statistics interval. All related statistics are collected at the end of that interval and included in the next extra-info descriptors published by the relay.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL [At most once.]
YYYY-MM-DD HH:MM:SS defines the end of the included measurement interval of length NSEC seconds (86400 seconds by default).
A "hidserv-stats-end" line, as well as any other "hidserv-*" line, is first added after the relay has been running for at least 24 hours.
2.2. Hidden service traffic statistics
We want to learn how much of the total Tor network traffic is caused by hidden service usage. More precisely, we measure hidden service traffic by counting RELAY cells seen on a rendezvous point after receiving a RENDEZVOUS1 cell. These RELAY cells include commands to open or close application streams, and they include application data.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-rend-relayed-cells" SP num SP key=val SP key=val ... NL [At most once.]
Where 'num' is the number of RELAY cells seen in either direction on a circuit after receiving and successfully processing a RENDEZVOUS1 cell.
The actual number is obfuscated as detailed in section "2.4. Statistics obfuscation". The parameters of the obfuscation are included in the key=val part of the line.
The obfuscatory parameters for this statistic are: * delta_f = 2048 * epsilon = 0.3 * bin_size = 1024
So, an example line could be: hidserv-rend-relayed-cells 19456 delta_f=2048 epsilon=0.30 binsize=1024
2.3. HSDir hidden service counting
We also want to learn how many hidden services exist in the network. The best place to learn this is at hidden service directories where hidden services publish their descriptors.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-dir-onions-seen" SP num SP key=val SP key=val ... NL [At most once.]
Approximate number of unique hidden-service identities seen in descriptors published to and accepted by this hidden-service directory.
The actual number number is obfuscated as detailed in section "2.4. Statistics obfuscation". The parameters of the obfuscation are included in the key=val part of the line.
The obfuscatory parameters for these statistics are: * delta_f = 1 * epsilon = 0.3 * bin_size = 8
So, an example line could be: hidserv-dir-onions-seen 112 delta_f=1 epsilon=0.30 binsize=8
2.4. Statistics obfuscation
We believe that publishing the actual measurement values in such a system might have unpredictable effects, so we obfuscate these statistics before publishing:
+--------------+ +--------------------+ actual value -> |additive noise| -> |round-up obfuscation| -> public statistic +--------------+ +--------------------+
We are using two obfuscation methods to better hide the actual numbers even if they remain the same over multiple measurement periods.
Specifically, given the actual measurement value, we first deploy additive noise in a fashion similar to basic differential privacy. Then, we round up this obfuscated result to the nearest multiple of an integer (which is a security parameter), to derive a final result which can be published safely.
More information about the obfuscation methods follows:
2.4.1. Additive noise
We apply additive noise to the actual measurement by adding to it a random value sampled from a Laplace distribution . Following the differential privacy methodology [DIFF-PRIVACY], our obfuscatory Laplace distribution has \mu = 0 and b = (delta_f / epsilon).
The precise values of delta_f and epsilon are different for each statistic and are defined on the respective statistics sections.
2.4.2. Round-up obfuscation
To further hide any patterns, before publishing statistics, we round up the result to the nearest multiple of 'bin_size'. 'bin_size' is an integer security parameter and can be found on the respective statistics sections.
This is similar to how Tor keeps bridge user statistics. As an example, if the measurement value is 9 and bin_size is 8, then the final value will be rounded up to 16. This also works for negative values, so for example, if the measurement value is -9 and bin_size is 8, the value will be rounded up to -8.
3. Security
The main security considerations that need discussion are what an adversary could do with reported statistics that they couldn't do without them. In the following, we're going through things the adversary could learn, how plausible that is, and how much we care. (All these things refer to hidden-service traffic, not to hidden-service counting. We should think about the latter, too.)
3.1. Identify rendezvous point of high-volume and long-lived connection
The adversary could identify the rendezvous point of a very large and very long-lived HS connection by observing a relay with unexpectedly large relay cell count.
3.2. Identify number of users of a hidden service
The adversary may be able to identify the number of users of an HS if he knows the amount of traffic on a connection to that HS (which he potentially can determine himself) and knows when that service goes up or down. He can look at the change in the total reported RP traffic to determine about how many fewer HS users there are when that HS is down.
4. Discussion
4.1. Why count only RP cells? Why not also count IP cells?
There are three phases in the rendezvous protocol where traffic is generated: (1) when hidden services make themselves available in the network, (2) when clients open connections to hidden services, and (3) when clients exchange application data with hidden services. We expect (3), that is the RP cells, to consume most bytes here, so we're focusing on this only.
Furthermore, introduction points correspond to specific HSes, so publishing IP cell stats could reveal the popularity of specific HSes.
4.2. How to use these stats?
4.2.1. How to use RP Cell statistics
We plan to extrapolate reported values to network totals by dividing values by the probability of clients picking relays as rendezvous point. This approach should become more precise on faster relays and the more relays report these statistics.
We also plan to compare reported values with "cell-*" statistics to learn what fraction of traffic can be attributed to hidden services.
Ideally, we'd be able to compare values to "write-history" and "read-history" lines to compute similar fractions of traffic used for hidden services. The goal would be to avoid enabling "cell-*" statistics by default. In order for this to work we'll have to multiply reported cell numbers with the default cell size of 512 bytes (we cannot infer the actual number of bytes, because cells are end-to-end encrypted between client and service).
4.2.2. How to use HSDir HS statistics
We plan to extrapolate this value to network totals by calculating what fraction of hidden-service identities this relay was supposed to see. This extrapolation will be very rough, because each hidden-service directory is only responsible for a tiny share of hidden-service descriptors, and there is no way to increase that share significantly.
Here are some numbers: there are about 3000 directories, and each descriptor is stored on three directories. So, each directory is responsible for roughly 1/1000 of descriptor identifiers. There are two replicas for each descriptor (that is, each descriptor is stored under two descriptor identifiers), and descriptor identifiers change once per day (which means that, during a 24-hour period, there are two opportunities for each directory to see a descriptor). Hence, each descriptor is stored to four places in identifier space throughout a 24-hour period. The probability of any given directory to see a given hidden-service identity is 1-(1-1/1000)^4 = 0.00399 = 1/250. This approximation constitutes an upper threshold, because it assumes that services are running all day. An extrapolation based on this formula will lead to undercounting the total number of hidden services.
A possible inaccuracy in the estimation algorithm comes from the fact that a relay may not be acting as hidden-service directory during the full statistics interval. We'll have to look at consensuses to determine when the relay first received the "HSDir" flag, and only consider the part of the statistics interval following the valid-after time of that consensus.
5. References
[GUARD-DISCOVERY]: https://lists.torproject.org/pipermail/tor-dev/2014-September/007474.html
[DIFF-PRIVACY]: http://research.microsoft.com/en-us/projects/databaseprivacy/dwork.pdf
Hi George,
I recommend a change to the way that these statistics are obfuscated. The problem is that new noise is used every day, and from the distribution of the reported bins, the exact location within the bin (assuming the stat stats constant) can be reported.
So instead of this
+--------------+ +--------------------+
actual value -> |additive noise| -> |round-up obfuscation| -> public statistic +--------------+ +——————————+
I recommend that you flip the order, so that it is like this +--------------+ +--------------------+ actual value -> |round-up obfuscation| -> |additive noise| -> public statistic +--------------+ +——————————+
“Additive noise” in the context of bins is actually just a distribution over bins. You can think of it in two ways: 1. Add Laplace noise to the bin center, and then report the bin of the resulting number. 2. Choose a bin using a two-sided geometric distribution centered at the correct bin. I believe that these are equivalent.
Best, Aaron
"A. Johnson" aaron.m.johnson@nrl.navy.mil writes:
Hi George,
Hello!
I recommend a change to the way that these statistics are obfuscated. The problem is that new noise is used every day, and from the distribution of the reported bins, the exact location within the bin (assuming the stat stats constant) can be reported.
Assuming that the underlying value is constant and since our Laplace distribution is public, the adversary can observe which bin is reported each time and get a probability distribution for the underlying value.
This indeed seems plausible under the powerful assumption that the underlying stat is constant.
So instead of this
+--------------+ +--------------------+
actual value -> |additive noise| -> |round-up obfuscation| -> public statistic +--------------+ +——————————+
I recommend that you flip the order, so that it is like this +--------------+ +--------------------+ actual value -> |round-up obfuscation| -> |additive noise| -> public statistic +--------------+ +——————————+
“Additive noise” in the context of bins is actually just a distribution over bins. You can think of it in two ways:
- Add Laplace noise to the bin center, and then report the bin of the resulting number.
Hm, you mean something like this, right?
+--------------+ +--------------------+ +--------------+ actual value -> | binning | -> | addditive noise | -> | binning | -> public statistic +--------------+ +——————————----------+ +--------------+
where the additive noise is applied to the center of the first bin?
I can see how this is better, since the underlying value gets immediately smoothed by binning. However, it does give me a weird hacky feeling...
Is this construction something that has been used before?
Thanks for the feedback!
This indeed seems plausible under the powerful assumption that the underlying stat is constant.
Actually it applies to any known relative pattern, for example, that the number increases by 1 each time.
where the additive noise is applied to the center of the first bin?
Yes, you can look at it like that.
I can see how this is better, since the underlying value gets immediately smoothed by binning. However, it does give me a weird hacky feeling...
Is this construction something that has been used before?
Well, the output here is a bin, not a number, and the “exponential mechanism” is the generalization of the Laplace mechanism to handle arbitrary output spaces (kunaltalwar.org/papers/expmech.pdf). In this case, I believe that adding Laplace noise to a bin center and then re-binning is a way to select according to the distribution that the exponential mechanism would prescribe.
Aaron
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 09/12/14 20:20, A. Johnson wrote:
This indeed seems plausible under the powerful assumption that the underlying stat is constant.
Actually it applies to any known relative pattern, for example, that the number increases by 1 each time.
Still a very powerful assumption given the stats we're gathering.
where the additive noise is applied to the center of the first bin?
Yes, you can look at it like that.
I can see how this is better, since the underlying value gets immediately smoothed by binning. However, it does give me a weird hacky feeling...
Is this construction something that has been used before?
Well, the output here is a bin, not a number, and the “exponential mechanism” is the generalization of the Laplace mechanism to handle arbitrary output spaces (kunaltalwar.org/papers/expmech.pdf). In this case, I believe that adding Laplace noise to a bin center and then re-binning is a way to select according to the distribution that the exponential mechanism would prescribe.
I see the value in switching algorithms and doing the binning step before adding noise.
But I don't see the value of binning the result once more. In a sense, we're already binning signal + noise by cutting off the float part. I don't see what we gain by reducing resolution even more. It seems just unnecessary.
Example:
- We observe 123 things. - We round up to the next multiple of 8, so to 128. - We add Laplace noise -12.3456, obtain 110.6544, round down to 110. - Why would we round up to the next multiple of 8 again, to 112?
It should be equally hard to infer the 128 value from 110 with bin size 1 or 112 with bin size 8. Unless I'm wrong. Please prove me wrong? :)
For now, I'm going to switch algorithm order in the code and not add a second binning step.
All the best, Karsten
But I don't see the value of binning the result once more. In a sense, we're already binning signal + noise by cutting off the float part. I don't see what we gain by reducing resolution even more. It seems just unnecessary.
In principle releasing the number could result in different differential-privacy guarantees than releasing the bin. However, the way I had in mind to set the Laplace parameters this wouldn’t be an issue, because the Laplace distributions themselves would satisfy the desired differential privacy guarantee (and not just the resulting distribution on bins).
So I guess this could be viewed as a post-processing step that is useful for clarity rather than privacy: namely, that the output should be interpreted as a range. But we could leave this to the data consumer to apply without a privacy issue.
Also, I believe that the parameters we had discussed should change. To see why, observe that the Laplace distributions for two adjacent values that cross a bin barrier are now very far apart after being recentered within the appropriate bins. Thus, \delta_f should increase if it is smaller than the maximum number of bins that can be crossed within that \delta_f multiplied by the bin size. With our previous numbers, the new \delta_f for rendezvous cell counts doesn’t change (still 2048), but the new \delta_f for HS descriptors counts is 8.
Aaron
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 10/12/14 15:37, A. Johnson wrote:
But I don't see the value of binning the result once more. In a sense, we're already binning signal + noise by cutting off the float part. I don't see what we gain by reducing resolution even more. It seems just unnecessary.
In principle releasing the number could result in different differential-privacy guarantees than releasing the bin. However, the way I had in mind to set the Laplace parameters this wouldn’t be an issue, because the Laplace distributions themselves would satisfy the desired differential privacy guarantee (and not just the resulting distribution on bins).
So I guess this could be viewed as a post-processing step that is useful for clarity rather than privacy: namely, that the output should be interpreted as a range. But we could leave this to the data consumer to apply without a privacy issue.
Can you be more explicit with regard to privacy guarantees of the obfuscation schema that is currently implemented: 1) binning, 2) add Laplace noise, 3) no second binning.
If you think 3) should be changed, can you explain why that leads to better privacy guarantees?
Also, I believe that the parameters we had discussed should change. To see why, observe that the Laplace distributions for two adjacent values that cross a bin barrier are now very far apart after being recentered within the appropriate bins. Thus, \delta_f should increase if it is smaller than the maximum number of bins that can be crossed within that \delta_f multiplied by the bin size. With our previous numbers, the new \delta_f for rendezvous cell counts doesn’t change (still 2048), but the new \delta_f for HS descriptors counts is 8.
The current parameters are:
* delta_f = 2048 * epsilon = 0.3 * bin_size = 1024
and
* delta_f = 1 * epsilon = 0.3 * bin_size = 8
I can see how the Laplace distribution doesn't add much noise to the second case. And your suggestion is to change the second delta_f to 8?
Thanks!
All the best, Karsten
Can you be more explicit with regard to privacy guarantees of the obfuscation schema that is currently implemented: 1) binning, 2) add Laplace noise, 3) no second binning.
I’ll discuss this in terms of attacks on the stats of the number of HS descriptors.
Binning: Suppose an adversary knows that the number of HS descriptors stays constant over a week. He knows when all descriptors are being published except for one. By binning he won’t know when that one is published unless the number of other descriptors exactly fills a bin.
Laplace noise: To provide cover in the case that all other descriptors exactly fill a bin, we add some noise so that sometimes an adjacent bin is reported instead, or (less likely) a bin two distant, etc. Then the adversary can’t immediately know whether an unknown descriptor is indeed published in any given period. However, he can eventually figure this out by making enough observations and looking at the resulting empirical distribution. But it’s better than not protecting it at all.
If you think 3) should be changed, can you explain why that leads to better privacy guarantees?
I don’t think that 3 should be changed, but if you removed it, it wouldn't affect the privacy argument.
I can see how the Laplace distribution doesn't add much noise to the second case. And your suggestion is to change the second delta_f to 8?
Yes.
Best, Aaron
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 11/12/14 14:31, A. Johnson wrote:
Can you be more explicit with regard to privacy guarantees of the obfuscation schema that is currently implemented: 1) binning, 2) add Laplace noise, 3) no second binning.
I’ll discuss this in terms of attacks on the stats of the number of HS descriptors.
Binning: Suppose an adversary knows that the number of HS descriptors stays constant over a week. He knows when all descriptors are being published except for one. By binning he won’t know when that one is published unless the number of other descriptors exactly fills a bin.
Laplace noise: To provide cover in the case that all other descriptors exactly fill a bin, we add some noise so that sometimes an adjacent bin is reported instead, or (less likely) a bin two distant, etc. Then the adversary can’t immediately know whether an unknown descriptor is indeed published in any given period. However, he can eventually figure this out by making enough observations and looking at the resulting empirical distribution. But it’s better than not protecting it at all.
Sounds good. George, maybe these explanations should go into the proposal, too.
If you think 3) should be changed, can you explain why that leads to better privacy guarantees?
I don’t think that 3 should be changed, but if you removed it, it wouldn't affect the privacy argument.
I can see how the Laplace distribution doesn't add much noise to the second case. And your suggestion is to change the second delta_f to 8?
Yes.
Great. Changed the second delta_f to 8 in the code, and I think George will change it in the proposal.
Thanks!
All the best, Karsten
On 11/12/14 15:45, Karsten Loesing wrote:
Sorry for double-posting!
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 11/12/14 14:31, A. Johnson wrote:
Can you be more explicit with regard to privacy guarantees of the obfuscation schema that is currently implemented: 1) binning, 2) add Laplace noise, 3) no second binning.
I’ll discuss this in terms of attacks on the stats of the number of HS descriptors.
Binning: Suppose an adversary knows that the number of HS descriptors stays constant over a week. He knows when all descriptors are being published except for one. By binning he won’t know when that one is published unless the number of other descriptors exactly fills a bin.
Laplace noise: To provide cover in the case that all other descriptors exactly fill a bin, we add some noise so that sometimes an adjacent bin is reported instead, or (less likely) a bin two distant, etc. Then the adversary can’t immediately know whether an unknown descriptor is indeed published in any given period. However, he can eventually figure this out by making enough observations and looking at the resulting empirical distribution. But it’s better than not protecting it at all.
Sounds good. George, maybe these explanations should go into the proposal, too.
If you think 3) should be changed, can you explain why that leads to better privacy guarantees?
I don’t think that 3 should be changed, but if you removed it, it wouldn't affect the privacy argument.
I can see how the Laplace distribution doesn't add much noise to the second case. And your suggestion is to change the second delta_f to 8?
Yes.
Great. Changed the second delta_f to 8 in the code, and I think George will change it in the proposal.
Thanks!
All the best, Karsten
George Kadianakis desnacked@riseup.net writes:
George Kadianakis desnacked@riseup.net writes:
Hello there,
I inline a copy of a proposal we've been working on lately. Discussion can be found in the "Feedback on obfuscating hidden-service statistics" thread.
The proposal suggests that Tor relays add some stats about hidden service usage. We believe that these stats are not dangerous and can be useful to Tor developers and to people who want to understand hidden services and the onionspace better.
Any feedback is appreciated :)
Hello,
I inline a better version of that proposal.
And now I inline an even better version of the proposal.
Changes include:
- Inverted the order of the obfuscation methods as discussed in the mailing list.
- Change delta_f of the HSDir stats to be 8 instead of 1.
- Switched from 'round-up obfuscation' to 'data binning'.
- Add links to pfm's graphs.
You can also find this proposal (and a diff from its previous state) at: https://gitweb.torproject.org/user/asn/torspec.git/log/?h=hs_stats
Thanks!
------
Filename: 238-hs-relay-stats.txt Title: Better hidden service stats from Tor relays Author: George Kadianakis, David Goulet, Karsten Loesing, Aaron Johnson Created: 2014-11-17 Status: Draft
0. Motivation
Hidden Services is one of the least understood parts of the Tor network. We don't really know how many hidden services there are and how much they are used.
This proposal suggests that Tor relays include some hidden service related stats to their extra info descriptors. No stats are collected from Tor hidden services or clients.
While uncertainty might be a good thing in a hidden network, learning more information about the usage of hidden services can be helpful.
For example, learning how many cells are sent for hidden service purposes tells us whether hidden service traffic is 2% of the Tor network traffic or 90% of the Tor network traffic. This info can also help us during load balancing, for example if we change the path building of hidden services to mitigate guard discovery attacks [GUARD-DISCOVERY].
Also, learning the number of hidden services, can give us an understanding of how widespread hidden services are. It will also help us understand approximately how much load is put in the network by hidden service logistics, like introduction point circuits etc.
1. Design
Tor relays shall add some fields related to hidden service statistics in their extra-info descriptors.
Tor relays collect these statistics by keeping track of their hidden service directory or rendezvous point activities, slightly obfuscating the numbers and posting them to the directory authorities. Extra-info descriptors are posted to directory authorities every 24 hours.
2. Implementation
2.1. Hidden service statistics interval
We want relays to report hidden-service statistics over a long-enough time period to not put users at risk. Similar to other statistics, we suggest a 24-hour statistics interval. All related statistics are collected at the end of that interval and included in the next extra-info descriptors published by the relay.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL [At most once.]
YYYY-MM-DD HH:MM:SS defines the end of the included measurement interval of length NSEC seconds (86400 seconds by default).
A "hidserv-stats-end" line, as well as any other "hidserv-*" line, is first added after the relay has been running for at least 24 hours.
2.2. Hidden service traffic statistics
We want to learn how much of the total Tor network traffic is caused by hidden service usage. More precisely, we measure hidden service traffic by counting RELAY cells seen on a rendezvous point after receiving a RENDEZVOUS1 cell. These RELAY cells include commands to open or close application streams, and they include application data.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-rend-relayed-cells" SP num SP key=val SP key=val ... NL [At most once.]
Where 'num' is the number of RELAY cells seen in either direction on a circuit after receiving and successfully processing a RENDEZVOUS1 cell.
The actual number is obfuscated as detailed in [STAT-OBFUSCATION]. The parameters of the obfuscation are included in the key=val part of the line.
The obfuscatory parameters for this statistic are: * delta_f = 2048 * epsilon = 0.3 * bin_size = 1024 (Also see [CELL-LAPLACE-GRAPH] for a graph of the Laplace distribution.)
So, an example line could be: hidserv-rend-relayed-cells 19456 delta_f=2048 epsilon=0.30 binsize=1024
2.3. HSDir hidden service counting
We also want to learn how many hidden services exist in the network. The best place to learn this is at hidden service directories where hidden services publish their descriptors.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-dir-onions-seen" SP num SP key=val SP key=val ... NL [At most once.]
Approximate number of unique hidden-service identities seen in descriptors published to and accepted by this hidden-service directory.
The actual number number is obfuscated as detailed in [STAT-OBFUSCATION]. The parameters of the obfuscation are included in the key=val part of the line.
The obfuscatory parameters for this statistic are: * delta_f = 8 * epsilon = 0.3 * bin_size = 8 (Also see [ONIONS-LAPLACE-GRAPH] for a graph of the Laplace distribution.)
So, an example line could be: hidserv-dir-onions-seen 112 delta_f=1 epsilon=0.30 binsize=8
2.4. Statistics obfuscation [STAT-OBFUSCATION]
We believe that publishing the actual measurement values in such a system might have unpredictable effects, so we obfuscate these statistics before publishing:
+-----------+ +--------------+ actual value -> | binning | -> |additive noise| -> public statistic +-----------+ +--------------+
We are using two obfuscation methods to better hide the actual numbers even if they remain the same over multiple measurement periods.
Specifically, given the actual measurement value, we first apply data binning to it (basically we round it up to the nearest multiple of an integer, see [DATA-BINNING]). And then we apply additive noise to the binned value in a fashion similar to differential privacy.
More information about the obfuscation methods follows:
2.4.1. Data binning
The first thing we do to the original measurement value, is to round it up to the nearest multiple of 'bin_size'. 'bin_size' is an integer security parameter and can be found on the respective statistics sections.
This is similar to how Tor keeps bridge user statistics. As an example, if the measurement value is 9 and bin_size is 8, then the final value will be rounded up to 16. This also works for negative values, so for example, if the measurement value is -9 and bin_size is 8, the value will be rounded up to -8.
2.4.2. Additive noise
Then, before publishing the statistics, we apply additive noise to the binned value by adding to it a random value sampled from a Laplace distribution . Following the differential privacy methodology [DIFF-PRIVACY], our obfuscatory Laplace distribution has mu = 0 and b = (delta_f / epsilon).
The precise values of delta_f and epsilon are different for each statistic and are defined on the respective statistics sections.
3. Security
The main security considerations that need discussion are what an adversary could do with reported statistics that they couldn't do without them. In the following, we're going through things the adversary could learn, how plausible that is, and how much we care. (All these things refer to hidden-service traffic, not to hidden-service counting. We should think about the latter, too.)
3.1. Identify rendezvous point of high-volume and long-lived connection
The adversary could identify the rendezvous point of a very large and very long-lived HS connection by observing a relay with unexpectedly large relay cell count.
3.2. Identify number of users of a hidden service
The adversary may be able to identify the number of users of an HS if he knows the amount of traffic on a connection to that HS (which he potentially can determine himself) and knows when that service goes up or down. He can look at the change in the total reported RP traffic to determine about how many fewer HS users there are when that HS is down.
4. Discussion
4.1. Why count only RP cells? Why not count IP cells too?
There are three phases in the rendezvous protocol where traffic is generated: (1) when hidden services make themselves available in the network, (2) when clients open connections to hidden services, and (3) when clients exchange application data with hidden services. We expect (3), that is the RP cells, to consume most bytes here, so we're focusing on this only.
Furthermore, introduction points correspond to specific HSes, so publishing IP cell stats could reveal the popularity of specific HSes.
4.2. How to use these stats?
4.2.1. How to use rendezvous cell statistics
We plan to extrapolate reported values to network totals by dividing values by the probability of clients picking relays as rendezvous point. This approach should become more precise on faster relays and the more relays report these statistics.
We also plan to compare reported values with "cell-*" statistics to learn what fraction of traffic can be attributed to hidden services.
Ideally, we'd be able to compare values to "write-history" and "read-history" lines to compute similar fractions of traffic used for hidden services. The goal would be to avoid enabling "cell-*" statistics by default. In order for this to work we'll have to multiply reported cell numbers with the default cell size of 512 bytes (we cannot infer the actual number of bytes, because cells are end-to-end encrypted between client and service).
4.2.2. How to use HSDir HS statistics
We plan to extrapolate this value to network totals by calculating what fraction of hidden-service identities this relay was supposed to see. This extrapolation will be very rough, because each hidden-service directory is only responsible for a tiny share of hidden-service descriptors, and there is no way to increase that share significantly.
Here are some numbers: there are about 3000 directories, and each descriptor is stored on three directories. So, each directory is responsible for roughly 1/1000 of descriptor identifiers. There are two replicas for each descriptor (that is, each descriptor is stored under two descriptor identifiers), and descriptor identifiers change once per day (which means that, during a 24-hour period, there are two opportunities for each directory to see a descriptor). Hence, each descriptor is stored to four places in identifier space throughout a 24-hour period. The probability of any given directory to see a given hidden-service identity is 1-(1-1/1000)^4 = 0.00399 = 1/250. This approximation constitutes an upper threshold, because it assumes that services are running all day. An extrapolation based on this formula will lead to undercounting the total number of hidden services.
A possible inaccuracy in the estimation algorithm comes from the fact that a relay may not be acting as hidden-service directory during the full statistics interval. We'll have to look at consensuses to determine when the relay first received the "HSDir" flag, and only consider the part of the statistics interval following the valid-after time of that consensus.
4.3. Why does the obfuscation work?
By applying data binning, we smudge the original value making it harder for attackers to guess it. Specifically, an attacker who knows the bin, can only guess the underlying value with probability 1/bin_size.
By applying additive noise, we make it harder for the adversary to find out the current bin, which makes it even harder to get the original value. If additive noise was not applied, an adversary could try to detect changes in the original value by checking when we switch bins.
5. Acknowledgements
Thanks go to 'pfm' for the helpful Laplace graphs.
6. References
[GUARD-DISCOVERY]: https://lists.torproject.org/pipermail/tor-dev/2014-September/007474.html
[DIFF-PRIVACY]: http://research.microsoft.com/en-us/projects/databaseprivacy/dwork.pdf
[DATA-BINNING]: https://en.wikipedia.org/wiki/Data_binning
[CELL-LAPLACE-GRAPH]: https://raw.githubusercontent.com/corcra/pioton/master/vis/laplacePDF_mu0_b6... https://raw.githubusercontent.com/corcra/pioton/master/vis/laplaceCDF_mu0_b6...
[ONIONS-LAPLACE-GRAPH]: https://raw.githubusercontent.com/corcra/pioton/master/vis/laplacePDF_mu0_b3... https://raw.githubusercontent.com/corcra/pioton/master/vis/laplaceCDF_mu0_b3...