Hello,
Tor hidden services are meant to primarily provide server anonymity but they also provide various other properties. For example, their addresses are self-authenticated and their connections punch NAT. This post is about another property, which is that Tor does not reveal the popularity of a hidden service by default. That is, you can't easily get the user count of a specific hidden service.
This is not that surprising to hidden service operators, since that's also how the normal Internet works. In the normal Internet, someone cannot learn the user count of an IP or website, except if they are the operator or they control DNS or the site publishes analytics.
Over the past years, people have suggested various features that would provide us with interesting information and optimizations but would have the side effect of revealing the user count of hidden services (or only of popular hidden services) to the public.
Some examples of such popularity-leaking features:
- Hidden Services dynamically set their number of Introduction Points depending on their popularity. Basically, they self evaluate their popularity, and use a formula to decide the number of Introduction Points between 3 and 10. For a while, we thought that this formula does not work properly (#4862, #8950), but we recently discovered that it seems to be working in some manner.
While an interesting and useful feature on its own, it has the side effect that it leaks how popular your hidden service is. Since a hidden service publishes a descriptor every hour, you can monitor hourly usage patterns of hidden services. Of course, you can't get the exact user count, but you might be able to get rough approximate numbers (we still haven't analyzed the formula enough to know *exactly* how much).
- During our recent work on hidden service statistics [0] people have suggested to gather statistics that would get us closer to learn the number of hidden service users [1]. The suggested way to do so is to have HSDirs or introduction points count the total number of introductions or descriptor fetches and publish that number in their extra-info descriptor. Since given a hidden service address you can easily learn its HSDirs or IPs, it should be possible to map those statistics to specific hidden services, which would leak their popularity (more on this later).
There might be more examples that I'm missing, but this should be enough to demonstrate the leaks. For the rest of this post, I will be presenting various arguments for and against leaking popularity.
Disclaimer: I still am not 100% decided here, but I lean heavily towards the "popularity is private information and we should not reveal it if we can help it" camp, or maybe in the "there needs to be very concrete positive outcomes before even considering leaking popularity". Hence, my arguments will be obviously biased towards the negatives of leaking popularity. I invite someone from the opposite camp to articulate better arguments for why popularity-hiding is something worth sacrificing.
== Arguments for leaking popularity and reaping its benefits ==
Here are a few arguments that people use to shrug off popularity-hiding. I can relate to some of them, but I find the reasoning of some others funny or even dangerous.
- "If we don't care about leaking popularity we can get useful statistics"
Indeed, we've dismissed various statistics that we could collect because we were afraid that they would leak the popularity of hidden services. If we didn't have this fear, we would have a better idea on how much usage hidden services see, or whether people are conducting DoS attacks on hidden services.
- "If we don't care about leaking popularity we can get nice optimizations"
As an example, the dynamic IP calculation is one of those optimizations. I'm not aware of other optimizations, but I bet that we can think of a few more if we completely remove popularity-hiding from our threat model. Also, people have claimed that more statistics would reveal more optimizations that we could do.
- "Popularity-hiding is just a side-effect of the Tor protocol, and not a stated security goal"
People have claimed that popularity-hiding is not a stated security goal of hidden services, or that the name "hidden services" does not imply popularity-hiding in any way.
- "There are no realistic attacks that could happen from leaking popularity"
People have claimed that popularity is just a curiosity, and nothing bad can come from leaking it. They say that protecting popularity does not offer security against realistic or dangerous attacks.
Other people claim that popularity-revealing attack vectors are too noisy and contain too much random data, hence it's hard to get targetted popularity values out of them. They say that it might only be possible for very popular hidden services, or for unlikely edge cases.
- "There are probably other ways to reveal popularity. You can't fix them all"
That's actually a big fear of mine. That we are nitpicking about 2-3 popularity revealing vectors, while there are hundreds more currently open. See #8742 for example, but I bet there are more vectors that we need to think about.
== Arguments for protecting popularity ==
And here are arguments that make me believe that popularity is something that should be protected.
- Popularity attracts attention
Anonymity likes uniformity, but popularity attracts attention. There are literally infinite possible use cases where a hidden service wants to be public and still not attract attention.
However, since the above argument has not been particularly successful and only attack demonstrations will persuade a true skeptic's mind, here is an attack scenario:
Try hard to imagine a dystopian future where authorities are tracking down and hacking activist websites. They just received a big list of hidden services, the result of a messy interrogation, but they are all locked. Their hackers can hack some of them but not all. Not much time before revolution, end of dystopian future and happiness for all humanity. The dictator needs to decide which hidden services to hack to stop the revolution. Which??
With popularity being public , they can get the popularity of the biggest ones and target those first.
- Popularity can be used to find patterns in group movements now and in the past.
Even though you can't track specific users using popularity, you can still track group of users. Also, these statistics are forever: even if you didn't care about a group of users in the past, but you start caring about them now, you can still look back and see their development over time.
Here is an attack scenario:
Imagine a community that practices very dangerous urban climbing [2]. Imagine thousands of friends climbing away in happiness from all over the world,
Imagine now that this community splinters in other smaller communities, if you monitor their popularity, it will be possible for you to observe the movement of that subculture.
As a further point, imagine now that dystopian future comes and very dangerous urban climbing gets outlawed. The police catches an urban climber in New London and gets a list of hidden services from her. They can then check _historically_ how many users those hidden services had. They can basically notice all the trends of the urban climbing scene in the past years. Creepy, no?
As you probably well know, anonymity is not a binary option. It's not like you are not either super anonymous, or not. It's more of a fuzzy variable that depends on many things. OPSEC is a big part of anonymity, and it seems to me that popularity has OPSEC consequences.
- Statistics noise will get reduced. Attacks only get better.
In the statistics we were talking about, each HSDir would reveal the number of descriptor fetches it received over the past day. We know that each HSDir serves about 150 hidden services, which means that the final value in the end will contain the popularity of 150 hidden services in one number. This is expected to be extremely noisy, and I think that's one of the main hopes of people who don't care about popularity hiding. That allows them to claim that popularity will only be leaked for very popular hidden services.
While this indeed seems reasonable, my main intuition is that attacks can only get better. Here are some ways that noise can be reduced. I will focus on the HSDir case, but same arguments apply to other suggested statistics like number of introductions per IP.
-- It's still early in the hidden services scene, so not many services get lots of traffic. I imagine that many of those 150 hidden services are going to be very inactive, and not provide much noise.
-- Hidden services publish hidden service descriptors to 6 HSDirs. This means that every day you will learn 6 noisy values for your target hidden service, not just 1. It's easier to remove noise that way.
-- Also, those 5 irrelevant hidden services that provide the noise will publish themselves to 6 HSDirs. Applying the same logic as above, you might be able to learn information about the noise, which makes it easier to remove. In a way, you can put all the statistics measurements in a big system of equations, and start solving it to reduce noise in the equation you are interested in.
-- Think of crazy edge cases. Maybe an introduction point is very weak and unlikely to be picked and only got 10 HSes for a day. If one of them is the hidden service you are interested in, there is going to be much fewer noise than usual.
-- There might be other techniques for reducing noise, by combining other statistics (like the number of hidden services per HSDir which is already a stat), or by influencing the statistics yourself (like Aaron's attack on the stats aggregation protocol [3]).
What I'm trying to say here, is that if you thought that the urban climbing example was ridiculous because such a community cannot be big enough to be visible in noisy statistics, maybe by reducing noise you can actually make it distinguishable.
- There are not that amazing benefits from ditching popularity-hiding.
To be honest, I have not heard convincing enough arguments that would make me ditch popularity hiding. Some extra statistics or some small optimizations do not seem exciting enough to me. Please try harder. This could be a nice thread to demonstrate all the positive things that could happen if we ditch popularity-hiding.
Also, there is a small difference here between the stats and the introduction point formula. The dynamic introduction point formula is something that we could disable by default, but also leave it as a configurable option for people who want to use it. That is, it will then be *the choice of the hidden service operator* whether he cares about popularity being hidden or not. With the statistics that have been proposed, you don't give any choice. You just do it for all hidden services forever.
- Principle of least surprise
Hidden service operators except that hidden services are at least as secure as the normal Internet plus more. On the normal Internet, popularity is private by default. Having this assumption violated on hidden services, might not be polite.
- Popularity-hiding is crucial to maintain the deep sea security model of hidden services
As I have mentioned in the past, some people think of the onion land as a very deep ocean. In some places of the ocean, you might be able to see some buoys (some more visible than others). To visit them, you need to wear your goggles and your snokrel, dive in and enter from underwater.
This might not seem like a very concrete security model, but in any case popularity is not revealed at any point. The sea is opaque and you can't see the divers entering the hidden services.
Anyway this post has grown to immense size, and I was really hoping it would be shorter.
On a more practical note, over the next few weeks, we should decide what we want to do with the dynamic introduction point formula and whether we should keep it or not (#4862). My current intuition is that it should be disabled but also kept there as an option for people who want to enable it. In any case, I hope that this thread can stimulate discussion.
Also, if you are a hidden service operator I'm curious to hear about whether you believe that popularity hiding is a security property that should be preserved if that's even possible.
Cheers!
[0]: https://blog.torproject.org/blog/some-statistics-about-onions [1]: https://lists.torproject.org/pipermail/tor-dev/2015-February/008247.html [2]: https://www.youtube.com/watch?v=kpS7vhvkIQM [3]: https://lists.torproject.org/pipermail/tor-dev/2015-March/008404.html
Hi George,
I read your post. I am a hidden service operator... and I feel strongly that hiding popularity is in fact an anonymity property... and security property. I also wonder if it's even possible to fully hide popularity. I suggest we try to hide popularity at least until we understand more about it and how to properly design hidden services. (don't disallow what you cannot prevent philosophy)
You mention the "normal Internet" several times in your post... positing that it's not possible to discover popularity. I'd like to point out that your upstream ISP would have a solid notion of weather or not your were operating a popular website... and by the way so would xkeyscore most likely (you know... the single most important revelation from the Snowden documents).
cheers,
David
On Fri, Apr 3, 2015 at 2:57 PM, George Kadianakis desnacked@riseup.net wrote:
Hello,
Tor hidden services are meant to primarily provide server anonymity but they also provide various other properties. For example, their addresses are self-authenticated and their connections punch NAT. This post is about another property, which is that Tor does not reveal the popularity of a hidden service by default. That is, you can't easily get the user count of a specific hidden service.
This is not that surprising to hidden service operators, since that's also how the normal Internet works. In the normal Internet, someone cannot learn the user count of an IP or website, except if they are the operator or they control DNS or the site publishes analytics.
Over the past years, people have suggested various features that would provide us with interesting information and optimizations but would have the side effect of revealing the user count of hidden services (or only of popular hidden services) to the public.
Some examples of such popularity-leaking features:
Hidden Services dynamically set their number of Introduction Points depending on their popularity. Basically, they self evaluate their popularity, and use a formula to decide the number of Introduction Points between 3 and 10. For a while, we thought that this formula does not work properly (#4862, #8950), but we recently discovered that it seems to be working in some manner.
While an interesting and useful feature on its own, it has the side effect that it leaks how popular your hidden service is. Since a hidden service publishes a descriptor every hour, you can monitor hourly usage patterns of hidden services. Of course, you can't get the exact user count, but you might be able to get rough approximate numbers (we still haven't analyzed the formula enough to know *exactly* how much).
During our recent work on hidden service statistics [0] people have suggested to gather statistics that would get us closer to learn the number of hidden service users [1]. The suggested way to do so is to have HSDirs or introduction points count the total number of introductions or descriptor fetches and publish that number in their extra-info descriptor. Since given a hidden service address you can easily learn its HSDirs or IPs, it should be possible to map those statistics to specific hidden services, which would leak their popularity (more on this later).
There might be more examples that I'm missing, but this should be enough to demonstrate the leaks. For the rest of this post, I will be presenting various arguments for and against leaking popularity.
Disclaimer: I still am not 100% decided here, but I lean heavily towards the "popularity is private information and we should not reveal it if we can help it" camp, or maybe in the "there needs to be very concrete positive outcomes before even considering leaking popularity". Hence, my arguments will be obviously biased towards the negatives of leaking popularity. I invite someone from the opposite camp to articulate better arguments for why popularity-hiding is something worth sacrificing.
== Arguments for leaking popularity and reaping its benefits ==
Here are a few arguments that people use to shrug off popularity-hiding. I can relate to some of them, but I find the reasoning of some others funny or even dangerous.
"If we don't care about leaking popularity we can get useful statistics"
Indeed, we've dismissed various statistics that we could collect because we were afraid that they would leak the popularity of hidden services. If we didn't have this fear, we would have a better idea on how much usage hidden services see, or whether people are conducting DoS attacks on hidden services.
"If we don't care about leaking popularity we can get nice optimizations"
As an example, the dynamic IP calculation is one of those optimizations. I'm not aware of other optimizations, but I bet that we can think of a few more if we completely remove popularity-hiding from our threat model. Also, people have claimed that more statistics would reveal more optimizations that we could do.
"Popularity-hiding is just a side-effect of the Tor protocol, and not a stated security goal"
People have claimed that popularity-hiding is not a stated security goal of hidden services, or that the name "hidden services" does not imply popularity-hiding in any way.
"There are no realistic attacks that could happen from leaking popularity"
People have claimed that popularity is just a curiosity, and nothing bad can come from leaking it. They say that protecting popularity does not offer security against realistic or dangerous attacks.
Other people claim that popularity-revealing attack vectors are too noisy and contain too much random data, hence it's hard to get targetted popularity values out of them. They say that it might only be possible for very popular hidden services, or for unlikely edge cases.
"There are probably other ways to reveal popularity. You can't fix them all"
That's actually a big fear of mine. That we are nitpicking about 2-3 popularity revealing vectors, while there are hundreds more currently open. See #8742 for example, but I bet there are more vectors that we need to think about.
== Arguments for protecting popularity ==
And here are arguments that make me believe that popularity is something that should be protected.
Popularity attracts attention
Anonymity likes uniformity, but popularity attracts attention. There are literally infinite possible use cases where a hidden service wants to be public and still not attract attention.
However, since the above argument has not been particularly successful and only attack demonstrations will persuade a true skeptic's mind, here is an attack scenario:
Try hard to imagine a dystopian future where authorities are tracking down and hacking activist websites. They just received a big list of hidden services, the result of a messy interrogation, but they are all locked. Their hackers can hack some of them but not all. Not much time before revolution, end of dystopian future and happiness for all humanity. The dictator needs to decide which hidden services to hack to stop the revolution. Which?? With popularity being public , they can get the popularity of the biggest ones and target those first.
Popularity can be used to find patterns in group movements now and in the past.
Even though you can't track specific users using popularity, you can still track group of users. Also, these statistics are forever: even if you didn't care about a group of users in the past, but you start caring about them now, you can still look back and see their development over time.
Here is an attack scenario:
Imagine a community that practices very dangerous urban climbing [2]. Imagine thousands of friends climbing away in happiness from all over the world, Imagine now that this community splinters in other smaller communities, if you monitor their popularity, it will be possible for you to observe the movement of that subculture. As a further point, imagine now that dystopian future comes and very dangerous urban climbing gets outlawed. The police catches an urban climber in New London and gets a list of hidden services from her. They can then check _historically_ how many users those hidden services had. They can basically notice all the trends of the urban climbing scene in the past years. Creepy, no?
As you probably well know, anonymity is not a binary option. It's not like you are not either super anonymous, or not. It's more of a fuzzy variable that depends on many things. OPSEC is a big part of anonymity, and it seems to me that popularity has OPSEC consequences.
Statistics noise will get reduced. Attacks only get better.
In the statistics we were talking about, each HSDir would reveal the number of descriptor fetches it received over the past day. We know that each HSDir serves about 150 hidden services, which means that the final value in the end will contain the popularity of 150 hidden services in one number. This is expected to be extremely noisy, and I think that's one of the main hopes of people who don't care about popularity hiding. That allows them to claim that popularity will only be leaked for very popular hidden services.
While this indeed seems reasonable, my main intuition is that attacks can only get better. Here are some ways that noise can be reduced. I will focus on the HSDir case, but same arguments apply to other suggested statistics like number of introductions per IP.
-- It's still early in the hidden services scene, so not many services get lots of traffic. I imagine that many of those 150 hidden services are going to be very inactive, and not provide much noise.
-- Hidden services publish hidden service descriptors to 6 HSDirs. This means that every day you will learn 6 noisy values for your target hidden service, not just 1. It's easier to remove noise that way.
-- Also, those 5 irrelevant hidden services that provide the noise will publish themselves to 6 HSDirs. Applying the same logic as above, you might be able to learn information about the noise, which makes it easier to remove. In a way, you can put all the statistics measurements in a big system of equations, and start solving it to reduce noise in the equation you are interested in.
-- Think of crazy edge cases. Maybe an introduction point is very weak and unlikely to be picked and only got 10 HSes for a day. If one of them is the hidden service you are interested in, there is going to be much fewer noise than usual.
-- There might be other techniques for reducing noise, by combining other statistics (like the number of hidden services per HSDir which is already a stat), or by influencing the statistics yourself (like Aaron's attack on the stats aggregation protocol [3]).
What I'm trying to say here, is that if you thought that the urban climbing example was ridiculous because such a community cannot be big enough to be visible in noisy statistics, maybe by reducing noise you can actually make it distinguishable.
There are not that amazing benefits from ditching popularity-hiding.
To be honest, I have not heard convincing enough arguments that would make me ditch popularity hiding. Some extra statistics or some small optimizations do not seem exciting enough to me. Please try harder. This could be a nice thread to demonstrate all the positive things that could happen if we ditch popularity-hiding.
Also, there is a small difference here between the stats and the introduction point formula. The dynamic introduction point formula is something that we could disable by default, but also leave it as a configurable option for people who want to use it. That is, it will then be *the choice of the hidden service operator* whether he cares about popularity being hidden or not. With the statistics that have been proposed, you don't give any choice. You just do it for all hidden services forever.
Principle of least surprise
Hidden service operators except that hidden services are at least as secure as the normal Internet plus more. On the normal Internet, popularity is private by default. Having this assumption violated on hidden services, might not be polite.
Popularity-hiding is crucial to maintain the deep sea security model of hidden services
As I have mentioned in the past, some people think of the onion land as a very deep ocean. In some places of the ocean, you might be able to see some buoys (some more visible than others). To visit them, you need to wear your goggles and your snokrel, dive in and enter from underwater.
This might not seem like a very concrete security model, but in any case popularity is not revealed at any point. The sea is opaque and you can't see the divers entering the hidden services.
Anyway this post has grown to immense size, and I was really hoping it would be shorter.
On a more practical note, over the next few weeks, we should decide what we want to do with the dynamic introduction point formula and whether we should keep it or not (#4862). My current intuition is that it should be disabled but also kept there as an option for people who want to enable it. In any case, I hope that this thread can stimulate discussion.
Also, if you are a hidden service operator I'm curious to hear about whether you believe that popularity hiding is a security property that should be preserved if that's even possible.
Cheers!
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On Fri, Apr 03, 2015 at 03:57:33PM +0100, George Kadianakis wrote:
I lean heavily towards the "popularity is private information and we should not reveal it if we can help it" camp
Hi George,
Thanks for your thoughts. I'm currently in this camp too.
Also, these statistics are forever: even if you didn't care about a group of users in the past, but you start caring about them now, you can still look back and see their development over time.
To me this is one of the strongest arguments against.
-- Hidden services publish hidden service descriptors to 6 HSDirs. This means that every day you will learn 6 noisy values for your target hidden service, not just 1. It's easier to remove noise that way.
I think tracking popularity by looking at reporting by HSDirs would be quite easy. The main reason is that each day every hidden service picks its own new set of 6 HSDirs. So even if there is noise confusing you today, tomorrow will be a new (independent) set of noise, etc. Doing an intersection attack on these values for your target hidden service should work nicely over time.
To be honest, I have not heard convincing enough arguments that would make me ditch popularity hiding. Some extra statistics or some small optimizations do not seem exciting enough to me. Please try harder. This could be a nice thread to demonstrate all the positive things that could happen if we ditch popularity-hiding.
It would be great if everybody here could do some brainstorming on this one. It would be a shame if we close a design door just because we weren't open-minded enough to think of benefits (as opposed to closing the design door because we weighed both sides and made an informed decision).
The dynamic introduction point formula is something that we could disable by default, but also leave it as a configurable option for people who want to use it. That is, it will then be *the choice of the hidden service operator* whether he cares about popularity being hidden or not.
Makes sense to me.
On the normal Internet, popularity is private by default.
I wish this were more true than it is. There are all sorts of mechanisms on the 'normal' Internet that track popularity at the large scale -- verisign and other people at the top of the dns root track requests and publish summaries; ISPs track clicklogs and publish summaries; and third-party vendors sucker millions of users into installing their surveillance toolbars so they can publish summaries.
So I would understand if you said "yeah, but those aren't built-in", but I think that line gets pretty blurry these days.
--Roger
Hi George,
Thanks for taking up the challenge I raised to you of coming up with use cases where leaking popularity is a threat.
Perhaps others have suggested that we don't worry about popularity at all, but for the arguments I had been trying to make these are straw men. I don't suggest that we completely ignore popularity. As one simple example, if you monitored and published the popularity of onion services at the level of seconds or minutes (maybe even courser) adversaries could almost certainly construct practical intersection attacks on users of some onion services whose client-side traffic was being monitored.
You noted anonymity is not binary, but you have only addressed popularity at a binary level: protect it or ignore it. We have an unfortunate tendency to sometimes do this in the Tor design community. For example, any design choice that partitions (or more generally statistically separates in any way) clients by the portions of the network about which they've been given information is not even worthy of consideration because partitioning is just bad. On the other hand, some pseudonymous profiling by exits is simply acceptable because of practicality considerations (and indeed, time to keep opening new connections on existing circuits has recently been significantly increased in Tor Browser Bundle for usability reasons---with a bit of discussion, but no significant analysis and no Tor Proposal). These are just single examples on each side for contrast, but others are easy to produce. I don't want to get into addressing the problem of this tendency in general here, I just want to make sure that we avoid specifically doing that for this problem.
I think I mentioned to you previously the sorts of popularity statistics I would like to gather. But perhaps I was unclear. I'll set it out here publicly for others to comment on. Details might change, and of course we'd have to worry about particular protocols. That's no different than anything else in Torland. But I want to assume that something like the following is basically feasible. As an argument from authority, I talked to Aaron a bit about how you might do this and we were both convinced it should be feasible to do this securely.
So, assume we have an onion service statistics gathering protocol that outputs say weekly the number of connections and bandwidth carried by the highest 5 percent, then 10 percent, then 20 percent, then 40 percent, then bottom 60 percent of onion services. I take it as given that these would be useful for many reasons, some of which you cited. We can revisit usefulness as needed.
The question I would like to have answered is what sort of practical threat could be posed by leaks from this. One could imagine an active attacker that hugely fluctuates the volume of a given onion service to determine which bin it had been in assuming very generously that this isn't covered by noise of other onion services or a very long attack on a service whose volume does not otherwise change.
These statistics are not a threat in the parkour example. They do not reveal historical volumes of individual onion sites.
In the dystopian future scenario, the authorities know which hidden services are run by the rebels but not which ones are popular, and they want to take down the popular ones quickly since the revolution is imminent. If they happen to guess the right few they could inflate the activity (if they can access the onion site) and learn in a week that they were popular (assuming that they are lucky enough to be sure that, e.g., noise doesn't obscure that). This is a pretty iffy and low bang for the buck attack. As a contrasting example, authorities could easily locate the the guard(s) of targeted onion sites (we're assuming they can access targeted onionsites) via congestion attacks and then just monitor the network activity from the guard or its ISP to see the popularity of targeted onionsites in realtime. Not to mention deanonymizing anyone they are watching on the client side. This could be done faster, easier, and more productively than using the statistics.
Tor is full of security vs. performance and/or efficiency and/or usability trade-offs. If we're going to rule out any onion service popularity statistics, I'd like some indication of a realistic potential threat. So far I don't feel I've heard that.
aloha, Paul