Re: [tor-dev] Scaling tor for a global population

27 Sep 2014


      Andrew Lewman:
...
I had a conversation with a vendor yesterday. They are
interested in including Tor as their "private browsing mode" and
basically shipping a re-branded tor browser which lets people toggle the
connectivity to the Tor network on and off.
They very much like Tor Browser and would like to ship it to their
customer base. Their product is 10-20% of the global market, this is of
roughly 2.8 billion global Internet users.
As Tor Browser is open source, they are already working on it. However
,their concern is scaling up to handling some percent of global users
with "tor mode" enabled. They're willing to entertain offering their
resources to help us solve the scalability challenges of handling
hundreds of millions of users and relays on Tor.
As this question keeps popping up by the business world looking at
privacy as the next "must have" feature in their products, I'm trying to
compile a list of tasks to solve to help us scale. The old 2008
three-year roadmap looks at performance,
https://www.torproject.org/press/2008-12-19-roadmap-press-release.html.en
I've been through the specs,
https://gitweb.torproject.org/torspec.git/tree/HEAD:/proposals to see if
there are proposals for scaling the network or directory authorities. I
didn't see anything directly related.
The last research paper I see directly addressing scalability is Torsk
(http://www.freehaven.net/anonbib/bibtex.html#ccs09-torsk) or PIR-Tor
(http://www.freehaven.net/anonbib/bibtex.html#usenix11-pirtor)
These research papers basically propose a total network overhaul to deal
with the problem of Tor relay directory traffic overwhelming the Tor
network and/or Tor clients.
However, I believe that with only minor modifications, the current Tor
network architecture could support 100M daily directly connecting users,
assuming we focus our efforts on higher capacity relays and not simply
adding tons of slower relays.
The core problem is that the fraction of network capacity that you spend
telling users about the current relays in the network can be written as:
f = D*U/B
D is current Tor relay directory size in bytes per day, U is number of
users, and B is the bandwidth per day in bytes provided by this Tor
network. Of course, this is a simplification, because of multiple
directory fetches per day and partially-connecting/idle clients, but for
purposes of discussion it is good enough.
To put some real numbers on this, if you compare
https://metrics.torproject.org/bandwidth.html#dirbytes with
https://metrics.torproject.org/bandwidth.html#bandwidth, you can see
that we're currently devoting about 2% of our network throughput to
directory activity (~120MiB/sec out of ~5000MiB/sec). So we're not
exactly hurting at this point in terms of our directory bytes per user
yet.
But, because this is fraction rises with both D and U, these research
papers rightly point out that you can't keep adding relays *and* users
and expect Tor to scale.
However, when you look at this f=D*U/B formula, what it also says is
that if you can reduce the relay directory size by a factor c, and also
grow the network capacity by this same factor c, then you can multiply
the userbase by c, and have the same fraction of directory bytes.
This means that rather than trying to undertake a major network overhaul
like TorSK or PIR-Tor to try to support hundreds of thousands of slow
junky relays, we can scale the network by focusing on improving the
situation for high capacity relay operators, so we can provide more
network bandwidth for the same number of directory bytes per user.
So, let's look at ways to reduce the size of the Tor relay directory, and
each way we can find to do so means a corresponding increase in the
number of users we can support:
1. Proper multicore support.
Right now, any relay with more than ~100Mbit of capacity really
   needs to run an additional tor relay instance on that link to make
   use of it. If they have AES-NI, this might go up to 300Mbit.
Each of these additional instances is basically wasted directory
   bytes for those relay descriptors.
But with proper multicore support, such high capacity relays could
   run only one relay instance on links as fast as 2.5Gbit (assuming an 8
   core AES-NI machine).
Result: 2-8X reduction in consensus and directory size, depending
   on the the number of high capacity relays on multicore systems we
   have.
2. Cut off relays below the median capacity, and turn them into bridges.
Relays in the top 10% of the network are 164 times faster than
   relays in the 50-60% range, 1400 times faster than relays in the 
   70-80% range, and 35000 times faster than relays in the 90-100% range.
In fact, many relays are so slow that they provide less bytes to the
   network than it costs to tell all of our users about them. There
   should be a sweet spot where we can set this cutoff such that the
   overhead from directory activity balances the loss of capacity from
   these relays, as a function of userbase size.
Result: ~2X reduction in consensus and directory size.
3. Switching to ECC keys only.
We're wasting a lot of directory traffic on uncompressible RSA1024
   keys, which are 4X larger than ECC keys, and less secure. Right now,
   were also listing both. When we finally remove RSA1024 entirely, the
   directory should get quite a bit smaller.
Result: ~2-4X reduction in consensus and directory size.
4. Consensus diffs.
With proposal 140, we can save 60% of the directory activity if
   we send diffs of the consensus for regularly connecting clients.
   Calculating the benefit from this is complicated, since if clients
   leave the network for just 16 hours, there is very little benefit
   to this optimization. These numbers are highly dependent on churn
   though, and it may be that by removing most of the slow junk relays,
   there is actually less churn in the network, and smaller diffs:
   https://gitweb.torproject.org/torspec.git/blob/HEAD:/proposals/140-consensus...
Let's just ballpark it at 50% for the typical case.
Result: 2X reduction in directory size.
5. Invest in the Tor network.
Based purely on extrapolating from the Noisebridge relays, we could
   add ~300 relays, and double the network capacity for $3M/yr, or about $1
   per user per year (based on the user counts from:
   https://metrics.torproject.org/users.html).
Note that this value should be treated as a minimum estimate. We
   actually want to ensure diversity as we grow the network, which may make
   this number higher. I am working on better estimates using replies from: 
   https://lists.torproject.org/pipermail/tor-relays/2014-September/005335.html
Automated donation/funding distribution mechanisms such as
   https://www.oniontip.com/ are especially interesting ways to do this
   (and can even automatically enforce our diversity goals) but more
   traditional partnerships are also possible.
Result: 100% capacity increase for each O($3M/yr), or ~$1 per new user
           per year.
So if we chain #1-4 all together, using the low estimates, we should be
able to reduce directory size by at least 2*2*2*2 or 8X.
Going back to the f=D*U/B formula, this means that we should be able to
add capacity to support 8X more users, while *still* maintaining our 2%
directory overhead percentage. This would be 2.5M users * 8, or 20M
directly connecting users.
If we were willing to tolerate 10% directory overhead this would allow
for 5 times as many users. In other words, 100M daily connecting users.
We would still need to find some way to fund the growth of the network
to support this 40X increase, but there are no actual *technical*
reasons why it cannot be done.
-- 
Mike Perry

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] Scaling tor for a global population