Andrew Lewman transcribed 1.8K bytes:
The last research paper I see directly addressing scalability is Torsk (http://www.freehaven.net/anonbib/bibtex.html#ccs09-torsk) or PIR-Tor (http://www.freehaven.net/anonbib/bibtex.html#usenix11-pirtor)
Using Private Information Retrieval (PIR) to retrieve a 50 KB/s relay would likely make the current network [0] completely unusable.
_______________________________________________________________________________
First, the PIR-Tor paper makes a few naïve assumptions which grossly effect the analysis of the overhead for directory fetches:
1. The authors assume that a client's three [sic] Guard Nodes will be the directory mirrors for that client, accepting PIR queries for descriptors.
2. There is zero handling of cases where multiple directory mirrors have differing data.
3. There is zero handling of colluding directory mirrors.
4. The *increase* to descriptor size is not well factored in to the analysis.
In the PIR-Tor paper, §4.2: | | "Even if all three guard relays are compromised, they cannot actively | manipulate elements in the PIR database since they are signed by the | directory authorities[...]" | Single descriptors are *not* signed, the set of all descriptors is. Therefore, if a client wanted to actually check that all (or a majority) or the Directory Authorities had signed a descriptor, that client would need to:
4.a. Download all of them and check. Which negates the whole point of this PIR thing.
4.b. All of the Directory Authorities would need to sign each and every descriptor by itself. This would result in the current microdescriptors, which are sized from roughly 900 bytes to some of the larger ones which are 2.5 KB, increasing in size by roughly 4 KB (that's the size of the DirAuth signatures on the current consensus).
Result: Each microdescriptor is 4X larger, and the Directory Authorities need to do several orders of magnitudes more signatures.
_______________________________________________________________________________
Second, because the PIR-Tor paper doesn't seem to be what we would actually want to implement, the following is an analysis.
Using some safe assumptions about what security guarantees we would likely want to require from any PIR scheme we used... That is, we would like want a 1-roundtrip (RT), b-private, b-Byzantine, k-out-of-m databases, [1] PIR scheme. (Meaning that there are m total directory mirrors, only k of those actually answer a set of your queries, and you want the privacy of your queries and the authenticity of the answers to your queries to be protected even if b number of directory mirrors are malicious and colluding with one another.)
Then, with those assumptions in mind, from Theorem 7.7 in §7.1 of this review on PIR schemes [2] (which is, admittedly, a bit dated) used in one of Dan Boneh's classes, we have a size complexity [3] of
O((k/3b) n^(1/[(k-1)/3]) m log_2 m)
where b<k/3 (Meaning that the number of malicious, colluding directory mirrors which answered is less than one third of the total number of mirrors which answered.)
import math m = 3000 k = 30 b = 4 n = 160 # the number of bits in a relay fingerprint int(math.ceil((((float(k)/(3*float(b)))
... * (n**(1.0/((float(k)-1.0)/float(3))))) ... * float(m)) * math.log(float(m), 2.0))) 146449
the above space complexity comes out to O(146449)-bits per lookup. (Where, by "lookup", I mean "the set of all queries pertaining to fetching a single relay descriptor", since, in PIR, a "query" is usually for a single bit.) This would mean that adding any sane PIR scheme for directory requests would result in something like a 900X increase in the bytes used for each request.
While one of the benefits of PIR would be that clients would no longer need to request the descriptors for all relays listed in the consensus, this benefit actually doesn't help as much as it would seem, because switching to a PIR scheme would require a client to make three of these roughly O(146449)-bit requests, every ten minutes or so, when the client wants a new circuit.
Don't get me wrong; PIR is awesome. But the current Tor network is likely *not* the correct real-world application of any of the current PIR schemes. At least, it isn't until we have some HUGE number of nodes in the network, which by #1 and #2 in my original reply to this thread, [4] we shouldn't.
[0]: Note, I said "the current network". An imaginary future Tor network which has 10,000,000 relays would be different. And then note the very last paragraph where I state that we probably shouldn't ever have 10,000,000 relays.
[1]: I assume that we need replication in Tor's use case. There are papers, such as the following:
Kushilevitz, Eyal, and Rafail Ostrovsky. "Replication is not needed: Single database, computationally-private information retrieval." 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 2013. http://www.cs.fiu.edu/~carbunar/teaching/cis5374/slides/pir-single.pdf
for which the research doesn't apply because it was aimed at computationally-hiding PIR schemes, and obviously Tor aims for information theoretic security. Other than the PIR-Tor paper, I haven't found any literature which analyses PIR for anything close to Tor's use case. (Although I'd be stoked to hear about any such papers.)
[2]: Gasarch, William. "A survey on private information retrieval." Bulletin of the EATCS. University of Chicago, 2004. https://crypto.stanford.edu/~dabo/courses/cs355_fall07/pir.pdf
[3]: I'm completely ignoring the fact that I'm supposed to do polynomial interpolation for this complexity because I just want an estimate and don't feel like messing around with Lagrange polynomials or doing Gaussian elimination on a Vandermonde matrix of a highly-conditional system of equations or anything. Consequently, I removed the `t` variable from the original complexity equation, effectively setting it to t=1.
[4]: https://lists.torproject.org/pipermail/tor-dev/2014-September/007558.html