Hi Karsten, Thanks for your feedback. I will try to address your comments inline below. What follows is terse, and will require referring back to the tech report.
On 14-06-16 04:40 PM, Karsten Loesing wrote:
Just one question from taking a quick look over the report: how resilient are the two designs to failing tally key servers? It seems that the plan is to have around 10 of those, which is about the number of directory authorities. And even those are sometimes having difficulty producing a consensus every hour. We even have a dedicated service that watches out for problems with the consensus process. So, what if a subset of the tally key servers break temporarily or even permanently? I guess what I'm asking is how much coordination effort does it take to run your system? Would we need a new tally-key-server-health service?
The tally servers are indeed a point of failure, but there don't have to be so many of them online at once and when they are online, they only need be around for the duration of the epoch. Let's elaborate.
In both schemes new keys are generated, in S2 by the exits and in D2 by the TKSs, and sent to their respective recipients. That initializes the epoch and let's us know who is around. For instance in S2 each exit will enumerate over all the known TKSs and send them keys if it is able to connect to the PrivEx listener on the TKSs. This way only online TKSs will take part in that epoch. In D2, each TKS has to generate an ephemeral key and send it to the PBB. This way if for an epoch the TKS is down there will not be a key for it in the PBB and hence an exit will not include it in it's key creation process.
Granted a TKS may go down during an epoch and then that epoch's data will be lost. We can mitigate this by reducing the size of the TKS pool and also by only bestowing TKS-hood to those servers with generally high uptime. It is not going to be fool-proof but at the very least it will fail secure.
The key take away is that only single epochs will be affected by this and the general utility of the system can be maintained over a long-term period.
What's the timeline here? You say that the code will be released soon, that you hope to deploy exits during the June-August timeframe, and that you're hoping to get some review on design and implementation. In what order will these things happen? Stated differently: will people have sufficient time to look out for implementation flaws before you deploy your exits?
What we would like is your comments on the design and (as soon as I can get the code cleaned up) on the code as well, if it is not too much trouble. It is my intention to keep in close contact with the Tor community about deployment efforts so as to ensure transparency. I will be in touch about more developments as they occur.
Cheers, Tariq