It's occurred to me that we have yet to provide any official recommendations with respect to best practices for operating Tor relays.
This post is my attempt to define a reasonable threat model and use it to develop some recommendations. My plan is to post it here first, to subject it to review by the relay operator community. After operators have a chance to comment, the plan is to relocate the document to the Tor Wiki and/or a blog post.
As always, to focus our thoughts, we start with the adversary's goals.
Adversary Goals
There is a significant difference between adversaries that can see inside of router-to-router TLS vs those that cannot. I believe this capability distinction governs the adversary goals in terms of compromising relays as opposed to merely externally observing them.
Adversaries that can unwrap router TLS can perform every attack that an actual node can perform, at any location between the user and the node, and/or between the node and other nodes.
In particular, adversaries that can see inside router TLS can perform tagging attacks (see https://lists.torproject.org/pipermail/tor-dev/2012-March/003361.html) as well as perform circuit-specific active and passive timing analysis.
These attacks can be quite severe. An adversary that is able to obtain Guard identity keys is free to perform a tagging attack anywhere on the Internet. In other words, if the adversary is interested in monitoring a particular user, the adversary need only obtain the identity keys for that user's 3 guard nodes, and from that point on, the adversary will be able to transparently monitor everything that user does by way of using tagging to bias the users paths to connect only to surveilled exit nodes who also have had their identity keys compromised.
Based on this distinction, it seems that some simple best practices can increase the costs of an adversary that wishes to compromise tor traffic.
Let's now consider how the adversary goes about compromising router TLS.
Attack Vectors
There are two high-level vectors towards seeing inside node-to-node TLS (which uses ephemeral keys that are rotated daily and authenticated via the node's identity key). Both high-level vectors therefore revolve around node identity key theft.
Attack Vector #1: One-Time Key Theft
The one-time adversary is interested in performing a grab of keys and then operating transparently upstream afterwords. This adversary will take the form of a coercive request at a datacenter/ISP to extract identity node key material and from then on, operate externally as a transparent upstream MITM, creating fake ephemeral TLS keys authenticated with the stolen identity key. Tor nodes that encounter this adversary will likely see it in the form of unexplained reboots/mysterious downtime, which are inevitable in the lifespan of any Tor node.
Attack Vector #2: Advanced Persistent Threat Key Theft
If one-time methods fail or are beyond reach, the adversary has to resort to persistent machine compromise to retain access to node key material.
The APT attacker can use the same vector as #1 or perhaps an external vector such as daemon compromise, but they then must also plant a backdoor that would do something like trawl through the RAM of a machine, sniff out the keys (perhaps even grabbing the ephemeral TLS keys directly), and transmit them offsite for collection.
This is a significantly more expensive position for the adversary to maintain, because it is possible to notice upon a thorough forensic investigation during a perhaps unrelated incident, and it may trigger firewall warnings or other common least privilege defense alarms inadvertently.
Unfortunately, it is also a more expensive attack to defend against, because it requires extensive auditing and assurance mechanisms on the part of the relay operator.
Defenses
It seems clear that the above indicates that at minimum relays should protect against one-time key compromise. Some further thought shows that it is possible to make the APT adversary's task harder as well, albeit with significantly more effort.
Let's deal with defending against each vector in turn.
Prevent Vector #1 (One-Time Key Theft): Deploy Ephemeral Identity Keys
The simplest way to defend against the adversary who attempts to extract relay keys through a reboot is to take advantage of the fact that even node identity keys can be ephemeral, and do not need to persist long term (certainly not past a reboot). This can be achieved with a boot script that wipes your keys (they live in /var/lib/tor/keys) at startup, or by using a ramdisk: http://www.cyberciti.biz/faq/howto-create-linux-ram-disk-filesystem/
Of the two, the ramdisk option is superior, since it will prevent the adversary from easily re-using your old keys after you begin using the new ones.
Additionally, ssh server key theft is another one-time vector that can be used to quickly bootstrap into node key theft. For this reason, node admins should always use ssh key auth for tor node administration accounts, since it prevents ssh server key theft from implying continuous server compromise: http://www.gremwell.com/ssh-mitm-public-key-authentication
Issues With Ephemeral Identity Keys
There are a few issues with deploying ephemeral identity keys.
Issues With Ephemeral Identity Keys: Client guard node loss
The primary issue with ephemeral identity keys is client Guard node loss. If your relay obtains the Guard flag, you should endeavor to keep it. If you have planned maintenance and controlled reboots, you should copy your identity keys to a safe location prior to reboot so that clients aren't forced to rotate their guards prematurely due to unnecessary rekeying.
Issues With Ephemeral Identity Keys: MyFamily
The next issue is that identity key rotation makes the use of MyFamily very complicated. For large families, it's nearly impossible to update the MyFamily line of each node instance after every unexpected reboot.
We've filed a bug to see if we can find a more convenient way to provide the same feature, but it's not clear that MyFamily is worth maintaining: https://trac.torproject.org/projects/tor/ticket/5565
It is very likely that the security benefit from maintaining MyFamily pales in comparison the gain from deploying ephemeral identity keys, and MyFamily should be abandoned entirely.
Issues With Ephemeral Identity Keys: Tor Weather
The final issue with ephemeral identity keys is that node monitoring mechanisms such as Tor Weather become difficult to use in the face of rotating keys. We've filed this bug to improve Weather's subscription mechanisms: https://trac.torproject.org/projects/tor/ticket/5564
Preventing Vector #2: Isolation Hardening and Readonly Runtime
Once one-time key theft has been dealt with, you can begin to consider how to deal with the Advanced Persistent Threat.
The effort required to defend against this adversary is considerable, and it is not expected that all operators will devote the effort to do so.
To limit scope, we are not going to deal with the daemon compromise vector; for that see your OS least-privilege mechanisms (such as SElinux, AppArmor, Grsec RBAC, Seatbelt, etc). Instead, we will deal with how you can attempt to protect your identity keys once an adversary already has root access.
If you are serious about defending against this adversary, the first thing you will want to do is disable access to the 'ptrace' system call from userland, which allows easy key theft using debugging tools. Note that all current built-in kernel mechanisms to do this still allow root users to use ptrace on arbitrary processes. In order to disable ptrace for root users, you need to load a kernel module to patch the syscall table to remove access to the syscall itself. Two options for this are: https://gist.github.com/1216637 http://people.baicom.com/~agramajo/misc/no-ptrace.c
Once access to the ptrace system call is removed, you need to disable module loading to prevent it from being restored. On Linux, this is accomplished via 'sysctl kernel.modules_disabled=1'. You should perform this operation as early in the boot process as possible. One technique that works on Redhat-based systems is to place a shell script in /etc/rc.modules to load the modules you need for operation, insert the ptrace module, and then issue the sysctl to disable further module loading. Redhat-derivatives launch /etc/rc.modules first thing at the top of /etc/rc.sysinit.
After that comes ensuring runtime integrity. There are several ways to achieve this, but most are easily subverted by an attacker with direct access to the hardware. The most robust approach seems to be to create a small encrypted loopback filesystem that contains all of the libraries required to run the 'tor' process as well as all of the requisite configuration files. Requisite libraries can be determined via 'ldd /usr/bin/tor'. The encrypted loopback filesystem doesn't need to be more than ~25M in size, but you will also need an auxillary var loopback that needs to be a hundred megs or so.
Here are the commands for creating the root loopback filesystem:
dd if=/dev/urandom of=./tor-root.img bs=1k count=25k losetup /dev/loop1 ./tor-root.img cryptsetup luksFormat /dev/loop1 cryptsetup luksOpen /dev/loop1 tor-root mkfs.ext4 /dev/mapper/tor-root
Once this encrypted loop is created such that it can run your relay's Tor processes, you should take the sha1sum of the file and store it offsite.
When you use this loopback, you will mount it readonly, and mount an unencrypted var directory inside of it, and a ramdisk for your keys inside of that:
dd if=/dev/urandom of=./tor-var.img bs=1k count=200k losetup /dev/loop2 ./tor-var.img mkfs.ext4 /dev/loop2
mount /dev/mapper/tor-root /mnt/tor-root -o ro mount /dev/loop2 /mnt/tor-root/var
mkfs -q /dev/ram1 128 mount /dev/ram1 /mnt/tor-root/var/lib/tor/keys cd /mnt/tor-root/ chroot . start_tor.sh
Once you start your tor process(es), you will want to copy your identity key offsite, and then remove it. Tor does not need it to remain on disk after startup, and removing it ensures that an attacker must deploy a kernel exploit to obtain it from memory. While you should not re-use the identity key after unexplained reboots, you may want to retain a copy for planned reboots and tor maintenance.
scp /mnt/tor-root/var/lib/tor/keys/secret_id_key offsite_backup:/mnt/usb/tor_key rm /mnt/tor-root/var/lib/tor/keys/secret_id_key
Upon suspicious reboots, you can verify the integrity of your tor image by simply calculating the sha1sum (perhaps copying the image offsite first). You do not need to do anything special with the var loopback.
These steps should prevent even adversaries who compromise the root account on your system (by rebooting it, for example) from obtaining your identity keys directly, forcing them to resort to kernel exploits and memory gymnastics in order to do so.
Don't forget to periodically update the libraries stored on your loopback root using a trusted offsite source, as they won't receive security updates from your distribution.
One alternative to make your loopback fs creation, tor startup, and maintenance process simpler is to statically compile your image's tor binary on an offsite, trusted computer. If you do this, you should no longer need to bother with chrooting your tor processes or copying libraries around. However, it still does not save you from the need to recompile that binary whenever there is a security update to the underlying libraries, and it may come at a cost of exploit resistance due to the loss of per-library ASLR.
Ok, that's it. What do people think? Personally, I think that if we can require a kernel exploit and/or weird memory gymnastics for key compromise, that would be a *huge* improvement. Do the above recommendations actually accomplish that?
If so, should we work on providing scripts to make the loopback filesystem creation process easier, and/or provide loopback images themselves?
Even the APT defenses end up not working out, I would sleep a lot better at night if most relays deployed only the defenses to one-time key theft... Thoughts on that?