Hey all,
I'm looking for a guide how to take an existing service, and convert it into an .onion-too service (what facebook and propublica did).
Part of why I wrote [1] *is* because I was having just this problem. (Not many guides on providing onion access to clearnet sites and I found a lot of knowledge spread out around the internet, like subdomains and vanity onion domains and what to do about hosting onion services in production.)
It was my first stab at it, and I hope that now (that we've published that and people are talking about it and this mailing list exists) the guide situation will surely get better.
Problem: Webservices tend to respond to a single URL only (like http://clearservice.com/), and won't deal gracefully with requests to http://onionservie.onion/); i.e. they might redirect to the clearservice address.
Before settling on a proxy, I thought of the ways I could maybe handle this.
1) You update your application to generates .onion URIs when it sees that a request is coming from the onion service.
2) An HTTP proxy at the onion service rewrites your application's responses to turn your clearnet URIs into onion URIs.
3) Rely on the client application to rewrite the URIs; such as with something like the darkweb-everywhere extension (or similar https-everywhere ruleset manipulation) or something yet to be decided on and built, see [2].
#1 was a non-starter because of the size of our site (years and years of content) and the issue of CMS maintenance (I'm not the one who does it, nor did I want to become the person to hack it and own it going forward).
The non-ubiquity (or non-existence) of #3 is not desirable since we want all of our onion users to have the same clearnet-avoidance. (But FWIW, I very quietly submitted a pull request to add our onion site to darkweb-everywhere, many months before it became publicly-known.)
So, #2 proxying it is. (Looks like the freiheitsfoo config does the same thing.) Bonus with #2 is that it addresses the concern about a service which only responds properly to one domain: the proxy sets the upstream "Host" HTTP header to the expected clearnet domain (as ours sets this to "www.propublica.org"), so that the original application acts normally. Keeps your application code simple.
propublica has published their nginx setup to deal with this, but this looks a bit scary.
For one, they don't seem to rewrite protocol-relative URLs like href="//sub.clearservice.com/".
The config in question is [3]. Yep, admittedly it is a bit scary; the sheer amount of content-rewriting is possibly a bit dangerous. (Ours is particularly complicated because I was worried about rewriting things that shouldn't be, and because of some inconsistent cases where the same thing could live at multiple domain name variants. I like freiheitsfoo's if you want a simpler version of the same technique, though they don't handle protocol-relative URIs on input since I assume their clearnet site doesn't use them. And other very minor differences.)
I'll note that our config *does* try to rewrite protocol-relative URIs; it could be simpler and the config grouped a bit better. But for any of our domains, the regular expressions attempt to rewrite all three cases of http/https/proto-relative clearnet to protocol-relative onion, i.e. this line and others like it:
subs_filter (http:|https:)?//(www.)?propublica.org/ //www.propub3r6espa33w.onion/ gir;
(Chose protocol-relative onion since we're currently testing SSL out with a self-signed cert as we work on getting an EV SSL certificate for our onion site.)
The rules are a bit worse than that because we have some inconsistent conventions around our static assets in Amazon S3 (cloudfront CDN domain vs s3.amazonaws/<bucketname> vs <bucketname>.s3.amazonaws etc) and have many domains and things like that. (And now that I look, the gist is slighty messier and out of date too. Will see about cleaning up the publicly-posted version.) But I think maybe this is a point to bring up, too: for a large enough general-purpose clearnet site, any rewrite rules like this are perhaps going to evolve to become more specific to your site, your application, and your users.
And then there generally is the question of ensuring that no clearnet URL escapes the rewriting. I guess for that, you'd need to implement a more thorough link checker and not just some ngnix filter rules.
In theory, *maybe* we'd want something smarter than nginx with the substitutions filter. A limitation of that module is that we have no ability to rewrite strings inside the HTTP headers, we can only add/remove headers. But otherwise it does the job well.
For something that has to play the role of HTTP server or proxy, it'd have to be pretty performant, and nginx fits the bill quite well. (And I wonder what would accomplish this better than some smart regular expressions against the partial URI or hostname? Although something to test your site's content after the fact would be nice -- though wouldn't this also rely on some string/regex matching to see what was missed?)
And last: for our site (and I'm sure many other sites), outbound links (and even some assets/multimedia/etc) are always going to be a problem since not all of the content reference or use and not all of the partners we work with have onion sites. Such is life on clearnet websites, and when navigating the space between both "pure" clearnet and "pure" onionspace.
Anyway, sorry for the long reply here.
Very interested to hear others' thoughts about this sort of proxying and keeping onion-users in onion space.
Best,
Mike Tigas News Applications Developer, ProPublica https://www.propublica.org/ @mtigas | https://mike.tig.as/ | 0x6E0E9923
[1]: https://www.propublica.org/nerds/item/a-more-secure-and-anonymous-propublica... [1-onion]: http://www.propub3r6espa33w.onion/nerds/item/a-more-secure-and-anonymous-pro...
[2]: https://lists.torproject.org/pipermail/tor-talk/2016-January/039899.html
[3]: https://gist.github.com/mtigas/9a7425dfdacda15790b2#file-2-nginx