> Please define "crawling of .onion".
> I don't know enough about the details of what you're doing to have a strong opinion.
I mean search engines crawling HTML pages on .onion. Like doing:
ahmia.fi does do crawling. I leave further discussion to them.
OnionLink actually does *zero* crawling. I leave it to Google et al.
When Google crawls me they use:
* using .onion addresses found via a search engine.
* using .onion addresses found on HTML pages on other .onion sites.
None of the rest. Nothing with HSDirs, etc. The *only* HSDir thing that has ever existed is caching NXDOMAIN responses from HSDirs to reduce the load Tor2web places on the Tor network. This was solely to *be kind to the operators*. However, as the caching caused some uproar I've stopped caching NXDOMAINs and have returned to unnessecarily burdening the Tor network.
> How do you access and index the web content on those .onion sites?
The accessing is just plain Tor2web HTTP requests. They announce themselves with the HTTP header `x-tor2web: true` . Google does the indexing.
> How often do you access the site?
Looking at analytics from Googlebot accessing Onionlink, every 7-21 days.
> How many pages deep do you go on the site?
Don't know. I suppose Google goes as deep as possible.
> Do you follow links to other .onion sites?
Yes. Corresponding to the other .onion sites /robots.txt policy.
> How do you make sure that Tor2web users are anonymised (as possible) when accessing hidden services?
I make a good faith effort not to wantonly reveal personally identifying information. But in short, it's hard. I urge people to think of tor2web nodes as closer to Twitter where they record what links you click. I wholly support having the "where is Tor2web in regards to user privacy" discussion (hopefully could even make some improvements to it!), but it is orthogonal to the "robots.txt on .onion" discussion. Let's address the robots.txt issue and then we can return to Tor2web user-privacy.
> Please stop releasing logs.
> It could easily be seen as a provocative act.
Yeah I understand. This is my 3rd or 4th attempt to discuss this and I was intentionally being a little pokey. I have no intention or desire of actually compromising anomymity.
-V