On 02/27/2014 03:14 AM, Roger Dingledine wrote:
On Sun, Feb 23, 2014 at 05:38:23PM +0530, Devang Thakkar wrote:
Its Devang here, a coding enthusiast studying at IIT Bombay. I am
looking forward to contribute to Tor for the upcoming Google Summer of Code 2014 as a prospective student. So I wanted to know if there was a provision for Web Scraping using Tor. If there is, I would to know more about it or if there isn't, is it a feasible Summer of Code project?
Web scraping using Tor is usually regarded as a bad thing -- first because it loads down the Tor network much more than normal browsing, and second because it makes destination websites more likely to get angry with Tor. For example, when Bing starts scraping Google over Tor in order to improve their search results, Google responds by making it harder to crawl Google over Tor, which impacts normal Tor users reaching Google too.
So I think we'd be happy to have a project on how to make website scraping through Tor less damaging to destinations and thus to users, but I think we're unlikely to find a "make it easier to scrape websites through Tor" project exciting.
Inconveniently enough, scraping websites (and hidden services) over Tor is exactly what a lot of the CMU Tor-related research involves. We have developed a few in-house tools for it (none of which are anywhere close to turnkey). We haven't put any serious thought into making it "less damaging to destinations," but I think we would be interested in helping with a project along those lines. Offhand I dunno if there's so much code as best practices documentation needed, though (what's an appropriate level of rate limiting, you really ought to run a private entry node, that sort of thing...)
zw