Hi everyone!
Been quite sometime since the last update but if one wants to see the details in between one could go to the DIAL blogs for the project[1]. As of now, we do have a working project with the following details implemented [2] and further the dotted[consensus module], idea taken from Senser paper[3] haven't been implemented yet, but hopefully I'll implement it within a week or two at most. I personally was tilted towards the similarity of the structure but after some discussions with woswos and Micah Sherr[4], I've thought of implementing the content based approach too.
I'll briefly describe both the methods below: + Structure of the website: This was thought of because we don't really know what various changes would be there for a website. More specifically would be useful for dynamic websites, websites with language based on geolocation (Geotargeting). But I have to use a filter list and statistical method to approach the problem.
+ Content based Approach: Compares the content of the HTML data using tree like structure and hashes to know how the structure is different or similar. Usage of proxies of the same locations as vantage points to get better results.
That said, the above mentioned methods are used for the case where websites partially block tor. One good example for this case would be https://dan.me.uk/ which doesn't block tor exit relay nodes completely, but gives an error page (partial block) and no error HTTP response code. The checking of the HTTP response codes being a low-hanging-fruitish algorithm is our first step which is seen performing good and might sometimes result in false positives (Says a website like https://cloudflare.com to be blocked completely, when it returns captcha or is partially blocked).
Further for the demo purpose, one can refer to the Experimental code[5] and it's log[6] (Isn't much of a good code and is a bit old but wrote to serve the purpose of backing up the first method (Structure of the website)). Also one could look into the `Analyzer.py`[7,8] which would contain the most recent and improved logic to the analysis. Hope to improve it with every passing day. I also plan to create a FAQ[9] page which would have excerpts of discussions or answers to as why a following approach was taken.
Thanks, Apratim (irc: _ranchak_)
** Looking forward for suggestions and comments as to how to improve on it. Also materials like research paper in this domain would be helpful **
References: [1] https://hub.osc.dial.community/t/tor-project-alexa-top-sites-captcha-and-blo... [2] https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2021#updat... [3] http://people.cs.georgetown.edu/~wzhou/publication/senser-acsac13.pdf [4] https://seclab.cs.georgetown.edu/msherr/ [5] https://github.com/Hackhard/Fetcher/blob/b9f2fa8d09061862cf954537cbaad7921dd... [6] https://raw.githubusercontent.com/Hackhard/Fetcher/main/status%20code/test_r... [7] Consensus_lite branch: https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/blob/consensus_lite/s... [8] Master branch: https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/blob/master/src/captc... [9] https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2021/Faqs
On Mon, Jul 12, 2021 at 05:01:35PM +0530, Apratim Ranjan Chakrabarty wrote:
** Looking forward for suggestions and comments as to how to improve on it. Also materials like research paper in this domain would be helpful **
Section IV-C of the ICLab paper has discussion of block page detection. The first pass is regex for known block pages, but there is also clustering by similar HTML structure and text. https://censorbib.nymity.ch/#Niaki2020a https://github.com/net4people/bbs/issues/52
The 2016 "Do You See What I See?" study seems to be in line with your project. "The second-class treatment of anonymous users ranges from outright rejection to ... imposing hurdles such as CAPTCHA-solving.... Our study draws upon ... scans of the home pages of top-1,000 Alexa websites through every Tor exit..." Section V-A has to do with scans of top-ranked sites. https://www.ndss-symposium.org/wp-content/uploads/2017/09/do-you-see-what-i-... https://archive.org/details/ndss16doyousee