Hi everyone!

Been quite sometime since the last update but if one wants to see the details in between one could go to the DIAL blogs for the project[1].
As of now, we do have a working project with the following details implemented [2] and further the dotted[consensus module], idea taken from Senser paper[3] haven't been implemented yet, but hopefully I'll implement it within a week or two at most. I personally was tilted towards the similarity of the structure but after some discussions with woswos and Micah Sherr[4], I've thought of implementing the content based approach too.

I'll briefly describe both the methods below:

+ Structure of the website: This was thought of because we don't really know what various changes would be there for a website. More specifically would be useful for dynamic websites, websites with language based on geolocation (Geotargeting). But I have to use a filter list and statistical method to approach the problem.

+ Content based Approach: Compares the content of the HTML data using tree like structure and hashes to know how the structure is different or similar. Usage of proxies of the same locations as vantage points to get better results.

That said, the above mentioned methods are used for the case where websites partially block tor. One good example for this case would be https://dan.me.uk/ which doesn't block tor exit relay nodes completely, but gives an error page (partial block) and no error HTTP response code. The checking of the HTTP response codes being a low-hanging-fruitish algorithm is our first step which is seen performing good and might sometimes result in false positives (Says a website like https://cloudflare.com to be blocked completely, when it returns captcha or is partially blocked).

Further for the demo purpose, one can refer to the Experimental code[5] and it's log[6] (Isn't much of a good code and is a bit old but wrote to serve the purpose of backing up the first method (Structure of the website)). Also one could look into the `Analyzer.py`[7,8] which would contain the most recent and improved logic to the analysis. Hope to improve it with every passing day. I also plan to create a FAQ[9] page which would have excerpts of discussions or answers to as why a following approach was taken.

Thanks,
Apratim
(irc: _ranchak_)

** Looking forward for suggestions and comments as to how to improve on it. Also materials like research paper in this domain would be helpful **

References:
[1] https://hub.osc.dial.community/t/tor-project-alexa-top-sites-captcha-and-block-monitoring/2552
[2] https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2021#updated-logic
[3] http://people.cs.georgetown.edu/~wzhou/publication/senser-acsac13.pdf

[4] https://seclab.cs.georgetown.edu/msherr/

[5] https://github.com/Hackhard/Fetcher/blob/b9f2fa8d09061862cf954537cbaad7921ddb3d89/status%20code/test_run4/tr.py

[6] https://raw.githubusercontent.com/Hackhard/Fetcher/main/status%20code/test_run4/tr_bash_output

[7] Consensus_lite branch: https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/blob/consensus_lite/src/captchamonitor/core/analyzer.py

[8] Master branch: https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/blob/master/src/captchamonitor/core/analyzer.py

[9] https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2021/Faqs