-------- Original Message -------- Subject: [icfp-active] Griffin's ICFP February Date: 2016-03-03 13:50 From: Griffin Boyce griffin@cryptolab.net To: ICFP Active icfp-active@opentechfund.org
Hello all,
In February, I fought logistics and small delays, but got some interesting initial results from Russia. Ultimately, I was not able to gather enough data to complete a paper before the PETS deadline. However, I have proposed a talk for the HOPE conference in NYC this summer [1]. My co-author, Jeff Landale, and I will continue working on a paper with the goal of submitting for USENIX:FOCI.
Initial results from Russia have been surprising. Of the Alexa top 1M sites, 153,000 seem to be blocked. I've run the test twice, and will re-run the test from another location, but if the block percentage really is ~15.3%, then that is quite extreme! I was guessing that around 2000 sites would be blocked.
Towards the end of the month I prepared for and attended the Internet Freedom Festival in Valencia, Spain. As part of the paper-writing process, I wrote up a first draft of the project methodology, which appears below. Please let me know if you see any gaps or have additional tests to suggest.
Methodology:
# Website Diagnostics
The sites to be tested are divided into segments of 10,000 websites, plus custom tests for sites focused on circumvention. This way, if there’s an error during the test, it’s easier to perform a re-test without having to test over a million sites (again).
The first test is a simple check to see if the site is available (code 200) or if it’s down. This test is performed from both the target region and a presumed-uncensored area (typically Germany or the US) to ensure that the site is not simply down for maintenance. (In the process, I determine whether it’s a DNS block or an IP block). Once I have a list of not-(code 200) sites, I dig in deeper. I take a screenshot of all sites using EyeWitness. Then I see if the site is being blocked or is giving a proper Block Page. If the block pages follow a common format *and* give the block reason, I scrape the reasons and map them to the domains. There can be interesting differences found here. For example, in Indonesia SMBC Comics is blocked for ‘pornography and promoting bigotry and sectarian violence,’ when the site is not pornographic and contains minimal fantasy violence.
If the site is inaccessible for whatever reason, I check for TCP reset packets, check if there is an SSL/TLS error, check for forced downloads, and for some sites check for MitM. MitM test is only performed for sites where a user would typically be trying to download something, such as Lantern or Tor Browser. In cases of possible MitM, I collect PCAP data (packet captures) and attempt to download the relevant parts of the website to compare with a non-suspicious version of the website. While some regional differences may exist -- news in different languages, for example -- differences in executables or malware injections are indicators of active man-in-the-middle attacks.
Once all website tests are performed, I categorize the sites by cross-referencing Alexa data. Categorizing website content is one of the harder problems at scale, so relying on Alexa for categorization appears to be the best solution. For example, it would be impossible for me to categorize 153,000 blocked websites from Russia given the sheer number of websites involved.
# Tor & Lantern tests
Once all websites are categorized, I check whether or not Tor Browser and Lantern are usable within the country. Lantern has a built-in test for this purpose. For Tor Browser, I use a modified version of Philipp Winter’s ExitMap, and feed it an up-to-date list of Tor nodes. It then cycles through thousands of individual tor circuits to check whether every entry node is available from the test location. From there, I determine whether all of Tor is blocked, or only some nodes. If only some guard nodes are blocked, this may mean (depending on the age of the unblocked nodes relative to the blocked nodes) that the unblocked guard nodes were set up to track users or indicate that the system for blocking Tor nodes hasn’t updated since those nodes were created. I then test whether different kinds of bridges are blocked within the country by making circuits using bridges (including obfs2/obfs3/obfs4/fte/scramblesuit and standard bridges). For some locations, we will also test flashproxy- and snowflake-backed Tor connectivity. All Tor-related tests are run twice, with pre-selected nodes where possible, to minimize errors and ensure test accuracy.
# Tests Performed
OONI: website accessibility, HTTP error codes, TCP reset packets ExitMap (hacked/customized): checks Tor network node connectivity Tor daemon: test bridge connectivity EyeWitness: screenshots Custom nmap script: to get data on SSL/TLS certificates Custom scripts: to compare SSL/TLS certificates between test locations, compare websites for MitM determination, collecting blocked website HTML, collecting blocked website reasons (as needed), and formatting collected data
# Websites Tested
The top one million sites, as determined by Alexa’s traffic analyses. This includes a very diverse group of religious and LGBT websites. In addition, I am testing some circumvention websites. As mentioned above, these are grouped into segments so that re-tests can be performed without having to re-test everything.
Box status: RU =) UA =| EG =) TN =( -- coordinating local tests instead KG =) KZ =( -- ISP set up Crunchbang OS instead of Debian
[1] Due to this, I will not be attending PETS.
tor-project@lists.torproject.org