Griffin's ICFP February - tor-project

3 Mar 2016


      -------- Original Message --------
Subject: [icfp-active] Griffin's ICFP February
Date: 2016-03-03 13:50
 From: Griffin Boyce griffin@cryptolab.net
To: ICFP Active icfp-active@opentechfund.org
Hello all,
In February, I fought logistics and small delays, but got some 
interesting initial results from Russia.  Ultimately, I was not able to 
gather enough data to complete a paper before the PETS deadline.  
However, I have proposed a talk for the HOPE conference in NYC this 
summer [1].  My co-author, Jeff Landale, and I will continue working on 
a paper with the goal of submitting for USENIX:FOCI.
Initial results from Russia have been surprising.  Of the Alexa top 1M 
sites, 153,000 seem to be blocked.  I've run the test twice, and will 
re-run the test from another location, but if the block percentage 
really is ~15.3%, then that is quite extreme!  I was guessing that 
around 2000 sites would be blocked.
Towards the end of the month I prepared for and attended the Internet 
Freedom Festival in Valencia, Spain.  As part of the paper-writing 
process, I wrote up a first draft of the project methodology, which 
appears below.  Please let me know if you see any gaps or have 
additional tests to suggest.
Methodology:
# Website Diagnostics
The sites to be tested are divided into segments of 10,000 websites, 
plus custom tests for sites focused on circumvention.  This way, if 
there’s an error during the test, it’s easier to perform a re-test 
without having to test over a million sites (again).
The first test is a simple check to see if the site is available (code 
200) or if it’s down.  This test is performed from both the target 
region and a presumed-uncensored area (typically Germany or the US) to 
ensure that the site is not simply down for maintenance. (In the 
process, I determine whether it’s a DNS block or an IP block).  Once I 
have a list of not-(code 200) sites, I dig in deeper.  I take a 
screenshot of all sites using EyeWitness.  Then I see if the site is 
being blocked or is giving a proper Block Page.  If the block pages 
follow a common format *and* give the block reason, I scrape the reasons 
and map them to the domains.  There can be interesting differences found 
here. For example, in Indonesia SMBC Comics is blocked for ‘pornography 
and promoting bigotry and sectarian violence,’ when the site is not 
pornographic and contains minimal fantasy violence.
If the site is inaccessible for whatever reason, I check for TCP reset 
packets, check if there is an SSL/TLS error, check for forced downloads, 
and for some sites check for MitM.  MitM test is only performed for 
sites where a user would typically be trying to download something, such 
as Lantern or Tor Browser.  In cases of possible MitM, I collect PCAP 
data (packet captures) and attempt to download the relevant parts of the 
website to compare with a non-suspicious version of the website.  While 
some regional differences may exist -- news in different languages, for 
example -- differences in executables or malware injections are 
indicators of active man-in-the-middle attacks.
Once all website tests are performed, I categorize the sites by 
cross-referencing Alexa data.  Categorizing website content is one of 
the harder problems at scale, so relying on Alexa for categorization 
appears to be the best solution.  For example, it would be impossible 
for me to categorize 153,000 blocked websites from Russia given the 
sheer number of websites involved.
# Tor & Lantern tests
Once all websites are categorized, I check whether or not Tor Browser 
and Lantern are usable within the country.  Lantern has a built-in test 
for this purpose.  For Tor Browser, I use a modified version of Philipp 
Winter’s ExitMap, and feed it an up-to-date list of Tor nodes.  It then 
cycles through thousands of individual tor circuits to check whether 
every entry node is available from the test location.  From there, I 
determine whether all of Tor is blocked, or only some nodes. If only 
some guard nodes are blocked, this may mean (depending on the age of the 
unblocked nodes relative to the blocked nodes) that the unblocked guard 
nodes were set up to track users or indicate that the system for 
blocking Tor nodes hasn’t updated since those nodes were created.  I 
then test whether different kinds of bridges are blocked within the 
country by making circuits using bridges (including 
obfs2/obfs3/obfs4/fte/scramblesuit and standard bridges).  For some 
locations, we will also test flashproxy- and snowflake-backed Tor 
connectivity.  All Tor-related tests are run twice, with pre-selected 
nodes where possible, to minimize errors and ensure test accuracy.
# Tests Performed
OONI: website accessibility, HTTP error codes, TCP reset packets
ExitMap (hacked/customized): checks Tor network node connectivity
Tor daemon: test bridge connectivity
EyeWitness: screenshots
Custom nmap script: to get data on SSL/TLS certificates
Custom scripts: to compare SSL/TLS certificates between test locations, 
compare websites for MitM determination, collecting blocked website 
HTML, collecting blocked website reasons (as needed), and formatting 
collected data
# Websites Tested
The top one million sites, as determined by Alexa’s traffic analyses.  
This includes a very diverse group of religious and LGBT websites. In 
addition, I am testing some circumvention websites.  As mentioned above, 
these are grouped into segments so that re-tests can be performed 
without having to re-test everything.
Box status:
RU =)
UA =|
EG =)
TN =( -- coordinating local tests instead
KG =)
KZ =( -- ISP set up Crunchbang OS instead of Debian
[1] Due to this, I will not be attending PETS.
-- 
“We have to create; it is the only thing louder than destruction.”
~ Andrea Gibson