Hi Pierre,
thanks for this proposal. Gunes has already raised some good points and I won't repeat them here. This is part one of my feedback as I need a bit more time to think about the code example section.
Pierre Laperdrix:
Hi Tor Community,
My name is Pierre and I'm really interested in participating in a GSoC project this year with the Tor organization. Since I've been working on browser fingerprinting for the past two years, I'd love to build a Panopticlick-like website to improve the fingerprinting defenses of the Tor browser.
I've included below my proposal in case anyone has ideas or suggestions, especially on the technical section or on some of the open questions that I have. (It should be noted that the Torprinter name is subject to change).
Summary - The Torprinter project: a browser fingerprinting website to improve Tor fingerprinting defenses The capabilities of browser fingerprinting as a tool to track users online has been demonstrated by Panopticlick and other research papers since 2010. The Tor community is fully aware of the problem and the Tor browser has been modified to follow the "one fingerprint for all" approach. Spoofing HTTP headers, removing plugins, including bundled fonts, preventing canvas image extraction: these are a few examples of the progress made by Tor developers to protect their users against such threat. However, due to the constant evolution of the web and its underlying technologies, it has become a true challenge to always stay ahead of the latest fingerprinting techniques. I'm deeply interested in privacy and I've been studying browser fingerprinting for the past 2 years. I've launched 18 months ago the AmIUnique.org website to investigate the latest fingerprinting techniques. Collecting data on thousands of devices is one of the keys to understand and counter the fingerprinting problem. For this Google Summer of Code project, I propose to develop the Torprinter website that will run a fingerprinting test suite and collect data from Tor browsers to help developers design and test new defenses against browser fingerprinting. The website will be similar to AmIUnique or Panopticlick for users where they will get a complete summary with statistics after the test suite has been executed. It can be used to test new fingerprinting protection as well as making sure that fingerprinting-related bugs were correctly fixed with specific regression tests. The expected long-term impact of this project is to reduce the differences between Tor users and reinforce their privacy and anonymity online. In a second step, the website could open its doors to more browsers so that it could become a platform where vendors can implement significant changes in their browsers with regards to privacy and see the impact first-hand on the website. With the strong expertise I have acquired on the fingerprinting subject and the experience I have gained by developing the AmIUnique website, I believe I'm fully qualified to see such a project through to completion.
Website features The main feature of the website is to collect a set of fingerprintable attributes on the client and calculate the distribution of values for each attribute like Panopticlick or AmIUnique. The set of tests would not only include known fingerprinting techniques but also ones developed specifically for the Tor browser. The second main feature of the website would be for Tor users to check how close their current fingerprint is from the ideal unique fingerprint that most users should share. A list of actions should be added to help users configure their browser to reach this ideal fingerprint.
We might want to think about that ideal fingerprint idea a bit. I think there is no such a thing even for Tor Browser users as we are e.g. rounding the content window size to a multiple of 200x100 for each user. Thus, we have at least one fingerprintable attribute where we say "you are good if you have one out of a bunch of possible values". The same holds for our security slider which basically partitions the Tor Browser users. We could revisit these design decisions and I am especially interested in getting data that is backing/not backing our decisions regarding them. Nevertheless, I assume we won't always be able to put users into just one bucket per attribute due to usability issues. And this in turn makes the idea to help users configure their browser not easier.
The third main feature would be an API for automated tests as detailed by this page : https://people.torproject.org/~boklm/automation/tor-automation-proposals.htm... . This would enable automatic verification of Tor protection features with regard to fingerprinting. When a new version is released, the output of specific tests will be verified to check for any evolution/changes/regressions from previous versions. The fourth main feature I'd like to include is a complete stats page where the user can go through every attribute and filter by OS, browser version and more. The inclusion of additional features that go beyond the core functionnalities of the site should be driven by the needs of the developers and the Tor community. Still, a lot of open questions remain that should be addressed during the bonding period to define precisely how each of these features should ultimately work. Some of these open questions include:
- How closed/private/transparent should the website be about its tests
and the results? Should every tests be clearly indicated on the webpage with their own description? or should some tests stay hidden to prevent spreading usable tests to fingerprint Tor users?
- Should a statistics page exist? Should we give a read access to the
database to every user (like in the form of a REST API or other solutions)?
- Where the data should be stored? How long should the data be kept? If
tests are performed by versions, should the data from an old TBB version be removed? Should the data be kept a week, a month or more?
I am not sure about how long the data should be kept. It probably depends on what kind of data we are talking about (e.g. aggregate or not). I think, though, that data we collected with Tor Browser A should not get deleted just because Tor Browser A+1 got released. I think, in fact, we might want to keep that data especially if we want to give users a guide about how to get a "better" fingerprint. But even if not we might want to have this data to measure e.g. whether a fix for a particular fingerprinting vector had an impact and if so, which one.
- How new tests should be added: A pull request? A form where
submissions are reviewed by admins? A link to the Tor tracker?
From a Tor perspective opening a ticket and posting the test there or ideally having a link to a test in the ticket that is fixing the fingerprinting vector seems like the preferred solution. I'd like to avoid the situation where tests get added to the system and we don't know about that dealing with users that are scared because of the new results. So, yes, some review should be involved here.
- Should the website only be accessible through Tor?
I don't think so. I am fine with Chrome/IE etc. users that try to see how they fare on that test. This not-closing-down right from the start and proper communication about that might be important if we like to create a better test platform not only for Tor Browser but other vendors as you alluded to above. (which is a good idea as it encourages collaboration and a better understanding of the fingerprinting problematic in general)
Technical choices In my opinion, the website must be accessible and modular. It should have the ability to cope with an important number of connections/data. With this in mind and the experience gained from developing AmIUnique, I plan on using the Play framework with a MongoDB database. Developing the website in Java opens the door to many developers to make the website better and more robust after its initial launch since it is one of most used programming language in the world. On the storage and statistics side, MongoDB is a good fit because it is now a mature technology that can scale well with an important number of data and connections. Moreover, the use of SQL databases for AmIUnique proved to be really powerful but the maintenance after the website was launched became a tedious task, especially when modifying the underlying model of a fingerprint to collect new attributes. The choice of a more flexible and modular database seems a better choice for maintenance and for adding/removing tests.
If we look at the Tor side I guess we have more experience with Python code (which includes me) than Java. Thus, by using Python it might be easier for us to maintain the code in the longer run. That said, I am fine with the decisions as you made them especially if you are already familiar with using all these tools/languages. And, hey, we always encourage students to stay connected to us and get even deeper involved after the GSoC ended. So, this might then actually be an area for you... ;)
One thing I'd like you to think about, though, is that we have guidelines for developing services that might be running on Tor project infrastructure one day:
https://trac.torproject.org/projects/tor/wiki/org/operations/Guidelines
Not sure if the tools you had in mind above fit the requirements outlined there. If not, we should try to fix that. (Thanks to Karsten for pointing that out)
Estimated timeline You will find below a rough estimate of the timeline for the three months of the GSoC.
Community bonding period - Discuss with the mentors and the community the set of features that should be included in the very first version of the website and clarify the open questions raised in one of the previous paragraphs.
23 May - 27 June : Development of the first version of the website with the core features Week 1 - Development of the first version of the fingerprinting script with the core set of attributes. Special attention will be given so that it is fully compatible with the most recent version of the Tor browser (and older ones too). Week 2 - Start developing the front-end and the back-end to store fingerprints with a page containing data on your current fingerprint (try adding a view to see how close/far you are from the ideal fingerprint). Week 3 - Start developing the statistics page with the necessary visualization for the users. Modification of the back-end to improve statistics computation to lessen the server load. Week 4 - Finishing the front-end development and refining the statistics page to get back the most relevant information. Adding and testing an API to support automated tests. Week 5 - Finishing the first version so that it is ready for deployment. Start developing additional features requested by the community (rest API? account management?)
27 June - Mid July : Deployment of the first version online for a beta-test with bug fixing. Finishing development of additional features requested by the mentors/community. Defining the list of new features for the second version.
Mid July - 23th August : Adding a system to make the website as flexible as possible to add/remove tests easily (A pull-request system? A test submission form where admins review tests before they are included in the test suite?) Developing additional features for the website. Making sure that the website can be opened to more browsers (work done at design time to support any browsers will be tested here) Bug fixing
That looks like a good timeline estimation to me.
That's it for the first feedback,
Georg
[snip]