Hello everyone!
Thanks for the feedback! Please see my inline comments.
Santiago:
I wonder if you could share a link to the source code for people to give you feedback/audit or implement fixes/features themselves.
I will be posting the source code once I'm a little further along.
Roger:
(1) Looking at the DOM tree reminds me of Micah's paper from a few years
back:
"Validating Web Content with Senser" https://security.cs.georgetown.edu/~msherr/pubs.php
This is very interesting! Earlier on I had considered using Merkle Trees for this project as well, but we are currently looking at other, more suitable options.
Roger:
and (2) Be sure to check out the recent papers by the Berkeley group on this area, e.g. the "do you see what I see" paper and more recent ones: https://www1.icsi.berkeley.edu/~sadia/
Yes, I have read many of these! The "Do you see what I see" paper was definitely one of the inspirations for this, as well as some results of my previous work.
Gunner:
It may or may not be of any use, but here is a content from an etherpad that a number of Tor folks worked on a while back regarding 'tor friendly sites"
Thanks for sending this! The results of this project will definitely supplement this etherpad with things that we find are broken "in the wild," either as a foreseen result of the design choices of the Tor Browser or by unforeseen consequence.
grarpamp:
You may be interested in this coupled pair of projects that may be studying a similar question from a different perspective. Note their needs list which might include integrating elements of your platform, OONI, etc.
Thanks for bringing these to our attention! These are certainly interesting projects that I think could benefit from the findings of our work, when we get there. We may be able to supplement the lists that are already built up with more information of services that don't outright block Tor, but make it difficult to anonymously use their Web service by relying on functionality that is dangerous to anonymity and blocked on Tor Browser.
Georg:
What are your criteria for saying "this is broken in Tor Browser" vs. "this is just rendered slightly different in Tor Browser"? For instance I suspect that you'd even get different ground-truths depending on the major Firefox version you use (like Firefox 65 vs. Firefox 60 ESR), yet you would hardly say "This is okay in Firefox 65 but broken in Firefox 60 ESR". Or maybe there *are* cases where you would say so? What I am saying is: mapping the creation of the DOM tree and logging JS execution might be a good means for you goal (I am not sure yet) but it does not seem to be sufficient to reach it.
There is some legitimate concern here, and the reason that my e-mail has been so late is because I've been considering this. The main observation here is that, as far as I know, the Tor Browser is a modification of a Firefox ESR, not an entirely stand-alone browser. The goal, then, is to take as ground truth the closest version of Firefox that we can.
The Tor Browser starts as a Firefox ESR release and then has changes applied to it (patches, extensions, etc). If we use the FF ESR release associated with the most current TB, then we can count that as "ground truth," since the comparison should isolate to only the changes made to FF to turn it into TTB.
Georg:
Secondly, I am wondering how you plan to deal with the fact that websites show different content if the logic behind them assumes you come from a different country/region. How does that get incorporated into your ground-truth, for example?
The way we intended to do this was to send our FF "ground-truth" collection through Tor, and specifically through the same exit node as TTB uses. This way we can isolate the variable to the differences in the browsers, rather than any network or other concerns. In addition, we are working on developing a method for determining if content is dynamically generated (and therefore different every time), or broken.
I hope this addressed all concerns, and if not, or if there is more feedback, please let me know!
Thanks,
Kevin