Hello tor-dev!
My name is Kevin and I'm a PhD student at NYU. Recently I've been working on creating a "Tor Friendliness Scanner" (TFS), or a scanner that will measure what features of a given website are broken (non-functional) when accessed on the Tor Browser (TB), along with actionable suggestions to improve it. In order to do this, we first must get an approximation of ground-truth data of how a given website should work. We then need to compare it to how the website works on the TB to determine any changes.
To generate a method of determining ground-truth, we decided to modify* the Firefox (FF) browser to log all of the steps of the creation of the Content Tree (also called the DOM tree), and to log the execution of all JavaScript functions (currently underway). We then will apply these changes to the TB as well, and run a scan of popular Web sites using the modified FF and the modified TB on all three of the TB security slider settings. We will then compare the resulting logs to determine where the tree creation processes differed* and why. These differences could potentially help us illuminate two things:
1. what functionality issues the Tor Browser encounters on popular Web sites, and 2. what threats (beyond metadata surveillance) the TB is protecting its users from in-the-wild.
As far as I have considered, this method seems to capture a lot, but it's far from complete. For one thing, it obviously won't detect any difference that's spawned from user interaction or input (such as a script launched by an OnClick event). However, it does seem to make automation of scanning for Tor Friendliness possible, and can allow for wide-scale use.
We have moved ahead with development (though have not yet finished it) and are (hopefully) very close to a working prototype. I was wondering if there was feedback on this method, or if anyone can consider an angle we have not that would either make the TFS more robust, easier to create, or both.
Thanks for your time and consideration!
Kevin
*Note 1: Unfortunately we cannot just rely on JavaScript for examining the content tree, since this needs to work on all 3 security settings of the TB's security slider, and the "safest" setting deactivates JavaScript by default on all Web pages.
*Note 2: There can be non-functional differences in Web pages, such as different ads showing or the display of the current time. We are working on methods to distinguish these from functional differences, such as using ad blacklists to determine if a given request or script is part of an ad, and ignoring it as part of the difference between the two trees.
Hi,
We have moved ahead with development (though have not yet finished it) and are (hopefully) very close to a working prototype. I was wondering if there was feedback on this method, or if anyone can consider an angle we have not that would either make the TFS more robust, easier to create, or both.
I wonder if you could share a link to the source code for people to give you feedback/audit or implement fixes/features themselves.
Cheers! -Santiago
On Mon, Mar 04, 2019 at 03:58:58PM -0500, Kevin Gallagher wrote:
To generate a method of determining ground-truth, we decided to modify* the Firefox (FF) browser to log all of the steps of the creation of the Content Tree (also called the DOM tree) [...] We have moved ahead with development (though have not yet finished it) and are (hopefully) very close to a working prototype. I was wondering if there was feedback on this method
Neat stuff!
(1) Looking at the DOM tree reminds me of Micah's paper from a few years back: "Validating Web Content with Senser" https://security.cs.georgetown.edu/~msherr/pubs.php
and (2) Be sure to check out the recent papers by the Berkeley group on this area, e.g. the "do you see what I see" paper and more recent ones: https://www1.icsi.berkeley.edu/~sadia/
--Roger
Hi Kevin,
It may or may not be of any use, but here is a content from an etherpad that a number of Tor folks worked on a while back regarding 'tor friendly sites".
Sounds like you have a robust way of going about this, so this is provided as food for thought.
peace gunner
Designing a web site to be "Tor-friendly"
The following represent an initial set of guidelines to help web site publishers to design and maintain sites that work well with the Tor Browser.
This is an incomplete set, and we welcome contributions, suggestions and feedback!
NOTE: Italicized comments are requests for additional input/responses...
Must do (as in "otherwise you undermine the core design goals of Tor and put user anonymity and privacy at risk")
Avoid using plugins like Adobe Flash, or any proprietary plugins that can not be audited, in any way, shape, or form
Avoid relying on users downloading and opening files such as pdfs or Microsoft Word Documents
What else can enable or cause leaking actual IP address info?
What else about site design/implementation could de-anonymize users?
The verb "avoid" here sounds like a MUST. Maybe here we should instead say "Do not use plugins..." and "Do no rely on users".
Should Do (as in "help to maximize the security and quality of Tor user experience")
Site design
Test all site pages and functionality using Tor Browser [maybe add the security level to test against? the browser security level allows different experiences]
Working in Tor Browser with the "Low" security level would actually be a MUST. Working with the "Medium" and "High" could remain a SHOULD.
https://tb-manual.torproject.org/en-US/security-slider.html
Verify site works without javascript enabled
I don't think you *have* to make your site work without JavaScript, but you should at least inform the user in a friendly way that JavaScript is required and the site will not work properly without it.
Ack on the previous comment but the site should have basic functionalities available with no JavaScript. Plus explain how to enable it in Tor Browser. We did that on https://tails.boum.org/install/download/ with https://tails.boum.org/install/inc/screenshots/allow_js.png.
The same as above, except for SVG and WebRTC instead of JavaScript
Serve all content over HTTPS
Use a trustworthy certificate authority such as LetsEncrypt.org
What is the purpose here? One CA is effectively the same as any other CA, so long as the CA isn't distrusted by Mozilla.
Don't depend on IP address for locale determination (allow users to set their own language)
Don't depend on or assume IP address will remain constant during user sessions
Don't expect a particular number of users per IP address
What else might break with new circuit or new identity based on site design assumptions?
Anything about "please don't fingerprint your users by network/device addresses or browser attributes" or "don't try to extract canvas details"?
Page design
If you actually have a feature (such as user avatar/image editing) that relies on canvas image extraction, allow a user to trigger the image extraction multiple times and confirm the resulting extracted image is what the user intends. This will allow friendly support of the Canvas Permission Prompt.
Do not rely on high resolution timestamps from any date properties, such as performance.now() or Date().getMilliseconds()
If you have a feature that relies on automatically detecting the user's timezone, allow the user to override the automatic selection with a manually chosen one
Before making use of DOM features (WebSpeech API, gamepad API, etc) perform feature detection to ensure the methods are present to avoid possible JavaScript errors
Minimize page "weight"/bandwidth needs
Make sure pages work properly with image loading turned off (that's not a "Should Do" item but "Please Do", if at all)
This is not a supported configuraiton of Tor Browser, so I don't think it should make the list.
Don't auto-start videos or multimedia content
And do not rely on the media statistic API to scale media performance. Alternately, detect the spoofed media statistics and ignore them.
Don't assume low latency or constant latency
Make sure pages work properly without SVG support enabled.
And/or display a note that SVG images are in use and what users are missing
Make sure pages work properly without being able to load fonts located remotely.
(Make sure pages work properly when the "Security Slider" is set to "High")
Server-side configuration
Anything? Technologies to be avoided?
If you use CloudFlare or another provider that treats Tor differently, enable uninterrupted access for Tor users (link to CF instructions)
Verify site works "without" cookies
(need correct/clarifying language to convey that actual session cookies after login make sense) (you could specify "third-party" cookies or more broadly "tracking cookies" if you want; right now I'd argue the third-party cookie item is actually a "Should Do" item as we currently have third-party cookies disabled; Hm. I wonder if that is not even a "Must Do" at the moment because if a website really relied on third-party cookies then it would be broken currently)
Please Do (as in "these things further enrich and protect Tor user experiences")
Make your site available via a corresponding .onion address [1]
Make your site available over IPv6 as well as IPv4 (provide both addresses in DNS)
Once Tor Browser supports it, using HTTP2 with Push will decrease the load time of your site.
In general, any tech that decreases load time (image spriting, minifying JS, etc) will get a magnified improvement in Tor Browser over other browsers.
On 3/4/19 1:15 PM, Roger Dingledine wrote:
On Mon, Mar 04, 2019 at 03:58:58PM -0500, Kevin Gallagher wrote:
To generate a method of determining ground-truth, we decided to modify* the Firefox (FF) browser to log all of the steps of the creation of the Content Tree (also called the DOM tree) [...] We have moved ahead with development (though have not yet finished it) and are (hopefully) very close to a working prototype. I was wondering if there was feedback on this method
Neat stuff!
(1) Looking at the DOM tree reminds me of Micah's paper from a few years back: "Validating Web Content with Senser" https://security.cs.georgetown.edu/~msherr/pubs.php
and (2) Be sure to check out the recent papers by the Berkeley group on this area, e.g. the "do you see what I see" paper and more recent ones: https://www1.icsi.berkeley.edu/~sadia/
--Roger
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On 3/4/19, Kevin Gallagher kcg295@nyu.edu wrote:
Recently I've been working on creating a "Tor Friendliness Scanner" (TFS), or a scanner that will measure what features of a given website are broken (non-functional) when accessed on the Tor Browser (TB), along with actionable suggestions to improve it.
You may be interested in this coupled pair of projects that may be studying a similar question from a different perspective. Note their needs list which might include integrating elements of your platform, OONI, etc.
https://trac.torproject.org/projects/tor/wiki/org/projects/DontBlockMe https://trac.torproject.org/projects/tor/wiki/org/projects/WeSupportTor
In order to do this, we first must get an approximation of ground-truth data of how a given website should work. We then need to compare it to how the website works on the TB to determine any changes.
Hi!
Kevin Gallagher:
Hello tor-dev!
My name is Kevin and I'm a PhD student at NYU. Recently I've been working on creating a "Tor Friendliness Scanner" (TFS), or a scanner that will measure what features of a given website are broken (non-functional) when accessed on the Tor Browser (TB), along with actionable suggestions to improve it. In order to do this, we first must get an approximation of ground-truth data of how a given website should work. We then need to compare it to how the website works on the TB to determine any changes.
To generate a method of determining ground-truth, we decided to modify* the Firefox (FF) browser to log all of the steps of the creation of the Content Tree (also called the DOM tree), and to log the execution of all JavaScript functions (currently underway). We then will apply these changes to the TB as well, and run a scan of popular Web sites using the modified FF and the modified TB on all three of the TB security slider settings. We will then compare the resulting logs to determine where the tree creation processes differed* and why.
What are your criteria for saying "this is broken in Tor Browser" vs. "this is just rendered slightly different in Tor Browser"? For instance I suspect that you'd even get different ground-truths depending on the major Firefox version you use (like Firefox 65 vs. Firefox 60 ESR), yet you would hardly say "This is okay in Firefox 65 but broken in Firefox 60 ESR". Or maybe there *are* cases where you would say so? What I am saying is: mapping the creation of the DOM tree and logging JS execution might be a good means for you goal (I am not sure yet) but it does not seem to be sufficient to reach it.
Secondly, I am wondering how you plan to deal with the fact that websites show different content if the logic behind them assumes you come from a different country/region. How does that get incorporated into your ground-truth, for example?
Georg
Hello everyone!
Thanks for the feedback! Please see my inline comments.
Santiago:
I wonder if you could share a link to the source code for people to give you feedback/audit or implement fixes/features themselves.
I will be posting the source code once I'm a little further along.
Roger:
(1) Looking at the DOM tree reminds me of Micah's paper from a few years
back:
"Validating Web Content with Senser" https://security.cs.georgetown.edu/~msherr/pubs.php
This is very interesting! Earlier on I had considered using Merkle Trees for this project as well, but we are currently looking at other, more suitable options.
Roger:
and (2) Be sure to check out the recent papers by the Berkeley group on this area, e.g. the "do you see what I see" paper and more recent ones: https://www1.icsi.berkeley.edu/~sadia/
Yes, I have read many of these! The "Do you see what I see" paper was definitely one of the inspirations for this, as well as some results of my previous work.
Gunner:
It may or may not be of any use, but here is a content from an etherpad that a number of Tor folks worked on a while back regarding 'tor friendly sites"
Thanks for sending this! The results of this project will definitely supplement this etherpad with things that we find are broken "in the wild," either as a foreseen result of the design choices of the Tor Browser or by unforeseen consequence.
grarpamp:
You may be interested in this coupled pair of projects that may be studying a similar question from a different perspective. Note their needs list which might include integrating elements of your platform, OONI, etc.
Thanks for bringing these to our attention! These are certainly interesting projects that I think could benefit from the findings of our work, when we get there. We may be able to supplement the lists that are already built up with more information of services that don't outright block Tor, but make it difficult to anonymously use their Web service by relying on functionality that is dangerous to anonymity and blocked on Tor Browser.
Georg:
What are your criteria for saying "this is broken in Tor Browser" vs. "this is just rendered slightly different in Tor Browser"? For instance I suspect that you'd even get different ground-truths depending on the major Firefox version you use (like Firefox 65 vs. Firefox 60 ESR), yet you would hardly say "This is okay in Firefox 65 but broken in Firefox 60 ESR". Or maybe there *are* cases where you would say so? What I am saying is: mapping the creation of the DOM tree and logging JS execution might be a good means for you goal (I am not sure yet) but it does not seem to be sufficient to reach it.
There is some legitimate concern here, and the reason that my e-mail has been so late is because I've been considering this. The main observation here is that, as far as I know, the Tor Browser is a modification of a Firefox ESR, not an entirely stand-alone browser. The goal, then, is to take as ground truth the closest version of Firefox that we can.
The Tor Browser starts as a Firefox ESR release and then has changes applied to it (patches, extensions, etc). If we use the FF ESR release associated with the most current TB, then we can count that as "ground truth," since the comparison should isolate to only the changes made to FF to turn it into TTB.
Georg:
Secondly, I am wondering how you plan to deal with the fact that websites show different content if the logic behind them assumes you come from a different country/region. How does that get incorporated into your ground-truth, for example?
The way we intended to do this was to send our FF "ground-truth" collection through Tor, and specifically through the same exit node as TTB uses. This way we can isolate the variable to the differences in the browsers, rather than any network or other concerns. In addition, we are working on developing a method for determining if content is dynamically generated (and therefore different every time), or broken.
I hope this addressed all concerns, and if not, or if there is more feedback, please let me know!
Thanks,
Kevin
Hi Kevin,
Really interesting project!
The way we intended to do this was to send our FF "ground-truth"
collection through Tor, and specifically through the same exit node as TTB uses. This way we can isolate the variable to the differences in the browsers, rather than any network or other concerns. In addition, we are working on developing a method for determining if content is dynamically generated (and therefore different every time), or broken.
Won't this severely taint your "ground-truth" though? If you're ultimately using the exit-node for both, then you'll really only end up measuring what is friendly to the Tor Browser itself instead of Tor overall. Is that the goal?
If not, with direct access to an exit node, you could run the test direct from that node (not using Tor) for the "ground truth" and then use it as the exit node of choice for the Tor Browser to have a solid comparison between Tor vs Not-Tor.
I look forward to seeing this come together!
-Ryan
Hi,
On 17 Mar 2019, at 08:04, Ryan Duff ry@nduff.com wrote:
The way we intended to do this was to send our FF "ground-truth" collection through Tor, and specifically through the same exit node as TTB uses. This way we can isolate the variable to the differences in the browsers, rather than any network or other concerns. In addition, we are working on developing a method for determining if content is dynamically generated (and therefore different every time), or broken.
Won't this severely taint your "ground-truth" though? If you're ultimately using the exit-node for both, then you'll really only end up measuring what is friendly to the Tor Browser itself instead of Tor overall. Is that the goal?
If not, with direct access to an exit node, you could run the test direct from that node (not using Tor) for the "ground truth" and then use it as the exit node of choice for the Tor Browser to have a solid comparison between Tor vs Not-Tor.
Most sites block by IP (or IP range), so a direct connection using the exit node's IP should give you very similar results to a Tor circuit using the exit node's IP.
T
Hi again,
On Sun, Mar 17, 2019 at 7:40 PM teor teor@riseup.net wrote:
Most sites block by IP (or IP range), so a direct connection using the exit node's IP should give you very similar results to a Tor circuit using the exit node's IP.
Thanks teor! The point still stand though even though my solution to it is flawed. The thing being measured will be friendliness towards the Tor Browser instead of Tor overall. Basically, the measurement will be "friendly to Tor but not the Tor Browser". If that's the intent, then there is no real issue. I don't know how many sites will render for a Tor exit node but also only have issues with the Tor Browser itself but I'm definitely interested in seeing that data.
-Ryan
On 3/17/19 8:18 PM, Ryan Duff wrote:
Hi again,
On Sun, Mar 17, 2019 at 7:40 PM teor <teor@riseup.net mailto:teor@riseup.net> wrote:
Most sites block by IP (or IP range), so a direct connection using the exit node's IP should give you very similar results to a Tor circuit using the exit node's IP.
Thanks teor! The point still stand though even though my solution to it is flawed. The thing being measured will be friendliness towards the Tor Browser instead of Tor overall. Basically, the measurement will be "friendly to Tor but not the Tor Browser". If that's the intent, then there is no real issue. I don't know how many sites will render for a Tor exit node but also only have issues with the Tor Browser itself but I'm definitely interested in seeing that data.
Yes, I should be clear about this. I am interested in the issues related to the Tor Browser, since the network level issues are already very well studied. For this project, only the Tor Browser is being considered.
I suppose, then, that I should call it the "Tor Browser Friendliness Scanner," but I didn't give that much thought to the name. Sorry about that!
- Kevin
-Ryan