Hi all!
So, Karsten, Nicolas and I were sitting together for a while and were looking at past data for figuring out how many users downloaded and updated their Tor Browser over time.
We actually got more questions than we were able to answer but I guess that's fine for a start.
Here are the graphs showing initial downloads, update pings and update requests over time:
https://people.torproject.org/~karsten/volatile/torbrowser-annotated-2016-09...
We annotated the grapgs a bit highlighting things we wanted people to point to.
The initial downloads are the number of package downloads from the website for all supported platforms (Windows, OS X and Linux). Apart from spike (5) all events seem to be non-Tor Browser related.
* On the downloads graph we seem to have a spike (5) in new downloads with the release of Tor Browser 6.0. Maybe because it was much more widely publicized in the media than previous/later releases?
* On the same graph, a big spike (6) can be seen at the same day the new board got announced.
* There are other spikes on the initial downloads graph (1, 3, 4) where we have no idea what happened while (2) is probably just an outlier.
The update pings are made by Tor Browser instances roughly twice a day and they indicate the number of active Tor Browser users. More importantly, one can see the decrease or increase of Tor Browser usage over time.
* As (2) in the downloads graph (7) seems to be an outlier as well.
* We don't know what (8) or (9) is but it seems to us we are losing users over time and are only getting them back slowly if at all. A weekday/weekend pattern is visible there as well.
The graph with the update requests is basically showing how fast users are updating to newly released Tor Browser versions.
* (10) shows a large spike correlating to the 6.0 release. It is not clear to us where all those update requests were coming from given the update request pattern before/after 6.0. One plausible explanation could be that our infrastructure was heavily overloaded causing clients to retry fetching the update.
We'd love to hear feedback, especially those that could shed light on the events we did not have an explanation for.
Georg
On Sun, Sep 11, 2016 at 04:13:00PM +0000, Georg Koppen wrote:
Here are the graphs showing initial downloads, update pings and update requests over time:
https://people.torproject.org/~karsten/volatile/torbrowser-annotated-2016-09...
The update pings are made by Tor Browser instances roughly twice a day and they indicate the number of active Tor Browser users. More importantly, one can see the decrease or increase of Tor Browser usage over time.
- We don't know what (8) or (9) is but it seems to us we are losing
users over time and are only getting them back slowly if at all. A weekday/weekend pattern is visible there as well.
Does Tor Browser continue checking for further updates in the span of time between when it downloads an update and when it is restarted? For example, you are running 6.0, the browser downloads the 6.0.1 update and stages it and asks you to restart; does the browser check for updates until you actually restart? If not, then the decreases in update pings might be people being tardy in restarting their browser.
On 9/11/16 3:45 PM, David Fifield wrote:
- We don't know what (8) or (9) is but it seems to us we are losing
users over time and are only getting them back slowly if at all. A weekday/weekend pattern is visible there as well.
Does Tor Browser continue checking for further updates in the span of time between when it downloads an update and when it is restarted? For example, you are running 6.0, the browser downloads the 6.0.1 update and stages it and asks you to restart; does the browser check for updates until you actually restart? If not, then the decreases in update pings might be people being tardy in restarting their browser.
That is a good theory, but I don't think update checks occur if there is a pending update. The code that checks and returns early is here:
https://gitweb.torproject.org/tor-browser.git/tree/toolkit/mozapps/update/ns...
On Mon, Sep 12, 2016 at 11:12:15AM -0400, Mark Smith wrote:
On 9/11/16 3:45 PM, David Fifield wrote:
- We don't know what (8) or (9) is but it seems to us we are losing
users over time and are only getting them back slowly if at all. A weekday/weekend pattern is visible there as well.
Does Tor Browser continue checking for further updates in the span of time between when it downloads an update and when it is restarted? For example, you are running 6.0, the browser downloads the 6.0.1 update and stages it and asks you to restart; does the browser check for updates until you actually restart? If not, then the decreases in update pings might be people being tardy in restarting their browser.
That is a good theory, but I don't think update checks occur if there is a pending update. The code that checks and returns early is here:
https://gitweb.torproject.org/tor-browser.git/tree/toolkit/mozapps/update/ns...
Oh, thanks for finding that source code link. I looked for that code and didn't find it.
But that's exactly what I'm saying: once someone has downloaded an update, they stop sending update pings until their next restart, which might explain the decreases in update pings at (8) and (9) in the graphs.
On 9/12/16 11:20 AM, David Fifield wrote:
Oh, thanks for finding that source code link. I looked for that code and didn't find it.
But that's exactly what I'm saying: once someone has downloaded an update, they stop sending update pings until their next restart, which might explain the decreases in update pings at (8) and (9) in the graphs.
Ah, right. Sorry for my confusion. So, yes, your theory really is a good one, although it is surprising that months later the update ping count did not return to its older value, e.g., the March counts are significantly higher than August ones. Maybe our usage is dropping.
If we think the restart delay is a bad thing, we could be more aggressive about encouraging people to restart and apply updates.
On 13 Sep 2016, at 01:51, Mark Smith mcs@pearlcrescent.com wrote:
On 9/12/16 11:20 AM, David Fifield wrote:
Oh, thanks for finding that source code link. I looked for that code and didn't find it.
But that's exactly what I'm saying: once someone has downloaded an update, they stop sending update pings until their next restart, which might explain the decreases in update pings at (8) and (9) in the graphs.
Ah, right. Sorry for my confusion. So, yes, your theory really is a good one, although it is surprising that months later the update ping count did not return to its older value, e.g., the March counts are significantly higher than August ones. Maybe our usage is dropping.
If we think the restart delay is a bad thing, we could be more aggressive about encouraging people to restart and apply updates.
That would mitigate issues where profile changes are ignored between when the update is applied and when users restart the browser: https://trac.torproject.org/projects/tor/ticket/18179
-- Mark Smith Pearl Crescent http://pearlcrescent.com/
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org
David Fifield:
On Mon, Sep 12, 2016 at 11:12:15AM -0400, Mark Smith wrote:
On 9/11/16 3:45 PM, David Fifield wrote:
- We don't know what (8) or (9) is but it seems to us we are losing
users over time and are only getting them back slowly if at all. A weekday/weekend pattern is visible there as well.
Does Tor Browser continue checking for further updates in the span of time between when it downloads an update and when it is restarted? For example, you are running 6.0, the browser downloads the 6.0.1 update and stages it and asks you to restart; does the browser check for updates until you actually restart? If not, then the decreases in update pings might be people being tardy in restarting their browser.
That is a good theory, but I don't think update checks occur if there is a pending update. The code that checks and returns early is here:
https://gitweb.torproject.org/tor-browser.git/tree/toolkit/mozapps/update/ns...
Oh, thanks for finding that source code link. I looked for that code and didn't find it.
But that's exactly what I'm saying: once someone has downloaded an update, they stop sending update pings until their next restart, which might explain the decreases in update pings at (8) and (9) in the graphs.
I am not convinced this is what actually happened for at least 2 reasons:
1) There are more than two updates over the whole time which the graph shows and I see now reason why a large amount of users would have changed their restart behavior just for those two incidents (if they were related to new releases at all).
2) The decrease in update pings for (9) seems to start before the 6.0 release gets out.
Cass Brewer had some ideas in a different mail which might be worth keeping in mind:
""" (8) Feb 25 was when Wired, Ars Technica, and others reported that the FBI had broken Tor, which might have led a lot of people to step away from their browser and Onion Services for a couple of weeks.
(9) May 15 was when the story about FBI/Isis hit popular sites like CNN and Entrepreneur.com The FBI hack also continued to see a lot of popular coverage around that time, which might have suppressed browser use. """
On the other hand, as Arthur Edelstein pointed out in the last Tor Browser meeting: there does not seem to be a similar drop in update requests (i.e. actual .mar files containing the update). While one has to keep in mind that the updater is falling back to requesting a complete .mar file if the incremental one can't be applied, it is still somewhat surprising to me to see the amount of update requests being not really affected by the update ping decreases.
Georg
Hi Georg,
I think the behavior you see can be explained by an overloaded download server. From the initial downloads graph you can see that there are on average 80.000 downloads a day. From the update pings and update requests graphs you can estimate that there are about 800.000 active Tor browser users. So, when there is a new version of Tor browser the number of update requests massively overloads the download server. The saw-tooth form of the update requests graph is what you expect in this situation. First you get an update request from all users, the next day you get a request from all users minus the users who were updated the previous day (max 80.0000) and so on.
I wonder if it is possible that failed downloads are counted too? That would explain the spikes. But systems that are so heavily overloaded can generate all kinds of weird results.
One thing bothers me. The update requests graph never touches zero. It should, because that would mean that all Tor browsers have been updated. 100.000 seems to be the lowest value.
On 12 September 2016 at 03:37, Rob van der Hoeven robvanderhoeven@ziggo.nl wrote:
One thing bothers me. The update requests graph never touches zero. It should, because that would mean that all Tor browsers have been updated. 100.000 seems to be the lowest value.
I'm not surprised by this at all. I think a very common mode of usage is people who have TB on their computer but don't use it regularly. (I have several friends like this.) Only when they want to search for something 'embarrassing' (medical conditions, etc) will they use it. With an update cycle of one-two months between releases, it's likely these people are actually _never_ up to date (unless they choose to restart TB during their browsing session.)
-tom
Hi,
On Mon, 12 Sep 2016, Rob van der Hoeven wrote:
Hi Georg,
I think the behavior you see can be explained by an overloaded download server. From the initial downloads graph you can see that there are on average 80.000 downloads a day. From the update pings and update requests graphs you can estimate that there are about 800.000 active Tor browser users. So, when there is a new version of Tor browser the number
We can estimate about 800.000 active daily users if we assume that they are all running their browser at different times during the day to make two pings a day. But some of them probably run it only during some part of the day that is less than 12 hours, which is not enough to make two pings per day. So I think from the update pings, we can only estimate that we have more than 800.000 active daily users, and less than 1.600.000.
To get a more precise estimate of active users, we could maybe count the total of update downloads for each version (in an other graph). Although this would be different from the update pings, as it would also include the occasional users who don't use it everyday.
of update requests massively overloads the download server. The saw-tooth form of the update requests graph is what you expect in this situation. First you get an update request from all users, the next day you get a request from all users minus the users who were updated the previous day (max 80.0000) and so on.
I wonder if it is possible that failed downloads are counted too? That would explain the spikes. But systems that are so heavily overloaded can generate all kinds of weird results.
I think it is possible that failed downloads are counted too. What we are counting is the number of http 302 responses (redirect) to initiate a download. If the redirect works but the actual download fails, it is still counted.
Nicolas
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 11/09/16 18:13, Georg Koppen wrote:
Here are the graphs showing initial downloads, update pings and update requests over time:
https://people.torproject.org/~karsten/volatile/torbrowser-annotated-2016-09...
Here's
the same graph with more data, more request types, and of course a lot more shininess:
https://tor-metrics.shinyapps.io/webstats/
Note that this is just a prototype that will go away in the future without notice. But if enough people like it we might run our own Shiny Server at some point in the future. Enjoy! (And thanks, Isabela, for suggesting Shiny!)
All the best, Karsten
This is awesome, Karsten!
On Wed, Sep 14, 2016 at 1:16 PM, Karsten Loesing karsten@torproject.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 11/09/16 18:13, Georg Koppen wrote:
Here are the graphs showing initial downloads, update pings and update requests over time:
https://people.torproject.org/~karsten/volatile/torbrowser-annotated-2016-09...
Here's
the same graph with more data, more request types, and of course a lot more shininess:
https://tor-metrics.shinyapps.io/webstats/
Note that this is just a prototype that will go away in the future without notice. But if enough people like it we might run our own Shiny Server at some point in the future. Enjoy! (And thanks, Isabela, for suggesting Shiny!)
All the best, Karsten
-----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org
iQEcBAEBAgAGBQJX2bAfAAoJEC3ESO/4X7XBKTAH/1e45BQDTj0fDMGG+Po61/vB 2XQpQ6YFlF44VaPBkOaipqh9E3THzJwtTtTmkVW2lcziTiOYdWBiZ3XyzKIMMbCq K5uNH9xtgV9JJebTl1e6hVIrYMfflpju+7E9y8SBg3WeiL2Vd9jxa9aoCFgLEdbY kX74D9OTtwi5RnJ9op/1F+DU7hLOFoudDdgzcuS6I/FguTGs/fkZxlZ+gXO1QLxp djZt+dRCk0E9Pgm0KJkq9AUa0YN+YLQeYpwQpI9Ge3Meo/KXtI2tTmJRlDFf9v4T 7UaESIkc4MpflAsN/inHtAKXtgXCGkiEi3r/dbwq7BxfmdBeSXECGkA45m7cqjQ= =xf5/ -----END PGP SIGNATURE----- _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Karsten Loesing:
On 11/09/16 18:13, Georg Koppen wrote:
Here are the graphs showing initial downloads, update pings and update requests over time:
https://people.torproject.org/~karsten/volatile/torbrowser-annotated-2016-09...
Here's
the same graph with more data, more request types, and of course a lot more shininess:
https://tor-metrics.shinyapps.io/webstats/
Note that this is just a prototype that will go away in the future without notice. But if enough people like it we might run our own Shiny Server at some point in the future. Enjoy! (And thanks, Isabela, for suggesting Shiny!)
If you feel that's interesting enough, would it be possible to also add the number of download of cryptographic signatures to the graph?
I also would love to see a breakdown by operating systems if you consider it a reasonable thing to do.
Seeing these graphs really made my day, I had been hoping for them for a long time (#10675 which maybe should be closed). Thanks a lot!
Thanks,
Here's
the same graph with more data, more request types, and of course a lot more shininess:
https://tor-metrics.shinyapps.io/webstats/ https://tor-metrics.shinyapps.io/webstats/...
If you feel that's interesting enough, would it be possible to also add the number of download of cryptographic signatures to the graph?
I also would love to see a breakdown by operating systems if you consider it a reasonable thing to do.
To these requests I would add a request for the methodology behind releasing these stats. Are they raw numbers? Rounded? More generally, how are the web logs sanitized? I’m interested in how safe these statistics are to release and how they might be changed to be even more privacy-preserving.
Thanks, Aaron
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 17/09/16 18:28, Aaron Johnson wrote:
Here's
the same graph with more data, more request types, and of course a lot more shininess:
https://tor-metrics.shinyapps.io/webstats/ https://tor-metrics.shinyapps.io/webstats/...
If you feel that's interesting enough, would it be possible to also add the number of download of cryptographic signatures to the graph?
I also would love to see a breakdown by operating systems if you consider it a reasonable thing to do.
To these requests I would add a request for the methodology behind releasing these stats. Are they raw numbers? Rounded? More generally, how are the web logs sanitized? I’m interested in how safe these statistics are to release and how they might be changed to be even more privacy-preserving.
Good thinking! I summarized the methodology on the graph page as: The graph above is based on sanitized Tor web server logs [0]. These are a stripped-down version of Apache's "combined" log format without IP addresses, log times, HTTP parameters, referers, and user agent strings.
[0] https://webstats.torproject.org/
I guess we'll write down the sanitizing process in more detail once we make this part of CollecTor, but that may take a few more weeks or even months.
If you spot anything in the data that you think should be sanitized more thoroughly, please let us know!
Thanks, Aaron
All the best, Karsten
Good thinking! I summarized the methodology on the graph page as: The graph above is based on sanitized Tor web server logs [0]. These are a stripped-down version of Apache's "combined" log format without IP addresses, log times, HTTP parameters, referers, and user agent strings.
...
If you spot anything in the data that you think should be sanitized more thoroughly, please let us know!
Interesting, thanks. Here are some thoughts based on looking through one of these logs (from archeotrichon.torproject.org http://archeotrichon.torproject.org/ on 2015-09-20): 1. The order of requests appears to be preserved. If so, this allows an adversary to determine fine-grained timing information by inserting requests of his own at known times. 2. The size of the response is included, which potentially allows an adversary observing the client side to perform a correlation attack (combined with #1 above). This could allow the adversary to learn interesting things like (i) this person is downloading arm and thus is probably running a relay or (ii) this person is creating Trac tickets with onion-service bugs and is likely running an onion service somewhere (or is Trac excluded from these logs?). The size could also be used as an time-stamping mechanism alternative to #1 if the size of the request can be changed by the adversary (e.g. by blog comments). 3. Even without fine-grained timing information, daily per-server logs might include data from few enough clients that multiple requests can be reasonably inferred to be from the same client, which can collectively reveal lots of information (e.g. country based on browser localization used, platform, blog posts viewed/commented on if the blog server also releases logs).
I also feel compelled to raise the question of whether or not releasing these logs went through Tor’s own recommended procedure for producing data on its users (https://research.torproject.org/safetyboard.html#guidelines https://research.torproject.org/safetyboard.html#guidelines): • Only collect data that is safe to make public. • Don't collect data you don't need (minimization). • Take reasonable security precautions, e.g. about who has access to your data sets or experimental systems. • Limit the granularity of data (e.g. use bins or add noise). • The benefits should outweigh the risks. • Consider auxiliary data (e.g. third-party data sets) when assessing the risks. • Consider whether the user meant for that data to be private. I definitely see the value of analyzing these logs, though, and it definitely helps that some sanitization was applied :-)
Best, Aaron
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi Aaron,
On 20/09/16 15:43, Aaron Johnson wrote:
Good thinking! I summarized the methodology on the graph page as: The graph above is based on sanitized Tor web server logs [0]. These are a stripped-down version of Apache's "combined" log format without IP addresses, log times, HTTP parameters, referers, and user agent strings.
...
If you spot anything in the data that you think should be sanitized more thoroughly, please let us know!
Interesting, thanks. Here are some thoughts based on looking through one of these logs (from archeotrichon.torproject.org http://archeotrichon.torproject.org/ on 2015-09-20): 1. The order of requests appears to be preserved. If so, this allows an adversary to determine fine-grained timing information by inserting requests of his own at known times.
Log files are sorted as part of the sanitizing procedure, so that request order should not be preserved. If you find a log file that is not sorted, please let us know, because that would be a bug.
- The size of the response is included, which potentially allows
an adversary observing the client side to perform a correlation attack (combined with #1 above). This could allow the adversary to learn interesting things like (i) this person is downloading arm and thus is probably running a relay or (ii) this person is creating Trac tickets with onion-service bugs and is likely running an onion service somewhere (or is Trac excluded from these logs?). The size could also be used as an time-stamping mechanism alternative to #1 if the size of the request can be changed by the adversary (e.g. by blog comments).
This seems less of a problem with request order not being preserved. And actually, the logged size is the size of the object on the server, not the number of bytes written to the client. Even if these sizes were scrubbed, it would be quite easy for an attacker to find out most of these sizes by simply requesting objects themselves. On the other hand, not including them would make some analyses unnecessarily hard. I'd say it's reasonable to keep them.
- Even without fine-grained timing information, daily per-server
logs might include data from few enough clients that multiple requests can be reasonably inferred to be from the same client, which can collectively reveal lots of information (e.g. country based on browser localization used, platform, blog posts viewed/commented on if the blog server also releases logs).
We're removing almost all user data from request logs and only preserving data about the requested object. For example, we're throwing away user agent strings and request parameters. I don't really see the problem you're describing here.
I also feel compelled to raise the question of whether or not releasing these logs went through Tor’s own recommended procedure for producing data on its users (https://research.torproject.org/safetyboard.html#guidelines https://research.torproject.org/safetyboard.html#guidelines):
Git history says that those guidelines were put up in April 2016 whereas the rewrite of the web server log sanitizing code happened in November 2015, with the original sanitizing process being written in December 2011. So, no, we didn't go through that procedure yet, but let's do that now:
• Only collect data that is safe to make public.
We're only using data after making it public, so we're not collecting anything that we think wouldn't be safe to make public.
• Don't collect data you don't need (minimization).
I can see us using sanitized web logs from all Tor web servers, not limited to Tor Browser/Tor Messenger downloads and Tor main website hits. I used these logs to learn whether Atlas or Globe had more users, and I just recently looked at Metrics logs to see which graphs are requested most often.
• Take reasonable security precautions, e.g. about who has access to your data sets or experimental systems.
We're doing that. For example, I personally don't have access to non-sanitized web logs, just to the sanitized ones as everyone else.
• Limit the granularity of data (e.g. use bins or add noise).
We're throwing out time information and removing request order.
• The benefits should outweigh the risks.
I'd say this is the case. As you say below yourself, there is value of analyzing these logs, and I agree. I have also been thinking a lot about possible risks, which resulted in the sanitizing procedure that is in place, which comes after the very restrictive logging policy at Tor's Apache processes, which throws away client IP addresses and other sensitive data right at the logging step. All in all, yes, benefits do outweigh the risks here, in my opinion.
• Consider auxiliary data (e.g. third-party data sets) when assessing the risks.
I don't see a convincing scenario where this data set would make a third-party data set more dangerous.
• Consider whether the user meant for that data to be private.
We're removing the user's IP address, request parameters, and user agent string, and we're throwing out requests that resulted in a 404 or that used a different method than GET or HEAD. I can't see how a user meant the remaining parts to be private.
I definitely see the value of analyzing these logs, though, and it definitely helps that some sanitization was applied :-)
Glad to hear that.
We shall specify the sanitizing procedure in more detail as soon as these logs are provided by CollecTor. I could imagine that we'll write down the process similar to the bridge descriptor sanitizing process:
https://collector.torproject.org/#bridge-descriptors
However, the current plan is to keep using the data provided by webstats.torproject.org in the upcoming 9 months while we're busy with other things. Just saying, don't hold your breath.
Best, Aaron
All the best, Karsten
Log files are sorted as part of the sanitizing procedure, so that request order should not be preserved. If you find a log file that is not sorted, please let us know, because that would be a bug.
That’s great! It just appeared ordered in that multiple related requests appeared in sequence, but I see that sorting can have that effect too.
- The size of the response is included, which potentially allows
an adversary observing the client side to perform a correlation attack (combined with #1 above). This could allow the adversary to learn interesting things like (i) this person is downloading arm and thus is probably running a relay or (ii) this person is creating Trac tickets with onion-service bugs and is likely running an onion service somewhere (or is Trac excluded from these logs?). The size could also be used as an time-stamping mechanism alternative to #1 if the size of the request can be changed by the adversary (e.g. by blog comments).
This seems less of a problem with request order not being preserved. And actually, the logged size is the size of the object on the server, not the number of bytes written to the client. Even if these sizes were scrubbed, it would be quite easy for an attacker to find out most of these sizes by simply requesting objects themselves. On the other hand, not including them would make some analyses unnecessarily hard. I'd say it's reasonable to keep them.
Here is a concern: if the adversary can cause the size to be modified (say by adding comments to an blog page), then he can effectively mark certain requests as happening within a certain time period by setting a unique size for that time period.
- Even without fine-grained timing information, daily per-server
logs might include data from few enough clients that multiple requests can be reasonably inferred to be from the same client, which can collectively reveal lots of information (e.g. country based on browser localization used, platform, blog posts viewed/commented on if the blog server also releases logs).
We're removing almost all user data from request logs and only preserving data about the requested object. For example, we're throwing away user agent strings and request parameters. I don't really see the problem you're describing here.
This might be easiest to appreciate in the limit. Suppose you have a huge number of servers (relative to the number of clients) with DNS load-balancing among them. Each one basically has no requests or all those from the same client. Linking together multiple client requests allow them to collectively reveal information about the client. You might learn the language in one request, the platform in another, etc. A similar argument applies to splitting the logs across increasingly small time periods (per-day, per-hour, although at some point the time period gets below a given client’s “browsing session"). Obviously both of these examples are not near reality at some point, but the more you separate the logs across machines and over time, the more that requests might reasonably be inferred to belong to the same client. This presents an tradeoff you can make between accuracy and privacy by aggregating across more machines and over longer time periods.
let's do that now:
:-D
• Don't collect data you don't need (minimization).
I can see us using sanitized web logs from all Tor web servers, not limited to Tor Browser/Tor Messenger downloads and Tor main website hits. I used these logs to learn whether Atlas or Globe had more users, and I just recently looked at Metrics logs to see which graphs are requested most often.
A more conservative approach would be more “pull” than “push”, so you don’t collect data until you want it, at which point you add it to the collection list. Just a thought.
• The benefits should outweigh the risks.
I'd say this is the case. As you say below yourself, there is value of analyzing these logs, and I agree. I have also been thinking a lot about possible risks, which resulted in the sanitizing procedure that is in place, which comes after the very restrictive logging policy at Tor's Apache processes, which throws away client IP addresses and other sensitive data right at the logging step. All in all, yes, benefits do outweigh the risks here, in my opinion.
I think this is the ultimate test, and it sounds like you put a lot of thought into it (as expected).
• Consider auxiliary data (e.g. third-party data sets) when assessing the risks.
I don't see a convincing scenario where this data set would make a third-party data set more dangerous.
Are there any files there are only of particular interest to a particular user or user subpopulation? Examples might be an individual’s blog or Tor instructions in Kurdish. If so, revealing that they have been accessed could indicate if and when the user or subpopulation are active on the Tor site. Are there any files that might hold particular interest to some adversary? Examples might be a comparison in Mandarin between Psiphon tools and Tor. If so, revealing their access frequency could indicate to the adversary that they should pay close attention to whatever is signified by that file. A similar issue arose with the popularity onion services, about which I believe the current consensus is that it should be hidden, the canonical example being a government that monitors the popularity of political opposition forums to determine which ones are beginning to be popular and thus need to be repressed.
• Consider whether the user meant for that data to be private.
We're removing the user's IP address, request parameters, and user agent string, and we're throwing out requests that resulted in a 404 or that used a different method than GET or HEAD. I can't see how a user meant the remaining parts to be private.
I’m happy to see that you’re removing 404s! Some things that occurred to me are avoided by doing this (e.g. inadvertent sensitive client requests).
We shall specify the sanitizing procedure in more detail as soon as these logs are provided by CollecTor. I could imagine that we'll write down the process similar to the bridge descriptor sanitizing process:
https://collector.torproject.org/#bridge-descriptors https://collector.torproject.org/#bridge-descriptors
I look forward to the writeup!
Best, Aaron
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On 22/09/16 01:48, Aaron Johnson wrote:
Oops, this thread got lost in the Seattle preparations and only surfaced today while doing some housekeeping. Please find my response below.
Log files are sorted as part of the sanitizing procedure, so that request order should not be preserved. If you find a log file that is not sorted, please let us know, because that would be a bug.
That’s great! It just appeared ordered in that multiple related requests appeared in sequence, but I see that sorting can have that effect too.
Okay, glad you didn't find a bug there.
- The size of the response is included, which potentially
allows an adversary observing the client side to perform a correlation attack (combined with #1 above). This could allow the adversary to learn interesting things like (i) this person is downloading arm and thus is probably running a relay or (ii) this person is creating Trac tickets with onion-service bugs and is likely running an onion service somewhere (or is Trac excluded from these logs?). The size could also be used as an time-stamping mechanism alternative to #1 if the size of the request can be changed by the adversary (e.g. by blog comments).
This seems less of a problem with request order not being preserved. And actually, the logged size is the size of the object on the server, not the number of bytes written to the client. Even if these sizes were scrubbed, it would be quite easy for an attacker to find out most of these sizes by simply requesting objects themselves. On the other hand, not including them would make some analyses unnecessarily hard. I'd say it's reasonable to keep them.
Here is a concern: if the adversary can cause the size to be modified (say by adding comments to an blog page), then he can effectively mark certain requests as happening within a certain time period by setting a unique size for that time period.
Alright, I see your point. We should remove sizes of requested objects that can be modified by users and hence adversaries. The blog is not affected here, because we're not including sanitized of the blog yet, and even if we were, comments are manually approved by the blog admin which only happens a few times per day and which takes away control from an adversary.
But we do have Trac logs where users can easily add a comment or modify a wiki page. We should simply include 0 as requested object size in those logs. And we should make sure we're doing the same with future sites where users can modify content. Added to my list.
- Even without fine-grained timing information, daily
per-server logs might include data from few enough clients that multiple requests can be reasonably inferred to be from the same client, which can collectively reveal lots of information (e.g. country based on browser localization used, platform, blog posts viewed/commented on if the blog server also releases logs).
We're removing almost all user data from request logs and only preserving data about the requested object. For example, we're throwing away user agent strings and request parameters. I don't really see the problem you're describing here.
This might be easiest to appreciate in the limit. Suppose you have a huge number of servers (relative to the number of clients) with DNS load-balancing among them. Each one basically has no requests or all those from the same client. Linking together multiple client requests allow them to collectively reveal information about the client. You might learn the language in one request, the platform in another, etc. A similar argument applies to splitting the logs across increasingly small time periods (per-day, per-hour, although at some point the time period gets below a given client’s “browsing session"). Obviously both of these examples are not near reality at some point, but the more you separate the logs across machines and over time, the more that requests might reasonably be inferred to belong to the same client. This presents an tradeoff you can make between accuracy and privacy by aggregating across more machines and over longer time periods.
So, I'm not sure if the following is feasible with the current sanitizing code. What we could do is merge all logs coming from different servers for a given site and day, sort them, and provide them as single sanitized log file. That would address your concern here without making the logs any less useful for analysis. If we cannot implement this right now, I'll make a note to implement this when we re-implement this code in Java and add it to CollecTor. Added to my list, too.
let's do that now:
:-D
Well, we did discuss benefits and risks at length a few years ago, we just didn't follow these guidelines simply because they didn't exist back at the time.
• Don't collect data you don't need (minimization).
I can see us using sanitized web logs from all Tor web servers, not limited to Tor Browser/Tor Messenger downloads and Tor main website hits. I used these logs to learn whether Atlas or Globe had more users, and I just recently looked at Metrics logs to see which graphs are requested most often.
A more conservative approach would be more “pull” than “push”, so you don’t collect data until you want it, at which point you add it to the collection list. Just a thought.
The downside is that we'd be losing history. I'm not in favor of that approach. To give a random example, it would have made the Tor Messenger analysis a lot less useful, because most downloads happened at the initial release a year ago. I'd rather want us to ensure that sanitized logs don't contain sensitive parts anymore, and publishing them seems to me like a good way to learn about that.
• The benefits should outweigh the risks.
I'd say this is the case. As you say below yourself, there is value of analyzing these logs, and I agree. I have also been thinking a lot about possible risks, which resulted in the sanitizing procedure that is in place, which comes after the very restrictive logging policy at Tor's Apache processes, which throws away client IP addresses and other sensitive data right at the logging step. All in all, yes, benefits do outweigh the risks here, in my opinion.
I think this is the ultimate test, and it sounds like you put a lot of thought into it (as expected).
Yep.
• Consider auxiliary data (e.g. third-party data sets) when assessing the risks.
I don't see a convincing scenario where this data set would make a third-party data set more dangerous.
Are there any files there are only of particular interest to a particular user or user subpopulation? Examples might be an individual’s blog or Tor instructions in Kurdish. If so, revealing that they have been accessed could indicate if and when the user or subpopulation are active on the Tor site. Are there any files that might hold particular interest to some adversary? Examples might be a comparison in Mandarin between Psiphon tools and Tor. If so, revealing their access frequency could indicate to the adversary that they should pay close attention to whatever is signified by that file. A similar issue arose with the popularity onion services, about which I believe the current consensus is that it should be hidden, the canonical example being a government that monitors the popularity of political opposition forums to determine which ones are beginning to be popular and thus need to be repressed.
I believe that we should only be using data that we're publishing. And I can see how we want to learn ourselves whether our outreach efforts are successful or not. So can others. I don't believe in using that information and at the same time trying to keep it secret.
• Consider whether the user meant for that data to be private.
We're removing the user's IP address, request parameters, and user agent string, and we're throwing out requests that resulted in a 404 or that used a different method than GET or HEAD. I can't see how a user meant the remaining parts to be private.
I’m happy to see that you’re removing 404s! Some things that occurred to me are avoided by doing this (e.g. inadvertent sensitive client requests).
Yes, keeping 404s would have been bad.
We shall specify the sanitizing procedure in more detail as soon as these logs are provided by CollecTor. I could imagine that we'll write down the process similar to the bridge descriptor sanitizing process:
https://collector.torproject.org/#bridge-descriptors https://collector.torproject.org/#bridge-descriptors
I look forward to the writeup!
You'll learn about the CollecTor re-implementation and documentation on this list or in the monthly team reports on tor-reports@. Though I'm not very optimistic that it will happen in the next 9 months, given that our roadmap is already quite full:
https://trac.torproject.org/projects/tor/wiki/org/teams/MetricsTeam#Roadmapf...
But it's on my list.
Best, Aaron
Thanks for your input here!
All the best, Karsten
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 17/09/16 17:52, Lunar wrote:
Karsten Loesing:
On 11/09/16 18:13, Georg Koppen wrote:
Here are the graphs showing initial downloads, update pings and update requests over time:
https://people.torproject.org/~karsten/volatile/torbrowser-annotated-2016-09...
Here's
the same graph with more data, more request types, and of course a lot more shininess:
This is now updated with more data and a lot more text. Enjoy!
Note that this is just a prototype that will go away in the future without notice. But if enough people like it we might run our own Shiny Server at some point in the future. Enjoy! (And thanks, Isabela, for suggesting Shiny!)
If you feel that's interesting enough, would it be possible to also add the number of download of cryptographic signatures to the graph?
Sure, added.
I also would love to see a breakdown by operating systems if you consider it a reasonable thing to do.
That is something I'd like to add very soon, but I'd first want to discuss whether the absolute numbers make sense before breaking them down by operating system, release channel, and locale.
I'll bring the database with me to Seattle. Maybe we can sit down and run some queries on it together? This was fun last weekend together with Georg and Nicolas, and it would for sure be fun with you and other interested folks in Seattle.
Seeing these graphs really made my day, I had been hoping for them for a long time (#10675 which maybe should be closed). Thanks a lot!
Oh, interesting. Maybe we can learn something from that ticket, too. Let's not close it just yet.
Thanks,
All the best, Karsten
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
Karsten Loesing:
If you feel that's interesting enough, would it be possible to also add the number of download of cryptographic signatures to the graph?
Sure, added.
Thanks! These are interesting datapoints regarding the “but nobody ever checks the signature” that I hear now and then.
That is something I'd like to add very soon, but I'd first want to discuss whether the absolute numbers make sense before breaking them down by operating system, release channel, and locale.
I'll bring the database with me to Seattle. Maybe we can sit down and run some queries on it together? This was fun last weekend together with Georg and Nicolas, and it would for sure be fun with you and other interested folks in Seattle.
Sure! :)
If we're going to get such graphs running as a way to measure the amount of Tor Browser users, I wonder if we should not also try to work with Tails' people to add their boot statistics, and maybe other projects who include Tor Browser without using the automated update mechanism.
- -- Lunar lunar@torproject.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 20/09/16 17:46, Lunar wrote:
Karsten Loesing:
If you feel that's interesting enough, would it be possible to also add the number of download of cryptographic signatures to the graph?
Sure, added.
Thanks! These are interesting datapoints regarding the “but nobody ever checks the signature” that I hear now and then.
Great!
That is something I'd like to add very soon, but I'd first want to discuss whether the absolute numbers make sense before breaking them down by operating system, release channel, and locale.
I'll bring the database with me to Seattle. Maybe we can sit down and run some queries on it together? This was fun last weekend together with Georg and Nicolas, and it would for sure be fun with you and other interested folks in Seattle.
Sure! :)
Ah well, I found some time to make a graph for this and also found this to be a good excuse to prototype R Markdown files:
https://tor-metrics.shinyapps.io/webstats2/
Enjoy! And please let me know how to make that graph even more useful! (Keep in mind that this is a prototype and that the version on Tor Metrics is likely going to provide fewer options to examine the data; but we should use this prototype to learn which are the most important things we want to have on Tor Metrics.)
If we're going to get such graphs running as a way to measure the amount of Tor Browser users, I wonder if we should not also try to work with Tails' people to add their boot statistics, and maybe other projects who include Tor Browser without using the automated update mechanism.
Good idea, added to the list.
All the best, Karsten