Plan to double-check Tor Browser initial download numbers

List overview All Threads
Download

newer

older

Notes from August 3 2017 Vegas...

June 2017 grants report

Karsten Loesing

17 Jul 2017 17 Jul '17

6:05 p.m.

Hello list,

it's been almost two years since we started collecting sanitized Apache web server logs. During this time the number of Tor Browser initial downloads rarely went below 70,000 per day.

https://metrics.torproject.org/webstats-tb.html

Either there must be a steady demand for fresh binaries, or there is a non-zero number of bots downloading the Tor Browser binary several times per day.

I already double-checked our aggregation code that takes sanitized web server logs as input and produces daily totals as output. It looks okay to me.

I'd also like to double-check whether there's anything unexpected happening before the sanitizing step. For example, could it be that there are a few IP addresses making hundreds or thousands of requests?

Or are there lots of requests with same referrers or common user agents indicating bots?

My plan is to ask our admins to temporarily add a second Apache log file on one of the dist.torproject.org hosts with the default Apache log file format without the sanitizing that is usually applied.

A snapshot of 15 or 30 minutes would likely be sufficient as sample. I'd analyze this log file on the server, delete it, and report my findings here.

This message has two purposes:

1. Is this approach acceptable? If not, are there more acceptable approaches yielding similar results?

2. Are there any theories what might keep the numbers from dropping below those 70,000 requests per day? What should I be looking for?

Thanks!

All the best, Karsten

Attachments:

signature.asc (application/pgp-signature — 495 bytes)

Show replies by date

teor

17 Jul 17 Jul

11:46 p.m.

...

On 18 Jul 2017, at 04:05, Karsten Loesing karsten@torproject.org wrote:

Hello list,

it's been almost two years since we started collecting sanitized Apache web server logs. During this time the number of Tor Browser initial downloads rarely went below 70,000 per day.

https://metrics.torproject.org/webstats-tb.html

Either there must be a steady demand for fresh binaries, or there is a non-zero number of bots downloading the Tor Browser binary several times per day.

I already double-checked our aggregation code that takes sanitized web server logs as input and produces daily totals as output. It looks okay to me.

I'd also like to double-check whether there's anything unexpected happening before the sanitizing step. For example, could it be that there are a few IP addresses making hundreds or thousands of requests?

Or are there lots of requests with same referrers or common user agents indicating bots?

My plan is to ask our admins to temporarily add a second Apache log file on one of the dist.torproject.org hosts with the default Apache log file format without the sanitizing that is usually applied.

A snapshot of 15 or 30 minutes would likely be sufficient as sample. I'd analyze this log file on the server, delete it, and report my findings here.

This message has two purposes:

Is this approach acceptable? If not, are there more acceptable

approaches yielding similar results?

Can you get similar results with a default apache log file, with the following changes: * remove timestamps * sort lines to destroy the original order

Without precise timing information, the data would be a lot less sensitive.

It might also be useful to know the distribution of requests over a 24 hour period, without any other details. This might help you work out how the activity is being triggered.

...

Are there any theories what might keep the numbers from dropping

below those 70,000 requests per day? What should I be looking for?

There are 86,400 seconds in a day, which means that we're getting about 1 request per second. This could be a single bot caught in a loop.

Are you only counting GET requests? Do you count incomplete downloads? (A continually failing automated download process could cause this.)

-- Tim Wilson-Brown (teor)

teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------

Karsten Loesing

19 Jul 19 Jul

4:32 p.m.

On 2017-07-18 01:46, teor wrote:

...

...
On 18 Jul 2017, at 04:05, Karsten Loesing karsten@torproject.org wrote:

Hello list,

it's been almost two years since we started collecting sanitized Apache web server logs. During this time the number of Tor Browser initial downloads rarely went below 70,000 per day.

https://metrics.torproject.org/webstats-tb.html

Either there must be a steady demand for fresh binaries, or there is a non-zero number of bots downloading the Tor Browser binary several times per day.

I already double-checked our aggregation code that takes sanitized web server logs as input and produces daily totals as output. It looks okay to me.

I'd also like to double-check whether there's anything unexpected happening before the sanitizing step. For example, could it be that there are a few IP addresses making hundreds or thousands of requests?

Or are there lots of requests with same referrers or common user agents indicating bots?

My plan is to ask our admins to temporarily add a second Apache log file on one of the dist.torproject.org hosts with the default Apache log file format without the sanitizing that is usually applied.

A snapshot of 15 or 30 minutes would likely be sufficient as sample. I'd analyze this log file on the server, delete it, and report my findings here.

This message has two purposes:

Is this approach acceptable? If not, are there more acceptable

approaches yielding similar results?

Can you get similar results with a default apache log file, with the following changes:

remove timestamps

Yes, we can remove timestamps. It's potentially useful information, but it's also potentially sensitive information. Let's leave it out for now, and if the remaining fields are not sufficient, let's reconsider including timestamps in some way for a possible second experiment.

...

sort lines to destroy the original order

That's more difficult, because Apache doesn't have an option for that. But the order of dist.torproject.org requests is likely less sensitive than the order of www.torproject.org requests where users navigate over the site. I'd say let's leave the order unchanged to keep this experiment simple.

...

Without precise timing information, the data would be a lot less sensitive.

Agreed.

...

It might also be useful to know the distribution of requests over a 24 hour period, without any other details. This might help you work out how the activity is being triggered.

Oh, that's a good idea, too! We shouldn't do this overlapping with the currently planned experiment, but I'll put it on the list for a follow-up experiment.

...

...

Are there any theories what might keep the numbers from dropping

below those 70,000 requests per day? What should I be looking for?

There are 86,400 seconds in a day, which means that we're getting about 1 request per second. This could be a single bot caught in a loop.

Are you only counting GET requests?

Yes.

...

Do you count incomplete downloads?

Yes. Apache only includes the size of the returned object, not the number of transferred bytes. What we do see though is HTTP range requests (code 206), which we do not count.

...

(A continually failing automated download process could cause this.)

Yes.

Thanks for your questions and ideas here!

All the best, Karsten

Ian Goldberg

17 Jul 17 Jul

11:54 p.m.

On Mon, Jul 17, 2017 at 08:05:30PM +0200, Karsten Loesing wrote:

...

Hello list,

it's been almost two years since we started collecting sanitized Apache web server logs. During this time the number of Tor Browser initial downloads rarely went below 70,000 per day.

https://metrics.torproject.org/webstats-tb.html

Either there must be a steady demand for fresh binaries, or there is a non-zero number of bots downloading the Tor Browser binary several times per day.

I already double-checked our aggregation code that takes sanitized web server logs as input and produces daily totals as output. It looks okay to me.

I'd also like to double-check whether there's anything unexpected happening before the sanitizing step. For example, could it be that there are a few IP addresses making hundreds or thousands of requests?

Or are there lots of requests with same referrers or common user agents indicating bots?

My plan is to ask our admins to temporarily add a second Apache log file on one of the dist.torproject.org hosts with the default Apache log file format without the sanitizing that is usually applied.

A snapshot of 15 or 30 minutes would likely be sufficient as sample. I'd analyze this log file on the server, delete it, and report my findings here.

This message has two purposes:

Is this approach acceptable? If not, are there more acceptable

approaches yielding similar results?

Are there any theories what might keep the numbers from dropping

below those 70,000 requests per day? What should I be looking for?

Thanks!

All the best, Karsten

Any chance you (i.e. a script) could replace the IP address with HASH(IP||salt) for a randomly chosen salt that you don't know, and which is deleted when the 30 minutes are up, before you get access to the log file?

Roger Dingledine

18 Jul 18 Jul

1:41 a.m.

On Mon, Jul 17, 2017 at 07:54:14PM -0400, Ian Goldberg wrote:

...

Any chance you (i.e. a script) could replace the IP address with HASH(IP||salt) for a randomly chosen salt that you don't know, and which is deleted when the 30 minutes are up, before you get access to the log file?

See https://www.eff.org/policy#cryptolog for how EFF does something similar. It looks like they use 24 hour intervals, and they do this all the time, but hopefully their cryptolog tool will be helpful if we opt to use it for the short term. https://github.com/efforg/cryptolog

Also, teor's question about partial downloads is a really good one: there are many "download accelerators" out there that fetch the first 5 kbytes of the file or something and then stop and do it again, over and over. In theory our current logs should be able to help there, since it should log how many bytes were fetched.

And for those wondering about our current logging approach, see https://trac.torproject.org/projects/tor/ticket/20928 http://lists.spi-inc.org/pipermail/spi-general/2016-December/003645.html https://anonscm.debian.org/cgit/mirror/dsa-puppet.git/tree/modules/apache2/f...

--Roger

Yawning Angel

2:11 a.m.

On Mon, 17 Jul 2017 21:41:40 -0400 Roger Dingledine arma@mit.edu wrote:

...

On Mon, Jul 17, 2017 at 07:54:14PM -0400, Ian Goldberg wrote:

...
Any chance you (i.e. a script) could replace the IP address with HASH(IP||salt) for a randomly chosen salt that you don't know, and which is deleted when the 30 minutes are up, before you get access to the log file?

See https://www.eff.org/policy#cryptolog for how EFF does something similar. It looks like they use 24 hour intervals, and they do this all the time, but hopefully their cryptolog tool will be helpful if we opt to use it for the short term. https://github.com/efforg/cryptolog

Would a prefix preserving scheme a la Crypto-PAn[0] be more useful? http://www.cc.gatech.edu/computing/Telecomm/projects/cryptopan/

Regards,

-- Yawning Angel [0]: nb: the scheme as presented and the reference code needs a slight tweak.

Ian Goldberg

2:18 a.m.

On Tue, Jul 18, 2017 at 02:11:47AM +0000, Yawning Angel wrote:

...

On Mon, 17 Jul 2017 21:41:40 -0400 Roger Dingledine arma@mit.edu wrote:

...
On Mon, Jul 17, 2017 at 07:54:14PM -0400, Ian Goldberg wrote:

...
Any chance you (i.e. a script) could replace the IP address with HASH(IP||salt) for a randomly chosen salt that you don't know, and which is deleted when the 30 minutes are up, before you get access to the log file?

See https://www.eff.org/policy#cryptolog for how EFF does something similar. It looks like they use 24 hour intervals, and they do this all the time, but hopefully their cryptolog tool will be helpful if we opt to use it for the short term. https://github.com/efforg/cryptolog

Would a prefix preserving scheme a la Crypto-PAn[0] be more useful? http://www.cc.gatech.edu/computing/Telecomm/projects/cryptopan/

I didn't think the subnet/prefix structure would be important to address Karsten's question, but that's a good technique to keep in mind.

Karsten Loesing

19 Jul 19 Jul

4:46 p.m.

On 2017-07-18 03:41, Roger Dingledine wrote:

...

On Mon, Jul 17, 2017 at 07:54:14PM -0400, Ian Goldberg wrote:

...
Any chance you (i.e. a script) could replace the IP address with HASH(IP||salt) for a randomly chosen salt that you don't know, and which is deleted when the 30 minutes are up, before you get access to the log file?

See https://www.eff.org/policy#cryptolog for how EFF does something similar. It looks like they use 24 hour intervals, and they do this all the time, but hopefully their cryptolog tool will be helpful if we opt to use it for the short term. https://github.com/efforg/cryptolog

As answered to Ian, I'd like to keep this simple and leave out IP addresses for now.

...

Also, teor's question about partial downloads is a really good one: there are many "download accelerators" out there that fetch the first 5 kbytes of the file or something and then stop and do it again, over and over. In theory our current logs should be able to help there, since it should log how many bytes were fetched.

As answered to teor, we should count these download accelerators as 1 code 200 request and n code 206 requests, and since we only consider code 200 requests, we'd count them as 1 download.

All the best, Karsten

...

And for those wondering about our current logging approach, see https://trac.torproject.org/projects/tor/ticket/20928 http://lists.spi-inc.org/pipermail/spi-general/2016-December/003645.html https://anonscm.debian.org/cgit/mirror/dsa-puppet.git/tree/modules/apache2/f...

--Roger

tor-project mailing list tor-project@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project

Karsten Loesing

4:37 p.m.

On 2017-07-18 01:54, Ian Goldberg wrote:

...

On Mon, Jul 17, 2017 at 08:05:30PM +0200, Karsten Loesing wrote:

...
Hello list,

it's been almost two years since we started collecting sanitized Apache web server logs. During this time the number of Tor Browser initial downloads rarely went below 70,000 per day.

https://metrics.torproject.org/webstats-tb.html

Either there must be a steady demand for fresh binaries, or there is a non-zero number of bots downloading the Tor Browser binary several times per day.

I already double-checked our aggregation code that takes sanitized web server logs as input and produces daily totals as output. It looks okay to me.

I'd also like to double-check whether there's anything unexpected happening before the sanitizing step. For example, could it be that there are a few IP addresses making hundreds or thousands of requests?

Or are there lots of requests with same referrers or common user agents indicating bots?

My plan is to ask our admins to temporarily add a second Apache log file on one of the dist.torproject.org hosts with the default Apache log file format without the sanitizing that is usually applied.

A snapshot of 15 or 30 minutes would likely be sufficient as sample. I'd analyze this log file on the server, delete it, and report my findings here.

This message has two purposes:

Is this approach acceptable? If not, are there more acceptable

approaches yielding similar results?

Are there any theories what might keep the numbers from dropping

below those 70,000 requests per day? What should I be looking for?

Thanks!

All the best, Karsten

Any chance you (i.e. a script) could replace the IP address with HASH(IP||salt) for a randomly chosen salt that you don't know, and which is deleted when the 30 minutes are up, before you get access to the log file?

Fine question. I'd like to keep this experiment simple and only use what Apache has built in. So, let's leave out IP addresses for the moment and see if the remaining fields, without timestamp and IP address, are sufficient to answer the question. We can always consider adding more fields and taking another 30 minutes snapshot. And if we do that we can see how other people have solved this problem, possibly using something similar to what you describe above.

Thanks!

All the best, Karsten

Karsten Loesing

4:50 p.m.

On 2017-07-17 20:05, Karsten Loesing wrote:

...

Hello list,

it's been almost two years since we started collecting sanitized Apache web server logs. During this time the number of Tor Browser initial downloads rarely went below 70,000 per day.

https://metrics.torproject.org/webstats-tb.html

Either there must be a steady demand for fresh binaries, or there is a non-zero number of bots downloading the Tor Browser binary several times per day.

I already double-checked our aggregation code that takes sanitized web server logs as input and produces daily totals as output. It looks okay to me.

I'd also like to double-check whether there's anything unexpected happening before the sanitizing step. For example, could it be that there are a few IP addresses making hundreds or thousands of requests?

Or are there lots of requests with same referrers or common user agents indicating bots?

My plan is to ask our admins to temporarily add a second Apache log file on one of the dist.torproject.org hosts with the default Apache log file format without the sanitizing that is usually applie> A snapshot of 15 or 30 minutes would likely be sufficient as sample. I'd analyze this log file on the server, delete it, and report my findings here.

Based on the discussion here, my amended plan is to use the default Apache log file format but leave out timestamps and IP addresses.

I'll ask our sysadmins to produce such a log file some time tomorrow, unless there are further concerns/ideas.

All the best, Karsten

...

This message has two purposes:

Is this approach acceptable? If not, are there more acceptable

approaches yielding similar results?

Are there any theories what might keep the numbers from dropping

below those 70,000 requests per day? What should I be looking for?

Thanks!

All the best, Karsten

Karsten Loesing

2 Aug 2 Aug

6:50 p.m.

On 2017-07-19 18:50, Karsten Loesing wrote:

...

On 2017-07-17 20:05, Karsten Loesing wrote:

...
Hello list,

it's been almost two years since we started collecting sanitized Apache web server logs. During this time the number of Tor Browser initial downloads rarely went below 70,000 per day.

https://metrics.torproject.org/webstats-tb.html

Either there must be a steady demand for fresh binaries, or there is a non-zero number of bots downloading the Tor Browser binary several times per day.

I already double-checked our aggregation code that takes sanitized web server logs as input and produces daily totals as output. It looks okay to me.

I'd also like to double-check whether there's anything unexpected happening before the sanitizing step. For example, could it be that there are a few IP addresses making hundreds or thousands of requests?

Or are there lots of requests with same referrers or common user agents indicating bots?

My plan is to ask our admins to temporarily add a second Apache log file on one of the dist.torproject.org hosts with the default Apache log file format without the sanitizing that is usually applie> A snapshot of 15 or 30 minutes would likely be sufficient as sample. I'd analyze this log file on the server, delete it, and report my findings here.

Based on the discussion here, my amended plan is to use the default Apache log file format but leave out timestamps and IP addresses.

I'll ask our sysadmins to produce such a log file some time tomorrow, unless there are further concerns/ideas.

So, I did ask the admins for some logs without timestamps and without IP addresses, but those logs did not reveal anything unusual.

It might have helped to include IP addresses, but I wanted to keep this analysis simple and decided to instead ask the admins for a log with only timestamps and no further request details.

iwakeh and I looked at a few days of these timestamp-only logs. There is a daily pattern with a decline towards UTC midnight and an incline towards UTC noon.

All in all we did not find any hints that these download numbers would be wrong. Which doesn't mean they're right, but that's quite impossible to prove.

Thanks to our friendly sysadmins for helping with this quick analysis!

All the best, Karsten

2654

Age (days ago)

2670

Last active (days ago)

tor-project@lists.torproject.org

10 comments

5 participants

tags (0)

participants (5)

Ian Goldberg
Karsten Loesing
Roger Dingledine
teor
Yawning Angel