On 8/25/11 10:08 AM, Karsten Loesing wrote:
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
Here's a suggested sanitizing procedure for our web logs, which are in Apache's combined log format:
- Ignore everything except GET requests.
- Ignore all requests that resulted in a 404 status code.
- Rewrite log lines so that they only contain the following fields:
- IP address 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests
(as logged by our Apache configuration),
- the request date (with the time part set to 00:00:00),
- the requested URL (cut off at the first encountered "?"),
- the HTTP version,
- the server's HTTP status code, and
- the size of the returned object.
- Write all lines from a given virtual host and day to a single output
file.
- Sort the output file alphanumerically to conceal the original order
of requests.
Pushing this forward. Here are the sanitized web logs that we'd like to publish on a daily basis for all our web servers and virtual domains for all of 2010 (155M):
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-01.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-02.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-03.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-04.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-05.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-06.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-07.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-08.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-09.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-10.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-11.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-12.tar
The webalizer output for www.torproject.org can be viewed here:
http://freehaven.net/~karsten/volatile/www.torproject.org-webalizer/
So. Is it safe to publish these logs on a daily basis? The same questions from my original mail apply here:
Is there still anything sensitive in that log file that we should remove? For example:
- Do the logs reveal how many pages were cached already on the
requestor's site (e.g. as repeat accesses)? Note that log files are sorted before being published.
- Are there other concerns about making these sanitized log files
publicly available?
Are the decisions to remove parts from the logs reasonable? In particular:
- Do we have to take out all requests with 404 status codes? Some of
these requests for non-existing URLs contain typos which may not be safe to make public. Should we instead put in some placeholder for the URL part and keep the 404 lines to know how many 404's we have per day?
- Is there any good reason to keep the portion of a URL after a "?"?
- Is it possible to leave some part of Referers in the logs that helps
us figure out where our traffic originates and what search terms people use to find us?
- Can we resolve client IP addresses to country codes and include those
in the logs instead of our 0.0.0.0/0.0.0.1 code for HTTP/HTTPS? How would we handle countries with only a few users per day, e.g., should there be a threshold below which we consider requests to come from "a country with less than XY users?"
The next steps will be to make these sanitized logs available on a daily basis and to publish the sanitized archives from 2008, 2009, and 2011.
I'm going to wait another week (probably longer) for feedback before taking these next steps.
Best, Karsten