Re: [tor-dev] Sanitizing and publishing our web server logs

5 Sep 2011


      On 9/2/11 7:32 PM, Marsh Ray wrote:
...
On 08/25/2011 03:08 AM, Karsten Loesing wrote:
...
Hi everyone,
we have been discussing sanitizing and publishing our web server logs
for quite a while now.  The idea is to remove all potentially sensitive
parts from the logs, publish them in monthly tarballs on the metrics
website, and analyze them for top visited pages, top downloaded
packages, etc.  See the tickets #1641 and #2489 for details.
Why?
I.e., what are the great benefits hoped to arise from such publication
to outweigh the considerable risks?
The benefits are, e.g., that we learn more about our website visitors
and can make our websites more useful for them.  And we can learn which
packages users download, including their platforms, languages, etc.
which may help us concentrate our efforts better.  These are just two
examples, but I think we agree that analyzing web logs does provide a
benefit.
Our general approach with analyzing potentially sensitive data is to
openly discuss the algorithm to remove any sensitive parts, make the
resulting data publicly available, and only analyze those.  Ideally, we
don't want to collect the sensitive parts at all, but sometimes that's
not feasible (IP addresses in bridge descriptors, request order in web
server logs), so we need to post-process the data before publication.
I think the overall risk of our approach is considerably lower than
trying to keep the data you're planning to analyze private, because
there's always the risk of losing data.
See this paper and website for a better answer:
https://metrics.torproject.org/papers/wecsr10.pdf
https://metrics.torproject.org/formats.html
What are the considerable risks you're referring to?
Best,
Karsten

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-dev] Sanitizing and publishing our web server logs