On 9/2/11 7:32 PM, Marsh Ray wrote:
On 08/25/2011 03:08 AM, Karsten Loesing wrote:
Hi everyone,
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
Why?
I.e., what are the great benefits hoped to arise from such publication to outweigh the considerable risks?
The benefits are, e.g., that we learn more about our website visitors and can make our websites more useful for them. And we can learn which packages users download, including their platforms, languages, etc. which may help us concentrate our efforts better. These are just two examples, but I think we agree that analyzing web logs does provide a benefit.
Our general approach with analyzing potentially sensitive data is to openly discuss the algorithm to remove any sensitive parts, make the resulting data publicly available, and only analyze those. Ideally, we don't want to collect the sensitive parts at all, but sometimes that's not feasible (IP addresses in bridge descriptors, request order in web server logs), so we need to post-process the data before publication.
I think the overall risk of our approach is considerably lower than trying to keep the data you're planning to analyze private, because there's always the risk of losing data.
See this paper and website for a better answer:
https://metrics.torproject.org/papers/wecsr10.pdf
https://metrics.torproject.org/formats.html
What are the considerable risks you're referring to?
Best, Karsten