On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
My concern is that we have the data at all. We shouldn't have any sensitive information logged on the webservers. Therefore sanitizing the logs should not be necessary. I would like to replace the current 0.0.0.0/0.0.0.1 scheme with a geoip lookup and just log the country code in place of the IP address. Apache can do this on the fly between request and the log entry.
Is there still anything sensitive in that log file that we should remove? For example:
Referrers and requested urls will be a nightmare to clean up. We literally get thousands of probes a day per site trying to exploit apache (or tomcat, or cgi, or a million other things). If we were the US military, we'd claim each probe is a hostile attack and whine about millions of attacks on our infrastructure a year. Clearly this is cyberwar and we need $3 billion to stop it or retaliate.
On the other hand, seeing the referrer data has been interesting because it tells us where our traffic originates. Our top referrers are google and the wikipedia pages about tor in various languages. The search terms are also valuable if we want to buy keywords for ads some day. We've had two volunteers do this already through google adwords and the results are surprising.