Re: [tor-relays] clarification on what Utah State University exit relays store ("360 gigs of log files")

14 Aug 2015


      Sharif Olorin:
...
...
At what resolution is this type of netflow data typically captured?
For raw capture, timestamps are typically second-resolution. The
resolution post-aggregation is a different question. Keep in mind that
netflow is just the most common example; many networks don't use Cisco
netflow, but have something that meets the same requirements,
storing relatively more or less data (e.g., pmacct, bro).
...
Are we talking about all connection 5-tuples, bidirectional/total
transfer byte totals, and open and close timestamps, or more (or less)
detail than this?
That's about right; some systems (e.g., pmacct in some configurations)
store a four-tuple of (src,dest,tx,rx) while throwing out the
ports and aggregating over the tx and rx flows such that connections
can no longer be uniquely identified. What's stored from Cisco netflow
is quite flexible[0]. Other systems like bro default to storing
one record per connection, with all the information in a five-tuple
plus things like IP TOS and byte counts.
...
Are timestamps always included?
Yes, to some granularity (there's not much point in storing connection info
without times, for any of the reasons people normally store connection
info). The most recent system I set up (bro) records connections with
second-precision timestamps; the one before that (pmacct) stored
aggregates over ten seconds (src,dest,tx,rx).
So in the bro-based system (which sounds higher resolution) the final
logged data was second-precision timestamps on full connection tuples?
So if I have a connection to a Tor Guard node opened for 8 hours, at the
end of the session, your system would record a single record with:
(my_ip,my_port,guard_ip,guard_port,tx,rx,timestamp_open,timestamp_close)
Or would it record 8*60*60 == 28800 records, with one record stored per
second that the connection was open/active?
...
...
I think for various reasons (including this one), we're soon going to
want some degree of padding traffic on the Tor network at some point
relatively soon, and having more information about what is typically
recorded in these cases would be very useful to inform how we might want
to design padding and connection usage against this and other issues.
arma or others can probably explain why this is a hard problem; I
don't know enough in this area to comment.
I think any system that is storing connection-level data (as opposed to
one record per timeslice of activity on a tuple) is likely to be rather
easy to defend against correlation.
I also think that systems that store only sampled data will also be very
easy to defend against correlation. Murdoch's seminal IX-analsysis work
required 100-500M transfers to get any accuracy out of sample-based
correlation at all, and even then the false positives were a serious
problem, even when correlating a small number of connections.
We have a huge problem right now where all of the research in this area
claimed extremely effective success rates, and swept any mitigating
factors under the rug (especially false positives and the effects of
large amounts of concurrent users or additional activity).
...
...
Information about how UDP is treated would also be useful if/when we
manage to switch to a UDP transport protocol, independent of any
padding.
I don't think UDP helps you at all here. What makes you think it might?
Well, it seems harder to store a full connection tuple for open until
close, because you have no idea when the connection actually closed
(unless you are recording a tuple for every second during which there is
any activity, or similar).
-- 
Mike Perry

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [tor-relays] clarification on what Utah State University exit relays store ("360 gigs of log files")