Sharif Olorin:
At what resolution is this type of netflow data typically captured?
For raw capture, timestamps are typically second-resolution. The resolution post-aggregation is a different question. Keep in mind that netflow is just the most common example; many networks don't use Cisco netflow, but have something that meets the same requirements, storing relatively more or less data (e.g., pmacct, bro).
Are we talking about all connection 5-tuples, bidirectional/total transfer byte totals, and open and close timestamps, or more (or less) detail than this?
That's about right; some systems (e.g., pmacct in some configurations) store a four-tuple of (src,dest,tx,rx) while throwing out the ports and aggregating over the tx and rx flows such that connections can no longer be uniquely identified. What's stored from Cisco netflow is quite flexible[0]. Other systems like bro default to storing one record per connection, with all the information in a five-tuple plus things like IP TOS and byte counts.
Are timestamps always included?
Yes, to some granularity (there's not much point in storing connection info without times, for any of the reasons people normally store connection info). The most recent system I set up (bro) records connections with second-precision timestamps; the one before that (pmacct) stored aggregates over ten seconds (src,dest,tx,rx).
So in the bro-based system (which sounds higher resolution) the final logged data was second-precision timestamps on full connection tuples?
So if I have a connection to a Tor Guard node opened for 8 hours, at the end of the session, your system would record a single record with: (my_ip,my_port,guard_ip,guard_port,tx,rx,timestamp_open,timestamp_close)
Or would it record 8*60*60 == 28800 records, with one record stored per second that the connection was open/active?
I think for various reasons (including this one), we're soon going to want some degree of padding traffic on the Tor network at some point relatively soon, and having more information about what is typically recorded in these cases would be very useful to inform how we might want to design padding and connection usage against this and other issues.
arma or others can probably explain why this is a hard problem; I don't know enough in this area to comment.
I think any system that is storing connection-level data (as opposed to one record per timeslice of activity on a tuple) is likely to be rather easy to defend against correlation.
I also think that systems that store only sampled data will also be very easy to defend against correlation. Murdoch's seminal IX-analsysis work required 100-500M transfers to get any accuracy out of sample-based correlation at all, and even then the false positives were a serious problem, even when correlating a small number of connections.
We have a huge problem right now where all of the research in this area claimed extremely effective success rates, and swept any mitigating factors under the rug (especially false positives and the effects of large amounts of concurrent users or additional activity).
Information about how UDP is treated would also be useful if/when we manage to switch to a UDP transport protocol, independent of any padding.
I don't think UDP helps you at all here. What makes you think it might?
Well, it seems harder to store a full connection tuple for open until close, because you have no idea when the connection actually closed (unless you are recording a tuple for every second during which there is any activity, or similar).