Mike,
At what resolution is this type of netflow data typically captured?
For raw capture, timestamps are typically second-resolution. The resolution post-aggregation is a different question. Keep in mind that netflow is just the most common example; many networks don't use Cisco netflow, but have something that meets the same requirements, storing relatively more or less data (e.g., pmacct, bro).
Are we talking about all connection 5-tuples, bidirectional/total transfer byte totals, and open and close timestamps, or more (or less) detail than this?
That's about right; some systems (e.g., pmacct in some configurations) store a four-tuple of (src,dest,tx,rx) while throwing out the ports and aggregating over the tx and rx flows such that connections can no longer be uniquely identified. What's stored from Cisco netflow is quite flexible[0]. Other systems like bro default to storing one record per connection, with all the information in a five-tuple plus things like IP TOS and byte counts.
Are timestamps always included?
Yes, to some granularity (there's not much point in storing connection info without times, for any of the reasons people normally store connection info). The most recent system I set up (bro) records connections with second-precision timestamps; the one before that (pmacct) stored aggregates over ten seconds (src,dest,tx,rx).
Are bidirectional transfer bytecounts always included?
You mean the number tx + rx, or the tuple tx,rx as opposed to just tx or rx? It's almost always the second one (tx,rx).
Are subsampled packet headers (or contents) sometimes/often included?
Contents storage is rare. Some universities store enough data to reconstruct most packets[1]; other ISPs usually don't. When full connection data is stored, it's deleted pretty fast (days or weeks at most).
Storing a subset of data from packet headers (ports, TOS) is very common, as is keeping counts of things like checksum mismatches.
What about UDP sessions? IPv6?
UDP is treated the same as TCP. IPv6 is the same as IPv4. ICMP et cetera are often stored too; these systems are normally thinking more in terms of IP packets than TCP segments or UDP datagrams.
I think for various reasons (including this one), we're soon going to want some degree of padding traffic on the Tor network at some point relatively soon, and having more information about what is typically recorded in these cases would be very useful to inform how we might want to design padding and connection usage against this and other issues.
arma or others can probably explain why this is a hard problem; I don't know enough in this area to comment.
Information about how UDP is treated would also be useful if/when we manage to switch to a UDP transport protocol, independent of any padding.
I don't think UDP helps you at all here. What makes you think it might?
Sharif
[0] http://www.cisco.com/en/US/technologies/tk648/tk362/technologies_white_paper... [1] https://www.bro.org/community/time-machine.html