Hi Jesse,
I found your email address in exit relay descriptors run at USU [1].
There has been an interesting comment on a recent boingboing post [2] that implies that USU exit relays store significant amount of logs without going into much detail:
dweller_below August 5, 2015
A very similar thing happened to USU. We received a summons from Homeland/ICE to produce 3 months of records (plus identifying info) for an IP that was one of our TOR exit nodes.
I eventually managed to contact the Special Agent in charge of the investigation. He turned out to be a reasonable person. I explained that the requested info was an extremely active TOR exit node. I said that we had extracted and filtered the requested data, it was 90 4 gig files (for a total of 360 gigs of log files) or about 3.2 billion log entries.
[...]
If you can confirm that the comment is authentic I'd be interested what kind of tor related data you are logging at your exit relays and why.
thanks, nusenu
[1] https://atlas.torproject.org/#search/contact:C20BEC80 [2] http://boingboing.net/2015/08/04/what-happened-when-the-fbi-sub.html
On Sat, Aug 8, 2015 at 2:03 AM, nusenu nusenu@openmailbox.org wrote:
that implies that USU exit relays store significant amount of logs
node. I said that we had extracted and filtered the requested data, it was 90 4 gig files (for a total of 360 gigs of log files) or about 3.2 billion log entries.
If you can confirm that the comment is authentic I'd be interested what kind of tor related data you are logging at your exit relays and why.
It's most likely netflow logs. Quite popular in Uni / regional ISP environments. People collect them for network stats, and to track down "security incidents". (Such logs by their very existance, and in absence of very strong policy, also generally attract dogooder suckup to whoever comes calling for them, be it their own internal network / security / employment / political queries, or from external parties). The USA has no EU style logging requirements, nor really laws beyond the mashup of wiretap / FERPA / PCI type stuff and internal "policy". Some might not keep anything, some 1wk / 1mo / 90d / 6mo / 1y or more, or whatever the disks can hold. Netstats can be aggregated down, but the other raw uses typically retain under "well, better keep them just in case". Once you start it's really hard to stop. Some places do in fact have really good policies, either from the start, or after years of debate.
If you can confirm that the comment is authentic I'd be interested what kind of tor related data you are logging at your exit relays and
why.
It's most likely netflow logs. Quite popular in Uni / regional ISP environments. People collect them for network stats, and to track down "security incidents".
Yep. I worked as a network engineer for a large public university in the US. They logged everything they could. There was at least 30 days of netflow, plus packet inspection of certain flows.
There was little to no policy regarding access to the logs by staff. Law enforcement requests went through the university's legal counsel.
I would expect most US universities to be logging netflow in the very least. Even if the Tor operator isn't keeping logs, it seems safe to assume the network operator is.
I would expect most US universities to be logging netflow in the very least. Even if the Tor operator isn't keeping logs, it seems safe to assume the network operator is.
I'd be surprised if it was different for non-US universities - I'd expect this to be the case for every university with its own AS, and probably most without. It's not specific to universities either; it would be a rare ISP that doesn't retain netflow for traffic accounting purposes. It's often somewhat aggregated, but to varying degrees - the last such system I worked on was designed to retain indefinitely at sub-minute granularity for training/crossvalidation of network anomaly detection.
I'd be curious to know if anyone is running a relay that's not logged at all within its own AS; it seems like it'd be out of the reach of most operators, unless they have a friendly employer.
Sharif
Hi.
On 08/09/2015 07:44 AM, Sharif Olorin wrote:
I would expect most US universities to be logging netflow in the very least. Even if the Tor operator isn't keeping logs, it seems safe to assume the network operator is.
I'd be surprised if it was different for non-US universities - I'd expect this to be the case for every university with its own AS, and probably most without. It's not specific to universities either; it would be a rare ISP that doesn't retain netflow for traffic accounting purposes.
Perhaps we're entering a time when universities need to be producing transparency reports...
It also seems that since there is significant incentive to run exits in order to gain "traffic visibility," we need some sort of competing incentive. I don't know what that is, however. Or perhaps "extensive logging at exits" need to be part of a more honest overview of Tor.
hope everyone is well tim
++ 09/08/15 06:44 +0000 - Sharif Olorin:
I'd be curious to know if anyone is running a relay that's not logged at all within its own AS; it seems like it'd be out of the reach of most operators, unless they have a friendly employer.
Up until now, my host didn't do anything like netflow - but I am pretty sure that will change sooner or later (but even then I don't expect that data to be retained for any more than seconds).
Sharif Olorin:
I would expect most US universities to be logging netflow in the very least. Even if the Tor operator isn't keeping logs, it seems safe to assume the network operator is.
I'd be surprised if it was different for non-US universities - I'd expect this to be the case for every university with its own AS, and probably most without. It's not specific to universities either; it would be a rare ISP that doesn't retain netflow for traffic accounting purposes. It's often somewhat aggregated, but to varying degrees - the last such system I worked on was designed to retain indefinitely at sub-minute granularity for training/crossvalidation of network anomaly detection.
Green & Sharif (& any others with direct netflow experience) -
At what resolution is this type of netflow data typically captured?
Are we talking about all connection 5-tuples, bidirectional/total transfer byte totals, and open and close timestamps, or more (or less) detail than this?
Are timestamps always included? Are bidirectional transfer bytecounts always included? Are subsampled packet headers (or contents) sometimes/often included?
What about UDP sessions? IPv6?
I think for various reasons (including this one), we're soon going to want some degree of padding traffic on the Tor network at some point relatively soon, and having more information about what is typically recorded in these cases would be very useful to inform how we might want to design padding and connection usage against this and other issues.
Information about how UDP is treated would also be useful if/when we manage to switch to a UDP transport protocol, independent of any padding.
Thanks a bunch!
On Wed, Aug 12, 2015 at 7:45 PM, Mike Perry mikeperry@torproject.org wrote:
At what resolution is this type of netflow data typically captured?
Routers originally exported at 100% coverage, then many of them started supporting sampling at various rates (because routers were choking and buggy anyways, and netheads were happy with averages), some only do sampling. Plug flow probes into network taps and you can do whatever you want (netsec loves this and other tools).
Are we talking about all connection 5-tuples, bidirectional/total transfer byte totals, and open and close timestamps, or more (or less) detail than this?
Are timestamps always included? Are bidirectional transfer bytecounts always included? Are subsampled packet headers (or contents) sometimes/often included?
What about UDP sessions? IPv6?
Information about how UDP is treated would also be useful if/when we manage to switch to a UDP transport protocol, independent of any padding.
All of the above depends on which flow export version / aggregation you choose, until you get to v9 and IPFIX, for which you can define your fields. In short... yes.
Flow endtime is last matching packet seen, but a flow can span records when the time (therefore space, ie RAM) limited mandatory expiry timers hit. UDP goes via that, TCP usually via flags. Records can span flows for which other semantic keys may not exist, as often with UDP. But DPI can also be used in the exporter to do all sorts of fun stuff and enable other downstream uses (obviously TLS / IPSEC / crypto break some things there).
Tor already bundles multiple logical flows (only TCP for user today) into some number of physical TCP flows, UDP transport there might not need anything special. But consider looking at average flow lifetimes on the internet. There may be case for going longer, bundling or turfing across a range of ports to falsely trigger a record / bloat, packet switching and so forth.
and having more information about what is typically recorded in these cases would be very useful to inform how we might want to design padding and connection usage against this and other issues.
"Typical" is really defined by the use case of whoever needs the flows, be it provisioning, engineering, security, operations, billing, bigdata, etc. And only limited by the available formats, storage, postprocessing, and customization. IPFIX and
https://en.wikipedia.org/wiki/NetFlow https://en.wikipedia.org/wiki/IP_Flow_Information_Export https://www.google.com/search?q=(netflow%7CIPFIX)+(probe%7Cexporter%7Cparser) http://www.freebsd.org/cgi/man.cgi?query=ng_netflow&sektion=4
I think for various reasons (including this one), we're soon going to want some degree of padding traffic on the Tor network at some point relatively soon
Really? I can haz cake nao? Or only after I pump in this 3k email and watch 3k come out the other side to someone otherwise idling ;)
https://cdn.plixer.com/images/slider-3-icon.png ... and/or some other bigdata systems ...
grarpamp:
On Wed, Aug 12, 2015 at 7:45 PM, Mike Perry mikeperry@torproject.org wrote:
At what resolution is this type of netflow data typically captured?
Are we talking about all connection 5-tuples, bidirectional/total transfer byte totals, and open and close timestamps, or more (or less) detail than this?
All of the above depends on which flow export version / aggregation you choose, until you get to v9 and IPFIX, for which you can define your fields. In short... yes.
But consider looking at average flow lifetimes on the internet. There may be case for going longer, bundling or turfing across a range of ports to falsely trigger a record / bloat, packet switching and so forth.
This interests me, but we need more details to determine what this looks like in practice.
I suspect that this is one case where the switch to one guard may have helped us. However, Tor still closes the TCP connection after just one hour of inactivity. What if we kept it open longer? Or what if the first hop was an encrypted UDP-based PT, where it was not clear if the session was torn down or closed?
recorded in these cases would be very useful to inform how we might want to design padding and connection usage against this and other issues.
"Typical" is really defined by the use case of whoever needs the flows, be it provisioning, engineering, security, operations, billing, bigdata, etc. And only limited by the available formats, storage, postprocessing, and customization. IPFIX and
"Typically", I appreciate your answers grarpamp. They're "typically" correct, but sometimes they have more flavor than I'm looking for, and in this case I am worried it may end up silencing the people I'd really like to hear from. I want real data from the field, here. Not speculation on what is possible.
I think for various reasons (including this one), we're soon going to want some degree of padding traffic on the Tor network at some point relatively soon
Really? I can haz cake nao? Or only after I pump in this 3k email and watch 3k come out the other side to someone otherwise idling ;)
You can say that, but then why isn't this being done in the real world? The Snowden leaks seem to indicate exploitation is the weapon of choice.
I suspect other factors are at work that prevent dragnet correlation from being reliable, in addition to the economics of exploits today (which may be subject to change). These factors are worth investigating in detail, and ideally before the exploit cost profiles change.
As such, I still look forward to hearing from someone who has worked at an ISP/University/etc where this is actually practiced. What is in *those* logs?
Specifically: Can we get someone (hell, anyone really) from Utah to weigh in on this one? ;)
Otherwise, the rest is just paranoid speculation, and bordering on trolled-up misinformation. :/
On Thu, Aug 13, 2015 at 3:40 AM, Mike Perry mikeperry@torproject.org wrote:
But consider looking at average flow lifetimes on the internet. There may be case for going longer, bundling or turfing across a range of ports to falsely trigger a record / bloat, packet switching and so forth.
This interests me, but we need more details to determine what this looks like in practice.
NANOG list could link specific papers regarding nature of the internet. The various flow exporters have sensible default timeouts tend cover that ok for purposes intended.
I suspect that this is one case where the switch to one guard may have helped us.
In that various activities such as ssh, browsing, youtube, whatever are confined to being multiplexed in one stream, that makes sense.
However, Tor still closes the TCP connection after just one hour of inactivity. What if we kept it open longer?
The exporting host has open flow count limited by memory (RAM). A longer flow might be forced to span two or more records. The "flags" field of some tools and versions may not mark a SYN seen in records 2+, the rest of tuple would stay same. Active timeout gives periodic data on longer flows, typically retaining start time but implementations can vary on state.
Here's an early IOS 12 default... Active flows timeout in 30 minutes (1~60) Inactive flows timeout in 15 seconds (10~600)
Also consider what is wished to hide, big iso download, little http clicks, start time of some characteristic session rippling across or appearing at edges, active data pumping attack. And what custom flowish things and flow settings an adversary might be doing to observe those. Traditional netflow seems useful as idea base to form a better heuristic analysis system.
Or what if the first hop was an encrypted UDP-based PT, where it was not clear if the session was torn down or closed?
In old netflow, UDP, where src/dst ip and port tuple is same, just times out into a record. Some new exotic DPI might form session context flow record based on application inside, crypto would stop the cleartext portion of DPI.
Some flow tools don't reassemble so the frag game might slide by them in an arms race.
in this case I am worried it may end up silencing the people I'd really like to hear from. I want real data from the field, here.
There's no censor here. Other operators in the field can and should speak up on topic (and feel free to bash any my errors or lack of paper / code / data posting).
You can say that, but then why isn't this being done in the real world? The Snowden leaks seem to indicate exploitation is the weapon of choice.
The Snowden leaks and Bamford also indicate NSA-UTAH, gigawatts, massive international cable tapping, CARNIVORE, NARUS, X-WHATEVER, etc. Exploitation could use a few offices full of hackers and some good peering points and hosts, not that huge $billions level of infrastructure and outside your borders expenditure. Though blackbagging to stand up an IP in the target net via an active tap could be useful, the driver to localize due to cost.
I suspect other factors are at work that prevent dragnet correlation from being reliable, in addition to the economics of exploits today (which may be subject to change). These factors are worth investigating in detail, and ideally before the exploit cost profiles change.
Yes, bigdata still has to conform to at least some kind of cost benefit analysis and substantiation too. Where is the line on that these days... 2^40, 2^56, 2^80...?
As such, I still look forward to hearing from someone who has worked at an ISP/University/etc where this is actually practiced. What is in *those* logs?
The questions were of a general "intro to netflow" nature, thus the links, they and other resource describe all the data fields, formation of records, timeouts, aggregation, IPFIX extensibility, etc. Others and I on these lists know what "360 gigs" of netflow looks like. *What* specific info are you looking for beyond that? Applicability to exploit? Should people print out flows from a degenerate client and exit use case and line them up?
More tools and tech... http://www.cisco.com/c/en/us/products/ios-nx-os-software/ios-netflow/index.h... nfdump, silk https://en.wikipedia.org/wiki/Network_Based_Application_Recognition
More generally, netflow is not the only passive analysis tool or idea out there. So focusing on it may be narrow, though it is popular tool.
Specifically: Can we get someone (hell, anyone really) from Utah to weigh in on this one? ;)
Well, I'll speculate that by now they've lawyered up and silenced their OP such that it's unlikely that we'll hear from them (short of coderman firing a broadside FOIA at them ;) (We're speculating it's 360 gigs of netflow.)
Otherwise, the rest is just paranoid speculation, and bordering on trolled-up misinformation. :/
https://www.youtube.com/watch?v=rh1Amz8s8MI http://www.multivax.com/last_question.html
Everybody's now asking questions what is possible, my job is done ;)
grarpamp:
On Thu, Aug 13, 2015 at 3:40 AM, Mike Perry mikeperry@torproject.org wrote:
However, Tor still closes the TCP connection after just one hour of inactivity. What if we kept it open longer?
The exporting host has open flow count limited by memory (RAM). A longer flow might be forced to span two or more records. The "flags" field of some tools and versions may not mark a SYN seen in records 2+, the rest of tuple would stay same. Active timeout gives periodic data on longer flows, typically retaining start time but implementations can vary on state.
Here's an early IOS 12 default... Active flows timeout in 30 minutes (1~60) Inactive flows timeout in 15 seconds (10~600)
This is helpful. To clarify, when a record is split due to timeout, a new record will have the start end end timestamps for the new flow?
Do collectors tend to recombine these split flows?
Otherwise, from these defaults, it sounds like Tor's one hour timeout on client TLS connections seems reasonable, and perhaps not worth raising, since even if we were using padding and keep-alives, the flow data would still record a fresh byte count record + timestamp every 30 minutes?
As such, I still look forward to hearing from someone who has worked at an ISP/University/etc where this is actually practiced. What is in *those* logs?
The questions were of a general "intro to netflow" nature, thus the links, they and other resource describe all the data fields, formation of records, timeouts, aggregation, IPFIX extensibility, etc. Others and I on these lists know what "360 gigs" of netflow looks like.
Well, right, then. Let's get to the meat of it.
*What* specific info are you looking for beyond that?
I am looking to understand what "360 gigs" aka "(3.2 billion records)" of netflow over 3 months looks like, and also if we can expect this to be standard practice, somewhat outside the norm, or indicative of someone who has specifically tuned their netflow config to attack Tor (should the opportunity arise).
Assuming the boingboing comment is accurate, and it's just one exit IP, then we're probably looking at two exits worth of data (either UtahStateExit0+UtahStateExit1, or UtahStateExit2+UtahStateExit3).
Each of these exit pairs appears to have averaged a little over 10Mbit/sec sustained over the most recent 3 month period according to https://globe.torproject.org. The exits are running some version of the Reduced Exit Policy, so there should be no bittorrent traffic. Likely mostly web traffic by connection count, and probably even byte count.
In three months, there are 7,776,000 seconds. So we're looking at 441 records per second in this dataset.
For 10Mbit/sec worth of sustained web traffic, that sounds about connection-level resolution to me. Do you agree?
Mike Perry:
grarpamp:
The questions were of a general "intro to netflow" nature, thus the links, they and other resource describe all the data fields, formation of records, timeouts, aggregation, IPFIX extensibility, etc. Others and I on these lists know what "360 gigs" of netflow looks like.
Well, right, then. Let's get to the meat of it.
*What* specific info are you looking for beyond that?
I am looking to understand what "360 gigs" aka "(3.2 billion records)" of netflow over 3 months looks like, and also if we can expect this to be standard practice, somewhat outside the norm, or indicative of someone who has specifically tuned their netflow config to attack Tor (should the opportunity arise).
Assuming the boingboing comment is accurate, and it's just one exit IP, then we're probably looking at two exits worth of data (either UtahStateExit0+UtahStateExit1, or UtahStateExit2+UtahStateExit3).
Each of these exit pairs appears to have averaged a little over 10Mbit/sec sustained over the most recent 3 month period according to https://globe.torproject.org. The exits are running some version of the Reduced Exit Policy, so there should be no bittorrent traffic. Likely mostly web traffic by connection count, and probably even byte count.
In three months, there are 7,776,000 seconds. So we're looking at 441 records per second in this dataset.
For 10Mbit/sec worth of sustained web traffic, that sounds about connection-level resolution to me. Do you agree?
(Yay! Thinking once and posting two posts at once to three different lists. I'm like some kind of Internet champion! ;)
I think I needed to do one more division. This is roughly one record per 3KB of traffic (which I think you alluded to earlier). Rather high if we expect this to be web traffic, even if there was only 1 web request per connection.
So then, what is the most likely configuration that would generate this many records? Is it indeed likely to be some BOFH scenario, or might there be some common (if half-insane) policy that ends up producing this many records?
Here's Globe for UtahStatExit2 and 3 for easy access: https://globe.torproject.org/#/relay/B4E641BC42DDB6FD2526CFF80504AB5221B0EB8... https://globe.torproject.org/#/relay/7E4E1CC167300932F05AC70ECD2B9A298732C6E...
The bandwidth histories have no current data, but you can click on the 3 month tab to get the numbers I used.
On Thu, Aug 13, 2015 at 07:39:45PM -0700, Mike Perry wrote:
Otherwise, from these defaults, it sounds like Tor's one hour timeout on client TLS connections seems reasonable, and perhaps not worth raising, since even if we were using padding and keep-alives, the flow data would still record a fresh byte count record + timestamp every 30 minutes?
Also check out https://trac.torproject.org/projects/tor/ticket/6799#comment:6 which got merged into Tor 0.2.5.5-alpha: https://gitweb.torproject.org/tor.git/tree/ChangeLog?id=tor-0.2.5.5-alpha#n9 where we randomize the time before we close an idle TLS conn.
--Roger
grarpamp:
On Thu, Aug 13, 2015 at 3:40 AM, Mike Perry mikeperry@torproject.org wrote:
But consider looking at average flow lifetimes on the internet. There may be case for going longer, bundling or turfing across a range of ports to falsely trigger a record / bloat, packet switching and so forth.
This interests me, but we need more details to determine what this looks like in practice.
NANOG list could link specific papers regarding nature of the internet. The various flow exporters have sensible default timeouts tend cover that ok for purposes intended.
I suspect that this is one case where the switch to one guard may have helped us.
In that various activities such as ssh, browsing, youtube, whatever are confined to being multiplexed in one stream, that makes sense.
However, Tor still closes the TCP connection after just one hour of inactivity. What if we kept it open longer?
The exporting host has open flow count limited by memory (RAM). A longer flow might be forced to span two or more records. The "flags" field of some tools and versions may not mark a SYN seen in records 2+, the rest of tuple would stay same. Active timeout gives periodic data on longer flows, typically retaining start time but implementations can vary on state.
Here's an early IOS 12 default... Active flows timeout in 30 minutes (1~60) Inactive flows timeout in 15 seconds (10~600)
Also consider what is wished to hide, big iso download, little http clicks, start time of some characteristic session rippling across or appearing at edges, active data pumping attack. And what custom flowish things and flow settings an adversary might be doing to observe those. Traditional netflow seems useful as idea base to form a better heuristic analysis system.
I submitted a proposal to tor-dev describing a simple defense against this default configuration: https://lists.torproject.org/pipermail/tor-dev/2015-August/009326.html
I'm also working on an implementation of that defense: https://trac.torproject.org/projects/tor/ticket/16861
Anyone with netflow experience should feel free to chime in there (or here if you are not subscribed to tor-dev), but please be mindful of the adversarial considerations in section 3 (unless you believe that adversary model to be invalid, but please explain why).
On Fri, Aug 21, 2015 at 12:30 AM, Mike Perry mikeperry@torproject.org wrote:
I submitted a proposal to tor-dev describing a simple defense against this default configuration: https://lists.torproject.org/pipermail/tor-dev/2015-August/009326.html
nProbe should be added to the router list, it's a very popular opensource IPFIX / netflow tap. http://www.ntop.org/products/netflow/nprobe/
For those into researching other flow capabilities... There are also some probes in OS kernels and some other opensource taps, they're not as well known or utilized as nProbe. Other large hardware vendors include Brocade, Avaya, Huawei, and Alcatel-Lucent.
Lots of SDN and monitoring projects can plug in with gear like this, because, FTW...
http://telesoft-technologies.com/technologies/mpac-ip-7200-dual-100g-etherne... http://www.hitechglobal.com/IPCores/100GigEthernet-MAC-PCS.htm http://www.napatech.com/sites/default/files/dn-0820_nt100e3-1-ptp_data_sheet... https://www.cesnet.cz/wp-content/uploads/2015/01/hanic-100g.pdf http://www.ndsl.kaist.edu/~kyoungsoo/papers/2010-lanman-100Gbps.pdf http://info.iet.unipi.it/~luigi/netmap/
grarpamp:
On Fri, Aug 21, 2015 at 12:30 AM, Mike Perry mikeperry@torproject.org wrote:
I submitted a proposal to tor-dev describing a simple defense against this default configuration: https://lists.torproject.org/pipermail/tor-dev/2015-August/009326.html
nProbe should be added to the router list, it's a very popular opensource IPFIX / netflow tap. http://www.ntop.org/products/netflow/nprobe/
While ntop is FLOSS, nProbe itself seems to be closed source. There's a FAQ on the page about it.
As such, I was only able to discover that its default inactive/idle timoeut is 30s. I couldn't find a range.
For those into researching other flow capabilities... There are also some probes in OS kernels and some other opensource taps, they're not as well known or utilized as nProbe. Other large hardware vendors include Brocade, Avaya, Huawei, and Alcatel-Lucent.
Out of all of these, I was only able find info on Alcatel-Lucent. It uses cflowd, which appears to be a common subcomponent. It's timeout ranges are the same as Cisco IOS.
What I really need now is any examples of common routers that have a default inactive/idle timeout below 10s, or allow you to set it below 10s. So far I have not found any.
Lots of SDN and monitoring projects can plug in with gear like this, because, FTW...
http://telesoft-technologies.com/technologies/mpac-ip-7200-dual-100g-etherne... http://www.hitechglobal.com/IPCores/100GigEthernet-MAC-PCS.htm http://www.napatech.com/sites/default/files/dn-0820_nt100e3-1-ptp_data_sheet... https://www.cesnet.cz/wp-content/uploads/2015/01/hanic-100g.pdf http://www.ndsl.kaist.edu/~kyoungsoo/papers/2010-lanman-100Gbps.pdf http://info.iet.unipi.it/~luigi/netmap/
I think these devices are wandering into the "adversarial admin" territory (see section 3 of the proposal). I want to focus on the case where the adversary demands/sniffs/exploits routers likely to be installed in most networks.
On 8/21/15, Mike Perry mikeperry@torproject.org wrote:
... What I really need now is any examples of common routers that have a default inactive/idle timeout below 10s, or allow you to set it below 10s. So far I have not found any.
i recall a switch vendor that used overflow condition to trim timeouts lower, but this is different from a hard, low limit by configuration.
i'll see what i can dig up...
best regards,
P.S. flow tracking systems always make me point at c++ & scapy userspace driven raw injection around massive flow sybils as retort in their raw take and analytics. most efficient state representation of TCP behavior in memory? it's a fun challenge :P [ P.P.S. this may just crash your in-path, rather than DoS. keep a backup route! ]
On Sat, Aug 22, 2015 at 1:09 AM, Mike Perry mikeperry@torproject.org wrote:
As such, I was only able to discover that its default inactive/idle timoeut is 30s. I couldn't find a range.
What I really need now is any examples of common routers that have a default inactive/idle timeout below 10s, or allow you to set it below 10s.
Not common unless you consider all the places where software is being used as a network tap, whether by legit operator, or by adversary.
nProbe can timestamp in milliseconds. [ 21] %LAST_SWITCHED %flowEndSysUpTime SysUptime (msec) of the last flow pkt [ 22] %FIRST_SWITCHED %flowStartSysUpTime SysUptime (msec) of the first flow pkt [152] %FLOW_START_MILLISECONDS %flowStartMilliseconds Msec (epoch) of the first flow packet [153] %FLOW_END_MILLISECONDS %flowEndMilliseconds Msec (epoch) of the last flow packet
Some define and assignment logic sets the defaults. It's software so anyone could adjust the options to at least fall within the "hardcoded" integer type... 1 to u_short. That doesn't mean it's sensible or that other bits in the code won't need munged, I didn't look.
u_short idleTimeout, lifetimeTimeout, sendTimeout; #define DUMP_TIMEOUT 30 /* seconds */ readOnlyGlobals.idleTimeout = DUMP_TIMEOUT; readOnlyGlobals.lifetimeTimeout = 4*DUMP_TIMEOUT; readOnlyGlobals.idleTimeout = atoi(optarg); readOnlyGlobals.lifetimeTimeout = atoi(optarg); if(readOnlyGlobals.lifetimeTimeout == 0) { readOnlyGlobals.lifetimeTimeout = 1; printf("[--lifetime-timeout|-t] <timeout> | It specifies the maximum (seconds) flow\n" " | lifetime [default=%d]\n", readOnlyGlobals.lifetimeTimeout); printf("[--idle-timeout|-d] <timeout> | It specifies the maximum (seconds) flow\n" " | idle lifetime [default=%d]\n", readOnlyGlobals.idleTimeout);
I think these devices are wandering into the "adversarial admin" territory (see section 3 of the proposal). I want to focus on the case where the adversary demands/sniffs/exploits routers likely to be installed in most networks.
Sniffs... Lavabit was [nearly/actually] forced to install devices on his network for some while, so I see no "Sorry, my vendors config range doesn't support it" distinction here. Telecoms like AT&T don't fight, and Vampires don't care.
Demands... The point with the NICs is that even 100Gbit taps are old news. With that comes deployment of flow / bro / etc like things that use them and the logs get saved because humans love to create collect and save useless stuff... to supply on demand. Software taps are popular, probably moreso at the network edges... universities, corp, regional / city, colo, etc. But costs are dropping, tech is rising, depts are doing these things.
Yes, a legit operator may be unlikely to adjust, or to set the timeouts too low on their own free will since: 1) why, 2) storage space, 3) processing cpu / bandwidth
Exploits... It's all software in the end.
I know, I'm partly diverging from legit operator context.
On 8/21/15, Mike Perry mikeperry@torproject.org wrote: ...
For those into researching other flow capabilities... There are also some probes in OS kernels and some other opensource taps, they're not as well known or utilized as nProbe. Other large hardware vendors include Brocade, Avaya, Huawei, and Alcatel-Lucent.
Out of all of these, I was only able find info on Alcatel-Lucent. It uses cflowd, which appears to be a common subcomponent. It's timeout ranges are the same as Cisco IOS.
for posterity, it would also be useful to scrutinize behavior of: - Arbor Peakflow SP - Narus Insight Manager - Lancope StealthWatch Xe
with respect to soft or hard fixed NetFlow limits within analysis or as pushed to tapped switches.
best regards,
While reducing network traffic to various accounting schemes such as netflow may enable some attacks, look at just one field of it... bytecounting.
Assume you've got a nice global view courtesy of your old bed buddies AT&T, Verizon, Sprint, etc and in addition to your own bumps on the cables.
You know the IP's of all Tor nodes (and I2P, etc). So you group them into one "cloud" of overlay IP's. For the most part any traffic into that cloud from an IP on the left, after it bounces around inside, must terminate at another IP on the right.
There are roughly 7000 relays, but because many of them are aggregable at the ISP/colohouse, peering and other good vantage point levels, you don't need 7000 taps to see them all.
You run your client and start loading and unloading the bandwidth of your target in one hour duty cycles for a few days. Meanwhile, record the bytecount every minute for every IP on the internet into some RRD.
There are only about 2.8 billion IPv4 in BGP [Potaroo]. Some usage research says about 1.3 billion of 2.6 billion BGP actually in use [Carna Census 2012]. IPv6 is minimal, but worth another 2.8 billion if mapped today. Being generous at 3.7 billion users (half the world [ITU]), that's 2^44 64-bit datapoints every three days... 128TiB.
Now, can you crunch those 3.7B curves to find one whose bytecount deltas match those of your datapump?
How fast can you speed it up?
And can you find Tor clients of clearnet services using similar method since you are not the datapump there?
What if you're clocking out packets and filling all the data links on your overlay net 24x7x365 such that any demand loading is now forced to ride unseen within instead of bursting out the seams?
Hi Mike.
On 08/21/2015 05:30 AM, Mike Perry wrote:
Anyone with netflow experience should feel free to chime in there (or here if you are not subscribed to tor-dev), but please be mindful of the adversarial considerations in section 3 (unless you believe that adversary model to be invalid, but please explain why).
I have some experience with netflow from $previousGig, and only had two potentially relevant thoughts when looking at your proposal.
- It is common practice to set the active timeout to 1min in SPs in order to speed detection of attacks with Arbor and similar tools.
- Cisco IOS (and likely other platforms) will immediately export flows if the cache fills to capacity. This will result in flows being exported in less than inactive timeout, and my understanding is that this is a common occurrence.
I hope this helps.
hope you are well tim
On 9/2/15, Tim Sammut tim@teamsammut.com wrote:
...
- Cisco IOS (and likely other platforms) will immediately export flows if the cache fills to capacity. This will result in flows being exported in less than inactive timeout,..
there is a second limit here, which is the netflow channel capacity / storage limit, if you introduce simulated flows at a rate beyond this capacity, you may become unobservable (via loss) resulting in failure to correlate.
this is why i asked about logical injection via userspace of billions of flows per minute as a resistance measure. (e.g. scapy or other raw inject across a border with cooperating peer, if needed.)
best regards,
On Thu, Sep 3, 2015 at 2:03 AM, coderman coderman@gmail.com wrote:
there is a second limit here, which is the netflow channel capacity / storage limit, if you introduce simulated flows at a rate beyond this capacity, you may become unobservable (via loss) resulting in failure to correlate.
I've seen ISP saturate their own backbone with netflow during nice UDP DoS, collectors had to be hung off local router ports after that.
this is why i asked about logical injection via userspace of billions of flows per minute as a resistance measure. (e.g. scapy or other raw inject across a border with cooperating peer, if needed.)
If the collector is not protected you can inject bogus flows, implicate your neighbor and fill disks.
On Thu, 13 Aug 2015, Mike Perry wrote:
As such, I still look forward to hearing from someone who has worked at an ISP/University/etc where this is actually practiced. What is in *those* logs?
I can speak to my experience in that capacity.
As an ISP, I need to be able to answer questions about the data flowing through my network, such as:
- What if scenarios for capacity planning, peering, etc that require me to understand how much traffic (in bits/sec at 95th percentile) is coming and going to each BGP prefix (route).
- Be able to shut down customers that are attacking the outside world and be able to classify the type of attack. To do this, I need to be able to differentiate good traffic from attack traffic somehow.
- Be able to quickly blackhole customer IPs that are taking a volumetric DDoS attack that is sufficiently large to threaten my other customer's connectivity.
- Be able to bill customers for the quantity of bandwidth that they use. For most ISPs, this comes from per-port counters, but ISPs that bill differently for peering traffic or for international traffic need more detail on bandwidth sent per customer to various destinations.
- Be able to track a given connection back to a customer to be able to react to abuse complaints for at least a few weeks. In my environment, I can just look at which IPs are assigned to a port, but other ISPs will need to consult DHCP logs. Some ISPs do NAT at the ISP level, and if a single IP can be used simultaneously by multiple customers, they may need to keep per-connection logs.
Netflow is the standardized way to collect packet header information to be able to answer all of the above and from what I can tell is in use by at least a significant fraction of ISPs. There are a few options to capture netflow data:
- Limited netflow capability is built into all of my big routers, and is probably the easiest to set up.
- If doing the netflow collection on the router itself is not viable, either port mirroring or specialized copper or fiber network taps can easily send a copy of all traffic to a collector box running something like nProbe (http://www.ntop.org/products/netflow/nprobe/).
The netflow record format will be something like on of these: http://www.cisco.com/c/en/us/td/docs/net_mgmt/netflow_collection_engine/3-6/...
The default netflow logging is per-flow, so one record will be saved per {protocol, source IP, source port, destination IP, destination port} tuple per 5-30 minutes. Connections that live longer than the flow timeout will span multiple records. Timestamps are second granularity, and TCP flags are ORed together, so you can tell whether the connection started/ended/continued based on the TCP flags.
I try to avoid storing any raw per-flow data to disk. At the scale I'm operating, I can't store it for very long, and walking through it again is too slow. If I wanted to throw more hardware at netflow log processing, it's at least possible to do, though. Of the people I've heard doing this, they are mostly paranoid companies (not ISPs) who want to be able to trace security incidents after the fact.
Instead, I try to only store much smaller aggregate data, such as packets and bytes sent and received per 5 minutes to and from each /24, and the results of the netflow-based attack detector, which processes the flows as it gets them.
-- Aaron
On 2015-08-13 19:00, Aaron Hopkins wrote:
I try to avoid storing any raw per-flow data to disk. At the scale I'm operating, I can't store it for very long, and walking through it again is too slow. If I wanted to throw more hardware at netflow log processing, it's at least possible to do, though. Of the people I've heard doing this, they are mostly paranoid companies (not ISPs) who want to be able to trace security incidents after the fact.
I was surprised how many companies had enough traffic to retroactively determine whether HEARTBLEED had previously been exploited. Neat, but scary.
On 2015-08-13 01:40, Mike Perry wrote:
As such, I still look forward to hearing from someone who has worked at an ISP/University/etc where this is actually practiced. What is in *those* logs?
I deal with two flow recording practices and resulting records retention at work.
The first is at the upstream ISP. Roughly 1 in 5k flows are sampled from routers for statistics gathering and general traffic measurement, in netflow format. Within two weeks, those records are used to generate simpler tuples of bandwidth used by member IP addresses and subnets, at decreasing resolution over time in a Cricket RRD.
The second is work's own security monitors. Full flows are generated and recorded from the same raw network data provided to IDSes. Records are retained for 1 year in practice.
The flow recorder is the open source Argus from Qosient.com. Argus will indicate a TIMEOUT situation when a TCP flow it has seen is still open at the time logs are rolled. Additional traffic on the flow will result in a new record, which is often annealed in post-processing with the previously seen flow.
In work's case Argus is configured to not record flows from problem hosts (high volume noise sources) or privacy-sensitive hosts (Tor nodes, others). Not all institutions will have that kind of configuration.
At first blush, it seems padding traffic may cause more TCP flows to be live for sampling hits in regimes like work's upstream ISP.
On the other hand, Argus may more easily confirm a TCP flow is still live in the case of padding traffic, but in practice "live" is already assumed for the lesser of a tcp.established default or a log roll, unless Argus saw a TCP teardown beforehand.
Richard
Mike,
At what resolution is this type of netflow data typically captured?
For raw capture, timestamps are typically second-resolution. The resolution post-aggregation is a different question. Keep in mind that netflow is just the most common example; many networks don't use Cisco netflow, but have something that meets the same requirements, storing relatively more or less data (e.g., pmacct, bro).
Are we talking about all connection 5-tuples, bidirectional/total transfer byte totals, and open and close timestamps, or more (or less) detail than this?
That's about right; some systems (e.g., pmacct in some configurations) store a four-tuple of (src,dest,tx,rx) while throwing out the ports and aggregating over the tx and rx flows such that connections can no longer be uniquely identified. What's stored from Cisco netflow is quite flexible[0]. Other systems like bro default to storing one record per connection, with all the information in a five-tuple plus things like IP TOS and byte counts.
Are timestamps always included?
Yes, to some granularity (there's not much point in storing connection info without times, for any of the reasons people normally store connection info). The most recent system I set up (bro) records connections with second-precision timestamps; the one before that (pmacct) stored aggregates over ten seconds (src,dest,tx,rx).
Are bidirectional transfer bytecounts always included?
You mean the number tx + rx, or the tuple tx,rx as opposed to just tx or rx? It's almost always the second one (tx,rx).
Are subsampled packet headers (or contents) sometimes/often included?
Contents storage is rare. Some universities store enough data to reconstruct most packets[1]; other ISPs usually don't. When full connection data is stored, it's deleted pretty fast (days or weeks at most).
Storing a subset of data from packet headers (ports, TOS) is very common, as is keeping counts of things like checksum mismatches.
What about UDP sessions? IPv6?
UDP is treated the same as TCP. IPv6 is the same as IPv4. ICMP et cetera are often stored too; these systems are normally thinking more in terms of IP packets than TCP segments or UDP datagrams.
I think for various reasons (including this one), we're soon going to want some degree of padding traffic on the Tor network at some point relatively soon, and having more information about what is typically recorded in these cases would be very useful to inform how we might want to design padding and connection usage against this and other issues.
arma or others can probably explain why this is a hard problem; I don't know enough in this area to comment.
Information about how UDP is treated would also be useful if/when we manage to switch to a UDP transport protocol, independent of any padding.
I don't think UDP helps you at all here. What makes you think it might?
Sharif
[0] http://www.cisco.com/en/US/technologies/tk648/tk362/technologies_white_paper... [1] https://www.bro.org/community/time-machine.html
Sharif Olorin:
At what resolution is this type of netflow data typically captured?
For raw capture, timestamps are typically second-resolution. The resolution post-aggregation is a different question. Keep in mind that netflow is just the most common example; many networks don't use Cisco netflow, but have something that meets the same requirements, storing relatively more or less data (e.g., pmacct, bro).
Are we talking about all connection 5-tuples, bidirectional/total transfer byte totals, and open and close timestamps, or more (or less) detail than this?
That's about right; some systems (e.g., pmacct in some configurations) store a four-tuple of (src,dest,tx,rx) while throwing out the ports and aggregating over the tx and rx flows such that connections can no longer be uniquely identified. What's stored from Cisco netflow is quite flexible[0]. Other systems like bro default to storing one record per connection, with all the information in a five-tuple plus things like IP TOS and byte counts.
Are timestamps always included?
Yes, to some granularity (there's not much point in storing connection info without times, for any of the reasons people normally store connection info). The most recent system I set up (bro) records connections with second-precision timestamps; the one before that (pmacct) stored aggregates over ten seconds (src,dest,tx,rx).
So in the bro-based system (which sounds higher resolution) the final logged data was second-precision timestamps on full connection tuples?
So if I have a connection to a Tor Guard node opened for 8 hours, at the end of the session, your system would record a single record with: (my_ip,my_port,guard_ip,guard_port,tx,rx,timestamp_open,timestamp_close)
Or would it record 8*60*60 == 28800 records, with one record stored per second that the connection was open/active?
I think for various reasons (including this one), we're soon going to want some degree of padding traffic on the Tor network at some point relatively soon, and having more information about what is typically recorded in these cases would be very useful to inform how we might want to design padding and connection usage against this and other issues.
arma or others can probably explain why this is a hard problem; I don't know enough in this area to comment.
I think any system that is storing connection-level data (as opposed to one record per timeslice of activity on a tuple) is likely to be rather easy to defend against correlation.
I also think that systems that store only sampled data will also be very easy to defend against correlation. Murdoch's seminal IX-analsysis work required 100-500M transfers to get any accuracy out of sample-based correlation at all, and even then the false positives were a serious problem, even when correlating a small number of connections.
We have a huge problem right now where all of the research in this area claimed extremely effective success rates, and swept any mitigating factors under the rug (especially false positives and the effects of large amounts of concurrent users or additional activity).
Information about how UDP is treated would also be useful if/when we manage to switch to a UDP transport protocol, independent of any padding.
I don't think UDP helps you at all here. What makes you think it might?
Well, it seems harder to store a full connection tuple for open until close, because you have no idea when the connection actually closed (unless you are recording a tuple for every second during which there is any activity, or similar).
So in the bro-based system (which sounds higher resolution) the final logged data was second-precision timestamps on full connection tuples?
It's not higher-resolution, it just has different defaults. I can configure bro to capture/retain anything I want, up to and including every datagram passing through the interface.
So if I have a connection to a Tor Guard node opened for 8 hours, at the end of the session, your system would record a single record with: (my_ip,my_port,guard_ip,guard_port,tx,rx,timestamp_open,timestamp_close)
Not a single record, no. Have a read of the docs, or try it for yourself: https://www.bro.org/documentation/index.html
You can just set bro up to capture packets on your local machine or network.
I think any system that is storing connection-level data (as opposed to one record per timeslice of activity on a tuple) is likely to be rather easy to defend against correlation.
As I alluded to earlier, these systems typically work with network flows, not TCP connections.
Well, it seems harder to store a full connection tuple for open until close, because you have no idea when the connection actually closed (unless you are recording a tuple for every second during which there is any activity, or similar).
The raw capture is usually the latter; sometimes with stateful monitors these are postprocessed/aggregated into connections, depending on configuration. Again, the docs are out there and you can just run the system for yourself - I recommend setting it up on your home network for a week and seeing what you get. :)
Mike,
Additionally, I should clarify that bro and netflow have some fundamental differences and are usually used for different things (but both are common in large networks). Bro's very stateful and is more focused on IDS-type applications, whereas netflow is more directed towards traffic accounting, which is why bro has all the stateful stuff about TCP connections. bro would be more commonly found at a university, but netflow's probably more relevant if you're looking at what the typical ISP will retain for a long time.
Sharif Olorin:
Mike,
Additionally, I should clarify that bro and netflow have some fundamental differences and are usually used for different things (but both are common in large networks). Bro's very stateful and is more focused on IDS-type applications, whereas netflow is more directed towards traffic accounting, which is why bro has all the stateful stuff about TCP connections. bro would be more commonly found at a university, but netflow's probably more relevant if you're looking at what the typical ISP will retain for a long time.
Yes, unfortunately this is why "just set up bro/netflow at home and try it!" is not really helpful. It is obvious that these systems can in theory be configured to log+analyze all data for all time, especially if it is just my tiny DSL line with one person browsing the web over Tor and I have a few TB worth of disk to burn.
However, speculation about the evil BOFH who twiddles his mustache and tunes netflow to deanonymize all Tor users forever is rather boring to me. It's a scenario that's unlikely to happen at scale, or be practical for full analysis of the entire Tor network. Even if we are looking at such a BOFH in the Utah case, we have yet another datapoint against the evil BOFH correlation theory: These logs were useless!
The important question to me is: "If we assume honest Tor nodes, what level of logging is likely to be practiced at their ISP or AS today without their knowledge, and what technical measures are available to us to reduce that potential impact?"
In this Utah exit case, the exit operator in question is indeed honest, and we're looking at an upstream admin who just happened to be logging stuff, likely as per some standard (if heavy-handed) connection-level logging policy.
I suspect that type of adversary will be possible to defeat with similar amounts of padding that will defeat hidden service circuit setup fingerprinting, website traffic fingerprinting, traffic type classification, and a host of other low-resource attacks...
tor-relays@lists.torproject.org