Netflow, IPFIX, and sFlow provide a lot of great information, and we can gain more insight by doing reverse name lookups for IP addresses. Once domain names are resolved we can correlate traffic by domain even when different server IP addresses are exported. Popular domain names can be correlated with categories like "Social Media" and "News". There is only one problem - DNS lookups can take a long time.
The code running the Flow Analyzer works around DNS lookup latency as much as possible, but for some domains there's no way to get around it entirely. Here's how we try to work around DNS:
- Recommend using a local caching DNS server, which:
- Quickly returns already cached records
- Saves bandwidth
- Reduces load on upstream DNS servers
- Doesn't abuse public DNS servers like Google's 18.104.22.168 / 22.214.171.124
- Cache records in Python for direct access
- Use timeouts for DNS lookups
Even still it isn't always quick. Say an IP address record belonging to example.com is exported and collected by the Flow Analyzer. If the record of that IP address (with the domain name example.com) isn't cached in Python we have to query the local caching server. If the local caching server doesn't have that record it has to go to an upstream server. This would normally be an ISP's server or Google's 126.96.36.199, Amazon's Route 53, Dyn, or others. If that upstream server doesn't have the record cached then it has to pass the request on. This request travels up the DNS hierarchy until it's referred to a server that has an answer, or the lookup fails.
There's another wrinkle in this as well, and that's the protocol DNS uses. While DNS lookups can be done over TCP or UDP, most systems default to using UDP first. UDP is connectionless, otherwise known as a "fire and forget" protocol. If a UDP packet is dropped in transit the sender won't know it because there isn't any handshaking or receipt acknowledgement. Using UDP means that DNS requests are quick and produce low traffic overhead, but it can also mean packets get dropped silently. A failed request because of "lost" UDP packets won't be re-attempted unless the lookup is run again, which introduces more latency.
Here's a profile of the Flow Analyzer's packet processing time in seconds over 220 packets:
|DNS Status||Average Packet Processing Time (sec)||Median Packet Processing Time (sec)|
|DNS Enabled, 3 sec. Timeout||1.691327273||0.004|
|DNS Enabled, No Timeout||3.73955||0.008|
With DNS lookups disabled the average packet processing time is 0.003 seconds, and the median processing time is in-line with that. Enabling DNS records with a 3 second timeout on slow / failed lookups introduces some pretty big latency spikes. Notice how the 1.691 second average is far off the 0.004 second median, a big indication that there's some latency spikes and outliers. The final row is much worse, and shows how bad DNS latency can be when it's allowed to run amok.
Be aware if you're thinking about enabling DNS lookups that there will be an impact on performance. Test this functionality during off-peak hours and use the info debug level to see packet processing times.