The secret life of DNS packets: investigating complex networks
domain name system Is a key part of the infrastructure used to facilitate communication across networks. It’s often described as a phone book: in its most basic form, DNS provides a way to find host addresses by easy-to-remember names. For example, looking for the domain name stripe.com will direct the client to the IP address 53.187.159.182, which is the location of one of Stripe’s servers. Before any communication can take place, the first thing the host must do is query the DNS server for the address of the target host. Because these lookups are a prerequisite for communication, maintaining a reliable DNS service is extremely important. DNS issues can quickly cause severe, widespread outages, and you may find yourself in real trouble. binding.
It’s important to establish good observability practices for these systems so that when problems arise, you can clearly understand how they failed and act quickly to minimize the impact. A well-instrumented system provides insight into how it operates; establishing monitoring systems and collecting reliable metrics is critical to effectively respond to incidents. This is critical for post-incident analysis when you are trying to understand the root cause and prevent future recurrences.
In this article, I’ll describe how we monitor our DNS systems and use a range of tools to investigate and fix a recent spike in unexpected DNS errors.
Stripe’s DNS infrastructure
At Stripe, we operate a cluster of DNS servers running Not bounda popular open source DNS resolver that can Solve recursively DNS queries and cached results. These resolvers are configured to forward DNS queries to different upstream destinations based on the domain in the request. Queries used for service discovery are forwarded to our Consul cluster. Query the domain we configured Route 53 and any other domains on the public Internet are forwarded to our cluster’s VPC resolver, which is the DNS resolver provided by AWS as part of its service. VPC products. We also run resolvers locally on each host, which provides an additional layer of caching.
Unbound exposes a wealth of statistics we collect collect and fed into our metrics pipeline. This allows us to understand metrics such as the number of queries being served, query types, and cache hit rate.
We have recently observed that for a few minutes every hour, the cluster’s DNS servers return SERVFAIL responses for a small number of internal requests. SERVFAIL is a common response returned by a DNS server when an error occurs, but it doesn’t tell us much about the cause of the error.
Without much progress initially, we found another clue in the request list depth metric. (You can think of this as Unbound’s internal to-do list, where it keeps track of all the DNS requests it needs to resolve.)
An increase in this metric indicates that Unbound is unable to process messages in a timely manner, possibly due to increased load. However, the metrics do not show a significant increase in the number of DNS queries, nor does resource consumption appear to reach any limits. Since Unbound resolves queries by contacting external name servers, another explanation could be that these upstream servers take longer to respond.
Trace the source
Following this clue, we logged into one of the DNS servers and checked Unbound’s request list.
This confirms that requests are being accumulated in the request list. We noticed some interesting details: most of the items in the manifest correspond to reverse DNS lookups (PTR record) and they are both waiting for a response from 10.0.0.2, which is the IP address of the VPC resolver.
Then we use tcp dump Capture DNS traffic on one of the servers to better understand what’s going on and try to identify any patterns. We want to make sure we capture traffic during one of these spikes, so we configure tcpdump to write data to the archive over a period of time. We split the archives into 60-second collection intervals to keep archive sizes small, making them easier to use.
Packet capture shows that during the hourly peak, 90% of requests to the VPC resolver were reverse DNS queries to the IP in 104.16.0.0/12 CIDR range. The vast majority of these queries fail with a SERVFAIL response. we have used dig Querying the VPC resolver using some of these addresses and confirming that it takes longer to receive a response.
By looking at the source IPs of the clients doing reverse DNS queries, we noticed that they all came from hosts in the Hadoop cluster. We maintain a database of when Hadoop jobs start and end, so we can correlate those times with hourly spikes. We eventually narrowed the source of the traffic down to a job that analyzed network activity logs and performed reverse DNS lookups on the IP addresses found in those logs.
One of the more surprising details we found in the tcpdump data was that the VPC resolver was not sending back responses to many queries. During one of the 60-second collection periods, the DNS server sent 257,430 packets to the VPC resolver. The VPC resolver only responded to 61,385 packets, an average of 1,023 packets per second. We realize we may encounter AWS limitations Learn how much traffic you can send to the VPC resolver, which is 1,024 packets per second per interface. Our next step is to establish better visibility in our clusters to validate our hypotheses.
counting packet
AWS exposes its VPC resolver via: Static IP address Relative to the VPC’s base IP, add 2 (for example, if the base IP is 10.0.0.0, the VPC resolver will be 10.0.0.2). We need to track the number of packets sent to this IP address per second. One tool that can help us is iptablesbecause it tracks the number of packets that match the rule.
We created a rule that matches traffic to the VPC resolver IP address and added it to OUTPUT
chain, which is a set of iptables rules that apply to all packets sent from the host.
We configure the rule to jump to a new chain named
VPC_RESOLVER
and adds an empty rule to the chain. Since our host may contain other rules OUTPUT
Within the chain, we added this rule to isolate matches and make it easier to parse the output.
Listing the rules, we see in the output the number of packets sent to the VPC resolver:
With this, we wrote a simple service that starts with VPC_RESOLVER
Chain and report this value through our metrics pipeline.
Once we start collecting this metric, we can see the hourly spikes SERVFAIL
The response coincides with a time period when the server was sending too much traffic to the VPC resolver.
Traffic amplification
What we saw from iptables (the number of packets sent to the VPC resolver per second) showed that traffic to the VPC resolver increased significantly during these times, and we wanted to better understand what was going on. Looking closely at the traffic shape from the Hadoop job into the DNS server, we noticed that the client sent five requests for each failed reverse lookup. Because the reverse lookup takes a long time or is dropped on the server, the local cache resolver on each host times out and keeps retrying the request. In addition, the DNS server will also retry the request, causing the request volume to increase by an average of 7 times.
spread the load
One thing to keep in mind is that VPC resolver limitations are imposed every network interface. Instead of just performing a reverse lookup on the DNS server, we can distribute the load and have each host contact the VPC resolver independently. We can easily control this behavior by executing Unbound on each host. Unbound allows you to specify different forwarding rules for each DNS zone. Reverse lookup uses special domains in-addr.arpa
so to configure this behavior, you only need to add a rule to forward requests from this region to the VPC resolver.
We understand that reverse lookups of private addresses stored in Route 53 may return faster than reverse lookups of public IPs that require communication with external name servers. So we decided to create two forwarding configurations, one for resolving private addresses ( 10.in-addr.arpa.
area) and one for all other reverse queries ( .in-addr.arpa.
district). Both rules are configured to send requests to the VPC resolver. Unbounded computing Retry timeout Based on a smoothed average of historical round-trip times to the upstream server, and maintaining a separate calculation for each routing rule. Even if two rules share the same upstream target, the retry timeout is calculated independently, which helps isolate the impact of inconsistent query performance on timeout calculations.
After applying the forwarding configuration changes to the native Unbound resolver on the Hadoop node, we found that the hourly load spike on the VPC resolver had disappeared, eliminating SERVFAILS
We see:
The new VPC Resolver Packet Rate metric gives us a more complete picture of what’s happening in the DNS infrastructure. It alerts us if we are approaching any resource limits and points us in the right direction if the system is unhealthy. Some other improvements we are considering include collecting rolling tcpdumps of DNS traffic and periodically logging the output of some Unbound debugging commands, such as the contents of the request manifest.
Visibility into complex systems
When implementing critical infrastructure such as DNS, it is critical to understand the health of each component of the system. The metrics and command-line tools provided by Unbound allow us to gain insight into one of the core elements of the DNS system. As we saw in this case, these types of investigations often uncover areas where surveillance can be improved, and it’s important to address these gaps to better prepare for incident response. Gathering information from multiple sources can give you different perspectives on what’s going on in your system, which can help you narrow down the root cause during your investigation. This information will also determine whether the remedial actions you took were having the desired effect. As these systems continue to grow in size and complexity, the way you monitor them must evolve to understand how different components interact and build confidence that your system is operating effectively.
2024-12-12 18:51:07