Overview of dnsmasq Vulnerabilities: The Dangers of DNS Cache Poisoning

By

Category: Unit 42

Tags: , , , , ,

This post is also available in: 日本語 (Japanese)

Executive Summary

DNS masquerade (dnsmasq) is a widely used open source DNS resolver. While one might not be familiar with dnsmasq by name, it is used by many projects and hardware firmwares around the world, from Kubernetes to routers and other products.

Over the years, multiple critical vulnerabilities have been found in dnsmasq. Recently, security researchers discovered new issues that continue to make dnsmasq vulnerable. These vulnerabilities can lead to DNS cache poisoning, denial of service (DoS) and possibly remote code execution (RCE). In this blog, I will review these vulnerabilities in dnsmasq, with a deep dive on DNS cache poisoning. I will also cover the effect such issues have on cloud products such as Kubernetes.

Palo Alto Networks customers are protected from the attacks outlined in this blog with Next-Generation Firewall with DNS Security, and Prisma Cloud.

Background on DNS Vulnerabilities

As covered in detail in my previous blog, “The History of DNS Vulnerabilities,” port and transaction ID randomization is one of the key methods a modern DNS resolver uses to protect against cache poisoning.

Cache poisoning is an attack in which one poisons the DNS resolver’s cache by sending malicious responses. The attack happens after a DNS resolver sends a request to an upstream server. At this point, the attacker sends fake responses that appear to come from the server the victim organization contacted. The DNS resolver receives the malicious responses and caches them. From then on, when a victim organization asks the DNS resolver for this domain, it will answer with the IP address of an attacker controlled server. This results in redirections that can be very hard to detect. For example, a user might browse to a bank’s website, having typed the URL correctly – but instead of returning the IP address of the bank, a cache poisoning attack could cause the DNS resolver to send the user to an attacker’s IP address instead. Because of the serious implications of this possibility, DNS vulnerabilities are often critical.

One of the mitigations against this attack is to randomize the transaction ID and source port of the request from the resolver to the upstream server. By randomizing those two values, each is 16 bits long, and each request has a 32 bit key that an attacker has to match in order to make malicious responses get accepted by the DNS resolver.

Figure 1. Recap of a DNS cache poisoning attack.
Figure 1. Recap of a DNS cache poisoning attack.

CVEs

Two types of vulnerabilities were recently discovered in dnsmasq:

A bug in the implementation of the DNS protocol, such as validation issues, that can be leveraged for DNS cache poisoning attacks:

And buffer overflow bugs that can lead to DoS attacks:

So far, the implications of the buffer overflow vulnerabilities seem limited to DOS attacks.

The first group of vulnerabilities, however, can be exploited to perform devastating cache poisoning attacks, and I will be focusing on them in this blog.

There are a few key design implementations special to dnsmasq, and knowing how they work helps provide a good understanding of the recent vulnerabilities.

CVE-2020-25684: Transaction ID and Port Randomization Done Incorrectly

dnsmasq implements transaction ID and port randomization. By default, it supports up to 64 ports at the same time. This means that dnsmasq can hold open sockets of 64 ports simultaneously and wait for responses on each of those port sockets. An attacker has to guess the source port (any of the 64 that are open) correctly. Otherwise, malicious packets will simply be dropped on the dnsmasq server.

That sounds like a good mitigation, but the problem lies in the fact that dnsmasq didn’t match transaction ID to source port. Instead of having to guess the exact transaction ID and the exact source port, an attacker needs to guess just the transaction ID and any of the 64 ports. This weakens the port randomization mitigation by 64 times.

What this means is that because dnsmasq doesn’t match a transaction ID to a specific port, it increases the chances of hitting an open request by 64 times – an attacker can succeed by hitting any of the 64 open ports.

How Does dnsmasq Match Responses to Requests?

A frec, or a forward record, is a record that is received at dnsmasq but isn’t in the cache, so dnsmasq has to forward the request to an upstream server. As long as the request isn’t fulfilled, meaning the dnsmasq hasn’t received a response yet for the associated request, it is called a frec. An attacker that wants to poison dnsmasq needs to send it a fake response for a frec. frecs are removed once fulfilled or a certain amount of time has passed (timed out).

But how does dnsmasq match each response to the correct frec? dnsmasq saves only the hash of the question section in the DNS query and discards the rest. Each DNS query has a “questions” section. It is where the query holds the actual question, for example: where is www.example.com.

It can do that because each DNS response also contains the question it is answering, so both the request and the response have this section. This section is completely identical on both sides, which makes it great to use as a key. dnsmasq hashes the question before it sends the request and then hashes the question in the received response and matches it to a question it asked, meaning a frec.

This is a simplified version of dnsmasq’s algorithm for matching a frec when it is received from the upstream server:

1.0 check if response’s destination port matches any of the open ports

1.1 if not: drop the response

2.0 set key = (hash of the response’s question, transaction ID)

3.0 search for matching frec by key

3.1 if not found: drop the response

4.0 process the response

By default, dnsmasq supports up to 150 frecs at the same time. This means that it is possible that in a given time there will be 150 open frecs. Each of those frecs uses one of the 64 randomized ports that we covered earlier.

There are two ways to abuse this behavior, which both result in the same outcome and I will cover them both in the next section.

CVE-2020-25685: Weak Hashing Algorithm

dnsmasq uses a custom CRC32 function as its hashing function for the key. Unfortunately, CRC32 is not a cryptographically secure hash function. In other words, it is possible for different inputs to have the same output. In this case, different question sections can result in the same output. An attacker can abuse that to craft a special list of domains that all result in the same Custom-CRC32 hash and then send 150 queries to dnsmasq using that list.

CVE-2020-25686: Pending Queries Are Not Checked

dnsmasq allows multiple queries for the same domain. It means that an attacker can just query www.example.com 150 times before dnsmasq is able to receive results for any of the requests, and there will be a timeframe in which there are 150 open frecs for those 150 queries.

Note that if an attacker issues an attack from a web browser, most modern web browsers will block further requests to the same domain and this method will not work. In such a case, the attacker would have to use the previous method, which is a bit more complicated.

Outcome

Both of those problems can result in the same thing. An attacker who wants to send fake responses can choose to do that in a time frame in which dnsmasq holds 150 frecs, with the exact same hashed question key, with 150 different transaction IDs and with 64 different ports.

An attacker needs to time this correctly and send fake responses in that tiny time frame when all open 150 frecs are of the same domain. The attacker’s fake responses will also be for that domain. In such a case, the attacker's chances of hitting any of the open frecs are 150 times greater than in a regular attack.

No Response Verification

When dnsmasq receives a response from an upstream DNS server, it does not verify it. In order to understand the problem here, one needs to be familiar with CNAME records.

A Canonical Name or CNAME record is a type of DNS record that maps an alias name to a true or canonical domain name. CNAME records are typically used to map a subdomain such as www or mail to the address hosting that subdomain’s content. For example, a CNAME record can map the domain mail.example.com to the mail server of the domain example.com.

In dnsmasq, however, one can send any A record after the CNAME record and dnsmasq will simply trust the response without verifying it. No one asked about those A records, so dnsmasq will not forward those records to the client. Instead, it simply caches all the A records that it is given. Not only that, dnsmasq will also overwrite any already cached addresses. For example:

www.example[.]com  CNAME  www.bank[.]com

www.bank[.]com     A      13.37.13[.]37

In the above example, if dnsmasq received such a response for an open request for www.example[.]com and the attacker got the correct transaction ID and source port, dnsmasq will overwrite the cached address of www.bank[.]com regardless of the TTL (Time to Live) of previously cached addresses. From now on, any organization that tries to access their bank’s website will end up accessing an attacker-controlled website.

This sounds crazy, but it isn’t. If dnsmasq had to verify every line after the CNAME, that would reduce its performance dramatically. It would have to contact each domain in the CNAME response and make sure the A record given is indeed its address, which will make the entire idea of CNAME redundant.

So instead, dnsmasq trusts the source of the response, on the assumption that it is nearly impossible to beat the transaction ID and source port randomization mitigation.

Attack Scenario Explained

Theory

In order to craft a successful DNS cache poisoning attack, one must correctly guess the transaction ID and source port. Usually this means guessing a 32-bit key – 16 bits for the port and 16 bits for the transaction ID.

In our case, as I demonstrated so far, CVE-2020-25684 helps the attacker by increasing the chances of hitting the correct port by 64 times, so the port part of the key is reduced to 16-log264=10 bits long. Combined with the transaction ID key, our key is still 26 bits long. The chances of hitting a successful frec so far are 1 to 226 = 67,108,864 with a single response.

With either CVE-2020-25685 or CVE-2020-25686, we concluded that using 150 requests with the same question’s hash, either with the same domain or with a special list, increases the chances of hitting the correct frec by 150 times. Combining that with CVE-2020-25684 increases the chances of hitting a successful frec by 64 * 150 = 9,600 times. The chances of hitting a successful frec so far is 1 to 232/9600 = 447,392 times.

Another thing is that usually an attacker would have to wait for a DNS entry Time to Live value to be expired (to be removed from cache) in order to poison it – otherwise the DNS resolver would simply give the results from cache right away and no poisoning will take place. In our case, however, an attacker can abuse the CNAME record in order to poison any entry desired.

Attack Scenario

An attacker wants to poison www.bank[.]com. Usually, this would require waiting for www.bank[.]com to be out of cache, guessing the port correctly and guessing the transaction ID correctly. Only then would the attacker have a chance of actually succeeding.

If the DNS resolver is dnsmasq, the attacker can skip the first part and just query any domain that would not be in the cache. Instead of responding with a regular A record response, the attacker would respond with a CNAME, as described earlier.

The attacker still needs to guess the port and the transaction ID, which is a 32 bit long key.

Figure 2. Cache poisoning attack on dnsmasq.
Figure 2. Cache poisoning attack on dnsmasq.

Using the above vulnerabilities, an attacker can issue 150 requests, either for the same domain or with a list of domains with the same custom-CRC32 hashes. This way, they will have 150 possible transaction IDs to guess and need to hit only one of them. As calculated above, using this method, the attacker will have a 9,600 times greater chance of succeeding with each response.

If the attacker manages to hit with one of the fake responses, dnsmasq will receive the CNAME response and will cache all the A records in it.

Chances for a Successful Attack

In my last blog about DNS, I dived into the probability calculation needed to understand what is the likelihood of succeeding in a DNS cache poisoning attack with different amounts of mitigations. The calculations are the same as last time, but with a key size of 232 / 9600 = 447,392.

An attacker needs to hit any of the 150 transaction IDs and any of the 64 open ports to succeed in the attack.

I managed to create a reliable proof of concept (PoC) in a simulated network, which could poison a dnsmasq resolver from inside an internal network in less than five minutes in the worst case, and under 10 seconds on average for a successful attack.

Kubernetes and OpenShift

dnsmasq is part of kube-dns which is the DNS service of Kubernetes. The kube-dns service is made up of three containers running in a kube-dns pod in the kube-system namespace.

The three containers are:

  • kube-dns: A container that runs SkyDNS, which performs DNS query resolution.
  • dnsmasq: Caches responses from SkyDNS.
  • sidecar: A sidecar container that handles metrics reporting and responds to health checks for the service.

In the past, Kubernetes and OpenShift used kube-dns for DNS services.

As of Kubernetes v1.12, CoreDNS is the recommended DNS Server, replacing kube-dns. If your cluster originally used kube-dns, you may still have kube-dns deployed rather than CoreDNS. kube-dns uses dnsmasq, which is potentially dangerous as discussed in this article. However, CoreDNS doesn’t use dnsmasq, so newer versions of Kubernetes are not affected.

Applications

A Kubernetes or OpenShift cluster that uses dnsmasq will be vulnerable to the attacks described above. It is enough for a single deployment to be breached in order to poison the entire cluster. Another possible situation involves a dnsmasq machine that was misconfigured and is open to the internet, instead of just the internal network. That scenario is even worse, as DNS resolvers open to the internet are easy to find by bots scanning public IPs.

Conclusion

These vulnerabilities show once again why DNS cache poisoning is relevant to this day, despite all the mitigations that have been added over the years.

Having said that, with secure browsing and certificates usage nowadays, even if an attacker manages to poison a DNS server for a domain, it is most likely that the victim organization’s browser won’t load the page because the certificates won’t match. But the technique can still be used for enormous DoS attacks by poisoning an ISP’s DNS resolver and forwarding requests to a nonexistent IP address.

In the cloud, however, the problem is much bigger. As discussed earlier, Kuberenetes used to use dnsmasq as its default DNS resolver until not so long ago. Even today, users can still choose to use dnsmasq instead of CoreDNS. In such a scenario, an attacker who manages to poison a cluster’s DNS resolver can cause an enormous amount of damage because, unlike browsers, cloud applications usually don’t do certificates verification.

An attacker who breaches a single container can easily use this attack to spread further into the entire cluster, even without a container escape vulnerability.

Palo Alto Networks customers are protected from the attacks outlined in this blog in a variety of ways. Palo Alto Networks Next-Generation Firewalls can block DNS attacks by detecting suspicious DNS queries and anomalous DNS responses. Attacks related to exhaustion or high quantities of traffic are handled with DoS or zone protection profiles that are built into the firewall. These protections are in addition to coverage of DNS threats and indicators through DNS Security.

Furthermore, Prisma Cloud customers are protected from this threat through the Prisma Cloud Compute Host and Container Vulnerability Scanner, which alerts on vulnerable software components. The attack described in this blog requires one of the following, in addition to an old and vulnerable dnsmasq deployment:

  • A breach in the internal network of the cluster to be able to send malicious packets to the DNS server from the inside.
  • A misconfigured DNS resolver that is open to the internet.

Such scenarios almost always happen by attackers exploiting outdated and misconfigured applications, which are monitored by the Prisma Cloud Compute Compliance Protection.