This post is also available in: 日本語 (Japanese)
Executive Summary
Configuration data that changes across each instance of deployed malware can be a gold mine of information about what the bad guys are up to. The problem is that configuration data in malware is usually difficult to parse statically from the file, by design. Malware authors know the intelligence value as they provide directives for how the malware should behave.
Malware is like most complex software systems in that there are many advantages for code reuse and abstraction. Therefore, it is not surprising to see that the concept of software configuration is pervasive across the various malware families we analyze. After all, it’s pretty hard to imagine a stereotypical cybercriminal wanting to bother with recompiling their code to change an IP address or whatever else, when going after different targets.
But the good news is that statically armored configuration data can often easily be found and parsed directly from memory. We will cover a nice example of an IcedID (information stealer) configuration, how it was obfuscated and how we’ve extracted it.
Palo Alto Networks customers receive improved detection for the evasions discussed in this blog through Advanced WildFire. As we continue to parse and extract this information from malware families at scale, we hope to build out a pool of threat intelligence that will better help us understand the campaigns and tactics of the various threat actors who are targeting various organizations.
Related Unit 42 Topics | Memory Detection |
What Are Malware Configurations?
So what exactly do we mean by the term “configuration” when talking about malware? Outside the context of malware, we think of configuration in terms of defining how systems should behave. For example, we would consider the rules used to define which networking routes for a firewall are allowed, or which font size your web browser uses while you read this, as configurable information.
For malware, this is no different. Malware configurations are just collections of elements that define how a malware operates, such as the following:
- Command-and-control (C2) network addresses
- Passwords for remote administrators
- File paths in which to drop persistent payloads
The way these elements are embedded in malware components tends to be specific to each malware family. Also, they might evolve over time as malware undergoes development, or when malware authors change their build process.
Generally speaking, malware configuration elements tend to be the properties of malware that the authors want to make easily editable between campaigns and deployments without requiring manual code edits for each one. Malware configuration elements can also expose latent behaviors and malware infrastructure that are not typically observable under routine dynamic analysis.
Malware configurations have intelligence value for security practitioners because they provide insights into campaigns over time. In some cases, defenders could use them as actionable artifacts for network detection, or for identifying infected hosts. The successful extraction and validation of a malware configuration can also be used to reinforce our confidence when identifying a file as malicious.
Because malware configurations have value to security systems and defenders alike, it is state-of-practice for modern malware authors to protect their configuration elements using different techniques. These protections often include a blend of encryption, obfuscation and compression. They might also be layered with evasive techniques.
This protection poses a significant challenge for malware configuration extractors that operate solely by using static analysis, because all of these protections must be detected and bypassed before extraction can be performed. Using an advanced dynamic analysis sandbox combined with intelligent runtime memory analysis makes it possible to bypass many of these protections and pinpoint the best opportunities to perform extraction.
When we represent and store these configurations using standardized schemas, it enables us to extract maximum value through automation, machine learning and interactive analysis. The DC3-MWCP library defines a schema for many of the most common configuration element types, and it provides a simple library for serialization to JSON.
The MITRE MAEC and STIX projects also provide us with a more general vocabulary for representing malware configuration elements. This also allows us to correlate the elements with observable objects collected during dynamic analysis.
IcedID Analysis
Let’s look at one IcedID binary and how its configurations are encrypted.
Hash | 05a3a84096bcdc2a5cf87d07ede96aff7fd5037679f9585fee9a227c0d9cbf51 |
This particular attack chain, shown in Figure 1, was discovered in early November 2022. It delivered IcedID, an information stealer also known as Bokbot, as the final payload. This threat is well-known malware that has been attacking people since 2019.
The following diagram shows the infection chain.
Authors of IcedID took pains to hide their configurations. Recent samples of IcedID stage two would only be downloaded if the victim’s machine matched the requirements of the threat actor.
The configurations of IcedID consisted of C2 URLs and their campaign IDs. The C2 URLs included some that might not be revealed during the execution of the IcedID binaries. The campaign ID links IcedID samples back to specific threat actors.
We will go through the following steps to extract the configurations found in the IcedID stage one and two binaries:
- Unpack the IcedID binary
- Locate the encrypted configuration data blob
- Extract the encryption key
- Decrypt the configuration data blob with the encryption key
Unpacking IcedID Stage One
IcedID stage one unpacks itself by first allocating memory using the VirtualAlloc function. This is followed by erasing the allocated memory using the Memset function, as shown in Figure 2. Finally, it copies the unpacked data to the allocated memory using the Memmove function.
To dump the unpacked data, we set a breakpoint at Memmove. The second argument of Memmove contains the address of the unpacked data. Figure 2 also shows the DOS MZ header of the unpacked IcedID stage one in the right-hand side of the hex dump.
Locating the Encrypted Configuration Data Blob
Next, we located the encrypted configuration data blob using the unpacked stage one IcedID. While debugging the unpacked IcedID stage one file, we set a breakpoint at the address that called WinHttpConnect, as shown in Figure 3. The address pointed to by register RDI contains the string of the C2 URL.
By backtracing the code, we located a function that used the decrypted configuration as shown in Figure 4.
Tracing the code flow back, we found the loop that decrypted the configuration, as shown in Figure 5.
The instruction at 0x7FEF33339CD loaded the address of the encrypted configuration data blob (Encrypted_Config) into register RDX.
Extracting the Encryption Key
The instruction at 0x7FEF33339D4 reads the encryption key. The key is 0x40 bytes offset from the address of Encrypted_Config. We also learned the configuration is 0x20 bytes long. An XOR loop was used to decrypt the configuration.
Decrypting the Configuration Data Blob With the Encryption Key
After gathering the encryption key, the encrypted data blob and the decryption routine, we can now decrypt the configuration using the following script shown in Figure 6.
The decrypted IcedID stage 1 configuration has the following format, as shown in Figure 7.
From the decrypted configuration, we can extract the following IoCs:
C2 URL | bayernbadabum[.]com |
Campaign ID | 1139942657 |
Now, we will decrypt the configuration for the IcedID stage two binary.
Unpacking the IcedID Stage Two Binary
As the IcedID stage two binary uses the same packer as stage one, we will not repeat the unpacking steps here.
Locating the Encrypted Configuration Data Blob
We set a breakpoint at the address that calls Winhttpconnect, as shown in Figure 8.
After tracing the code, we located the function that used the decrypted configuration, as shown in Figure 9.
Extracting the Encryption Key
Tracing the code flow even further back, we found the function that decrypts the configuration. The first few instructions located the encrypted configuration blob. The encrypted blob is 0x25c bytes long. The encryption key is the last 0x10 bytes of the encrypted configuration blob, as shown in Figure 10.
After retrieving the encryption key, the next step is the loop to decrypt the encrypted blob, as shown in Figure 11.
Decrypting the Configuration Data Blob With the Encryption Key
We replicated the instructions in the decryption loop using Python. After gathering the encryption key, encrypted data blob and the decryption routine, we can now decrypt the configuration using the following script (shown in Figure 12).
The decrypted IcedID stage two configuration has the following format, shown in Figure 13.
From the decrypted configuration, we can extract the following indicators of compromise (IoCs):
C2 URLs | newscommercde[.]com
spkdeutshnewsupp[.]com germanysupportspk[.]com nrwmarkettoys[.]com |
C2 URI | news |
Campaign ID | 1139942657 |
We have manually decrypted the configuration for both the IcedID stage one and two binaries.
Scaling Up
Now that we’ve discussed the work of figuring out how to target the configuration data in memory, the next challenge is to figure out how to perform this at scale. The massive scale of most malware processing systems means that most practitioners looking to build out a configuration extraction system will need to be careful about adding additional overhead. This means that we will need a mechanism to intelligently identify only the samples of interest for each parser, so we’re not unnecessarily running dozens of parsers across millions of samples.
We think a reasonable approach to this problem involves using intelligent runtime memory analysis, as it provides us with excellent visibility into the secrets malware authors want to protect. A typical workflow for our malware configuration extractors includes the following activities:
- Scanning memory and/or other dynamic analysis artifacts
- Applying a noise filter on the results to identify the best candidates for extraction
- Performing extraction using the best fitting module and storing the results for reporting and indexing
Generalizing this common workflow presented us with the opportunity to make the following improvements:
- Optimizing the search phase by only scanning analysis data once in most cases
- Applying abstractions and reusable code for many common tasks
- Limiting the impact of modules with problematic inputs or other bugs
- Giving our security researchers visibility into the performance of their modules
The following example shows some of the IoCs from a recent IcedID extractor after being deployed at scale. Having a nice framework for deploying configuration extractors means that once you are finished crafting a configuration extraction script, it’s time to kick your feet up and relax while hundreds of configurations flow into your malware configuration database.
Conclusion
Thank you for joining us in this overview of malware configurations and why we are working hard to parse this information at scale in Advanced WildFire. Reverse engineering variants of each malware family allow us to build out parsers to extract meaningful and relevant data for all of them at scale.
There is a staggering amount of diversity among payloads in the malware landscape, which makes the task of supporting them all more or less impossible. Where possible, we use metrics-based approaches to prioritize focus on the malware families and variants most relevant to our customers. In this ongoing area of research, our team will continue to expand support for new malware families and variants.
Palo Alto Networks customers receive protections from threats such as those discussed in this post with Advanced WildFire.
Indicators of Compromise
05a3a84096bcdc2a5cf87d07ede96aff7fd5037679f9585fee9a227c0d9cbf51
Additional Resources
- Blowing Cobalt Strike Out of the Water With Memory Analysis – Unit 42, Palo Alto Networks
- Navigating the Vast Ocean of Sandbox Evasions – Unit 42, Palo Alto Networks
- Machine Learning Versus Memory Resident Evil – Unit 42, Palo Alto Networks
- Tailoring Sandbox Techniques to Hidden Threats – Unit 42, Palo Alto Networks
Updated May 17, 2023, at 6:00 a.m. PT.