Abstract

The competing landscape between malware authors and security analysts is an ever-changing battlefield over who can innovate over the other. While security analysts are constantly updating their signatures of known malware, malware variants are changing their signature each time they infect a new host, leading to an endless game of cat and mouse. This survey looks at providing a thorough review of obfuscation and metamorphic techniques commonly used by malware authors. The main topics covered in this work are (1) to provide an overview of string-scanning techniques used by antivirus vendors and to explore the impact malware has had from a security and monetary perspective; (2) to provide an overview of the methods of obfuscation during disassembly, as well as methods of concealment using a combination of encryption and compression; (3) to provide a comprehensive list of the datasets we have available to us in malware research, including tools to obfuscate malware samples, and to finally (4) discuss the various ways Windows APIs are categorized and vectorized to identify malicious binaries, especially in the context of identifying obfuscated malware variants. This survey provides security practitioners a better understanding of the nature and makeup of the obfuscation employed by malware. It also provides a review of what are the main barriers to reverse-engineering malware for the purposes of uncovering their complexity and purpose.

1. Introduction

Digital resources and infrastructure have become some of the most crucial concerns in the field of cyber security. As we encourage a greater use of the Internet to delegate the tasks of everyday life, we expose ourselves and our information through potential exploitation by malicious actors. The biggest culprit is malware, a portmanteau for malicious software. Malware takes on many forms, but put simply, the ultimate goal of malware is to carry out a series of actions for nefarious purposes. Whether the end goal is espionage, disrupting services, or exploiting systems for financial gain, the costs associated with inaction are increasing every year as new malware variants are deployed on unsuspecting enterprises and victims. Every year several antivirus (AV) vendors publish their annual white papers regarding the current state of malware worldwide. From a research standpoint, researchers are concerned about three aspects of malware behavior: the ability for malware to disguise its own structure to avoid detection; modification and/or utilization of the host operating system (OS) resources; and the communication malware aims to establish externally [1] to so-called command and control servers (CnC). These aspects of malware behavior can be summarized as follows:Obfuscation: Malware employs the use of various obfuscation techniques, such as packing and encryption, in order to avoid signature-based detection methods. Obfuscated malware also makes it cumbersome to disassemble and produce accurate control-flow graphs (CFG) when reverse engineering.Resources: Malware will utilize various resources of the host operating system in order to carry out its predefined objectives. Malware will call several Windows application programming interfaces (APIs), make changes to the registry, read and write to the file system, as well as create and spawn new daughter processes and threads.Network: Malware will attempt to communicate with an outside command and control (CnC) server in order to relay information. Communication may be used to serve a greater botnet network and relay personal confidential details obtained from surveillance of the target operating system (OS) or used in detecting the presence of a sandbox environment in antiemulation and stealth malware.

The scope of malware worldwide is widespread and includes infections in both Macintosh and Windows OS, affecting businesses, governments, and individuals alike. A total of 20% of individuals have experienced a malware attack in one form or another, a 14% increase from 2018 [2]. Estimates obtained for 2019 identified 24 million Windows and 30 million Macintosh infections being recorded [3], with Kaspersky noting over 24 million unique malicious objects being detected in 2019 alone [4]. While infections recorded span several different types of OS, approximately 94% of malware developed is, in fact, Windows targeted [5, 6]. Malware takes on many shapes and sizes and includes archetypes such as Trojans, adware, spyware, viruses, worms, ransomware, rootkits, exploits, cryptojackers, and keyloggers. These all carry out some form of invasion, damage, or disabling of systems for the direct or indirect benefit of the malicious actor. More recently, the availability of free and open source software distributions has posed significant risks, as so-called “script kiddies,” which are users who have little to no experience in writing software themselves, have made use of these tools for nefarious purposes. The readily available access to distributions such as Remnux and Kali Linux (Offensive Security, New York City, NY) has made it even easier for users to deploy various forms of reconnaissance and penetration testing tools with out-of-the box software. As natural language processing (NLP) tools become more sophisticated, chatbots such as ChatGPT can act as personal advisers in red-teaming and blue-teaming drills, which can also subsequently be used by black hats for their own vulnerability campaigns.

Businesses are some of the most susceptible recipients to malware attacks, as they are potential victims to ransomware attacks for monetary gain and experience service downtown due to denial of service (DOS) attacks. For example, in late 2019, the average downtime for a ransomware attack was 16.2 days and the average ransom payment was 81,116 USD, almost doubling from 41,198 USD seen earlier in 2019 [7]. The average cost of a data breach to a business was estimated at 3.8 million USD [8], and the average cost of a DOS attack was placed at 2 million USD [9]. The prevalence of malware in the business environment is evident, with 95% of organizations recording a malicious infection [10] and 81% having been affected by such an infection [11]. While total malware detections have seen a small increase of 1% year over year, the business sector has seen a 13% increase in 2019 [3, 12]. The top 10 malware variants which target business infrastructure saw triple digit increases in their number of infections between 2018 and 2019 [3]. Small businesses represent 43% of infected businesses reported, likely due to their inability to mitigate, flag, and respond to infections appropriately [7] and the fact that 37% of businesses spend less than 200,000 USD on Internet technology (IT) security and 78% do not have a formal incident response plan in place [13, 14]. Security experts encourage IT security personnel to adopt the 1-10-60 rule: threats are to be detected within the first minute, threats are to be investigated in 10 minutes, and an appropriate action must be taken within the first 60 minutes [15]. Businesses are prime targets for malware due to the financial motivation, with 71% of all breaches being financially motivated and 25% being motivated by espionage [7, 8]. Furthermore, North America is one of the leading regions where corporate ransomware is a pressing concern, with 68% of businesses having experienced attacks in the last year [16].

AV vendors are particularly interested in the emergence of new forms of malware because these represent unique instances of malware that have never been seen before and they pose a significant threat to security infrastructure. A report by FireEye noted over 100,000 unique malware signatures are being reported each day by AV vendors [10]. Zero-day attacks are of particular concern as they require AV vendors to develop signatures of these new malware instances, requiring significant domain-level knowledge and constant revision of their signature database. New and emerging threats are evident, with 60% of ransomware variants identified in the last 6 months of 2016 being developed in the last year [17]. Moreover, a small but mutable subset of malware variants, totaling only 50 malware families, were noted to make up 80% of all successful malware infections [10]. This propensity for malware infections to originate from a small family of malware instances is due to the polymorphism built into their development. Polymorphism allows for malware to change their signature upon each iteration of its propagation, leading to previously unseen threats and new instances of zero-day attacks [1821]. As the stakes increase for both cybercriminals and businesses, so has the tools they develop to penetrate and mitigate threat vectors, respectively. The call for cyber security expertise has never been at its highest, with 62% of organizations planning on investing more in cyber security in 2020 [22]. The prevalence of polymorphic malware and its variants has expanded how we approach the field of cyber security for threat mitigation. Legacy methods, which classify new malware based on previously known signatures, are no longer effective in identifying polymorphic malware [23], lending credence to the development of a more adaptable, behavioral, and cognitive-based approach to how we detect malware [24]. The vast majority (93.6%) of malware observed today is polymorphic [25], and the necessary steps must be taken to ensure our instruction detection systems (IDS) and security information and event management (SIEM) systems are equipped to keep up with the ever-mutating nature of today’s malware landscape.

This review will cover several aspects of metamorphic malware: starting from the limitations of current signature-based methods to the various obfuscation techniques employed by malware. This survey discusses the constantly evolving threat characteristics of metamorphic malware, which provides the basis for building more sophisticated heuristic and analytically tools based on potential features sets. In addition, a broad discussion of metamorphic engines and antiarmoring techniques discusses the challenges researchers face in isolating malware variants in a controlled environment. We hope to improve the current understanding of metamorphic malware research by making the following core contributions in this work:(i)Summarize the common obfuscation methods which, in turn, can be used to develop better heuristic techniques for feature engineering in machine learning pipelines.(ii)Present the inner workings of a metamorphic engine and polymorphism more generally. Understanding how a malicious payload can persist in memory without ever be written to disk will allow researchers to find indicators of compression or encryption when a candidate binary is presented.(iii)Outline the current metamorphic engines broadly available in the literature which can be used by researchers to obfuscate their own binaries to incorporate robustness into their own work.

Section 1.1 will cover basic signature techniques used by AV vendors including the most common scanning techniques and considerations for scanners. Section 2 builds on the limitations of these techniques by introducing malware obfuscation, which is the most commonly used routine used by metamorphic engines in its obfuscation stage. In Section 3, the idea of obfuscation is put into perspective with a deepdive into a metamorphic engine, which involves the ability of malware to unpack, obfuscate, compress, and encrypt its payload on the fly. Finally, Section 4 provides an overview of the most well-studied datasets used in malware research, with Section 4.2 covering popular metamorphic kits that can be used by researchers to create their own metamorphic binaries.

1.1. Signature Analysis and Creation

Signatures are used to help identify malicious code segments present, either existing as independent executables or attached to benign files known as benignware. It is imperative that AV vendors constantly update their signature databases in order to cross-reference known malicious binaries with files suspected of being malicious. Acting as unique fingerprints for malware, signatures are plagued with several fundamental issues. First, signatures are incapable of identifying emerging malware variants. In an environment where approximately 60% of new ransomware are never-before-seen variants according to the most recent estimates [3], this creates a significant shortfall in detection rates for new variants. In addition, when the vast majority of malware is polymorphic [25], signatures are sometimes not generalized to catch obfuscated instances of previously flagged malware.

The art of file scanning is in of itself a laborious process, requiring trade-offs between speed and specificity. Incorporating longer signatures provides a more specific identification of malware and malware families but is unable to catch the subtleties of minute changes [26]. Short signatures provide better coverage but results in more false positives [27, 28]. AV vendors therefore must come up with a series of rules to both generalize their signatures and improve their scanning efficiency. Some of the basic scanning strategies are shown in Table 1 and described in the following:String scanning is the de facto standard for any string match scanning. The scanner is to look up the exact sequence of bytes in any offset.Wildcards method allows for the use of wildcard variables. In the example shown in Table 1, the use of “??” acts as a placeholder for 2 bytes of any string, while %3 prompts the scanner to look for the subsequent byte sequence in any of the proceeding 3 byte positions. This is extremely effective for catching register swapping and instruction replacement obfuscations.Mismatch method incorporates the idea of partial match of any given byte sequence. In the example provided in Table 1, if the scanner allows for up to 1 mismatch, as long as 2 of the 3 byte sequences are found, the scanner alerts to a match.Generic method allows for the detection of malware families through the use of both wildcards and mismatch sequences. This method extracts the core malware artifacts of a malware family, thereby capturing any subtle alterations to the bytecode sequence that may arise in the future. For example, the Win95/Regswap virus uses similar opcodes between generations. Through a combination of wildcard string matching with mismatch, the entire Regswap generation can be flagged based on a few common signatures.

In addition to generating unique signatures as part of generating a greater signature database for malware, scanning files requires a dedicated strategy, and in some cases, dedicated hardware. For example, while a signature may be located in any one of the portable executable (PE) sections, such as .idata, it may also be located in the PE file header. In addition, if you wish to cross-reference thousands of malicious signatures with an incoming data stream using regex patterns, you would have to take advantage of intrapacket or interpacket scanning to process them effectively [29]. AV vendors utilize cheaper operations, such as checking the file length, before committing to the use of a more arduous task such as a checksum [28]. In practice, a signature can act as a representation for a series of bytes, a whole file, or certain sections. The ways in which AV vendors carry out simple scanning on a binary is described in the following sections.

1.1.1. Top-and-Tail Scanning

This mode of scanning used to extract signatures from the top and bottom of files. This is especially useful for viruses that append to the front or back of the targeted host program. Since the address of the main entry point of a program is in its header section, manipulation of this address to point to the appending malicious binary is possible [30]. As an example, the Polimer.512.A virus preappends itself at the front of the executable and shifts the original program content after itself. Alternatively, the Vienna virus is 1,881 bytes long and appends itself to the end of the host file.

1.1.2. Entry Point Scanning

This mode of scanning is used to extract signatures from the sequence at program entry points. Malware routinely alters program entry points as to avoid detection through rerouting of the execution flow to a decryptor stub which decrypts the original binary [31]. The Zmorph virus follows such behavior, whereby the decryptor aims to rebuild the instructions line by line by pushing the result into the stack memory. This can lead to “black hole” scenarios where useless operations are compiled early on in the process flow to burden the reverse engineering analysis.

In addition, an assembly encoder or an altered JUMP statement can be configured to run encoded information in a “code cave,” as to not increase the file size of the binaries. This would normally impact the binary file header values, and any changes will alter relative/absolute offsets, so the pointers need to be changed accordingly. As previously mentioned, the Polimer.512.A virus appends itself to the infected program and in doing so is exactly 512 bytes long. This would raise flags and be easy to identify possible infected files due to the consistent file size differential.

Viruses such as the Win32/Simile is able to avoid changing the entry point of an infected file by altering call instructions which reference ExitProcess() to point to the virus code. This has the effect of not changing the entry point of the infected file. Other viruses such as W32/Bistro and W32/SMorph obfuscate their entry point [32]. SMorph is able to use existing API calls in the infected file to call to its own import address table containing references to API imports.

1.1.3. Integrity Checking

This mode of scanning can be an extremely powerful tool to detect manipulation of system files which should never change [27]. A checksum database can be used for reference when performing routine integrity checking of the system and files to detect any alterations [33, 34]. Common checksums include MD4, MD5, and CRC32. Checksums are routinely used on byte values suspected areas of a virus body, thereby reducing the number of total checksums required.

Alternatively, certain types of infections, such as companion infections, may attempt to mimic the name of an infected file and redirect the header section of an EXE which stores the address to the main entry point of a program to the start of the virus code [35]. The virus may also change the extension to COM as the Windows OS give a higher priority to COM over EXE extensions. In order to account for this, distributions such as McAfee’s network security platform can assign a magic number to file types and will flag files whose extensions have been tampered with [36].

2. Obfuscation

This chapter will provide an overview of the common obfuscation techniques employed by malware. Examples of these techniques will be provided, along with some actual code snippets from popularized malware variants. Finally, a brief overview of encryption and compression is given, two very important techniques to familiarize yourself with. This chapter will focus on obfuscations made specifically via changes to the opcodes and operands, which serve as the CPU instruction set which specifies the data that are processed and how it is done. Opcode examples will include both Intel and ATT syntax, with the former being readily apparent as the source operand is always on the right side of the instruction and the destination on the left (e.g., mov eax, 1).

2.1. Dead-Code Insertion

Dead-code insertion, or sometimes referred to as garbage code insertion, is an obfuscation technique which inserts byte code sequences into a binary without affecting functionality [3740]. This obfuscation relies on the fact that instructions can be added to code which do not perform any meaningful function, or in other scenarios, can carry out an instruction and perform the operation in reverse [41, 42]. An example of this type of obfuscation is shown in Table 2 where a series of nop instructions are used to pad the instructions. Typically, dead-code insertion is used to carry out one of three functions:(1)Insertion of a pointless operation such as nop, mov eax, eax, add eax, 0, and eax, −1 or or eax, 0. In practice, these instructions do not change the content of CPU registers or memory as they are all semantically equivalent to nop; however, they may modify the status of the flag register in the CPU. These instructions also have different opcodes.(2)Insertion of operations with the purpose of burdening the reverse engineering process by altering values in registries and then reversing the instruction. An example would be incrementing a registry add eax, 1 and then reversing the instruction by decrementing sub eax, 1. Other examples would be push and pop and inc and sub. This form does change the values of the CPU registry but simply undoes the operation sometime afterwards.(3)Insertion of dead code within branches of code that are never actually called, which may or may not make changes to variables in other branches of code which are never executed. An example would be a set of variables in Function A which are manipulated, but Function A is never executed because it is bypassed with a jmp statement.

Garbage code insertion is used successfully in the implementation of W95/Bistro, a later implementation of W32/Zperm, which utilizes a random block insertion engine which is placed directly after the virus entry point. Upon entering, this block of code millions of instructions is run, thereby overburdening the emulator before the virus instructions are even executed. Other popular examples of viruses utilizing garbage code insertion are W32/Evol and W32/Zmist. Zmist is notable for its use of the executable trash generator (ETG). W32/Evol in particular is able to utilize garbage code insertion to produce very different variants with different opcodes and string signatures, thereby evading signature scanning techniques as no sequence of bytes is similar between the two generations. An example of 3 variations of the same code is shown in Table 3.

The use of garbage code insertion techniques is useful in avoiding AV scanning for two reasons. First, the garbage code inserted is unique to each virus generation, thereby sidestepping previously seen AV signatures [44]. Secondly, garbage code from benignware can be inserted into malware to increase the false negative rate. In [45], the authors created binaries with approximately 30% of dead code along with 10% benign code and showed similar classification scores as benignware. In the work of [46], ranges of garbage code between 5 and 35% were used to determine their effectiveness at evading detection; with 10% being noted as being sufficient. In an earlier work [47], the authors combined various proportions of garbage code insertion with subroutine reordering to total 25 different combinations. Two different obfuscation engines, AVFUCKER and DSPLIT, also known as crypters, were used in [48] to produce obfuscated code with dead code insertion. Since there is a wide variety of permutations, from single nops to intermeshed garbage code blocks, upon which garbage code insertion can take form, string scanning is fairly ineffective against this form of obfuscation.

2.2. Registry Reassignment

Registry Reassignment, or sometimes referred to as Registry Renaming, is an obfuscation technique which swaps unused registers or memory variables with those currently used by the program [44]. In its simplest form, as demonstrated in Figure 1, registry reassignment can replace the eax registry with ebx, with no change in functionality.

The downside to using registry reassignment is that string scanning techniques, such as wildcard or half-byte techniques, can be used to detect any possible combination of registry used. This in effect will provide a constant string between generations of registry reassignment, rendering them easily flagged by scanners. The virus W95/Regswap (hence the name) effectively made use of registry reassignment as demonstrated in Table 4.

In Table 5, the string signatures of version 1 and version 2 have a 60% similarity when it comes to their hexadecimal representation [49]. With the help of regex expressions, the accuracy is greatly increased with variations of a similar instruction set [50]. Along with garbage code insertion, these primary obfuscation techniques make it considerably harder to flag new variants of malware.

2.3. Instruction Substitution

The instruction-substitution technique introduces an additional layer of obfuscation on the existing techniques discussed. The power of instruction-substitution comes from the fact that there is a seemingly endless diversity to the substitutions you can introduce to an existing instruction framework. Table 6 demonstrates an example of a 2–4 instruction substitution (2 instructions are replaced with 4 to perform the same function) [51]. Another instruction substitution would be push eax; mov eax, ebx with push eax; push ebx; pop eax. Semantically, these are equivalent, but push, pop is in fact slower as it is quicker to direct registry write with mov. This exact substitution is utilized by the W95/Zmist virus, along with interchanging xor/sub and or/test instructions.

Instruction-substitution is utilized very effectively in several high-profile viruses such as Evol, MetaPHOR, Zperm, and Avron. Since instructions-substitutions produce different opcode representations, this renders opcode frequency and accompanying n-gram techniques effectively useless. Researchers have attempted to draw from the basic set of fundamental operations in order to track the malware’s original intentions. In [52], a clue set was established for the Evol virus in which all rewritten instructions were based upon. This approach was found to be very effective at characterizing the metamorphic engine Evol uses. A similar approach was taken by [53] where the complex instructions the virus would create were transformed back into their simple representations using their similar semantics. In Table 7, two versions of the W95/Bistro virus are shown, using different instruction substitutions in each generation. Similar to registry reassignment, the generations contain similar string signatures, making them susceptible to wildcard and half-byte scanning techniques. While this manuscript is focused on obfuscators based on the Intel x86 instruction set, compile-time instruction set obfuscators can also create semantically similar rule sets for basic operations in other instruction sets [54, 55].

2.4. Code Transposition

Code transposition, or sometimes called instruction permutation, is an obfuscation technique which utilizes conditional or unconditional jmp statements to reorder single or blocks of instructions [18]. Since jmp instructions can theoretically be used for every line of instruction, the total number of permutations is proportional to the number of lines rearranged [44]. Code transposition carries out a very similar function as subroutine reordering with the exception that there is a change in the process flow; therefore, they will be discussed together. Subroutine reordering, also known as block reordering, is an obfuscation technique that reorders the process flow by rearranging blocks of code that have independent subroutines [56]. If a program were to be categorized into number of subroutines, then permutations of subroutines are available for rearrangement [40, 50, 57, 58]. A simple program with 10 subroutines would therefore be able to produce over 3.6 million possible iterations. Subroutines require that the instructions’ set are independent of one another, allowing them to be reordered without having an impact on functionality. In Table 8, an example of a set of instructions exhibiting multiple forms of obfuscation is shown. In the example code transposition, subroutine reordering, garbage code insertion, and instruction-substitution are all used.

Several jmp statements are employed to permute blocks of instructions which can be run independently from each other. Instruction-substitution is used to add more sophisticated instructions based on the simple instruction set add eax 5; mov ecx, eax. jnk insertions are used to add complexity to the existing code, as well as added following the jmp F1 statement where it is never actually executed. This jnk could include code from benignware that would normally fail to compile if it were embedded within the existing obfuscated framework but may confuse scanning techniques nonetheless. Table 8 also displays another form of obfuscation called subroutine outlining [32]. This obfuscation explicitly turns instruction blocks into subroutines and uses the call instruction to perform an unconditional jump to the location indicated by the label operand. Subroutine inlining would carry out the reverse: where subroutines would be unraveled and placed in order to preserve the process flow. Unlike simple jmp instructions, call preserves the locations to return to when the subroutine is completed.

This sophisticated form of obfuscation is used by the W95/Zperm and W32/Ghost viruses, with the former employing the use of the real permutation engine to perform subroutine reordering. Zperm divides the code into frames which are independent subroutines, which are then repositioned randomly and connected using branch instructions to preserve process flow. When Zperm initializes, it allocates a buffer sized at 64 Kb filled with zeros and then fills it with obfuscated code and randomly positioned jmp statements [43]. This means that a constant body is never generated between generations and is never present in memory. Similar to Table 8, garbage code is inserted between frames to fool string detection similar to the Zmist virus. W95/Zmist also inserts jmp instructions after every instruction, making it the perfect shield to heuristic detection. In [39], 30% subroutine reordering was used to sidestep a developed similarity metric that compared benignware to malware based on the similarity of their transpositions. From a security analysis standpoint, it is extremely difficult to know when the virus begins when it is embedded within existing code and is encrypted. Partial emulation is one avenue whereby code can be reconstructed and then used to completely decrypt the virus. But when and how to decrypt during emulation is still a laborious process in of itself.

3. Encryption, Compression, and Metamorphism

Metamorphism, and more generally obfuscation techniques, makes up the backbone for most new and emerging malicious threats we see today. As the signature-based scanning techniques improved for AV vendors, so did the levels of obfuscation employed by malicious actors to thwart said techniques [49, 59]. Along with obfuscation came various forms of armoring, stealth-behavior and antiemulation tactics, which made the job of a security researcher that much more burdensome.

To understand how mutation came to be, it is worth mentioning the earliest forms of obfuscation and how they came into existence. Viruses make use of entry point obscuration (EPO) in order to avoid any consistency in the execution order of the virus code in relation to the infected file. As shown in Figure 2, the file header would point to an address that would execute virus code, which would then point back to the host file so that the virus execution would do so unknowingly.

The CASCADE virus in 1986 became one of the first known viruses to implement encryption, thereby requiring a separate decryption routine to carry out decryption and push the instructions into memory for execution. Since the form of encryption would become apparent as the virus propagated, the decryptor routine itself would have to be mutated, leading to the establishment of the first series of oligomorphic viruses.

3.1. Oligomorphism

Oligomorphism began as a reaction to the signature-based scanning techniques widely utilized for flagging possible virus infections. With the help of scanning techniques such as wildcard and mismatch, a greater swath of possible infections could be characterized by a few unique signatures. Furthermore, since virus code would either append or preappend onto an existing file, top-and-tail scanning was an effective tool for extracting signatures from certain select sections of a file. Emulators could also be utilized to uncover the decryption routine used in the encryption, meaning that the decryption routine itself had to be altered in some form or another. Emulators wait as the virus is decrypted one instruction at a time and as it rebuilds itself by pushing the stack into memory. Once control is sent to the stack memory, the emulator monitors the stack, and the code can be dumped. Oligomorphic malware were the start of a new breed of malware which would involve obfuscation of the routine itself, meaning viruses were unique among their generation. The first oligomorphic virus was the whale DOS virus first identified in 1990. In Figure 3(a), an obfuscated, encrypted decryption routine is used to carry out decryption of the virus body and to avoid detection.

However, a major limitation to oligomorphism is that the loop of possible decryptors is finite. For example, the W95/Memorial virus had exactly 96 different decryptors to choose from. Once an oligomorphic generator is exhausted, the entirety of its possible generational variance is also exhausted and understood. The natural extension to this problem is to introduce obfuscation into the decryptor routine itself, leading to an infinite number of possible decryption routines [60]. This led to the first generation of polymorphic viruses such as 1260, and popularized generators such as Phalcon/Skism mass-produced code generator (PS-MPC) and virus creation lab (VCL), which are still used to this day.

3.2. Polymorphism

Polymorphic malware was seen as a complete package: complete with a compiler that could decrypt and obfuscate then recompile everything back together. The unencrypted virus body would create a new mutated decryptor using a random encryption algorithm and then allow the decryptor to encrypt itself before linking both sections back together. However, the core problem of emulation remains: the virus code section would be decrypted into memory and be able to be detected and flagged by security researchers. It was also the case that prior generations of obfuscators suffered from several limitations [61]:(1)Constant size of virus code between generations (Polimer.512.A or Vienna viruses)(2)Appending or preappending to the infected host file meant signature scanning could target these sections exclusively(3)Similar virus code segments between generations mean the virus is subject to entropy analysis

In order to build on some of these deficiencies, the introduction of the metamorphic engine came to be.

3.3. Metamorphism

The introduction of metamorphic viruses introduced the idea for the first time that no two generations of viruses can have similar signatures, as no constant body is present like with polymorphic malware [43]. In Figure 3(b), an example of a metamorphic virus is shown. Unlike polymorphism, the virus code is obfuscated, meaning that the entirety of the virus is present in an obfuscated state. This introduces the fundamental issue since “metamorphics are body-polymorphics” [62] and as a result have no constant body and they reinforce the notion that anomaly-based detection is NP-complete [63, 64]. The first metamorphic viruses were W95/Regswap in 1998 [65] followed by W32/Ghost identified in 2000 [66]. W32/Ghost contained 10 submodules, so over 3.6 million possible variations were possible with subroutine reordering.

In light of the graphic shown in Figure 3(b), the separation between the decryptor and the virus body is no longer possible and the level of obfuscation means that encryption is no longer needed. Furthermore, as is typically the case, the decryption routine is scattered in the benign code. The executed code in the virus body mutates entirely along with the decryptor, and it does not need to unpack to create a new constant virus body like polymorphics [50]. One of the most utilized and effective metamorphic generators is W32/NGVCK created in 2001. Metamorphic viruses have a sophisticated mutation engine that contains many subprocesses. These will be discussed in the following section.

3.4. Metamorphic Engine

A metamorphic engine is responsible for the obfuscation and reconstruction of the binary so that the file can remain operational. In Figure 4, an illustration of a complete metamorphic engine is shown. Some of the key components of the metamorphic engine are described as follows [65, 67]:Disassembler is responsible for turning the source code into assembly instructions. This creates an intermediate form that is independent of the CPU architecture for future adoption with different OS and CPU architectures [43]. Within the disassembler a code analyzer provides info for a code transformer module that gathers information related to control flow, subroutines, variables, and registers.Shrinker eliminates much of the garbage code produced from previous generations and mainly eliminates garbage and other nonsequential code that is produced from obfuscation. This step also carries out code shrinking, a form of code-substitution that will turn previous 1 to 2 or 1 to 3 instruction substitutions back to their semantically similar primitive equivalents [68].Permutor carries out much of the obfuscation using permutations of subroutines, many times in a randomized fashion. Insertion of jmp instructions is also common to divert control flow.Expander performs instruction-substitution to convert instructions into another equivalent instruction set. In addition, registries are reassigned and variables are reselected according to the fixed probabilities using substitution tables [65, 69]. Garbage and other do-nothing codes are added, and functions are inlined/outlined [70, 71] Both the permutor and expander steps are quite sophisticated in the metamorphic W32/Etap and W32/Zmist viruses [60].Assembler restructures the control flow and converts the assembly code back into machine binary code where it can become operational again.Virus code contains the core instruction set that will execute on all new generations of the virus. It also contains the instructions that coordinate with the mutation engine and other components.

The mutation engine does not have to operate at the assembly and the source code level but can also operate at an intermediate representation (IR) bytecode level [70]. In [72, 73], morphing techniques are seen as deterministic automata, whereby transitions following formal grammar are made to symbols and new mutations are produced. In [69], a template is used which illustrates how simple representations of formal grammar can produce several possible mutations. The depiction shown in Figure 4 includes all the core components with the exception of a decryption routine. A metamorphic engine with the addition of a decryption routine is shown in Figure 1 and follows a sequence of steps to decrypt, obfuscate, and link everything back together. The steps are as follows in order:(1)First, the decryption routine decrypts the virus body and executes an instance of it.(2)The decryption routine then decrypts the mutation engine and executes it.(3)The shrinker component of the mutation engine goes to work to deobfuscate the virus body.(4)Obfuscation takes place by introducing a new and unique decryption routine using the various techniques discussed in Section 2.(5)The virus body is then obfuscated by the mutation engine to produce a unique generation using the various techniques discussed in Section 2. The virus body is then encrypted using a unique algorithm, a static key or a host specified temporary key. More is given on this in the following section.(6)Finally, the mutation engine is encrypted.

Once all three components are reobfuscated to seemingly new binaries, with the mutation engine and virus body decrypted, the virus relinks its components back up and can execute on a new host by decrypting its payload through it newly obfuscated decryption routine.

The authors in [57] provide a detailed summary of the production and considerations for creating a metamorphic generator, as well as in [74] for creating a metamorphic worm. One of the more sophisticated metamorphic viruses is W32/Simile, also known as MetaPHOR or Etap. The author, “Mental Driller,” referred to the expansion, contraction, and permutation of instructions as the “Accordion Model” [61, 67] based on the changing form that garbage code takes when it becomes obfuscated. The Simile virus was also unique, and in that, 90% of the virus code was dedicated to the metamorphic engine itself, with the decryptor being placed at the end of the code section and the virus body being partitioned elsewhere [43, 52].

3.5. Encryption

While encryption was briefly touched upon at the beginning of Section 3, obfuscation engines make use of a variety of encryption techniques to avoid detection [49]. The earliest form of encryption was carried out by the CASCADE virus on DOS [40] and did so using a simple xor (see Figure 5).

The cascade virus, first identified in the early 1900s, was shown to increase the file size of infected files by 1701 and 1704 bytes and mainly comprised its encryption loop and main body. The virus uses a technique called “cascading” to conceal its presence. When the infected files are executed, the virus code is executed first, causing the virus to infect more files and directories. This creates a cascading effect, making it difficult for antivirus programs to detect and remove the virus [75]. The decryption routine in Figure 5 is fairly simple: the stack pointer, sp, acts as the key and the si register is used to keep track of which position of the virus body to point to. As the decryption process is carried out, both the si and sp counter increment and decrement by one, respectively, until sp returns to 0; otherwise, it will jump using jnz. For example, applying a simple xor operation to each byte using an 8-bit value as the encryption key will produce the encrypted text. The string 2D03 002E when xor’d with the key 0xFF will produce D2FC FFD1. Doing so in reverse with the same key will produce the original text, thereby performing encryption and decryption with only one key.

Conventional decryption relies on the virus’ own decryptor loop to decrypt the virus body. It did not take long for malicious actors to rely on multiple decryptors instead of one, such as the DOS/whale virus in 1990, which utilized dozens of different decryptors and chose one randomly each infection. It may also be the case that rather than the encryption being performed serially, decryption can be performed in a random fashion, as is the case for W32/MetaPHOR which does so seemingly randomly, with each instruction only being decrypted once. In malware deployments, the use of a crypter is typically used, which carries out encryption for antianalysis and obfuscation purposes. A crypter contains a stub which carries out the decryption and does so while generating a new payload and key with each new generation [48, 76]. All of this occurs in memory, and nothing is written to disk. Decryption can take place in the stack, but then the key to it is not writable, as opposed to allocating to memory which is easily flagged by emulations that are monitoring memory. On Intel x86 platforms, 24 bytes or more of modified memory is characteristic of a decryption routine [28]. Once the stub passes control to the virus body after decryption, a new encryption key is created and all executables and .text sections are encrypted with the new key. Depending on the file type, a TEA cipher can be used for EXE and RC4 for DLLs as is the case for HackedTeam’s core-packer [77]. The key is then stored in the decryptor stub or elsewhere.

Basic encryption can be performed as mentioned previously with a single decryptor key (see Figure 6), using 1 to 1 byte to byte mapping, with zero operand using inc or neg, or reversible instructions such as add or xor. Alternatively, sliding key encryption makes use of the starting key which updates as it progresses and may even utilize the characters most recently encrypted (see Figure 6(b)) or based on an algorithm, as shown in Figure 6(c). Flow encryption determines a key stream in advance equal to the size of the encrypted text and then encrypts the body instruction by instruction. Key generation can also be varied amongst decryptor routines, where a key(s) can be located in the decryptor stub itself, hidden among the virus body, generated uniquely from the host system, or alternatively, randomly generated and not stored at all.

The sources for the encryption key can vary but can either be hardcoded in one form or another or obtained through the host. In the case of variable key generation, the decryptor can develop the encryption key based on its own function calls. Alternatively, environmental key generation does not involve any descriptors from the viral payload or stub itself, but rather, retrieves them from the infected host. One example of environmental key generation is the use of a trusted platform module (TPM) chip, which is a hardware component built into many modern computers and devices [78]. The TPM can generate unique encryption keys that are tied to specific physical attributes of the device, such as the device’s BIOS, firmware, or other hardware components. This makes it much more difficult for an attacker to access the key and decrypt the protected data even if they are able to physically access the device. In the case of the RDA.Fighter virus family, the virus checks the BIOS address at FFFF : 000E0, and if it returns advanced technology (AT), as in AT-class computer, the time stamp is retrieved from the CMOS buffer; otherwise, it is retrieved from the system clock. The timestamp is then used to create a 16-bit number that is used to decrypt the next code section using a mirror table lookup as a mask. In addition to time, the current date, timer tick, host filename, and even the hard disk serial number can act as sources for developing the encryption key. As a form of armoring, the key can be stored on a distant web server, and outside of a typical host environment, such as in virtualization or emulation, the virus can disable itself and fail to run.

Decryptions and decryption loops are not limited to a single loop, or to a single key. For example, the RDA.Fighter virus family utilizes 16 layers of decryption and does so in a backward fashion, making it a laborious process to automate the disassembling process [28]. Multiple layers of encryption are also utilized by the W32/Harrier and Bradley viruses [79]. To avoid all form of local or external storage of the key, a random decryption algorithm (RDA) can be used to brute force the key. The key can be any generated word value, and the decoding method will check the checksum following the decoding procedure to identify when it has successfully found the key. In the RDA.Fighter family, RDA is used as secondary form of encryption on top of environmental key generation.

3.6. Compression

Compression represents an additional level of obfuscation on top of a possible decryption routine and other forms of obfuscation. A packer is defined as a utility which enacts some form of compression to the executable either to reduce files size to avoid entropy analysis or introduce a layer of obfuscation to the PE header. It has been estimated that 80% of all malware uses some form of packer [80], as well as 90% of all worms [43]. Two of the most popular packers are Ultimate Packer for eXecutables (UPX (https://upx.github.io/)) and ASPACK (https://www.aspack.com/). In addition to significant compression ratios and great performance, these packers work for a variety of executable formats with no memory overhead due to in-place decompression.

Packers are ultimately tasked with compressing executables with decompressed code and a compressed payload. Packers compress the code to avoid reverse engineering and bypass firewalls. Malware makes use of packers by initially converting an Image Section (see Figure 7(a)) into a Packed Section and Unpacking Section (see Figure 7(b)). The Unpacking Section is then set to be the initial point of entry once the file is executed. Upon execution, the packed section is decompressed to become the Unpacked Section (Figure 7(c)) and is executed on virtual memory [81]. One of the more devious uses of packers in malware analysis is that the original PE header is hidden as the visible import functions are those utilized by the packer itself. Since packers such as UPX, ASProtect, PECompact, and Themida are widely used for nonnefarious purposes as well, there is no sure indication that the file is malicious based on the import functions [8284].

One of the more comprehensive tools for the detection of malicious packers is the use of entropy analysis [1]. In the work of [85], 28 different packers were used to classify a control flow graph as an image representation through the use of a convolutional neural network (CNN). The work of [86] used CNNs for a similar purpose, but was used to categorized 9300 malware variants into 25 malware families simply based on the malware binary. These techniques have the advantage of allowing the neural network to learn which PE sections are important in identifying maliciousness; and in doing so it uses an advanced form of entropy analysis which can identify malware family usage of packers, encryption and garbage code obfuscation [86]. When compression is coupled with encryption, as is the case with so-called Protectors, the resulting binary has high entropy levels, making it susceptible to classification. In [58], a file segmentation method that utilized entropy with wavelet analysis was used to classify metamorphic malware based on edit distance between file segments. This motivation was derived from the earlier work of [87] that established that the homogeneity of each malware’s binary section is characteristic of the complexity of its data order. Along with this insight, polymorphic malwares are able to be identified using these techniques, albeit with a high rate of false positives [87].

In Figure 8, a historic summary is provided, which is complete with major milestones in obfuscation and new malware deployments.

4. Metamorphic Datasets, Generation Kits, and Armoring

While metamorphic malware has grown in sophistication, so has the tools we have as available as researchers to thwart their actions. One of such tools and resources is the use of publicly available datasets, such as DARPA99, a popularized dataset released to improve intrusion detection systems. Datasets encourage the development of classification tools by leaving the details for collecting representative samples in a controlled environment and at scale to others. Secondly, datasets also provide a baseline in which to compare competing algorithms, usually with the aim of increasing true positive rates and decreasing false positives. One of the downsides is that these datasets are typically outdated and are not representative of new and emerging threats. If researchers make raw malicious binaries available, as is the case with the SOREL dataset [88], they cannot do the same for benign binaries due to issues with copyright. One workaround used in SOREL is to dump the entire metadata of the binary and use that metadata dump to create features for a model to learn from. This section will touch on some of the more useful malware datasets used historically and then transition into covering some aspects of malware generation kits and antiarmoring behavior.

4.1. Malware Datasets

The DARPA dataset was created in 1998 and contains 7 weeks of raw TCP/IP dumps of a simulated attack scenario to an Air-Force base. The dataset contains both host and network files. The KDD99 was created based off the DARPA dataset [89], with a reduced size and a total of 24 attack types and an additional 14 existing solely in the test dataset [90, 91]. Based on the observations of [91], KDD99 was the most widely used dataset in IDS research between the years 2010 and 2015. Several issues arose with the use of KDD99, namely, the time-to-live (TTL) values for benign and malicious packets were different [92, 93], and the data rates were not characteristic of real-world networks [94]. Many of these issues were exemplified in the critique carried out by [93], leading to a need to provide much needed modifications to the existing dataset. In addition, since the size of the KDD99 datasets was large for many trainable models and the dataset contained duplicates of attacks such as DOS, the dataset was further reduced to become its most recent version, NSL-KDD [93]. Another dataset containing network traffic is the UNSW-NB 15. The dataset was created by the IXIA PerfectStorm tool at the Cyber Range Lab at the Australian Center for Cyber Security [95]. A TCP Dump tool is used to capture 100 GB of raw traffic, with a total of 49 features generated using a set of tools and algorithms. Other lesser known network datasets include CAIDA [96] and ISCX 2012 [97] for network intrusion detection and CICIDS2017 [98]. The CICIDS2017 dataset is unique, and in that, the authors included behavior for Windows (XP, 7, 8, and 10), macOS, iOS as well as Linux operating systems, encompassing attacks from Botnets, DoS, DDos, Brute Force FTP, Brute Force SSH, Heartbleed, Web Attack, and Infiltration [98]. For a thorough summary of network-based datasets, the authors refer to the review carried out by [99].

Several datasets have been used to represent the content of the malware binary, versus relying on network activity. One of the more utilized datasets is the Microsoft Malware Classification Challenge dataset, which becomes popularized in a Kaggle competition back in 2015. The raw data of a virus’ binary are represented in hexadecimal, with a compilation of metadata retrieved using the IDA disassembler tool. Binary representations of malware binary have also become popularized as a dataset in image analysis, with the Malimg dataset [100] having the greatest impact in recent years [101109]. Other alternatives include the Malicia dataset [110] which contains 11,668 malicious binaries from 54 families retrieved from 500 drive-by downloads over 11 months. However, the project was ultimately discontinued in 2016. The Malsign dataset [111] contains 142,000 signed malware and potential unwanted products (PUP) binaries obtained from 2012 to 2015 for the Windows platform [112].

Mobile and internet-of-things (IoT) security plays a unique but important role in malware security, as these devices make up a larger proportion than ever in how we connect with others and exchange information. The Drebin dataset [113, 114] is one of the most used datasets in mobile security, with 5500+ malware being included in the dataset belonging to 20 families, collected from 2010 to 2012. The android adware and general malware dataset (AAGM) [115, 116] includes network activity of 1900 adware, general malware, and benignware running on android smartphones. The IoTID20 [117] is a more recent dataset used to simulate network attack retrieved from two smart home devices. The dataset consists of 42 pcap files encompassing simulated attacks produced from Nmap and from the Mirai botnet [118, 119].

Several datasets include features extracted directly from PE files, and this includes the ClaMP and EMBER dataset. ClaMP [120] includes features from the DOS header, file header and optional header of PE files. The integrated dataset includes 68 features:28 features are from the raw dataset, 26 features are Boolean (file and optional header), and 14 are derived features. A second version of the dataset exists which consists of 56 features. Finally, the largest dataset by far is the Ember dataset [121] with a total of 1.1 million binary files. The authors in [122] include additional tools to extract features from the PE files to further encourage the use of the dataset to train benchmark problems. The Ember dataset was the larges of such datasets until the introduction of the SOREL dataset in 2020, which expanded from 1.1 million binaries to 20 million binaries, including 10 million disarmed malware samples ready for feature extraction [88]. The Australian Defense Force Academy (ADFA) is the author of two datasets: the Linux dataset (LD) [123, 124] and Windows dataset (WD) [125]. Both datasets provide a comprehensive simulation of a HIDS based on the collection of system calls; however, a significant downside exists for the ADFA-WD as it was collected solely on Windows XP, which limits the applicability to future generations of Windows OS [125].

Insider threats are considered one of the more emerging sources of security vulnerabilities for government and firms. CERT identified that 15–24% of firms experience an insider incident perpetrated by a business partner [126]. It has also been noted that a quarter of cyber security risks are due to insider threats, meaning that current or close business partners are considered as much of a threat as ransomware from a security standpoint [17]. That is why, a dataset such as the CERT insider threat V.2 dataset is so important in our understanding and tracing of threats that exist in network topologies [127]. The dataset includes several synthetic threat scenarios, accompanied with information related to HTTP records, employee info, and log on/off times, among other indicators. A summary of the datasets discussed along with some information on their makeup is shown in Table 9.

Virus repositories are also a source for millions of malicious binaries and source code for malware research. The Zoo (https://github.com/ytisf/theZoo) from [263] contains hundreds of malicious binaries that are updated on a regular basis as new threats emerge and as virus source code becomes available [264]. VirusTotal (https://www.virustotal.com/gui/home/) contains one of the most comprehensive repositories used in the industry today. Malicious binaries can be uploaded or searched via MD5 hash to provide a detailed summary of the threat and other metadata. VirusTotal also comes equipped with a public and private API that allows threats to be uploaded while returning a detailed report, along with which AV vendors have already developed a signature for the given binary. Virushare (https://virusshare.com/) is a searchable sample database, boasting 34 million + malware samples for use for analysts, researchers, and the security community [265]. Other less popularized repositories for sharing malware for research purposes include Malshare, VirusBay, and Das Malwerk.

4.2. Metamorphic Generation Kits

Virus generation kits facilitate the creation of a bulk of the newly generated virus signatures we see every day. These kits perform some, if not all, types of obfuscation outlined in Section 2 to evade signature-based techniques and are a significant problem for AV vendors and researchers alike. In addition, some kits even provide functionality whereby users can customize the level of obfuscation and encryption to introduce variation into the malware generation and are even able to enact antiemulation and armoring behavior. Some generation kits have been easily flagged by AV vendors since their generated code would contain similar code between generations; therefore, only a few signatures developed could flag the entire generation, rendering the generation kit obsolete. Depending on the generation kit, COM and EXE viruses can be produced directly, while other kits generate the virus assembly code. For example, Borland TurboAssembler TASM 5.0 can assemble an ASM file into an object file and then TLINK takes the object files and libraries and links them together to produce virus executables. As demonstrated in Figure 9, disassemblers such as IDA Pro can be used to produce the ASM files [266]. The ASM files can then be used to extract opcodes and other features sets for use in malware classification [267]. This section will discuss several popular generation kits used in research, with a brief description on some of the obfuscation techniques used by each generator.

The phalcon-SKISM mass produced code generator (PS-MPC) was developed in 1992 and includes over 25 options for different types of encryption and payload types, as well as having options to be memory resident. The generator employs its own decryption routine but lacks options for stealth techniques. PS-MPC generates files that reside in memory long enough to infect all COM and EXE files. The advantage of PS-MPC at the time of creation was the ability to carry out code generation in batches due to the generator operating as a code-morphing engine as it is script-driven [43]. While all PS-MPC-generated codes today are readily flagged by AV vendors, the generator is still used today for research on metamorphic malware [31, 51, 268271]. The mass-produced code generation kit (MPCGEN) was first developed in 1993 and was used to create CFG files which were then passed to PS-MPC followed by TASM to produce 32-bit executables. The name “mass-produced” comes from the fact that the process of generating, compiling, and assembling can be carried out for 500 files in as little as 25 minutes. Similarly, MPCGEN is used to produce a high quality and quantity of metamorphic variants for research purposes [51, 56, 271275].

The second-generation virus generator (G2) was developed in 1993 and produces COM and 16-bit EXE infectors. It employs several code substitution techniques, and as an extension to PS-MPC, introduces antidebugging and antiemulation features, as well as resident and nonresident viruses. G2 has an easily modifiable source code to allow customization by an advanced programmer, and the routines it uses are semipolymorphic. G2 to do this day is a go-to for generating polymorphic variants [31, 50, 51, 56, 58, 59, 67, 73, 268278].

Virus creation lab for Windows 32 (VCL32) was created in 1992 but was revamped in 2003. Created by a virus writer named Nowhere Man, a member of a group called NuKE, this generator can produce the assembly source code of viruses. This means the assembly code needs to be compiled and linked afterwards before they are active. The versatility of VCL32 comes from being able to customize activation conditions based on date, time of day, number of infected files, computer country code, version of DOS, or the amount of RAM available. VLC32 supports COM file infections, generating companion viruses, as well as various encryption and infection strategies. As a complete package with a GUI and drop-down menus, the most recent version VCL32 released in 2004 is commonly used in research [31, 50, 51, 56, 268, 272, 274, 279, 280].

The next generation virus generation kit (NGVCK) is one of the more popular virus construction kits available. Developed in 2001 with the most recent version released in 2003, NGVCK has been widely adopted for use in developing 32-bit PE-EXE polymorphic malware, especially in a research environment [31, 39, 47, 50, 51, 56, 58, 67, 268275, 277288]. Options for encryption include rotate without carry ROR/ROL, Twos complement negation NEG, Ones complement Negation NOT, logical exclusive or XOR, and addition/subtraction ADD/SUB. NGVCK can carry out dead code insertion, subroutine reordering, code substitution, and registry renaming, and all are very effective techniques for obfuscation. In [51], NGVCK was compared to other popular generation kits, including G2, MPCGEN, and VCL32, and was noted to produce the highest rates of obfuscation compared to other kits. A similarity metric was used to compare assembly programs, and no similarity was found to have G2 and MPCGEN, up to 2.4% was found with VCL32, and normal files had similarities between 0.98% and 1.2%. In [271], only a 10% similarity was found between NGVCK when run over multiple iterations, meaning that the kit produces a large amount of variability between uses. An example of two virus variations produced by the NGVCK generation kit is shown in Table 10. Obfuscation produces two semantically similar variants using garbage code insertion, instruction substitution, and subroutine reordering as techniques.

A more recent polymorphic engine was introduced in [69] as the virus and metamorphic worm (MWOR) generation kit. The effectiveness of the generation kit was exemplified in [270] for being able to fool common statistical analysis. The kit has also found more recent interest in research as it is able to control for the proportion of garbage code and subroutine reordering possible [270, 271, 273, 282, 283, 286]. This is extremely effective because inserting a certain amount of garbage code from benign files has demonstrated an improved ability to thwart AV scanners [39]. This chapter does not provide an exhaustive list of generation kits, and on the contrary, these kits represent a small subset of available kits widely distributed. Websites such as VxHeavens were one of such sources until the website was taken down in March 2012 by Ukrainian police. Repositories containing over 200+ generation kits once hosted on VxHeavens can be found circulating online to this day. Included in these kits as discussed is antiarmoring and antiemulation capabilities. Some of these will be discussed in the next section.

4.3. Anti-Emulation, Stealth, and Code Protection

Antiemulation is an all-encompassing term that includes all the various armoring, stealth, and/or code protection techniques that are used to thwart or burden the process of reverse engineering of a malware sample. According to Symantec, approximately 28% of malware are VMware [12]. One of the shortcomings of virtual machines and other honeypot deployments is that the environment they are deployed in is static, with several configurations set to default. It is for this reason that antiemulation malware can check the environment for indicators of virtualization and fail to execute or burden the reverse engineering analysis with cumbersome instructions. This section will cover some of the actions taken by antiemulation malware to exploit their virtual environment and prevent security experts from understanding the full breadth of their behavior. Antiemulation checks fall into four categories: human interaction, configuration-specific, environment-specific, and VMware specific checks [289, 290].

4.3.1. Human Interaction

Checks to see if actions routinely carried out by a user are being performed. This includes mouse movements, use of the clipboard, and opening and closing windows. The Cuckoo Sandbox, for example, has a setting which provides this sort of functionality for each malware submission. Trojan Upclicker is a virus variant that monitors user input in the form of a left click in order to identify sandbox environments. It does this by using the SetWindowsHookEx() and GetLastInputInfor() API to determine the rate of user input over time. This would identify the presence of sandbox environments as automated analysis does not require the use of an auxiliary keyboard and mouse [291].

4.3.2. Configuration-Specific

Uses time periods or other configuration to execute at a later time and date only if certain conditions are met. The Duqu virus, which was first identified in 2011, included a series of antistealth techniques in the form of delays as a precautionary measure [292]. Code injection only occurs after approximately 10–15 minutes, and the lifespan of Duqu is set by an unknown communication module that removes its hooks, deletes its kernel driver, and removes its registry key once the timer has elapsed [292, 293]. The Kelihos botnet and Nap Trojan both make use of the SleepEx() and NtDelayExecution() for extended sleep calls, with the Kelihos botnet having affected 41,000 users before being identified and taken down. Hastati has a hardcoded check which is executed only at 2 pm on March 20, 2013. Otherwise, it does not execute if GetLocalTime() returns a time less than that, indicating the presence of a virtualized environment [294].

4.3.3. Environment-Specific

It looks at the settings and parameters of the host operating system and hardware and decides whether to execute based on those findings [295]. Virtual machines incorporate virtual hardware which tends to have consistent configurations between VM deployments. Hardware such as network adapters, USB controllers, and audio adapters are all virtualized, meaning that MAC addresses, USB controller types, and SCSI device types are all telling signs of virtualization. The Scoopy Doo tool developed by Tobias Klein uses Windows Script Host to read registry keys located in HKEY_LOCAL_MACHINE∖HARDWARE∖DEVICEMAP∖Scsi∖ and HKEY_LOCAL_MACHINE∖SYSTEM ∖ControlSet001∖Control∖Class associated with SCSI and can also lookup keys that are associated with IO and ports for strings containing “VMware.” In another application, malware can utilize the internal processor tick counter via the ReaD Time Stamp Counter (RDTSC) instruction. Based on a random bit value that is returned, the decryptor contained within the malware will decode and execute the virus body; otherwise, it will bypass and exit.

4.3.4. VMware-Specific

It uses checks that add the ability for malware to look for specific indicators of virtualization based on the VMware software used by the host. One of the best examples is in the use of VMWare workstation’s WinXP Guest virtual hardware which includes a running VMtools service and 300 references to VMtools in the registry. Another interesting adoption of VMware behavior is Pushdo. Pushdo uses PspCreateProcessNotify() to deregister sandbox routines [5, 290]. It also performs a check of the physical hard drive serial number and checks if it is set to a default value of 00 which is typically in virtual machines. In the work of [296], the authors looked at antiemulator behavior in android malware and noted volume identifiers, network interfaces, and invoking the GPU were all techniques used to obfuscate Dalvik virtual machines. Other evasion techniques, such as exception process timing, IMEI checking, and checking the variability in sensors have all been traced to emulation evasion in android malware [297302].

Alongside the specific checks mentioned above, general antidebugging makes it difficult for researchers to extract signatures or strings to develop systems to protect against them. An example is the Bistro virus which inserts garbage code insertion and dummy loops before the decryptor stub. As a result, before the malware has even unpacked millions of instructions and burdens the emulator, and Bistro fails to run. During analysis, many malware variants are memory-resident, thereby requiring careful monitoring of viral payload to load itself into memory before it can be dumped and analyzed [61]. In the past, malware authors have been one step ahead in their efforts to thwart monitoring memory dumps or memory snapshotting. An example is the Zmorph virus which has its decryptor rebuilding its instructions line by line by pushing the result into stack memory. One of the earlier adopters of this sort of technique was the DOS/DarkParanoid which contained 10 different encryption functions which it used to encrypt its previously run instructions while only allowing its current instruction to be decrypted at any point in time. Without a conventional decryption loop, it is a true polymorphic memory-resident variant. The use of other so-called “stealth viruses” employed reconnaissance of the OS by waiting until AV products check-summed programs to check for changes. When a file was read, as opposed to executed as is the case with user input, it took that as indication of check-summing by the AV and removed itself from the target executable. Finally, once it waited until the file was closed, it then reinfects the file [303]. Using this process, it can follow the AV and infect every file on disk. A thorough summary of antidisassembly, antidebugging, and antiemulation techniques can be found in [43]. For a summary of android application hardening used by malware authors and developers, we refer the readers to the work of [304].

5. Approaches to Feature Analysis

Malware features are typically categorized into two types: static and dynamic. Static features incorporate all the unique compositional information of the executable, irrespective of the contextual information of the target system [305308]. That is to say, the static features of an executable would be the same regardless of what machine the malware is deployed on. Static features typically include the portable executable (PE) structure, assembly code instructions [5], list of DLLs, n-grams, and byte sequences. PE structure features would include information related to PE sections, resources, application programming interface (API) calls, as well as which dynamic link libraries (DLL) are imported/exported. Most modern antivirus (AV) products employ the use of a signature database which contains known signatures of the static features of malware. Alternatively, dynamic features include API and DLL call graphs, information gathered from the file system, registry, as well as process and thread activity and the consumption of kernel resources. Dynamic analysis can also include temporal snapshots of process execution, memory, network, and system call logs [309]. Dynamic analysis is OS-specific because depending on the system resources, account privileges, and other environmental variables, the malware will behave differently and have a different signature as a result.

The ability for malware to mutate has also presented a problem for researchers, which render many of the legacy static approaches to malware research obsolete. As a result, dynamic analysis has been presented as the de facto standard in classification approaches as it is impervious to routine obfuscation and packing carried out my mutating malware. Nowadays, dynamic analysis represents some 51% of the analysis methods in the body of literature examined [306], with a unique combination of feature sets and model architectures being used to perform classification. It has been noted that malware classification is not a trivial problem, with some presenting it as an NP-complete problem [63] to identify a bounded-length mutating virus or a polymorphic variant of one [310]. Characterizing malware is the fundamental issue of concern, and researchers and practitioners are constantly refining their methods to stay ahead of the curve. Figure 10 provides an illustration of the feature pipeline used for most malware classification approaches. Both static and dynamic features form the bedrock in the characterization of malicious behavior. Any number of these features can be combined to form a hybridized model for feature analysis, which is unofficially the third form of characterization.

Many of these methods are covered in the comprehensive review of [308, 309], but this work will simply provide a narrow overview of malware detection approaches as it concerns API calls. While API calls are just of one of the many forms of static and dynamic behavior, it is one of the most consequential and information rich sources of discrimination. But first, an introduction to the source of APIs, files known as dynamically linked libraries, is required and will be the topic of the next section.

5.1. Dynamically Linked Libraries

Dynamically linked libraries, or DLLs, are libraries of code that are written by vendors such as Microsoft as well as third parties to coordinate and manage resources on the Windows OS. DLLs are fundamentally libraries of code that contain one or more functions, indicated in their Export Address Table (EAT), which identifies and whose functions are available for export to other processes. DLLs are structurally equivalent to executables, with the exception being that their main function is called DllMain, and they cannot be executed without the use of helper functions RUNDLL.exe or RUNDLL32.exe, for 64-bit and 32-bit, respectively. DLLs are useful because they allow multiple processes to share the same library of code loaded into memory, thereby reducing the time required to recompile each process and the amount of memory overhead if the same code segments had to be loaded in memory multiple times. Because each process does not need to include static code of its functions, it keeps file sizes smaller overall when it can connect to an already running copy of the library of functions. It also has the advantage of allowing the OS vendor to update a catalogue of core DLL libraries which can work with subsequent versions of the OS.

When a DLL is requested to be loaded by an EXE, it does so through by checking some default directories first. There is a known registry key in KnownDLLs that tells Windows that the well-known DLLs should be found in the System32 path; otherwise, it searches in the .exe directory, the current working directory, the %SystemRoot% directory, the 16-bit system path, and then the directories in your environment PATH. DLL order hijacking is the process by which malicious actors inject their own malicious DLLs somewhere in this load order so that their payload is loaded instead of a legitimate DLL. For example, ntshrui.dll is loaded by explorer.exe, but it is not a known DLL and therefore can be susceptible to load-order hijacking. DLLs that are fully protected can recursively load other DLLs that are not protected, which forces the next executable to follow the default search order and be prone to hijacking. The tool Dependency Walker (https://www.dependencywalker.com/) can be used to see the dependency tree between loaded DLLs on the OS. Legacy malware would change the Import Address Table (IAT) to point to a new address in memory for the DLL it needs. Changing pointers to new malicious address locations with malicious payloads has since been rectified on newer versions of Windows as it becomes apparent if all the address locations for functions are in higher memory space 0x7C86 and a single function is loaded into 0x3420 then most likely that IAT entry has been changed with a hook by a rootkit. Alternatively, malware can just modify the DLL inline, requiring no changes in pointers just the code, leading to a vulnerability commonly known as DLL proxying which is much harder to detect but can be alerted to using integrity checking.

Potentially vulnerable DLLs can be observed if using tools such as SysInternals’ Process Monitor (Procmon (https://docs.microsoft.com/en-us/sysinternals/downloads/procmon)). In Procmon, if a DLL is not found and it is not core to the functionality of the process, it will return an entry NAME NOT FOUND. Using an out-of-the-box option like Metasploit’s (https://www.metasploit.com/) msfvenom will produce a DLL than can be put in place of the missing DLL, thereby running the malicious payload and executing a successful DLL hijacking. Other tools such as the SANS (https://www.sans.org/blog/detecting-dll-hijacking-on-windows/) tool can be used to search for DLLs that appear multiple times, are unsinged, and are in unusual folders. More common in research, the Dependency Walker tool (https://www.dependencywalker.com/) makes it easy to view the mapping of imported DLLs and to even view a hierarchical view of all dependencies between modules by looking at the IAT. The authors in citewang 2008 separated DLL usage according to implicit dependency, delay-load dependency, and forward dependency, which are all responsible for the static loading of DLLs in 3 tiers of hierarchy. Tier 1 starts from those used by the main program, followed by Tier 2 which have DLLs invoked by other DLLs that are not in the main executable, with Tier 3 being the entire statically loaded tree. The authors created a one-hot encoded vector if the particular DLL existed in the program and used that feature mapping for classification. In [311], a similar approach was taken which relied on the DLL dependency tree but incorporated encoding tree string dependencies. The authors looked at all the tiers of DLLs which loaded and created a depth-first representation where the original executable is the root node and all nodes from root to leaf are assigned a unique integer value. They then used CMTreeMiner which extracts closed frequent subtrees that exist in a particular executable, and one-hot encoded a feature vector if a particular subtree exists in the executable. Looking at depths of subtrees from 3 to 6, accuracies as high as 98%+ were obtained following random forest and naive Bayes classifiers. The work of [312] did not go in as depth as [310], but the authors looked at the number of API calls by a DLL in addition to the list of DLLs used and the API calls made. In any case, while DLLs do provide a good proxy of malicious intent, it is in fact the API calls that are made that are the real discriminator. For this reason, researchers turn their focus towards API calls and their usage among malware variants.

5.2. Windows Application Programming Interface

Windows API calls are interfaces provided by DLLs to access low-level resources [313]. API calls come in two flavors: user-level and kernel-level APIs. User level APIs operate at Ring 3 and provide the average user just enough privileges to access system resources to perform typical workloads. The actual hardware on the other hand runs in the kernel mode, which makes use of the kernel level APIs that are not directly available to users for the sake of security and stability of the OS. From the stability perspective, a user-level crash results in an error message, while a kernel-level crash results in the OS crashing. From the security side, malware could reside in the kernel and operate at a layer that is indistinguishable to the user or any Ring 3 defenses. Nowadays, it is much more unlikely to see malware residing in the kernel, as the Windows OS has made it more difficult to run code in the kernel and make use of rootkits. Ultimately, to make use of the kernel, all userland code uses Kernel32.dll as a gateway to communicate with Ntdll.dll which, in turn, communicates with the kernel.

The fascination with API calls comes down to the fact that API calls provides a higher resolution of analysis of the operation of any given process. It is the case that API functions and system calls are related to the services provided by the OS [309, 314, 315]. As the API is responsible for all system resource management, it is a particularly discriminating feature for malware classification as it provides the basic functionality for everything from networking to saving files to disk. The usage of APIs and patterns in usage can be very telling. Similar to the overarching view of static and dynamic analysis of behavior, APIs are approached from a static and dynamic perspective as well. In dynamic analysis, the run-time behavior is monitored, and ideally, all code segments are traced to reveal the behavior of the malware. This circumvents the obfuscation techniques of encryption, packing, and polymorphism [316]. Static analysis on the other hand can be fooled by adding fake API calls [317] or API calls typical of benign event activity [318]. It is also the case, as mentioned in Section 5.1, that the imported functions of a DLL may or may not ever be called, which can be used as a distraction from the real nefarious purpose of the malware.

Features such as the API call function names, parameters, and the return values of an executable can be extracted from the APIs [319]. Monitoring the API calls is an approach to detecting the malicious behavior of software; however, there is no clear distinction between malicious APIs and benign APIs as all native APIs are a helpful utility given the right context. The next section will outline some of the nefarious usages of APIs by malware authors and how they balance stealthiness with functionality.

5.2.1. Malicious Windows Application Programming Interface Usage

Broadly speaking, API usage can be categorized into 7 categories based on the functionality they provide to a process [314, 320]. Researchers have also made use of similar categories to classify malicious intent [184]. Some of the malicious functionality APIs can provide to executables and include the following:File: create a file in sensitive folders; delete or hide files; file directory traversalProcess: inject DLL into a running system process; create mutex to prevent executionMemory: free up or occupy memory; minimize memory usageRegistry: add or delete system service. Autorun, hide, and protectNetwork: open and listen on a port, communicate over e-mail service, communicate with CnC serverWindows Service: terminate windows update, firewall, setup Telnet or SSHOthers: hooking keyboard, hiding window, scan for existing vulnerabilities and configuration

Code injection usually begins with the usage of third-part DLLs or injecting code into a Windows DLL. Malware makes use of Ntdll.exe indirectly to make use of kernel APIs, so checking the stack trace of event activity is important [321]. Malware authors have to balance gaining increased functionality at the cost of rising suspicion, so a careful deliberation of which APIs to use is always in mind [322]. Native Windows API calls that begin with NTtQuery are popular for malware, as they include functions such as NTtQuerySystemInformation and NTtQueryInformationProcess which provide much more information about the host system. More invasively, early rootkits would make changes to the System Service Descriptor Table (SSDT) which contains addresses to the kernel functions, which would instead be changed to malicious driver functions. If, for example, a typical address of a kernel function is set to 804d7000 for ntoskrnl.exe, then one can look at addresses which are not familiar and contained within the address space typical for kernel drivers. With x64 bit versions of Windows starting with XP, PathGuard prevents modification of the kernel and the kernel code in the SSDT and the Interrupt Descriptor Table (IDT). The IDT takes care of exception handling, so rerouting the response to interrupts to malicious code would be highly disruptive. As a precaution to prevent making changes to native Microsoft DLLs and APIs, Windows Vista was the first Windows version to introduce digitally signed drivers. Some of the example use-cases and APIs used by malware are the following:(a)File: if software wishes to make use of the file register, it can do so using CreateFile, ReadFile, and WriteFile. Malware can make use of CreateFileMapping or MapViewOfFile which loads the file into RAM, avoiding writing to disk all-together. Some malware types, like Ransomware, perform high volume file and encryption operations to carry out its function [323].(b)Process: it is typical for malware to use OpenMutex to check if a mutex exists for a running malware executable. Malware can make use of DLL injection or direct injection. Code can be injected into a running process using VirtualAlloxEx and WriteProcessMemory. When the code is injected into an executable such as Explorer.exe, the same privileges hold for the executable it is injected into. Asynchronous procedure call (APC) is a process by which malicious code is attached to the APC queue of a process’ thread. WaitForSingleObjectEx is the most common call, with QueueUserAPC being used for queues running on a thread. It can be run from the kernel using KeInitializeApc and KeInsertQueueApc. APC remains a known vulnerability on the MITRE ATTCK knowledge base [324].(c)Registry: when it comes to making use of the Windows registry, malware can gain persistence so that it can load whenever Windows restarts [316, 325]. Most commonly the Run key located in HKLM\Software\Microsoft\Windows\CurrentVersion\Run can set executables to run automatically. The Sysinternals tool Autoruns (https://docs.microsoft.com/en-us/sysinternals/downloads/autoruns) can be used to check dozens of registry locations, drivers loaded into the kernel, and any other DLLs. Other options for persistence include running Services which are typically more powerful than administrator privileges. Other registry entries include AppInit_DLLs, which is a registry key that contains DLLs that are attached to processes that load User32.dll. This option has can be disabled in Windows 8 and later versions when Secure Boot is enabled. WinLogon Notify launches during log on, sleep, or when the lock screen is open. Adding a malicious DLL to the ServiceDll parameter in the registry allows a malicious service to start its malicious service DLL into a loaded svchost.exe [326].(d)Networking: certain network API usage can be indicative of malicious intent as networking APIs provide different levels of flexible. For example, the APIs in Wininet.dll will use higher level APIs for HTTP and HTTPS communications. Malware might use the raw Winsock libraries located in ws2_32.dll if there is a need to provide further flexibility to their malicious arsenal. The Metasploit framework can produce shellcode that acts as a listener on a port by creating a simple process using CreateProcess. The configuration for STARTUPINFO is set to a socket, thereby creating a remote shell. This setup allows for I/O and error handling for cmd.exe and does so with the command window suppressed to remain stealthy.(e)Other: malware downloaders and launchers use URLDownloadtoFileA to download a file from a URL and then execute the file by making a call to WinExec. Keyloggers use hooking or polling. Hooking uses an API such as SetWindowsHookEx to notify about a key press, while polling is conducted using GetAsyncKeyState and GetForegroundWindow to poll key states during any time period.

Researchers have looked beyond individual API calls and have investigated API call distribution [327]. A summary of some of these classes of API usage used by researchers is shown in Table 11. The issues arise in that, and it requires significant domain expertise to create and update a database of API calls for particular malware variants or families. It is also the case that there is significant overlap between malicious and benign API usage, thereby making it difficult to alert malware without alerting false positives. The work of [328] developed a similarity metric to trace the similarity between malware variants and Stuxnet based on groups of API calls. It comes to reason that groups of API calls in succession, or the distribution of API calls, can provide further insight into malicious behavior [334]. For this, we investigate some of these research methods in the following section.

5.2.2. Classification of Windows Application Programming Interfaces

The investigation of API calls in the context of feature extraction is sometimes referred to as API call sequence or API call traces. In either definition we are concerned with the patterns that arise in the sequence of API calls used one after another. Early adopters of this form of investigation used Hofmeyr API call sequences, whereby behavior profiles were established between two sequences of API calls based on Hamming distance [335]. Originally, UNIX system calls were traced, and the investigators were motivated by the immune system in their attempt to draw an analogy between sequences of system calls and chains of amino acids in the human body. API call sequences have been leveraged in several applications involving malware detection [160, 184, 316, 336339], as well as in tracing the API call traces during event activity [316, 340343]. Overall, API call frequency and API sequences are effective techniques in identifying data-flow dependencies in a process [315].

5.2.3. Application Programming Interface Frequency

One of the more primitive approaches to API analysis is API frequency analysis. It stands to reason that if malware and benignware make use of similar API libraries, then malware must make use of certain libraries or “malicious” APIs more frequently than others. In [319], considering API frequency alone was effective in achieving 97% accuracy in a multicategorical classification problem involving metamorphic malware variants. One takeaway was that incorporating sequential information did improve accuracy of the models, so frequency analysis is certainly a useful preliminary step in behavioral analysis. The work of [344] developed an end-to-end malware detector based on the frequency of occurrence of opcode and API calls. Their detector coined OPEM, demonstrating an increased area under the curve (AUC) and lower FPs with static calls and a hybrid approach. Unfortunately, the authors did not account for obfuscated malware which tend to be packed and have polymorphic engines which obfuscates the opcode. Their hybrid approach, which included API execution trace, did outperform all other feature sets used in their work [344]. Certain works, like that of [245], decided to use a frequency of a subset of 794 API calls extracted from 500 thousand malware samples. The authors then fused this feature set with other static techniques such as entropy and features extracted from the PE file such as the total number of assembly instructions in the .data and .rsrc section. The drawback to these approaches is that taking the most frequent API calls leaves out information of potential edges cases; it is also a fact that frequent API calls by malware are still routine events carried out by benignware, such as reserving memory, creating a file, etc. The work of [345] approached the problem in a similar fashion, where they eliminated API calls with low frequency. Again, doing so removes important edge-cases and is used typically to reduce the size of the feature vector space to improve training times. These aforementioned works all made use of ML techniques to classify their malicious behavior. Other works make use of statistical similarity metrics to differentiate malicious versus benign by using one or more metrics of comparison. For example, in [304], the authors made use of information gain to select the features based on the sequence of opcodes from android applications. Based on some key obfuscation techniques discussed thus far, including control flow obfuscation, string encryption, in addition to advanced techniques such as class encryption and reflection, the authors found several ML approaches were effective in detecting obfuscated samples.

In [346], the cosine similarity was proposed to compare API call frequency between two vectors to represent the similarity in vector space of a known signature to a new malware sample. The expression for cosine similarity is shown in equation (1). The motivation for using cosine similarity is that the measure computes the similarity between two vectors while excluding their magnitude. This has the effect of ignoring the impact of magnitude if one vector were to use an API much more frequently than the other, as the angle in equation (1) is indifferent to their magnitude.

The extended Jaccard measure is another similarity metric than is useful in measuring the degree of overlap in two sets [346]. As an extension to Jaccard for use in continuous or count attributes, it is effective in demonstrating the similarity, or the ratio of set intersection, between two sets in the context of set theory. The equation for this relationship is shown in equation (2). The numerator can be seen as expressing the set intersection, while the denominator can be seen as the union which acts as a form of normalization.

The cosine similarity was used effectively to create a similarity matrix between the rarest 20–30% raw security events and events of the training set [160]. This approach was used to significantly reduce their dimensionality of their set by focusing their efforts on the similarities between a baseline set of unusual events and their dataset more broadly. In [347], similarity metrics were computed for API sequences that appear frequently, and both assembly instructions and API calls were considered in their work. API calls were noted to be faster in having a smaller signature; however, the authors noted that the API approach is bad for network applications such as PuTTY and encrypted files which show few or do not show any API calls. Their work did rely on unpacked executables as it was limited only to static analysis. In [346], an API call frequency similarity measure was used followed by a chi-square test to test the representation based on a distribution from a known signature. Families of APIs of known metamorphic mutation engines were categorized and compared to one another and to the same mutation engine using both the cosine similarity and the extended Jaccard measure. An interesting finding was that comparing a similarity metric between variants from the same mutation engine provided a measure of the degree of obfuscation, which was shown to be the largest for the next generation virus creation kit (NGVCK), a well-known mutation engine. The work of [275] completed similar work, whereby a proximity index table was setup to compare the similarities between mutation engine families. Due to the sheer number of possible API calls, feature dimensionality reduction was carried out on the original 1000 or so APIs according to frequency. The authors noted that common APIs were used between mass code generator (MPCGEN) and NGVCK-generated viruses. An approach that included data mining was taken in [320], whereby the calling frequencies of the raw features are calculated to select a subset of features, and then principal component analysis (PCA) is used for dimensionality reduction of the selected features. In total, 24,662 API function calls, 792 DLL features, along with PE header info, were considered in their feature set while considering only the top 30 DLLs according to frequency [320]. To address the issue with high-dimensional data, the authors in [336] developed a string-based malware detection system that focused on the top 3,000 interpretable strings that included API names using a max-relevance algorithm. Their feature parser extracted strings from 9,838 executables and classified them as Backdoors, spyware, Trojans, and worms, in addition to benignware. While these techniques have been proven useful in many controlled scenarios, frequency-based analysis is still prone to malware which can obfuscate themselves to avoid heuristic detection. For this reason, sequence analysis is used.

5.2.4. Application Programming Interface Sequences

The investigation of API sequences has become the de facto standard for many behavioral approaches as the information contained within sequences is too powerful to rely on the API frequency alone. It has also led to the adoption of natural language approaches which will be discussed in Section 5.4. The work of [316] provided an example of the flow of information surrounding a process that can act as a template for how to carry out sequence analysis of APIs. The three flow paths are as follows:(1)The API call GetModuleFileName takes a NULL character as its first argument which returns the malware file path(1.1). the path can be passed to CopyFile to open the executable and run its processes(1.2). or, if desired, a process can call CopyFile on itself with the share permission shared to NULL, thereby preventing applications from opening and scanning the file

This example serves to demonstrate that two very different uses of CopyFile can indicate malicious behavior, and only once the whole context is understood can a detection system alert it. An application that performed this successfully was in [337] where 2,727 unique APIs were categorized into 26 groups based on functionality such as hooking, file and directories, registry modification, and others. Based on the sequence of the APIs, critical patterns were uncovered which were essential for core functionality such as screen capturing and DLL injection. Results demonstrated F1 scores as high as 0.999 with a focus on the longest common subsequence between existing malicious signatures and those of unknown variants. A similar approach was taken in [1] where 11 hand-crafted signatures of dynamic and static behaviors were created based on malicious operations spanning registry operations to device operation to kernel operations. These signatures were converted into semantic blocks based on the largest common subsequences between dynamic and static APIs. The work of [348] created a formulation that includes API sequences as part of a temporal domain, and pointers passed to API calls as spatial information. The motivation being similar to [316] in that an API call such as LocalAlloc takes in uBytes as an argument that is statistically lower for malicious files than benign files during allocation of the heap. Capturing this information in the spatial domain, while modeling the sequences of APIs in the temporal domain were effective in classifying 516 executables with accuracies as high as 0.966. Rather than focusing on API sequences as it pertains to general malicious behavior, researchers have explored common API sequence usage among malware variants and types. In [330], five classes of malware including Worm, Trojan-Downloader, Trojan-Spy, Trojan-Dropper, and Backdoor were associated based on the presence of 26 API categories and sequences. 534 malware variants were hooked and then categorized based on the presence of these API sequences, which were characteristically different for different malware types that aim to pursue different objectives through their API usage. In [349], the authors considered 9 behaviors based on sequences of 2–4 APIs in succession, while [315] looked at combinations of 3 APIs (such as CreateFile, WriteFile, and CloseHandle). The work of [350] obtained a 99.7% detection rate using several API calls sets, which included sequences of different lengths.

When it comes to determining appropriate sets of API calls for classification, researchers have pursued approaches in the data mining space to optimize for a set of association patterns towards a particular objective [351] and in this case, optimizing an objective that a sample belongs to a malicious or benign sample. Several papers have been published in this area, in particular those published out of the Xiamen University [352354] focused on malware classification. Ultimately, regardless of the particular mining algorithm used, the idea is to find a set of API calls that support the objective of classifying malware from benignware. In [353], this was performed using a frequency pattern growth algorithm [355]. The goal is to create a frequency pattern tree which encodes sequence in a tree-like structure similar to a Huffman coding where parents of a node are encoded as longer extensions of the child sequences. So, for a given API call API_i, it would exist as a leaf node, while its parent nodes would contain sequences that contain API_i such as (API_i, API_j) or (API_i, API_k). This is performed recursively up the tree, and frequencies are stored as satellite information at each node, and this is how rules are generated. A new sample is then matched against the rules according to the descending order of the rules’ confidence and support [356]. The motivation is to maximize the likelihood that rules exist which can discriminate one objective from the other. This procedure was further described in [352] and used successfully to generate rules which parse 29,850 Windows PE files, half of which were malicious. In the approach of [356], the authors compared frequency mining approaches to ML approaches including SVM, decision trees, and naïve Bayes and noted a 2–9% improvement in classification accuracy. Because these approaches did extract the APIs from the PE files, this static approach is not effective for packed malware or APIs which are imported by the executable but never used. In a later paper by Ye and Yu [143], rule pruning was used for duplicate rules and only elected to use the top 100 API calls as no further improvement was shown beyond 100. While using a linear SVM, Aassociate classifier and novel hierarchical associative classifier, 26 thousand malicious samples were parsed and a precision value as high as 96% was achieved but with a low recall value of 34%. A thorough examination of the state of data mining approaches as it pertains to cyber security are covered in [357]. While handcrafting sequence signatures can be time-consuming and require knowledge of specific patterns in API usage, the alternative is to consider all possible subsequences of a given length and consider the usage patterns of all sequences simultaneously. While data mining does provide a compact representation to do this, more innovative works allow models to discern these rules on their own when coupled to ML approaches. For this purpose, n-gram representation is used.

5.2.5. Application Programming Interface n-Grams

One of the earliest forms of sequence analysis in the malware domain was carried out in [358]. It was also the first successful application of n-grams, which involves translating a sequence of APIs into subsequences long and doing so for every possible subsequence that exists in the original API sequence. This has the effect of incorporating information about the sequences of APIs with little preprocessing required. For any given API sequence, a sequence of length would have n-grams, where is the length of the subsequences and assuming a stride length of one. So, for an API sequence 10 APIs long, we would have (10 − 5) − 1, subsequences for . The number of possible n-gram combinations would be , which represents all the unique combinations of five APIs in sequence that are possible in the set of APIs . The authors in [358] looked at short byte string n-grams of the PC boot sector which was 512 bytes long. They utilized an ML approach that removed the sigmoid activation and stored the weights as 5/6-bit integers. The technique became part of the IBM AV package and was successfully deployed to millions of machines.

The versatility of n-grams means that one can look at smaller to generate shorter signatures which are noisy but more generalizable or use larger to create more specific signatures which lead to lower false positives (FP) but at a cost of lower true positives (TP). The application of n-grams is known to have low FP rates with increasing sequence length ; however, the space complexity of n-gram sequences is exponential in the length of the sequences [71]. The work of [359] focused their attention of the PE header and body and carried out static analysis using the top 500 most common 4-grams [360], representing DLL names. Results demonstrated that the header-only features are as relevant as body information and that separately, they both have a use-case [359]. Similarly, in [361], a 4-gram representation was used to model API sequences. The authors developed average confidence values of benign and malicious activity and used the average confidence of malware as a threshold. This simple thresholding obtained 90% accuracy; however, the work provided no indication of FP rates to support their findings. The work of [342] went one step further and carried out n-gram modeling of API call sequences based on the file system, network, and registry activity. This work was unique in that, and it separated API events based on the file system, network, and registry, to provide a further analysis of how these event categories fare in acting as discriminators. In all, the authors looked at over 17,900 malicious executables and obtained 92.5% test accuracy. Finally, [345] resorted to 3- and 4-gram representations but focused on the dynamic API usage after process execution. This resulted in 94% accuracy, but when coupled with static feature sets based on frequency, it improved the accuracy beyond 97%. The shortfall of n-grams is that sequences exceeding that of 4 or 5 are impractical to model due to the number of permutations of API calls, which significantly hinders the ability for models to attend to different behaviors. For this reason, we can pursue graph-based approaches in an attempt to consider different behaviors simultaneously.

5.3. Graph-Based Approaches

Graph-based approaches to malware detection have a long history. The earliest application of graph-based includes the use of control flow graphs (CFG) to evaluate unique control flow sequences of a program. A CFG is created as a directed graph where the nodes represent individual or blocks of program instructions and the edges represent the control flow between statements [310]. Within each CFG, we have a subgraph that is isomorphic to the whole graph. Trying to map a subgraph from one sample to another is part of the set of problems which includes the subgraph isomorphism problem which is NP-complete [362]. In Figure 11, we can see an illustration for the control flow from the Trojan.Emotet virus. This instruction segment belongs to the set of instructions that are responsible for spawning a child process which depends on the initial call to CreateEvent at the top of Figure 11. When examining such a control flow, the question becomes which segment(s) of instructions are responsible for malicious behavior. While this segment was carefully selected to show the behavior of Emotet, extracting similar segments from the entire malicious execution is cumbersome, especially when they include diversions and dead-ends. Extracting such segments as signatures and generalizing these signatures to flag future malware samples is the goal of CFG-based malware classification.

Most applications of CFGs look at extracting some subset of the flow of sequences to compare to other samples to establish a baseline for malicious control flow. One approach used by [363] looked at jmp, jcc, call, ret, inst, and ret opcode instructions and built the CFG based on only these instructions, thereby creating a reduced graph and leaving placeholders for the rest. Based on these, the authors created unique signatures for malware detection. In [364], the authors looked at the system call functions, which included call, jump, and conditional jump expressions in the x86 Intel instruction set. In [365], the authors looked at the most frequent subgraphs and simply excluded the rest. The sample set used by [366] included 25,145 functions which were 5 nodes (simple instructions) large and 15,439 unique functions which were 5 nodes long. Setting the threshold at 5 ensures that only atypical calls and procedures are included. One of the issues associated with CFGs is that the control flow is either (a) similar among all executables, regardless of malicious activity (also known as boilerplate code) or (b) is sometimes appended with benign code segments that are never executed but can confuse string-based scanning techniques [366]. This was considered by [367] in their CFG reconstruction based on system call logs extracted using Procmon. Their approach did not look at functions that were not loaded by the dynamic linker in order to remove boilerplate code. However, this is a double-edged sword as malware does not only rely on its Import Address Table (IAT) to fetch the APIs it needs, it can load those statically as well. An alternative approach used in [368] looked at contrast subgraphing [369], which is the opposite of graph isomorphism since it looks for the smallest subgraph of that does not belong in . This approach lends itself well to looking for characteristically significant differences between malware and benignware, rather than developing signatures that look for similarities among classes. Alternatively, one can consider creating signatures as coopcode graphs that belong to malware families and therefore create high-level signatures that can be used to classify malware families based on the coopcode graph similarity [319]. While opcodes have been investigated extensively, Windows API usage has been shown to perform well at detecting polymorphic variants, [143, 160, 364] but the large size of potential subgraphs remains a limitation to graph-based approaches. Going more in depth, [370] examined not just the API functions used but also their function input arguments among file system, registry, socket, and process operations. This provides additional insight into the calling process, such as through bytes written to when using WriteFile or destination key when setting a registry value using RegSetValue. The work of [289] looked at the opcode similarity to detect polymorphic variants. The authors developed a weighted directed graph where the edges were probabilities that one opcode followed the next. They then computed scores between metamorphic viruses and between viruses and benign files and developed a threshold score for maliciousness. This approach performed well since metamorphic viruses are created with a selected few metamorphic engines; therefore, the signatures developed are in fact tracing obfuscation used by a given mutation engine [364, 371].

Another factor to consider when using CFGs is how to establish a comparison between CFGs from malicious and nonmalicious control-flows. The authors in [362] examined the detection of metamorphic code based on a cross-comparison of the control flow graphs of known malware. The authors normalized the code to remove dead or unreachable code, removed common subexpressions, removed dead paths, and analyzed indirect control flow transitions to remove longer chains of control flow and avoid misdirections. The authors recorded a 96.5% true positive rate while producing almost no false positives. The Jaccard similarity matrix was used in [367] between system call subsequences. The cosine similarity is another approach used [372], but all similarity metrics suffer from drawbacks because they are all subject to the selection of subgraph as discussed earlier. Even with reliable subgraphs that perform well on a particular set of malware, the work of [373] demonstrated that 23 algorithmic graph features including betweenness centrality, closeness, degree centrality, density, and number of edges and nodes can be used in adversarial analysis and result in a 100% misclassification rate. Their approached target IoT malware, but android malware, is also an ongoing field of study [374376]. With all the shortcomings that come with the graph-isomorphism problem, newer advances in this field remove the need for graphs all-together and convert the entire graph into feature vectors [373, 377]. Once features are vectorized, this opens up the door for other machine learning models to act as discriminators for the classification step.

5.4. Natural Language Processing Approaches

The use of natural language processing (NLP) approaches applied to API call sequences was a natural extension to developing models that can predict malicious behavior. Malicious behavior is not simply a product of individual API usage or frequency of APIs, but it is rather a consideration of the pattern in the API usage over time. Similar to how word usage and context can provide an indication of whether or not an email is spam or not, the context of API called in succession can tell you something about malicious intent. This has the effect of being able to attend to different behaviors simultaneously and allows the model to learn what malicious behaviors exist on its own.

Many popularized vectorization techniques used in NLP applications have also been migrated for the purpose of malware research. Two of these techniques were displayed in the work of [378] which used a bag-of-words (BoW) model and term frequency-inverse document frequency (tf-idf). The background specifics of these techniques will be discussed in the next section. Their work created fixed lengthened vectors from behavioral reports produced in virtual machines and automated the feature extraction step. Finally, an ensemble of ML techniques, such as random forest, k-nearest neighbors (k-NN), support vector machine (SVM), and XGBoost, were used, with majority voting summarizing the end predictions over the models. An application that did involve APIs was carried in [1] who looked at both dynamic and static behaviors and hand-crafted groups of signatures based on operation. The authors created 11 different types of malicious operations, spanning from registry operations to device I/O to kernel operations. APIs were converted to semantic blocks which looked at the largest common subsequences between dynamic and static behavior. Following the sequencing, tf-idf was used to vectorize the contribution of each API, with a focus on rarely used APIs that drive malicious behavior. In [160], tf-idf was used to convert the sequence of a unique event name to a representation for a machine learning mode to learn which included both 1-dimensional convolutional neural network (CNN) and long short-term memory (LSTM) architectures. A similar line of work was used in [379] where a LSTM was used to model sequential API usage of 20 thousand malware samples run on a Windows 7 machine using the Cuckoo sandbox. The authors only considered 342 API calls but limited their investigation to those that were used at least 10 times among all samples in the training set. When coupled with tf-idf, this has the effect of focusing more on rarely used APIs, and by limiting the minimum threshold to 10, there are enough training examples for the model to learn the importance of those features. In a more recent work in [380], graph neural networks (GNN) were used to identify dynamic malware execution in a sandbox using the techniques developed in [315] and used in [381]. Windows APIs were vectorized with n-gram and td-idf, with malware execution being performed in sandbox snapshots with different benignware excecutions to simulate different potential host environments. The use of GNNs allowed the model to learn patterns in API usage by combining learned patterns from neighboring nodes that represent differnet hierachies in process execution. This has the effect of not only learning the API usage of a single process, but that of all the processes that are daughter or parent processes of any given running process - thereby magnifying the discriminatory power of the model in identifying malicious behavior.

In addition to the form of vectorization, modern NLP models allow the model itself to learn the importance of each word (or API) relative to the context of the surrounding words. For this purpose, word embeddings were developed which can learn the semantic relationship between words and map that relationship to vector space [382]. This has the effect of allowing models that are closely related to have similar cosine-similarity scores. A modest application by [383] used 300-dimensional word embeddings followed by a similarity matrix to cluster malware and benignware using k-means. This way, the cluster index was a dense representation of malware and benignware. A more end-to-end approach was used in [381] whereby API stack traces were modeled as an NLP problem. Embedding dimensions of size 50 to 200 were used to map the API stack trace that included APIs that communicated all the way to the kernel. With the use of a transformer architecture which learns latent representation of the sequences, F1 scores as high as 96.2% were obtained when considering registry APIs. The authors in [384] looked at developing a semantic transition matrix to segregate API calls which have similar contexts into clusters. This was conducted by capturing the relationship between API calls that represent malware and benignware using Word2Vec [382], a word embedding technique which has more powerful encoding ability than vanilla word embedding approaches. More powerful encoders translate to better ability to learn context, which was evident in their FP rate of only 1%. A similar use of Word2Vec was followed by an LSTM in [385] to analyze opcodes and API function names. In total, 1369 API function names and opcodes were used, of which 958 were API calls.

Several works have made use of the Windows PE malware API sequence dataset [379], a dataset of over API call sequence extracted from 7017 malicious binaries from 8 malware classes including Adware, Backdoors, Downloaders Droppers, Apyware, Trojans, Viruses, and Worms. For this dataset, [386] achieved poor results with a 0.38 F1 score when using a 32-dimensional embedding to represent the API sequences followed by a 2-layer LSTM. Their approach used 342 API calls and discarded those that were used less than 10 times. Similar poor results were obtained in [387] which reported F1 scores ranging from 0.33 to 0.72 for the 8 malware types based on a similar LSTM approach. The work of [388] went one step further and compared an LSTM approach to that of a transformer and finally to a bidirectional encoder representation from transformers (BERT). BERT relies on learning latent representations from both directional contexts from before and after sequences, meaning that it does a better job encoding context of the API sequence. In [388], they also used the Windows PE malware dataset and found similar issues classifying the 8 classes with a weighted F1 score of 0.51 on their best performing BERT model. One approach that did find success using BERT was that of [389] who implemented fastText [390], a text vectorizing technique based on n-gram. While removing redundant API calls, such as NtDelayExecution, accuracies as high as 96.76% using BERT were obtained.

6. Conclusions

This paper provides a systematic review of commonly used obfuscation techniques used by malware variants and mutation engine kits. This survey of the literature touched upon several key indicators of obfuscation employed by malware, which serves to better understand the nature of the reverse-engineering process. Our work makes four core contributions.

We noted the scope of malware and obfuscation worldwide and presented some of the key red-flags noted by antivirus (AV) vendors and researchers. The numbers suggest an aggressive increase in the number of threats and the monetary cost associated with breaches, system intrusions, and downtime. In addition, we discussed some of the string scanning techniques that are still very much in use by AV vendors to this day.

We provided an examination of the popular obfuscation techniques used to translate the opcode sequences of malware into semantic equivalent but different instructions. These techniques have been integrated into popular mutation engines for over a decade now and render much of the reverse-engineering and legacy signature-based techniques obsolete if used effectively. This presents a fundamental problem for researchers and practitioners, but it has led to the field of dynamic analysis which examines the run-time behavior of malicious executables. We also touched upon the structure of metamorphic mutation engines, along with encryption and compression, two very important behaviors that serve as key indicators of maliciousness for a given binary.

We provided a review of popularized malware datasets that are commonly used in malware research. These datasets span applications in mobile malware, intrusion detection, networking, and binaries. We also touched upon some antiemulation and antiarmoring tactics in use by malware to protect from examination under virtualized environments.

Finally, some common approaches to feature analysis are introduced which discusses the various ways Windows APIs are categorized and vectorized to identify malicious binaries, especially in the context of identifying obfuscated malware variants.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research has been financially supported by Mitacs Accelerate (IT15018) in partnership with Canadian Tire Corporation and is supported by the University of Manitoba.