On transferring knowledge about malware internals
This post explores the question of transferring knowledge about malware internals, particularly malware's functionality up to the functions' bytes.
Cybersecurity companies are producing technical reports about malware families (or new versions of them), diving into the disassembly/decompiled code, explaining what techniques are used in the malware sample, providing information about the structures malware sample is using and artifacts for detection purposes, such as YARA rules and IOCs. After the dissemination of the report, malware analysts can use this information to identify malware families during analysis which speeds it (analysis) up and, depending on the goal, analyst can automate configuration extraction, strings decryption or write a C2 communication emulator, among other things. At the same time, members of SOC may use: TTPs in the report to create hypothesis for internal threat hunting reducing mean time to detect; IOCs, such as hashes and IPs, and YARA rules for detection purposes in EDR, IDS/IPS, etc and external threat hunting (identifying new C2 infrastructure proactively). However, there are two issues: (1) malware analysts cannot directly benefit from the analyzed code, as they first have to identify the malware family of a given sample, look through the report and reverse engineer the code anyway, considering the information from the report to get, at the very least, functions’ names inside the given binary; (2) SOC members don’t use the analyzed code (procedures) apart from the conclusion of what the code is responsible for (tactics & techniques). Ideal solution would solve both of these issues.
As for the first issue, it could be possible to use FLIRT signatures or, considering the limitations of FLIRT, WARP. However, companies do not provide FLIRT signatures in the reports as they are meant to be used in binary analysis tools only and, as a result, signatures would be useless for most of the report consumers, not to mention WARP’s support in other tools and FLIRT’s primary usage which is identifying library functions. The second existing solution is Lumina server, but it is too specific (md5 hash of the function) and proprietary. As for the second issue, which would also be the third existing solution for the first issue, is to use YARA or CAPA (which provides more possibilities because of a disassembler) rules to detect key functions (those functions that exhibit specific TTPs tied to particular malware family) or the combinations of them in other binaries, but that would mean creation of rule(s) that are too specific as there is no way to evaluate logic expression to something other than true or false which is required in case the code slightly changes. Considering the disadvantages of mentioned tools, the preferable solution would need to: have an API so that it is possible to integrate it with other, existing security tools; generate signatures that are not specific, so the solution could produce a similarity score for two functions (from a given sample and the known malware sample), like BinDiff or Diaphora, but without access to the known malware sample and an ability to compare functions inside a binary to a list of functions belonging to known malware samples, or, in other words, comparing 1 function to N functions using some sort of signatures; have a disassembler to identify functions inside the binary; add attributes to function’s signature to supply the information about malware family, name, description of the function, MBC or MITRE ATT&CK identifier, etc.
A tool that has the needs outlined above is Binlex which is the first, as far as I know, open source tool that gives an ability to search for similar functions and basic blocks inside Milvus database using a vector generated from Graph Neural Network and k-nearest neighbors algorithm, while the actual data about the function, including its bytes, is stored in MinIO database. On top of the vector comparison, Binlex can generate and compare minhash, tlsh (although it is quiet limited) and sizes of two functions in order to reduce false-positive. Companies could provide at least vectors and minhashes for key functions from a malware family through MISP or OpenCTI, so that (1) malware analysts could compare all the functions from the malware sample and identify functions that have benn analyzed already (with a possibility of transferring even more data in the supplemented attributes); (2) SOC members could leverage this information to identify code reusage in the binaries (which can be new malware samples) from EDR or other security solutions. It would take many computer resources to compare every single binary in the organization’s network (as there is a need to disassemble the binary, generate vector and other data) which means that there must be criterias when the binary’s functions shall be checked on known malware functions. Those criterias could be an alert in EDR or a suspicious binary discovered during the incident response. In any case, when comparing this approach to simple hashes, the decrease in coverage is inevitable, although the probability of detecting new malware sample (assuming it did not trigger a YARA rule if there is one) is bigger.
On a side note, related to the detection of new malware samples by SOC analysts with other existing solutions that are not focused on malware functions, MISP supports fuzzy hashing which includes SSDEEP since 2018. It can be used in conjunction with YARA rules if the security tool permits to use SSDEEP.