Due to the exploding number of unique malware binaries on the Internet and the slow process required for manually analyzing these binaries, security practitioners today have only limited visibility into the functionality implemented by the global population of malware. To date little work has been focused explicitly on quickly and automatically detecting the broad range of high level malware functionality such as the ability of malware to take screenshots, communicate via IRC, or surreptitiously operate users’ webcams.
To address this gap, we debut CrowdSource, an open source machine learning based reverse engineering tool. CrowdSource approaches the problem of malware capability identification in a novel way, by training a malware capability detection engine on millions of technical documents from the web. Our intuition for this approach is that malware reverse engineers already rely heavily on the web “crowd” (performing web searches to discover the purpose of obscure function calls and byte strings, for example), so automated approaches, using the tools of machine learning, should also take advantage of this rich and as of yet untapped data source.
As a novel malware capability detection approach, CrowdSource does the following:
- Generates a list of detected software capabilities for novel malware samples (such as the ability of malware to communicate via a particular protocol, perform a given data exfiltration activity, or load a device driver);
- Provides traceable output for capability detections by including “citations” to the web technical documents that detections are based on;
- Provides probabilistic malware capability detections when appropriate: e.g., system output may read, “given the following web documents as evidence, it is 80% likely the sample uses IRC as a C2 channel, and 70% likely that it also encrypts this traffic.”
CrowdSource is funded under the DARPA Cyber Fast Track initiative, is being developed by the machine learning and malware analysis group at Invincea Labs and is scheduled for beta, open source release to the security community this October. We’ll be presenting Crowdsource in more detail at BlackHat shortly, including compelling results that demonstrate that CrowdSource can already rapidly reverse engineer a variety of currently active malware variants.
Josh Saxe is a lead research engineer at Invincea Labs, where he serves as technical lead on the DARPA Cyber Genome program, seeking to produce automated systems that discover, analyze and visualize evolutionary relationships between malicious software artifacts. Josh also serves as technical lead on a DARPA Cyber Fast Track effort dubbed “CrowdSource,” on which he leads the development of algorithms for rapidly and automatically characterizing novel malware binaries’ functionality using crowdsourced, machine learning-based methods.