In the ongoing attempt to keep our computers safe from malicious programs, or malware, there is always an “arms race” going on between the Bad Guys that launch the attacks, and the system administrators and users who try to defend against them. Most users are familiar with anti-virus programs, which work primarily by comparing suspect material to a “signature” data base, which lists characteristics of known malware. (For example, a given virus might have a particular sequence of 16 bytes beginning 32 bytes from the beginning of the file.) This approach can work well, but its obvious limitation is that the anti-virus vendor has to have seen samples of malware in order to derive the signatures, meaning that it is not useful against totally new threats.
To get around this problem, the defenders employ various heuristics, which it is hoped will work even against a new malicious program. Some of these are behavior based: that is, a program that tries to carry out certain actions (for example, modifying the processing of keyboard interrupts) is regarded as ipso facto suspicious. Another approach is to assume that the malware, because it by definition must contain executable code, will have certain characteristics that differentiate it from plain text, for example. Some malware authors disguise the “dirty work” part of their product (the payload) by encoding it, but even so there must be at least a simple routine that does the decoding and start-up. So some detection schemes are focused on looking for the decoding program.
Unfortunately, a new paper [PDF] that was presented at last week’s ACM Conference on Computer and Communications Security suggests that this approach is not as robust as one might wish. The researchers — Joshua Mason and Sam Small from Johns Hopkins University, Fabian Monrose from the University of North Carolina, and Greg MacManus of iSight Partners — have developed a technique for encoding malicious software so that both the payload and the initial decoding program are disguised as pseudo-English text. The resulting text would not necessarily seem sensible to a human reader, but it would have the same superficial statistical properties as English text, making automatic detection very difficult. For example, the following example text was generated to encode a routine that simply calls the system
exit(0) function; some of the text is just padding, and is skipped by immediately-preceding “Jump” instructions:
There is a major center of economic activity, such as Star Trek, including the Ed Sullivan Show. The former Soviet Union. International organization participation Asian Development Bank, established in the United States Drug Enforcement Administration, and the Palestinian Territories …
This is clearly going to come across as a bit odd to a human reader, but will be very hard for a text analysis program to distinguish from, say, a newspaper article. (The overall “flavor” of the generated text depends on the body of legitimate text used to derive the encoding.)
These results are another reminder that there is no silver bullet when it comes to detecting malicious code when it is mixed together with arbitrary data streams, and that diligence in preventing code injection attacks is still of vital importance.
Update Friday, November 27, 14:05
The New Scientist now has an article about this research.