the cups blog

07-20-11

VizSec 2011: Malware Images

Lakshmanan Nataraj, Karthikeyan Shanmugavadivel, Gregoire Jacob and B.S Manjunath, “Malware Images: Visualization and Automatic Classification”

The authors visualize the bytes of malware files to produce small visualizations. These visualizations let you get a high level sense of a file and what the different components are.

If you look at malware across variants they look visually similar while looking dissimilar to other malware families.

Once malware is converted to an image representation you can use that to characterize the malware. The authors used texture features that are normally used to identify different landscapes or other images. They used k-nearest neighbors for classification. Using a euclidian distance measurement` to determine how similar images are.

Took 2000 malware comprising eight malware families and converted them to images and used image texture based features. The authors were able to get around 98% classification accuracy.

What about packing?

Images after packing look completely different from the unpacked executable.

Common wisdom says that everything packed by the same packer should look the same and not like the original. The authors tried packing each malware with each of three packers. Even after packing the authors were able to identify family groups with high accuracy.

Used 25k malware from Anubis and VxHeavens Datasets and labeled using Microsoft Security Essentials and used the top 100 families. Still got high accuracy.

Tried 64k malware with 531 families and still got high accuracy.

The biggest advantage of image based malware analysis is speed. It only takes about 50ms. It also doesn’t require execution or dis-assembly.

Limitations of this work is that it is data driven. It doesn’t prevent zero day attacks well. Also the characterization is grouping based on images not on actual functionality.

Questions

Q1: How do forensic malware analysists see using this work? What about the low accuracy points.

A1: Low accuracy could be countered by using more AV labeling.

Q2: What you are doing is visualizing signatures. Will this work on polymorphic malware? Is this different enough from existing software, since it is only classifying known malware not separating it from good code.

A2: We did try adding in a bunch of non-executable default windows files as a extra “family” and were able to tell the difference between this “family” and others.