Posted: Wed, September 19, 2007
The rise of PDF spam
by Nick Johnston
Spammers are known to be highly creative and versatile in their attempts to bypass spam filters. For years, image spam has been very popular, with spammers using a variety of different techniques to
randomise their images, making detection more difficult. As both MessageLabs and the wider anti-spam community have improved their image processing techniques, spammers are increasingly switching
to a new format: PDF
The beginning of PDF spam
PDF (Portable Document Format) is a popular document format invented by Adobe Systems, and is widely used for document exchange in the business world. As such, it is a "trusted" format, and many
naïve anti-spam solutions automatically whitelist all messages containing a PDF file. Such is the importance and general acceptance of PDF in the business world that practically all computers in a
corporate environment will have a PDF viewer installed. This makes PDF an excellent "vector" for spam messages.
MessageLabs first saw large-scale PDF spam in the middle of June 2007. This "spam run" or "campaign" was a "pump and dump" scam promoting a German stock. Many new types of spam start
primitively, and PDF spam was no exception. This first spam run included exactly the same document in each message, making it easy to stop the messages using hashes or "fingerprints".
PDF spam evolves quickly
Soon after seeing the first major PDF spam run, MessageLabs began seeing more. But this time, each message had a different PDF file attached. Spammers have long had the ability to randomise
images, and have now updated their botnet software to simply insert these random images into PDF documents. This technique means that each PDF that a spammer sends out will be different, and will
be more difficult to stop.
Randomised PDFs
The images below (taken from randomised PDFs) illustrate this concept well. Each image includes exactly the same text, but the shape of the image is different, as are the colours:



In contrast to legitimate business PDF files, PDF files from this randomised spam run do not use standard paper sizes such as A4 or Letter. For example, one document might be 74.4 × 96 mm,
and
another might be 168.3 × 54.7 mm.
Corrupted files to avoid detection
Many images in image spam were deliberately corrupted - in other words, the images were constructed without complying with the appropriate specification or standard. By corrupting files, spammers make
it more difficult for the analysis tools used by the anti-spam companies to open and analyse the images. Some computer programs would fail to process such images, and indeed these images could cause
some programs to become much slower, use more resources or crash. However, spammers rely on the fact that other applications (like many common email clients) are more forgiving and display the
images without problems.
MessageLabs has seen similar tactics employed with PDF spam, detecting many corrupted PDF documents. It's unclear if this corruption is accidental or deliberate, but as with corrupt images, strict
processing programs tend to fail on these PDF files and so analysis and identification becomes more difficult. The messages can still be viewed by the recipients though because Adobe's Acrobat Reader
displays the PDF correctly (by rebuilding part of the PDF document's internal structure). Some older versions of Acrobat Reader briefly display a dialog box telling the user that the file is damaged and is
being repaired, but this requires no interaction from the user.
Variable length PDF files
A more recent tactic seen by MessageLabs is the use of variable length PDF documents. Until recently, most PDF documents sent in spam were simple, single page documents. In contrast, with variable
length PDF spam, the first half or so of the first page includes the spam message, and the rest of the page and a random number of subsequent pages contain text "poison".
This poison is designed to foil statistical anti-spam techniques such as Bayes. We have seen spam PDF documents containing up to 14 pages of poison. The poison can be random words,
programmatically-generated "nonsense text" or legitimate text "scraped" off popular web sites. Some examples of this text include:
But in light of their back-stabbing, Artificial Intelligence-inspired offenses and their sinister,
temptation-ridden environment this response is degenerate.
Ships from and sold by Amazon.
I also had my tripod and took several amazing long exposure shots of the interior.
It is likely that spammers think longer PDF documents are more likely to be considered legitimate business documents like reports, manuals and so on.
PDF spam diversity
Most press coverage around PDF spam has solely concentrated on "pump-and-dump" stock spam. MessageLabs has seen PDF documents used in other types of spam, such as pharmaceutical spam
and online casino spam. Recent examples include:

PDF spam construction
Spammers are using a wide variety of tools to produce their PDF documents. Many tools include their name as the document "producer" or "creator" in the PDF file itself. Some spammers are using
common office applications such as Microsoft Word and OpenOffice:
/Producer(GNU Ghostscript 7.07)
/Creator(OpenOffice.org 1.1.4)
/Title(Microsoft Word - sancashtemplate.doc)
/Creator(PScript5.dll Version 5.2.2)
Some spammers have also used tools like PowerPDF, text2pdf and so on to produce their PDF documents. More recently, spammers have written their own tools to produce PDF documents. This
gives them maximum flexibility, and lets them specify random "producer" names and titles which are difficult to detect by anti-spam software, for example:
Title: One of the most interesting things about the
present development of the automobile is the trend to
give cars a retro look.
Producer: For pure and simple ugly no one has been able
to beat them
Title: , has a new promotion that puts its money where
its mouth is.
Producer: The flights will be convenient for travellers
coming from the U
Although many people are familiar with PDF documents, there are also some related formats which are comparatively unknown. Recently MessageLabs has seen spam claiming to have FDF (Forms
Data Format) attachments, which also open with Adobe's Acrobat Reader. The attachments are actually PDF files merely labelled with a '.fdf' extension. This is likely to be another attempt by the
spammers to bypass anti-spam software that only looks at the file extension ('.fdf' in this case), rather than doing reliable checking of the actual file.
Meet the threat
PDF spam is an increasing problem and now accounts for around 20% of spam. The damage that spam can cause any business should never be underestimated. Efficiency, productivity and profitability
can all take a serious hit if electronic junk email gains access to inboxes, with valuable time and effort eaten up in identifying and deleting unwanted messages. MessageLabs stops PDF spam using
several different broad techniques:
- Skeptic® heuristics updated around the clock to ensure the highest protection possible from PDF spam
- Automatic fingerprint-based blocking of known spam PDF files
- Honeypot monitoring systems for identifying new PDF spam runs
- Tools to detect corrupted PDF files
- Generic approaches such as IP blacklisting
About the Author
Nick Johnston is with Anti-Spam Development at MessageLabs. The company offers a managed anti-spam service, allowing customers to benefit from seamless, continual system improvement.
Combined with 24 hours per day, 7 days a week, 365 days a year operations and development teams, ensuring that MessageLabs customers are always protected against PDF spam and other emerging
spam threats.
For more information visit www.messagelabs.com.
Send a comment about this article to editor@itwales.com.
|