Annoyance Filter: Adaptive Junk Mail Filter

Adaptive Bayesian Junk Mail Filter

Business propaganda must be obtrusive and blatant. It is its aim to attract the attention of slow people, to rouse latent wishes, to entice men to substitute innovation for inert clinging to traditional routine. In order to succeed, advertising must be adjusted to the mentality of the people courted. It must suit their tastes and speak their idiom. Advertising is shrill, noisy, coarse, puffing, because the public does not react to dignified allusions. It is the bad taste of the public that forces the advertisers to display bad taste in their publicity campaigns.

— Ludwig von Mises, Human Action

Based upon this wise observation, made more than half a century ago, we shall now undertake to put an end to the aggravation of being incessantly affronted, assaulted, and insulted by electronic mail we never requested and have no desire to receive.

I'm not going to take it any more! Junk mailers, you're finished; behold the apparatus of your eradication. You have scuttled from ISP to ISP as access has been denied; you've emigrated from respectable countries toward jurisdictions where the rule of law is more accommodating to your ilk. You have found ways to disguise your messages to fly below the radar of ever more elaborate attempts to block your abuse of our electronic commons.

Check, and mate. Right here, right now, we commence the end-game. The game's not over, but it's afoot. You're finished—better start getting used to it and contemplate other, preferably less-disreputable ways to make a living. You can run, you can hide, you can use every trick to disguise who you are or whence you're spewing your sewage onto the net, but one thing never changes—because it cannot—and that's the nature of what you're sending: advertising! As von Mises said so eloquently, it is inherently “shrill, noisy, coarse, puffing”; it uses a different vocabulary and syntax from other communications, as it must. Were it not so, it would not work, and so wouldn't be worth doing.

The Annoyance Filter is a program which exploits the indelible signature of advertising to identify it before it ever reaches the eyes of the reader, with a very low likelihood of junk mail being confused with legitimate messages. It accomplishes this by scanning collections of an individual user's mail, the more the better, which have been manually sorted into piles of legitimate mail and junk. From these archives, Annoyance Filter computes statistics for the words which appear in the two collections of messages, determining for each the probability that its appearing in a message is indicative of junk mail. This is the training phase, and results in a dictionary of word probabilities.

With this dictionary in hand, Annoyance Filter may now be used to classify incoming messages as they arrive. Incoming messages are parsed precisely like those used to train the program, and based on the words with the greatest probability of appearing predominately in junk or non-junk, a probability for the message as a whole is computed. This probability is then tested against a threshold which, if exceeded, indicates with a high degree of confidence the mail is junk. Each message is marked with its classification, and what happens from there on is up to you. Unix users of Procmail may easily direct mail Annoyance Filter to deems junk to a suitable destination. But there's nothing Unix- or Procmail-specific about Annoyance Filter. It can be built on any platform with a standard C++ compiler and integrated into any mail system which permits an external program to filter incoming mail. The details, of course, may be complicated, messy, and tedious, but the concept is straightforward.

A brief history of Annoyance Filter

In a real sense, this program has been twenty-five years in the making. The seed was planted in the 1970's while thinking about Jim Warren's concept of “datacasting”. He envisioned using subcarriers of FM stations (or perhaps data encoded in the vertical retrace interval of television signals) to transmit digital information freely accessible to all. Not Xanadu or the Internet, mind you…this remained a one-to-many broadcast medium, but one capable of providing information in a form which the then-emerging personal computers could receive, digest, and present in a customised fashion to their users.

“But who pays?” Well, that detail, which played a large part in the inflation and demise of the recent .com bubble, was central to the feasibility of datacasting as well. Jim Warren's view was that the primarily advertiser-supported business model adopted by most U.S. print and broadcast media would be equally applicable to bits flung into the ether from a radio antenna. As I recall, he cited the experience of suburban weekly newspapers, which discovered their profits increased when they moved from a paid subscription/per-copy readership to free distribution—circulation went up, advertising rates rose apace, and the bottom line changed from red to green.

Intriguing…but still I had my doubts. When you read a newspaper or magazine, you can't avoid the advertising—you can flip past it, to be sure, but you still have to look at it, at least momentarily, so there's always the possibility a sufficiently clever image or tag line may motivate you to read the rest. I asked Jim why, once a document was in an entirely digital form, folks couldn't develop filters to remove the advertising before it ever reached their eyes. This would destroy the free distribution model and render an advertising-supported digital broadcasting service unworkable. Jim wasn't too concerned about this. In his estimation, discriminating advertising from editorial content would require artificial intelligence which did not exist and wasn't remotely on the horizon.

That's when von Mises' words on advertising came back to me. Advertising is advertising—perforce, it speaks with a different vocabulary than the sports page, letters to the editor, police blotter, national and international news, and commentary (aside, perhaps, from Maureen Dowd's columns in the New York Times). Given a sufficiently large collection of known editorial copy and advertising, might it not be possible to extract a signature, in the sense of radar signatures to discriminate warheads from decoys in ballistic missile defence, with which a sufficiently clever program could identify advertising and remove it, with a high level of confidence, before the reader ever saw it?

Fast forward—or, more precisely, pause…. By the late 1970's I'd concluded the best strategy to make the most of the ambient malaise was to amass a huge pile of money. Money may not buy happiness, but at the very least it would mitigate many of the irritations of that bleak, collectivist era. Being a nerd, I immediately turned to technology for a quick fix, and what should I espy but an exploding market in affordable home video cassette recorders—VCRs—which were, in those days, becoming a fixture in more and more households. Many VCRs were purchased to play rented movies, but, being also able to automatically record programs off-the-air on a preset schedule, they could be used for “time-shifting”—-recording broadcast programs for later viewing. But why, thought I, sit though all those tedious commercials you've recorded along with the programs you intend to watch? Certainly, people quickly learned to “zip”—use the fast forward to skip past commercials—but what if you could detect commercials and “zap” them—never record them in the first place? It occurred to me that inventing a device which accomplished this might be lucrative indeed.

The concept couldn't have been simpler—a little box which monitors the video and audio of the channel you're recording and, based on real-time analysis of the signal, pauses and resumes recording of the program on your VCR, yielding a tape free of advertising. It was easy to imagine such a gizmo succeeding like the contemporary “Demon Dialer” telephone speed dialer add-on, selling in the tens of millions in a matter of months. Well of course it occurred to me that widespread adoption of such a device would motivate advertisers to disguise the tags that discriminated commercials from programs. (But hey—by the time that happened I'd have already cashed the customers' checks and blown the joint. There was bit of the Ferengi in me then. Truth be told, there still is.) Imagine the dismay of advertisers and my own contented avarice as I watched the money bin fill deep enough for high diving. No more laps round the worry room for me!

I must confess to some inside information in this regard. While working for a regrettable employer in an odious swamp, I'd twigged to the fact that network television advertisers tagged their commercials with a signature in the vertical retrace interval to permit audit bureaux to measure how many network affiliates actually broadcast each commercial. This tag appeared to me the Achilles' heel of television advertising. As long as one could distinguish tagged commercials from an un-tagged program, it would be more or less straightforward to detect when a commercial was being transmitted and pause the VCR until the program resumed.

If only…. In reality, only nationally broadcast commercials bore the tag, and only some of them. Local commercials were never tagged. This created a difficult marketing dilemma for my grand scheme. While it might have been possible to block some of the most ubiquitous and irritating commercials on mass-market network series, the bottom feeders who watch those shows probably enjoyed the commercials and wouldn't be prospects for my gadget, while those like myself, infuriated by incessant commercials interrupting late night movies, would find the device ineffective since local commercials on independent stations were never tagged. Real-time analysis of video or even audio in the 1970's and early 80's was technologically out of the question for a product aimed at a mass consumer market. So, I put the idea of an annoyance filter for television aside and occupied myself with other endeavours.

We now arrive at the late 1980's. I'd spent the last decade or so filling up the money bin more or less flat out, and having reached a level I judged more than adequate, I began to turn my attention to matters I'd neglected during those laser-focused years.

Writing science fiction, for one thing. There was something about the advertising filter which had dug its way into my brain so deeply that nothing could dislodge it. The year is 1989; the Berlin Wall is about to tumble; and I'm scribbling a story about two programmers spending the downtime between Christmas and New Year's Day (the period when I'd accomplished about half of my own productive work over the previous half decade) prowling the nascent Internet for evidence of an extraterrestrial message already received, but not recognised as such. In We'll Return, After this Message, it is an Annoyance Filter which recognises an extraterrestrial message for what it is, advertising, and as von Mises observed, distinguishable by its own strident clamouring for attention.

A decade later, in the very years in which I set my science fiction story, I launched my own search for a message from our Creator hidden in the most obvious of locations—no results so far. Yet still I scour the Net.

Which brings us, more or less, to the present. The idea of an annoyance filter continued to intermittently occupy my thoughts, especially as the volume of junk arriving in my mailbox incessantly mounted despite ongoing efforts to filter it with increasingly voluminous and clever Procmail rules. Then, in August 2002, my friend and colleague Kern Sibbald brought to my attention Paul Graham's brilliant design for an adaptable, Bayesian filter to discriminate junk and legitimate mail by word frequencies measured in actual samples of mail pre-sorted into those categories. Now that sounded promising! Here was a design which was simple in concept, theoretically sound, and best of all, it seemed to work. Graham implemented his prototype filter in the “Arc” Lisp dialect used in his research. I decided to build a deployable tool in industrial-strength C++, founded on his design, and handling all the details required so the filter could, as much as possible, interpret mail the same way a human would—decoding, translating, and extracting wherever necessary to defeat the techniques junk mailers adopt to hide their content from naïve filtering utilities.

This is not a simple task. Consider—you can probably sort out a message you're interested in reading from unsolicited junk in a fraction of a second, but that assumes it's presented to you after all of the mail transfer and content encodings have been peeled away to reveal the true colours of the content. Long gone are days when E-mail was predominantly ASCII text. Today, it's more than likely to be HTML (if not a Flash animation or some other horror), often transmitted in Quoted-Printable or Base64 encodings largely in the interest of “stealth”—to hide the content from filters not equipped with the decoding facilities of a full-fledged mail client.

The Annoyance Filter is based on Graham's crystalline vision of Bayesian scoring of messages by empirically determined word probabilities. It includes the tedious but essential machinery required to parse MIME multi-part mail attachments, decode non-plain-text parts, and interpret character sets in languages the user isn't accustomed to reading. This makes for great snowdrifts of software, but fortunately few details about which the typical user need fret.

Preliminary tests indicate Annoyance Filter is inordinately effective in discriminating legitimate from junk mail. But this entire endeavour remains very much an active area of research and, consequently, Annoyance Filter has been implemented as a toolkit intended to facilitate experiments with various filtering strategies and measuring the characteristics which best identify mail worth reading. You're more than welcome to build and install the program using the cookbook instructions but, if you're inclined to delve deeper, feel free to jump in—the programming's fine! Everyone is invited to contribute their own wisdom and creativity toward bringing to an end this intellectual pollution. Remember, when nobody ever sees junk mail, nobody will bother to send it. Let us commence rowing toward that happy landfall.

A log of the detailed history of the development of Annoyance Filter and its ongoing evolution appears near the end of the program listing.

Audience

The Annoyance Filter is potentially addressed to anybody beset by unwanted mail. But realistically, because this is a new program, under active development, which requires some effort and experience to install, train, and integrate with existing mail processing tools, at the moment only “bleeding edge” early adopters are encouraged to experiment with it. As I noted above, there's nothing platform specific about the program, but it has been developed and tested on Linux and Solaris systems with gcc/g++ (version 2.96) and naturally will be easiest to install on like systems. Reports of experiences installing on other systems are welcome, especially if they include suggestions and/or corrections to remedy portability problems.

Developers with experience on other platforms, in particular Microsoft Windows and the Macintosh, are invited to help integrate Annoyance Filter with their mail facilities. Given that Annoyance Filter is standard C++ (and the ubiquity of gcc in any case), porting the program isn't the big problem—it's the integration of filtering with the mail processing system, plus whatever is needed to transform the mail system's archives into the Unix mail folder format the program uses while training.

Great, but where's the documentation?

The Annoyance Filter is written using the Literate Programming methodology. A literate program is as much an essay addressed to other people as input to a compiler. When done well, a literate program should be as rewarding to read as to run. Literate programs are their own documentation; we who write them consider explanation and implementation inseparable. You don't “read the flippin' manual”, you “read the bloody program”—it's the ultimate authority—so why read anything else?

Complete user and internal documentation, including an annotated listing of the program source code is automatically generated in PDF format whenever the program is modified. You can download this documentation from the links below or read it on-line if your browser is equipped a PDF plug-in. (When you download the source code, the PDF documentation is included; there's no need to download it separately.) Free software fundamentalists appalled at the thought of installing Acrobat Reader (which is free, but not Free) can always use Xpdf or Ghostscript.

Traditionalists loath to consult a PDF file, or those who prefer a more telegraphic form of description may consult the manual page, which is supplied in both troff -man and HTML formats. Quick start information may be found in the README and INSTALL files included in the source archive.

Download

	annoyance-filter C++ source code: annoyance-filter-1.0d.tar.gz The source code archive includes all the following items.
	Ready-to-run 32-bit Windows executable (Zipped archive): annoyance-filter.zip
	Read annoyance-filter source code. [PDF]
	Read annoyance-filter manual page.
	Read statistical library source code. [PDF]

Prior releases may be downloaded from the archives.

Developers may obtain the current source code from the Annoyance Filter Project on SourceForge.

Distribution

This software is in the public domain. Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, without any conditions or restrictions. This software is provided “as is” without express or implied warranty.

Author

John Walker
http://www.fourmilab.ch/

August 5th, 2004