Spam Filtering Methods and Statistics


Note: Internet Explorer users may need to install the Adobe SVG Plugin to view graphs on this page.

The Computer Science Department mail servers receive hundreds of thousands of messages per day. The reality of the modern internet, unfortunately, is that the vast majority of these messages are scams, unsolicited bulk commercial e-mail, mail-based denial of service attacks, or outright gibberish, all otherwise known as spam.

Systems Staff employ a wide array of methods on our central mail servers in order to protect our users from spam, including:

  • Scanning for known viruses, blacklisted scammer URLs, and known spam indicators
  • Opt-in Greylisting
  • Watermarking outbound mail, to distinguish legitimate mail from "backscatter"
  • Requiring correct behavior of remote relays

These methods are explained in detail below, with instructions on how users can take advantage of these features, where applicable. As always, if you have questions or need assistance setting something up, contact staff@cs.umd.edu.

Over the past 14 days, we've received 530,749 messages from off-campus, averaging 37,910 per day. We blocked 414,669 of those messages (78%).

(click here for a detailed graph of individual filter performance)

Virus and Spam Scanning

Traditional virus and spam testing, currently performed by MailScanner, ClamAV, and SpamAssassin. Every mail message accepted by Department servers is run through these software packages, which inspect the content for known viruses, phishing scams, blacklisted URLs, and other indicators associated with spam. Additionally, we hand-maintain localized spamassassin rules to supplement scoring on the newest variations from spammers.

Each mail message is assigned a score based on the occurrence of spam indicators. If the score is high enough, the mail is marked as spam with the X-Spam-Status: Yes header, so that users may filter or inspect the mail as they see fit. Viruses and extremely high-scoring spam are quarantined, meaning they are not delivered, but kept in storage for one month in the unlikely case that they are legitimate and needed.

Any user with a unix account can take advantage of the X-Spam-Status header by setting up procmail filters. Users who need assistance setting up filtering with this or other mail setups should contact staff@cs.umd.edu. Additionally, if you would like to use custom scoring with spamassassin, you can also run your mail through your own instance of the program.

Greylisting

Greylisting is a facility that takes advantage of the poor behavior of most spam bots. Spammers are generally in a great hurry to send out as much mail to as many people as possible. A server that implements greylisting simply asks the remote server to wait for a few minutes before accepting delivery. Most spam engines refuse to wait, and thus are unable to deliver their spam.

Legitimate mail servers that fully implement the protocols involved in e-mail will wait and have their delivery accepted. Once a mail server has proved itself able to behave, it is added to an exceptions list, permitting it to deliver in the future without delays. Additionally, we permit mail from major mail providers such as GMail or AOL without any delays, since these are relays that are known to behave correctly.

While we strongly recommend the service based on its tremendous success rate at blocking spam, we recognize that it may not be for everyone, since there is the risk of an initial delay in receiving mail from legitimate senders. Contact staff@cs.umd.edu if you'd like to try it out.

Watermarking

Watermarking is a tactic for eliminating "backscatter" messages. Backscatter attacks (sometimes called "joe-jobs") occur when a spammer shotguns a large amount of spam, while forging the sender's address as that of some innocent third party. The victim sees the result as thousands of bounces for messages they never sent.

In order to combat these attacks, our mail servers insert a cryptographically-signed header into every message originating inside the department. Any bounce message missing this header is presumed to be backscatter, and is quarantined.

This feature is automatically enabled for all department users.

Remote Relay DNS and Timing Checks

The great bulk of spam does not originate from big mail servers and major e-mail providers, but in fact from compromised home and office desktop computers all over the world, often called "bots" or "zombies."

One common indicator that a connecting mail server is part of a botnet is that its internet name is mismatched or broken, since it is often the case that home ISPs do not assign static, matching names to all of their users' computers. Department mail servers require that all computers delivering mail have correct forward and reverse DNS lookups. This measure by itself has the effect of blocking around 70% of all spam.

Occasionally a legitimate, but poorly-configured, remote site will be unable to send mail to us due to this restriction. The sender should get an error such as No address associated with hostname or cannot find your hostname. If this happens, contact staff@cs.umd.edu and request that we add the remote site to our exceptions list, and they will then be allowed through.

Spam engines will often attempt to deliver many messages in a great hurry, so in addition to DNS restrictions, our servers enforce a limit to the rapidity of connections from remote sites. This does not block any mail, but does require spam engines to slow down in their deliveries.

These features are automatically enabled for all department users.


Tuesday, 02-Aug-2011 12:26:23 EDT -building@cs