Under the hood: Spam Wars

In the ongoing war against the spammers, we have put a lot of effort over the last year or two in looking at the effectiveness of various methods, and thought it might be helpful to give a bit of a behind-the-scenes look at some of the lists and methodologies we use and their relative effectiveness. What these stats don’t show is the amount of false positives or the amount of spams that we miss as with our diverse user base it is impossible to measure these things accurately.

We try to be quite aggressive at detecting spams as the majority of our users make use of what we call auto-whitelisting, where anyone they send an email to automatically gets added to their whitelist and doesn’t get checked for spam in the future (well not at stage 2 anyway – see below).

The first stage of our spam blocking is the most aggressive, and most sensitive. If we have false positives here, we tend to find out about it because we reject the connections based on the IP address that is trying to connect to us.

I’ve included links below so you can investigate and find out more about any particular list.

Firstly, let’s look at connections to our servers. Taking a sample day, of Wednesday December 14th 2016, we received a total of 6,680,134 inbound SMTP connections. Here’s what we did with them.

Spamhaus Zen 5,403,727
Invaluement ivmSIP 236,211
Accepted 1,040,196

Of those 1,040,196 accepted connections, we received 1,008,901 individual emails. These were then broken up as follows:

Whitelisted 138,055
Blacklisted 3,513
Too Large to Scan 6,960
Not scanned (user not enabled anti-spam) 131,480
Scanned 728,893

So we now have a grand total of 728,893 emails to feed into our anti-spam servers. These run a piece of software called Spamassassin that looks for patterns in emails that mean they are probably spam and score them accordingly. Unfortunately, the spammers have access to this, and the good ones are very clever at making their spams not look like spam to a computer (though still obviously spam to a human), so we rely quite heavily on various blacklists to identify spam for us.

In the last couple of years, the spammers have become even more sophisticated and found ways to send out millions of spams before the blacklists are able to list them. The blacklists are fighting back, however, with new lists such as InstantRBL and faster listings (particularly good at URIBL).

Taking our sample day with 728,893 spams to be scanned, here is how many are caught by each different method/list employed. These stats show unique hits (so, for example, if something is caught by two lists, or one list and other Spamassassin rules, it won’t show up).

Spamhaus Zen 2,062
URIBL 8,137
Invaluement ivmSIP24 2,029
Invaluement ivmURI 10,194
Barracuda 9,109
InstantRBL 8,442
Protected Sky 5,468
Spamassassin other rules 32,007
Total caught 113,439

It’s hard to draw a pretty chart from all of this. However, here are the headline figures. 6,680,134 inbound connections, 891,949 emails delivered to inboxes, which represents 13% of the total. 131,480 of those didn’t get a chance to be scanned because our end user didn’t have the feature enabled.

So there you have it, it’s an ongoing battle, and the battleground keeps shifting. The spammers have access to all of the same tools that we do – that’s the nature of the internet, so they will keep trying to find new ways to beat the system, and we will keep trying to find new ways to stop them.

Under the hood: Upgrading MySQL 5.1 -> 5.6 with zero downtime

Downtime is not an option for us. We might get away with a minute or two in the middle of the night, but running an email service means millions of emails a day at all times of day.

So, after years of resisting I was finally lured by the charms of the latest and greatest MySQL – well almost. After wasting a day of my life trying to import my data into MySQL 5.7, I switched to 5.6 and all worked like a dream. So, having run with MyISAM tables all this time, why switch to InnoDB. Here are the reasons for us:

  1. Performance. In our testing with a particularly complex query that examines lots of rows, v 5.1 with MyISAM took 1 minute, 2 seconds. v 5.6 with InnoDB took 39 seconds the first time, and 28 seconds all subsequent times. This is probably related to point 2.
  2. InnoDB caches all data in memory, not just indexes. Our dataset is around 20GB in size, so this all fits nicely in memory, giving a speed boost, and reducing disk accesses (always a good thing).
  3. Row-level locking. This is a biggie. MyISAM has always been a bit of a problem in this regard, and in unexpected ways. Perform the query in point 1, which is just reading from the databases, and you get a bit of an unexpected problem. Because this query is looking a logging tables that are written to frequently, as soon as a write comes through (which would be within 0.1 seconds in our situation), the write blocks, and more importantly, and subsequent reads block waiting for that write. Before you know it, rather than 100 connections to the database, you suddenly have 1,000 and emails start queueing on our system. With InnoDB this problem completely goes away.
  4. Crash recovery. Although we have 100% power uptime, dual feeds etc in our data centre, it’s nice to know that if the worst does happen, the databases won’t get corrupted, so no lengthy repair times or risk of data loss.

So, how do we get from here to there without a lot of pain in the middle. The answer is authoritative DNS servers and a spare MySQL server. All servers that use this database must access it via a name (we use ‘sql’ for the main read-write database and ‘sql-ro’ for the replication slave). In our situation, they query our caching DNS servers that are also authoritative for our domain, so if I make a change, it happens instantly for all our servers.

The process then goes like this. Existing live servers are called sql001 (master) and sql002 (slave). Our level of traffic is such that one server can cope with the full load, particularly at off-peak times, so we will use this to help us.

  1. Point sql-ro at sql001. This removes sql002 from being live.
  2. Use mysqldump to take a copy of the database on sql002.
  3. Set up new server, sql003, with MySQL 5.6 and restore the mysqldump data, with sql003 now also being a slave of sql001.
  4. Alter all tables to be InnoDB.
  5. Have a good look through my.cnf and set it up so that you have plenty of memory for InnoDB caching instead of MyISAM caching.
  6. Test, test, test. And once happy, point sql-ro at sql003 to make sure all runs well on that server.
  7. Upgrade sql002 to MySQL 5.6 and set to slave off sql003. To do this, you’ll need to use mysqldump on sql003 with —single-transaction (which only works for InnoDB but allows you to do it without locking the tables).
  8. Now it’s time to do the dangerous stuff. Switch the DNS for sql to point to sql003. As soon as this is done, shut down the MySQL server on sql001. In our case, we quickly got conflicts on sql003 due to logging table entries. We’re not too worried about this, but best to switch off sql001 so all clients are forced to reconnect – to the correct server.
  9. Now everything is pointing at sql003, time to upgrade sql001 to 5.6.
  10. Shut down the server on sql002 and copy the database contents to sql001. Also, copy my.cnf and don’t forget to alter the server-id. Remove auto.cnf from /var/db/mysql on sql001 before starting or MySQL server thinks it has two slaves with the same ID.
  11. Once sql001 running ok, ‘mysqldump –single-transaction –master-data’ from sql001 to set up sql002 as slave of sql001. So now we have sql003 as master, with sql001 as slave of sql003 and sql002 as slave of sql001.
  12. Change sql and sql-ro to point at sql001 and switch off sql003 server.
  13. Double-check sql002 and then point sql-ro at it. Job done.

This is very much an overview, and you should google for the exact commands to use, config file setups for my.cnf etc to suit your own situation. There will undoubtedly be some gotchas along the way that you have to look for (for example, after upgrading to 5.6, we had to recompile any code on that server that used MySQL as the v5.1 libraries had been removed).

 

And now for the gotchas, which I will list as I discover them:

  1. InnoDB tables do not support ‘INSERT DELAYED’ command. This is probably because there’s no need for it as there’s no table level locking. However, rather than just ignoring the ‘DELAYED’ keyword, MySQL rejects the command. We used this for some of our logging, so have lost some log info following the change-over.
  2. Weird one, this. For storing some passwords, use some one-way encryption using the ‘encrypt’ function. This can optionally be supplied with a ‘salt’ of 2 characters. We supply 2 or more characters based on the account number of the password being encrypted. It would seem that with MySQL 5.1, if you supply more than 2 characters, it switches from a simple MD5 encryption to SHA512. With MySQL 5.6, it ignores anything but the first two characters and sticks with MD5. To work around this problem, we have to call the function twice, once with the 2 character account number, and one with the full account number, preceded by “$6$”. Told you it was weird!