An explanation is in order about the zimbra002 downtime this morning, so please bear with me while I try and explain.
The Zimbra servers have a backup mechanism that can only back up to a mounted drive, so either a local drive or an NFS mount.
Every night they do an incremental backup and once a week they do a full backup. The NFS server to which zimbra002 and zimbra003 (which never seems to be affected in the same way) are backing up to crashed in the night. It crashed in such a way that it was still responding to pings, but not NFS calls, so this wasn’t picked up by our monitoring systems.
When an NFS mount goes away, a Linux server treats this as a very bad thing, and basically waits around for it to come back again.
This tends to affect the process that is attempting to access the mounted drive, and other processes normally continue unaffected. Because Zimbra is a monolithic piece of Java code, the backup process causes it to misbehave.
However, it still responds on all the ports that we check with our monitoring systems – the problems occur after logging in, so once again not picked up by our monitoring systems.
While we are still uncertain of the causes of the backup server outage, we’re going to look at two approaches to avoid this problem recurring.
1) Start backing up zimbra002 to a different server.
2) Mount the disk, do the backup, then unmount it straight afterwards. That way, if the server has a problem, hopefully the mount will fail, and we’ll skip the backup rather than breaking the server.
Once again, apologies for all affected by this today, and hopefully this goes someway towards explaining things!