Monday, February 09, 2009

Impact of ESX snapshot backups on Microsoft database servers

Up until Update 2, ESX 3.5 uses a VMWare tool called the Sync Manager as part of the snapshot backup process. The Sync manager quiesces the file system (pauses all incoming I/O requests and dumps dirty data to disk) before the snapshot backup is taken. This allows the backup to be crash-consistent.

If you were to take a snapshot without pausing the I/O requests and then restore the snapshot, the virtual machine will start up for the first time thinking that it is recovering from a power failure type of crash. But the recovered system will not be able to find the I/O requests that were stored in the memory (RAM) at the time the snapshot was taken. So data loss is possible.

What does this have to do with database servers? For servers housing databases (Active Directory, Exchange, or SQL servers), stopping I/O requests with the Sync manager halts incoming requests to the database without notifying the database of what is happening. The database is waiting on that information to arrive – so when the data doesn't come in when expected, errors are logged to the event log and in some cases, the databases become corrupt. I happened to find this out on an active directory domain controller, and the sequence of errors looks like this:

First, you'll see event ID 1 logged with source LGTO_Sync. This is the Sync driver starting to do its work quiescing the file system.

On domain controllers, AD requests will begin to fail. The description will differ based on the request, but the Event ID stays the same.

For domain controllers running DNS, dynamic updates will fail as well:

For DHCP servers, you will see this:

On Exchange servers, you'll see Autodiscover errors:

If you are seeing these errors, stop using the sync manager now. Eventually you will corrupt your database.

Workaround
Stop quiescing guest database servers before taking snapshots, and start adding snapshots of virtual machine memory to your backups. Most backup applications allow you to do this. If yours doesn't, you can script it using vimsh.

Example:
vimsh -n -e "vmsvc/createsnapshot [VmId] [snapshotName] [snapshotDescription] [includeMemory]"

vimsh -n -e "vmsvc/createsnapshot XXXX FIRST_SNAPSHOT MY_FIRST_SNAPSHOT_1"

By taking a snapshot of the guest machine's memory, you are creating a full snapshot. When you restore, you restore the memory on top of the file system. When the machine starts, it will be able to access all the necessary information in memory to start normally - no crash.

Resolution
To combat the Sync Manager problem, ESX has released update2, which includes an ESX VSS tool that integrates with Microsoft VSS. It works by using the windows operating system to hold I/O requests, eliminating the need for the sync driver. When the operating system is in charge of halting its own I/O activity, the databases are notified that a backup is taking place. The databases can then pause their own processing of requests, and no errors occur.

This update is relatively new, and many third-party backup applications do not support update 2 yet, which is why I have offered the workaround here.

One last note about Microsoft domain controllers and Vmware snapshot backups
In 2006 (revised in Dec. 2008), Microsoft released KB 888794 (http://support.microsoft.com/kb/888794/en-us), which states that

"Active Directory does not support any method that restores a snapshot of the operating system or the volume the operating system resides on. This kind of method causes an update sequence number (USN) rollback. When a USN rollback occurs, the replication partners of the incorrectly restored domain controller may have inconsistent objects in their Active Directory databases. In this situation, you cannot make these objects consistent. "

In reality, the BURFLAGS registry referred to in Microsoft KB290762 (http://support.microsoft.com/kb/290762) can be set so that the virtual DCs are nonauthoritative, and an existing domain controller can be set to authoritative. This will allow the USN to be overwritten by the authoritative domain controller, and no USN rollback will occur.

2 comments:

Iben Rodriguez said...

I wonder how VMware's Update Manager snapshots impact virtual machines running databases?

Seems like it would be similar... no?

Maybe we should recommend not using snapshots with Update Manager?

K.Brown said...

Just a comment...the article you linked to for BURFLAGS addresses FRS replication only (i.e. SYSVOL).

USN rollback affects Active Directory replication (I.e. data in NTDS.DIT).

Configuring BURFLAGS with a "D2" will not correct AD replication issues - it will only force FRS data to be re-replicated from another FRS source (i.e. SYSVOL or other FRS replication data).

FYI...one of your screen shots contains the FQDN of your DNS domain - you may want to erase that also.