Thursday, November 8, 2012

Patching Exchange 2010 servers being member of a Database Availability Group (DAG)

For administrators that come from a non-clustered environment to a clustered one must learn that patching is done different. Non-clustered servers often simply can be set to use WSUS or any other patching product with simply install patches and then most likely a restart of services or a reboot of the windows box. With clustering solution this is not best practice and behaving this way often break something in the process.
The preferred way of patching clustered servers is to move resources off the node you’re about to patch and then apply patches. This is a manual process and often takes time. Positive side is that administrators have complete visibility of the process and sees if anything don’t work as expected and can take action on the failed node while other nodes of cluster is still maintain service to users. This scenario also let administrator work at regular office hours instead of working late at night on patching-day or weekends.
Exchange 2010 is delivered with some scripts that help administrators manage all this manually.
In the scripts directory, there are 3 scripts. StartDagServerMaintenance.ps1, StopDagServerMaintenance.ps1 and RedistributeActiveDatabases.ps1
Start on the first node by running StartDagServerMaintenance.ps1 –serverName <node name>
This will move databases from the first node to another node in the DAG and also move the Cluster Group to another node if needed, to maintain quorum and have the Primary Active Manager online which is important. It will not just move things off the first node, it will also reconfigure some parameters to stop databases moving back if a failure occur on another node during maintenance. This is the same for the “Cluster Group” resource.
Now you can patch and do other maintenance on the first node as needed without it coming online.
For it to come online after maintenance you run StopDagServerMaintenance –serverName < node name>
This will remove configuration to stop databases move to the server. So from here on the server is free to host active databases again or running the Primary Active Manager. Script will not move any databases back to the server, it will just configure server to be a possible owner of databases.
Next step is to repeat all this for any other node in your DAG. If you have a large dag then it is possible to patch multiple servers at the same time. But be careful to maintain quorum, otherwise things will break.
As a last step when all servers are back online in normal state is to use the last script RedistributeActiveDatabases
This script has several parameters but the one you want to use now is -BalanceDbsByActivationPreference
When run, databases will be moved to the mailbox database server with the lowest activationpreference set on each individual database so hopefully it is set already in a way that suits your environment.
All these steps takes time and is manual, you should also verify between each step that replication is working and you don’t do backup at the same time. Replication can be verified with Get-MailboxDatabaseCopyStatus cmdlet.
What happens if you simply apply patches to server being member of a DAG and reboot it afterwards ?
In theory nothing should stop your DAG from serving users. This is also what I experience from real live as long as you don’t reboot multiple servers at the same time and loose quorum, but what sometime happens is that replication and Indexing breaks, leaving some tasks for admins to clean up later.
Smaller organizations often don’t have the time to babysit servers while patching them, they just want to configure patches in WSUS and they will automatically be applied during the night without anything bad happen.
To make this work you must first configure windows update client on server to not to apply patches at the same time. Next I have created a script that you must schedule to run a little bit ahead of time when Windows Update do its work. Let say you configure Windows Update to apply patches at 1 am in the morning, then schedule the script to run 30 to 60 minutes before.
On next node in your DAG you schedule Windows Update to apply patches at 3am, so on this server schedule the script to run sometime between 2 and 2.30 am.
Script will do about the same thing as the start/stopdagservermaintenance script but in a slightly different way. It will not configure databases and servers to be blocked from activation. It will verify and not move any databases while a backup is running. It will also try to fix some errors such as content index or replication being in a failed state. Fixing error should not be needed in a perfect world but can happen. Script is just doing some very basic things and it is up to you to change what you think is good action to different situations.
Script loops through looking for database and replication health state. If something is not good it will try to apply action to it and also try to move databases to another node in your DAG. Between each loop there is pause for a minute to let replication and other things to catch up and come to a steady state.
It will not loop forever, but has a limit of ten times. I figured that, if actions could not be performed with 10 tries it is not worth trying anymore. You probably have something worse to handle than just a script handling a little glitch in your DAG.
Parameters that you probably need to set to suit your environment is the sleep time on line 120 and number of loops on line 124, copy and replayqueuelength on row 110 need to be set to work correctly in your organization.
There is a lot of IF’s between line 23 and 110 trying to handle different situations. Look through these and see if they are OK for you.
Script doesn’t have any special error handling or output. I figured if it scheduled at nighttime there isn’t much need for output anyway. Logging can be added for admins to look at later.
Why not schedule the start/stopdagservermaintence script? This can of course be a viable solution. Problem here as I see it is that it will set servers and databases to not become active automatically and you want to set everything back to automatic once each server is patched. The problem is that you don’t know how long time each server take to patch so picking the correct schedule could be tricky and if only have a small DAG with 2 servers you want them to be available as much time as possible. My script also tries to correct some issues before move databases to another node.
Script is on my Skydrive