I’ve noticed over the years that while VMware admins tend to really understand maintenance mode, a lot of others in adjacent spaces (storage, network, etc.) have a very murky perspective on it. In fact, I’d bet that if you sat down a storage person at vCenter and told them to evacuate a host with MM, odds are they would be really confused when the task hung or failed. I know I was the first time I tried it.
In case you don’t know, maintenance mode is an option for a host that is designed to non-disruptively clear off any running VMs via vMotion (and possibly powered down/suspended VMs) in order for “something” to be done to that host. A lot of times this is just a reboot. Most of our lives (“us” being storage guys) are spent directing other people to do this. “Just put it in maintenance mode,” we tell the VMware admin. We know implicitly that this is going to vMotion all the guests off of it automagically. But really this isn’t always true. Actually it is only true in one case.
Maintenance mode is always an option for a host, regardless of whether the cluster has DRS enabled or what mode it is in. But the behavior of MM is contingent on the DRS settings for the cluster.
When enabled, DRS has three different modes of operation. Briefly those are:
- Manual – This option generates DRS recommendations that the VMware admin can apply, but it does not automatically move or place VMs
- Partially Automated – This option also generates DRS recommendations, and also does not automatically move VMs. However it does attempt to balance the cluster by placing VMs on specific hosts when those guests are powered on.
- Fully Automated – This option automatically places VMs at power on, as well as actively moves VMs to balance the cluster load. There is an additional slider that controls how conservatively (infrequent) or aggressively (frequent) the moves happen.
So you have a cluster with DRS enabled in partially automated mode and you go to maintenance mode a host, and it just sits and spins. Eventually the task times out and fails. Why?
This is because maintenance mode actually generates DRS move recommendations, and in partially automated mode (as well as manual mode) DRS won’t automatically apply move recommendations. The host is waiting for you to either manually relocate VMs or apply the generated recommendations. Here I’ve put a host in MM in a Manual mode cluster, and you can see the DRS recommendations available.
But those recommendations will just sit there until applied, and the host will sit waiting to enter maintenance mode until you either apply those recommendations or you manually migrate with vMotion.
One good thing about the DRS recommendations is that there isn’t really a need to disable or tweak DRS when entering MM, even in fully automated mode. Think about it – if DRS is responsible for ensuring balanced load across all hosts, then as a host was evacuating VMs for MM it would say “hey, there is a free host with nothing on it, we should load that up!” It would be moving stuff on while MM was moving stuff off. But because DRS is MM aware we don’t have to worry about that.
Another nice thing about MM is that it won’t complete until the VMs are migrated off. So say you didn’t know about all these details around MM not being fully automated in certain clusters. It isn’t going to actually put the host into MM while there are still running VMs on it, causing an outage. The MM task will simply time out and fail.
So we have four DRS options for a given cluster. Here are how they break down with MM:
- DRS Disabled – in this case a host will start to enter maintenance mode, but will not complete until all VMs are evacuated. Because DRS is disabled, an administrator is required to manually migrate all VMs to other hosts with vMotion.
- DRS Enabled, Manual or Partially Automated – Similar to DRS disabled, the host will start to enter but won’t complete until all VMs are evacuated. Manual migration via vMotion is still an option but an easier way is to go to the DRS recommendations page which should have recommendations to evacuate all guests because the host is entering MM. Then you can just apply the recommendations.
- DRS Enabled, Fully Automated – This mode will automatically evacuate the host with no administrator intervention required.
Here are some other things to keep in mind.
First, HA should have no meaningful interaction with maintenance mode because there should be no outage (guest or host) during the process. There is a maintenance option with HA, seen here.
This is intended for anyone doing network maintenance on the management network. Essentially if the management network goes down but the guest networks are up, we don’t want a full cluster freak out. But again, no need to monkey with this if you are just doing maintenance mode.
Second, aside from random reboots vSphere Update Manager is going to be a big source of MM as well. If you are applying new VIBs or patching, it is likely going to need to put hosts in MM, and it will follow the same rules as regular MM (behavior is based on DRS settings). This is really important to note if you have scheduled patch updates which you expect to complete automatically after hours! If your DRS cluster is in anything other than Fully Automated, you will require administrator intervention to complete the process.
Next, don’t forget about any VM Overrides with respect to the DRS settings. Sometimes in fully automated clusters, you may have VM Overrides set for VMs that you don’t want moving during the day, like VOIP related servers. If a VM Override sets a VM to Partially Automated or Manual, that will also require administrator intervention.
Finally, the conservative/aggressive setting. Because MM is tied into DRS and generates DRS recommendations there isn’t really an issue no matter where this slider is. However for the paranoid with fully automated DRS, you can adjust the slider to the most conservative setting which will essentially stop generating DRS moves based on load balancing but still honor moves for MM and affinity settings.
An interesting question here is, what if I have anti-affinity rules that would be violated by the MM setting? An easy enough thing to check as I only have two hosts in my lab. I created an anti-affinity rule for two VMs:
With my cluster in Partially Automated mode and the VMs distributed on separate hosts, I tried to put one host into MM and then checked the DRS recommendations.
Notice that while there are two recommended moves here, there is not a move for linuxdns which is also on that host. This is because that move would violate the anti-affinity rule I put in place. So I have to manually migrate that VM with vMotion.
Also kind of interesting is that the vMotion compatibility checker will also recognize that a host is going into MM and not let you vMotion to it.
After the vMotion completes, the host goes into MM like normal. Same thing if the cluster is in Fully Automated mode as well.
So in summary, when using MM make sure to keep the cluster DRS settings, as well as any VM overrides and affinity/anti-affinity rules, in mind so that your MM and updates aren’t impacted. It is likely that with a larger cluster, anti-affinity rules can still be satisfied even with hosts going into MM, but remember that there are also different kinds of affinity rules as well as VUM having the ability to execute some patching in parallel if you have enough resources in your cluster…so multiple hosts may be going into MM at the same time. There are a lot of different configuration options but hopefully this post will help clear up the behavior for you.