2013
05.02
Here I am, working on a Sunday (when I wrote this post).  It’s not so bad, it’s raining outside, so that rules out going for a walk or doing some photography.  I jumped onto Twitter and saw someone moaning that they had to work on a Sunday to patch their Hyper-V cluster.  To me that’s a WTF! moment.
image
Windows Server 2012 Failover Clustering gives us Cluster Aware Updating (CAU).  Using this you can patch a Hyper-V cluster without getting manually involved in “maintenance modes” and Live Migration.  The process will:
  1. Download updates from Microsoft, WSUS, etc, or a file share, to the hosts (and this is expandable to 3rd party updates such as OEMs).
  2. Put host 1 into maintenance mode – that drains it of virtual machines using Live Migration and … Quick Migration (for VMs marked as LOW priority, by default, which I DO NOT agree with).  You can make it 100% Live Migration so no services suffer an outage during the moves.  The more bandwidth your Live Migration network has, the faster this will be – using 1 Gbps networking for 512 GB RAM hosts is stupid!
  3. Patch and reboot host 1
  4. Wait for host 1 to come back online
  5. Bring host 1 out of maintenance mode
  6. Repeat steps 2-5 for each host
This process orchestrates the entire process.  All you’ve go to do is make it happen:
  • You can manually invoke CAU from a Failover Cluster Manager console not running on a cluster member
  • You can set up a special CAU role on the cluster with a patching schedule – it’s a clustered role so it will move just like the VMs
And the process is customizable, e.g. don’t proceed/continue if Y hosts are offline.
So … let me ask you a question.  If your VMs are moving around using Live Migration, and their services never go offline … why do you need a maintenance window?  Why exactly do you want to be a sad bastard like me and work on a Sunday?
Me, I think I’d do my host patching on a Wednesday morning, at around 11am, in a typical business.  Why?  A few reasons:
  1. Live Migration keeps services online so the business should not notice.
  2. I’m “in” the office already.  If something does go wrong, I am not getting a call at 3am or at the weekend.  I’m sober, awake (as much as I will be, anyway), and able to respond immediately.
  3. Any support services will have their primary staff available.  If I do need to call someone for hardware or software support, they are online, and I’m not dealing with the red-eye team at 3am on a Sunday morning.
  4. I can monitor for exceptions quite happily.
  5. The business doesn’t need to pay me overtime or give me time-in-lieu.
  6. Peak business in IT is at either end of the week (“password reset Monday” and “I didn’t want to bother you” Friday afternoons) so Wednesday seems like a nice balance.
So yeah, I do think that CAU should kill the Hyper-V cluster patching window.
Edit 1:
The same person was on Twitter many hours later, complaining that patching Hyper-V took them “11 hours”.  Really!?!?! Hmm, I think if that was me I’d be asking what I was doing wrong.  Just sayin’  is all …
You can learn more about Windows Server 2012 Hyper-V from the book, Windows Server 2012 Hyper-V Installation And Configuration Guide:

7 comments so far

Add Your Comment
  1. I’d like to add that CAU works flawelessly and we have been using it on all our Hyper-V clusters since RTM, during business hours for the reason Aidan mentioned and if you know that our business is > 95% virtualized on Windows Server 2012 Hyper-V cluster so we cannot afford to mess this up. It will aways retrun your cluster to the state it was in when you started wheter the updates fail or not. We have grown to trust it. You can even use it to deploy firmware, BIOS updates etc.

  2. All well and good, but that’s just for the hosts, one still has VMs and not all those are aleays clustered.

    • Correct; but you’ll see by the title of the post, I was not referring to the VMs.

  3. It is all nice on the paper, but in reality – I have blue screens time to time when moving machines between nodes – always same BSoD – vmswitch.sys.
    Which points to the same problem – various versions of intel drivers i gues.
    Issue is – it is not repeatable. But in 3 node cluster of 300gb ram and 40 vms. That happens regularly. Tried 3 different versions of drivers – can’t confirm any of them as guilty – BSoD are still sporadically.
    The only one common – using intel 520 10g adapters in convergent mode – each adapter has its own vm switch and jumbo frame enabled.
    So have huge stop in using live migration – never know when it is going crash the node.
    Therefore caw is limited on what people is comfortable to run.
    Simple live migrations usually fine, massive movements of 100gb ram vms one in 4 times results in BSoD.

    • Open a support case with your h/w vendor. You buy the h/w you should expect it to work correctly.

  4. I have yet to do even a HV 2008R2 project where we do not patch the cluster during the day. Not automated (unless they also have SCorch and use it) but even that works great.

  5. is there a VMware or KVM (or any other platform, I don’t want enumerate all of them) equivalent to this. It seems that this is an absolute killer feature for a common scenario. I just heard a person who moved their pfsense out of a virtualbox VM and onto a physical machine because the host updates would downtime his network all the time.
    BTW. Aidan, we love you. Keep the Hyper-V flame burning!

Get Adobe Flash player