The Big Changes In WS2012 Cluster Shared Volume (CSV)

Microsoft made lots of changes with CSV 2.0 in Windows Server 2012.  But it seems like that message has not gotten through to people.  I’ve responded to quite a few comments here on the blog and I’m seeing stuff on forums.  What’s really annoying is that when you tell people that X has changed, they don’t listen.

I would strongly recommend that people take some time (I don’t care about excuses) to watch the TechEd presentation, Cluster Shared Volumes Reborn in Windows Server 2012: Deep Dive, by Rob Hindman and  Amitabh Tamhane (Microsoft).  There are lots of changes.  But I want to focus on the big ones that people repeatedly question.

OK, what are the major changes?

There IS NO Redirected IO in WS2012 CSV Backup

Let me restate that in another way: Windows Server 2012 does not use Redirected IO to backup CSVs.

This has been made possible thanks to substantial changes in how VSS places VMs that are stored on CSV into a quiescent state.  The backup agent (VSS Requestor) kicks off a backup request with a list of virtual machines.  The Hyper-V Writer identifies the storage location(s) of the VMs’ files.  A new component, the CSV Writer, is responsible for coordinating the Hyper-V nodes in the cluster … meaning all VMs on a CSV that is being backed up to be placed into a quiescent state at the same time.  This allows for a single distributed VSS snapshot of each CSV.  That allows the provider (hardware, software or system) to go to work and get the snapshot.

image

This is much simpler than what CSV did in Windows Server 2008 R2.  [The following does not happen in WS2012] There was no CSV Writer.    There was no coordination, so Redirected IO was required.  The node performing a snapshot needed exclusive access to the volume so all IO went through it for the time being.  A lot of people knew that bit up to there.  The bit that most people didn’t know was that each node (hosting VMs that were being backed up) took snapshots of each CSV that was being backed up.  And that could cause problems.

I’ve heard several times now from people who’ve experienced issues with volumes going offline during backup.  There were two causes that I’ve seen, and both were related to a third party hardware VSS provider:

  • Using a hardware VSS provider that did not support CSV
  • The rapidly rotating and repeated snapshot process caused chaos in the SAN with the hardware snapshots

But, all that is G-O-N-E when backing up CSV on Windows Server 2012:

  • There is no redirected IO
  • There is a single VSS snapshot performed

SCSI3 Reservation Starvation Should Go Away

Every node in a Hyper-V cluster used SCSI3 persistent reservations and SCSI3 reservations to connected to CSVs.  Every SAN has a finite number of those persistent reservations and reservations.  The SCSI3 persistent reservations was a bottleneck.  No manufacturer shares that number, and it’s a hell of a lot smaller than you’d expect – we typically find out about it during a support call.  To compound this, each host required a number of SCSI3 persistent reservations, and that multiplied based on:

  • Number of hosts in the cluster
  • Number of CSVs
  • Number of storage channels per host (possibly even a multiple of the number of physical HBAs/NICs, depending on the SAN)

What happens when you deploy too many nodes, CSVs, or storage channels?  CSVs go offline.  Yup.  The SAN is starved of resources to connect the hosts to the LUNs.  I saw this with small deployments with an entry level SAN, 3 hosts, and 5 CSVs.  And it aint pretty.

Imagine a cluster with 64 nodes!?!?!  With Windows Server 2012, each node gets a static key instead of using the legacy persistent reservation multiplication.  That means your SAN can support more CSVs and more hosts running Windows Server 2012 than it would have with Windows Server 2008 R2.  Note that the static key is assigned when the node is added to the cluster.

You can find the static keys in the registry of your cluster nodes in HKEY_LOCAL_MACHINEClusterNodes<Node Number>ReserveID (REG_QWORD).  You can identify which node number is which host by the NodeName (REG_SZ) value.  You can see an example of this below.

image

This new system, which replaces persistent reservations, gives you better cluster infrastructure scalability, but it doesn’t eliminate the scalability limits of your SAN.

7 thoughts on “The Big Changes In WS2012 Cluster Shared Volume (CSV)”

  1. Really useful blog you have. Does this mean hardware providers that didn’t work due to issues with CSVs in 2008 R2 should work in 2012? Thinking about the Dell MD Series and possibilities of using it as SAS JBOD storage for SMEs. Thanks for any advice!

    1. No. H/W VSS providers are the responsibility of the OEMs. Don’t look to Microsoft for help there. If you buy a SAN from X, you look for a H/W VSS provider that (a) supports WS2012 CSV and (b) is well written. If it messes up, take it up with company X.

      1. Much appreciated. As I thought, been trying to get an answer from Dell for a month or so and not had one that I’m happy with so not going to risk it until they say its tried and tested.

  2. Excellent, and very informative! thanks for the link and an excellent blog. Have your opinions changed on AV on host servers now that there is more compliance in 2012?

    1. The guidance hasn’t changed. I still prefer not to have AV on Hyper-V hosts for the same reasons as before.

  3. Hi Aidan,
    Just to be clear – SANs are still required to support persistent reservations in Server2012 CSV 2.0 – it’s just that it will need fewer of them, 1 per node. Is that correct?
    Thanks,
    Tim

Leave a Reply to Aidan Finn Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.