Why I Dislike Dynamic VHD in Production

With this post, I’m going to try explain why I recommend against using Dynamic VHD in production.

What is Dynamic VHD?

There are two types of VHD you may use in production:

  • Fixed: This is where all of the allocated storage is consumed at once.  For example, if you want 60 GB of virtual machine storage, a VHD file of around 60 GB is created, consuming all of that storage at once.
  • Dynamic: This is where the VHD will only consume as much as is required, plus a little buffer space.  If you allocate 60 GB of storage, a tiny VHD is created.  It will grow by small chunks to accommodate new data, always leaving a small amount of free space.  It kind of works like a SQL Server database/log file.  Eventually the VHD will reach 60 GB and you’ll run out of space in the virtual disk.

With Windows Server 2008 we knew that Dynamic VHD was just too slow for production.  The VHD would grow in very small amounts, and often lots of growth was required at once, creating storage write latency.

Windows Server 2008 R2

We were told that was all fixed when Windows Server 2008 R2 was announced.  Trustworthy names stood in front of large crowds and told us how Dynamic VHD would nearly match Fixed VHD in performance.  The solution was to increase the size of the chunks that were added to the Dynamic VHD.  After RTM there were performance reports that showed us how good Dynamic VHD was.  And sure enough, this was all true … in the perfect, clean, short-lived, lab.

For now, lets assume that the W008 R2 Dynamic VHD can grow fast enough to meet write activity demand, and focus on the other performance negatives.

Fragmentation

Let’s imagine a CSV with 2 Dynamic VHDs on it.  Both start out as small files:

image

Over time, both VHDs will grow.  Notice that the growth is fragmenting the VHDs.  That’s going to impact reads and overwrites.

image

And over the long term, it doesn’t get any better.

image

Now imagine that with dozens of VMs, all with one or more Dynamic VHDs, all getting fragmented.

The only thing you can do to combat this is to run a defrag operation on the CSV volume.  Realistically, you’d have to run the defrag at least once per day. Defrag is an example of an operation that’s going to kick in Redirected Mode (or Redirected Access).  And unlike backup, it cannot make use of a Hardware VSS Provider to limit the impact of that operation.  Big and busy CSVs will take quite a while to defrag, and you’re going to impact on the performance of production systems.  And you really need to be aware of what that impact would be on multi-site clusters, especially those that are active(site)-active(site).

Odds are you probably should be doing the occasional CSV defrag even if you use Fixed VHD.  Stuff gets messed up over time on any file system.

Storage Controllers

I am not a storage expert.  But I talked with some Hyper-V engineers yesterday who are.  They told me that they’re seeing SAN storage controllers that really aren’t dealing well with Dynamic VHD, especially if LUN thin provisioning is enabled.  Storage operations are being queued up, leading to latency issues.  Sure, Dynamic VHD and thin provisioning may reduce the amount of disk you need, but at what cost to the performance/stability of your LOB applications, operations, and processes?

CSV and Dynamic VHD

I became aware of this one a while back thanks to my fellow Hyper-V MVPs.  It never occurred to me at all – but it does make sense.

In scenario 1 (below) the CSV1 coordinator role is on Host1.  A VM is running on Host1, and it has Dynamic VHDs on CSV1.  When that Dynamic VHD needs to expand, Host1 can take care of it without any fuss.

image

In scenario 2 (below) things are a little different.  The CSV1 coordinator role is still on Host1, but the VM is now on Host3.  Now when the Dynamic VHD needs to expand, we see something different happen.

image

Redirected Mode/Access kicks in so the CSV coordinator (Host1) for CSV1 can expand the Dynamic VHD of the VM running on Host3.  That means all storage operations for that CSV, on Hosts2-3 must travese the CSV network (maybe 1 Gbps) to Host1, and then go through its iSCSI or fibre channel link.  This may be a very brief operation, but it’s still something that has a cumulative effect on latency, with potential storage I/O bottlenecks in the CSV network, Host1, Host1 HBA, or Host1 SAN connection.

image

Now take a moment to think bigger:

  • Imagine lots of VMs, all with Dynamic VHDs, all growing at once.  Will the CSV ever not be in Redirected Mode? 
  • Now imagine there are lots of CSVs with lots of Dynamic VHDs on each.
  • When you’re done with that, now imagine that this is a multi-site cluster with a WAN connection adding bandwidth and latency limitations for Redirected Mode/Access storage I/O traffic from the cluster nodes to the CSV coordinator.
  • And then imagine that you’re using something like a HP P4000/LeftHand where each host must write to each node in the storage cluster, and that redirected storage traffic is going back across that WAN link!

Is your mind boggled yet?  OK, now add in the usual backup operations, and defrag operations (to handle Dynamic VHD fragmentation) into that thought!

You could try to keep the VMs on CSV1 running on Host1.  That’ll eliminate the need for Redirected Mode.  But things like PRO, and Dynamic Optimization of SCVMM 2012 will play havoc with that, moving VMs all over the place if they are enabled – and I’d argue that they should be enabled because they increase service uptime, reliability, and performance.

We need an alternative!

Sometimes Mentioned Solution

I’ve seen some say that they use Fixed VHD for data drives where there will be the most impact.  That’s a good start, but I’d argue that you need to think about those System VHDs (the ones with the OS).  Those VMs will get patched. Odds are that will happen at the same time and you could have a sustained level of Redirected Mode while Dynamic VHDs expand to handle the new files.  And think of the fragmentation!  Applications will be installed/upgraded, often during production hours.  And what about Dynamic Memory?  The VMs paging file will increase, thus expanding the size of the VHD: more Redirected I/O and fragmentation.  Fixed VHD seems to be the way to go for me.

My Experience

Not long after the release of Windows Server 2008 R2, a friend of mine deployed a Hyper-V cluster for a business here in Ireland.  They had a LOB application based on SQL Server.  The performance of that application went through the floor.  After some analysis, it was found that the W2008 R2 Dynamic VHDs were to blame.  They were converted to Fixed VHD and the problem went away.

I also went through a similar thing in a hosting environment.  A customer complained about poor performance of a SQL VM.  This was for read activity – fragmentation would cause the disk heads to bounce and increase latency.  I converted the VHDs to fixed and the run time for reports was immediately improved by 25%.

SCVMM Doesn’t Help

I love the role of the library in SCVMM. It makes life so much easier when it comes to deploying VMs, and SCVMM 2012 expands that exponentially with the deployment of a service.

If you are running a larger environment, or a public/private cloud, with SCVMM then you will need to maintain a large number of VM templates (VHDs in MSFT lingo but the rest of the world has been calling them templates for quite a long time). You may have Windows Server 2008 R2 with SP1 Datacenter, Enterprise, and Standard. You may have Windows Server 2008 R2 Datacenter, Enterprise, and Standard. You may have W2008 with SP1 x64 Datacenter, Enterprise, and Standard. You may have W2008 with SP1 x86 Datacenter, Enterprise, and Standard. You get the idea. Lots of VHDs.

Now you get that I prefer Fixed VHDs.  If I build a VM with Fixed VHD and then create a template from it, then I’m going to eat up disk space in the library.  Now it appears that some believe that disk is cheap.  Yes, I can get 1TB of a disk for €80.  But that’s a dumb, slow, USB 2.0 drive.  That’s not exactly the sort of thing I’d use for my SCVMM library, let alone put in a server or a datacenter.  Server/SAN storage is expensive, and it’s hard to justify 40 GB + for each template that I’ll store in the library.

The alternative is to store Dynamic VHDs in the library.  But SCVMM does not convert them to Fixed VHD on deployment.  That’s a manual process – and that’s one that is not suitable for the self-service nature of a cloud.  The same applies to storing a VM in the library; it seems pointless to store Fixed VHDs for an offline VM, but there’s a manual conversion process to convert the stored VMs to Dynamic VHD.

It seems to me that:

  • If you’re running a cloud then you realistically have to use Fixed VHDs for your library templates (library VHDs in Microsoft lingo)
  • If you’re a traditional IT-centric deploy/manage environment, then store Dynamic VHD templates, deploy the VM, and then convert from Dynamic VHD to Fixed VHD before you power up the VM.

What Do The Microsoft Product Groups Say?

Exchange: “Virtual disks that dynamically expand are not supported by Exchange”.

Dynamics CRM: “Create separate fixed-size virtual disks for Microsoft Dynamics CRM databases and log files”.

SQL Server: "Dynamic VHDs are not recommended for performance reasons”.

That seems to cover most of the foundations for LOB applications in a MSFT centric network.

Recommendation

Don’t use Dynamic VHD in production environments.  Use Fixed VHD instead (and passthrough in those rare occasions where required).  Yes, you will use more disk for Fixed VHD for all that white space, but you’ll get the best possible performance while using flexible and more manageable virtual disks. 

If you have implemented Dynamic VHD:

  • Convert to Fixed VHD (requires VM shut down) if you can. Defrag, and set up a less frequent defrag job.
  • If you cannot convert, then figure out when you can run frequent defrag jobs.  Try to control VM placement relative to CSV coordinator roles to minimize impact.  The script will need to figure out the CSV coordinator for the relevant CSV (because it can failover), and Live Migrate VMs on that CSV to the CSV coordinator, assuming that there is sufficient resource and performance capacity on that host.  Yes, the Fixed VHD option looks much more attractive!

11 thoughts on “Why I Dislike Dynamic VHD in Production”

  1. For what it’s worth a lot of storage vendors do NOT support dynamically expanding VHD especially not when they have, thin/over provisioning, depude etc … and frankly in that case, who needs it?

  2. been going through these conversations a lot lately, in clustered enviro obviously the issues are more severe, So what we are seeing is that a c: drive with nothing being written to it except patches is fine, during patches things are staggered so we don’t notice anything at all. The interesting thing is that we run tp and dd on the arrays also. but we are also using 2tb of flashcache (pam) so that may be masking the issue.

  3. I would like to counter this slightly. In a SMB environment we use dynamic disks to reduce backup sizes dramtically (using hyper-v vss provider).
    We do however have a partition per vm so fragmentation is less of an issue.
    If the hyper-v vss provider was inteligent enought to not backup free space we would definitely use fixed vhd.

  4. Hi Aidan,
    I have read your article on dynamic disks and i have almost ran out of disk space on my partition now.What is the best route for me to convert them to fixed.Thanks,

    1. Add disk space, power down the VMs, do a disk conversion to create new disks, edit the settings of the VMs to use the new disks. Remove the old dynamic disks.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.