Notes–Enabling Disaster Recovery for Hyper-V Workloads Using Hyper-V Replica

I’m taking notes from VIR302 in this post. I won’t be repeating stuff I’ve blogged about previously.

Outage Information in SMEs

Data from Symantec SMB Disaster Preparedness Survey, 2011. 1288 SMBs with 5-1000 employees worldwide.

Average number of outages per year? 6
What does this outage cost per day? $12,500

That’s an average cost of $75,000 per year! To an SME! That could be 2 people’s salary for a year.

% That do not have a recovery plan: 50%. I think more business in this space don’t have DR.
What is their plan? Scream help and ask for pity.

Hyper-V Replica IS NOT Clustering And IT IS NOT a Cluster Alternative

Hyper-V Replica IS ALSO NOT Backup Replacement

It is a replication solution for replicating VMs to another site. I just know someone is going to post a comment asking if they can use it as a cluster alternative [if this is you – it will be moderated to protect you from yourself so don’t bother. Just re-read this section … slowly].

Failover Clustering HA: Single copy, automated failover within a cluster. Corruption loses the single copy.
Hyper-V Replica: Dual asynchronous copy with recent changes, manual failover designed for replication between sites. Corruption will impact original immediately and DR copy within 10 minutes.
Backup: Historical copy of data, stored locally and/or remotely, with the ability to restore a completely corrupted VM.

Certificates

For machines that are non-domain joined or non-trusted domain members. Hoster should issue certs to the customer in the hosted DR scenario.

Compression

Can disable it for WAN optimizers that don’t work well with pre-optimised traffic.

Another Recovery History Scenario

The disaster brought down VMs at different points. So VMA died at time A and VMB died at time C. Using this feature, you can reset all VMs back to time A to work off of a similar set of data.

You can keep up to 15 recovery points per day. Each recovery point is an hour’s worth of data.

The VSS option (application consistent recovery) fires every two hours. Every 2nd hour (or whatever depending on where you set the VSS slider) in the cycle it triggers VSS. All the writes in the guest get flushed. That replica is then sent over.

Note that the Hyper-V VSS action will not interfere with backup VSS actions. Interoperability testing has been done.

So if you’re keeping recovery snapshots, you’ll have standard replicas and application consistent (VSS) replicas. They’ll all be an hour apart, and alternating (if every 2nd hour). Every 5 minutes the changes are sent over, and every 13th one is collapsed into a snapshot (that’s where the 1 hour comes from).

Every 4 hours appears to be the sweet spot because VSS does have a performance impact on the guests.

Clusters

You can replicate to/from clusters. You cannot replicate from one node to another inside a cluster (can’t have duplicate VM GUIDs and you have shared storage).

Alerting

If 20% of cycles in the last hour are missed then you get a warning. This will self-close when replication is healthy again.

PowerShell

24 Hyper-V Replica cmdlets:

19 of them via get-command –Module hyper-v | where {$_.Name –like “*replication*”}
5 more via get-command –Module hyper-v | where {$_.Name –like “*failover*”}

Measure-VMReplication will return status/health of Hyper-V Replica on a per-VM basis.

Measure-VMReplication | where {$_.ReplicationHealth –eq “Critical”}

Could use that as a part of a scheduled script, and then send an email with details of the problem.

Replica Mechanism

Refers to the HRL (Hyper-V Replica Log) process as a write splitter. They use HTTP(s) for WAN traffic robustness. It’s also hosting company friendly. The HRL is swapped out before sending for a new HRL.

There is a threshold where the HRL cannot exceed half the VHD size. If WAN/storage goes down and this happens then HVR goes into a “resync state” (resynchronisation). When the problem goes away HVR automatically re-establishes replication.

VM Mobility

HVR policy follows the VM with any kind of migration scenario. Remember that replication is host/host. When the VM is moved from host A to host B, replication for the VM from host A is broken. Replication for the VM starts on host B. Host B must be already authorized on the replica host(s) – easier with cluster Hyper-V Replica broker.

IP Addressing VMs In DR Site

Inject static address – Simplest option IMO
Auto-assignment via DHCP – Worst option IMO because DHCP on servers is messy
Preserve IP address via Network Virtualisation – Most scalable option for DR clouds IMO with seamless failover for customers with VMs on a corporate WAN. Only one for seamless name resolution, I think, unless you spend lots on IP virtualisation in the WAN.

Failover Types

Planned Failover (downtime during failover sequence):

Shutdown primary VM
Send last log – run planned failover action from primary site VM. That’ll do the rest for us.
Failover replica VM
Reverse replication

Test Failover (no downtime):

Can test any recovery point without affecting replication on isolated test network.

Start test failover, selecting which copy to test with (if enabled). It does the rest for you.
Copies VM (new copy called “<original VM name> – test”) using a snapshot
Connects VM to test virtual switch
Starts up test VM

Network Planning

Capacity planning is critical. Designed for low bandwidth
Estimate rate of data change
Estimate for peak usage and effective network bandwidth

My idea is to analyse incremental backup size, and estimate how much data is created every 5 minutes.

Use WS2012 QoS to throttle replication traffic.

Replicating multiple VMs in parallel:

Higher concurrency leads to resource contention and latency
Lower concurrency leads to underutilizing and less protection for the business

Manage initial replication through scheduling. Don’t start everything at once for online initial synchronisation.

What they have designed for:

Server Impact of HVR

On the source server:

Storage space: proportional to the writes in the VM
IOPS is approx 1.5 times write IOPS

On the replica server:

Storage space: proportional to the write churn. Each additional recovery point approx 10% of the base VHD size.
Storage IOPS: 0.6 times write IOPS to receive and convert. 3-5 times write IOPS to receive, apply, merge, for additional recovery points.

There is a price to pay for recovery points. RECOMMENDATION by MSFT: Do not use replica servers for normal workloads if using additional recovery points because of the IOPS price.

Memory: Approx 50 MB per replicating VM

CPU impact: <3%

Technorati Tags: Event Notes,Windows Server 2012,Hyper-V,Virtualisation,DR

Leave a Reply Cancel reply