2014
01.14

I’ve had a number of requests to specify the pieces of a solution where there is a Windows Server 2012 R2 Hyper-V cluster that uses SMB 3.0 to store virtual machines on a Scale-Out File Server with Storage Spaces (JBOD). So that’s what I’m going to try to do with this post. Note that I am not going to bother with pricing:

  • It takes too long to calculate
  • Prices vary from country to country
  • List pricing is usually meaningless; work with a good distributor/reseller and you’ll get a bid/discount price.
  • Depending on where you live in the channel, you might be paying distribution price, trade price, or end-customer price, and that determines how much margin has been added to each component.
  • I’m lazy

Scale-Out File Server

Remember that an SOFS is a cluster that runs a special clustered file server role for application data. A cluster requires shared storage. That shared storage will be one or more Mini-SAS-attached JBOD trays (on the Storage Spaces HCL list) with Storage Spaces supplying the physical disk aggregation and virtualization (normally done by SAN controller software).

On the blade versus rack server question: I always go rack server. I’ve been burned by the limited flexibility and high costs of blades. Sure you can get 64 blades into a rack … but at what cost!?!?! FlexFabric-like solutions are expensive, and strictly speaking, not supported by Microsoft – not to mention they limit your bandwidth options hugely. The massive data centres that I’ve seen and been in use 1U and 2U rack servers.  I like 2U rack servers over 1U because 1U rack servers such as the R420 have only 1 full-height and 1 half-height PCI expansion slots. That half-height slot makes for tricky expansion.

For storage (and more) networking, I’ve elected to go with RDMA networking. Here you have two good choices:

  • iWARP: More affordable and running at 10 GbE – what I’ve illustrated here. Your vendor choice is Chelsio.
  • Infiniband: Amazing speeds (56 Gbps with faster to come) but more expensive. Your vendor choice is Mellanox.

I’ve ruled out RoCE. It’s too damned complicated – just ask Didier Van Hoye (@workinghardinit).

There will be two servers:

  • 2 x Dell R720: Dual Xeon CPU, 6 GB RAM, rail kits, dual CPU, on-board quad port 1 GbE NICs. The dual CPU gives me scalability to handle lots of hosts/clusters. The 4 x 1 GbE NICs are teamed (dynamic load distribution) for management functionality. I’d upgrade the built-in iDRAC Essentials to the Enterprise edition to get the KVM console and virtual media features. A pair of disks in RAID1 configuration are used for the OS in each of the SOFS nodes.
  • 10 x 1 GbE cables: This is to network the 4 x 1 GbE onboard NICs and the iDRAC management port. Who needs KVM when you’ve already bought it in the form of iDRAC.
  • 2 x Chelsio T520-CR: Dual port 10 GbE SFP+ iWARP (RDMA) NICs. These two rNICs are not teamed (not compatible with RDMA). They will reside on different VLANs/subnets for SMB Multichannel (cluster requirement). The role of these NICs is to converge SMB 3.0 storage, and cluster communications. I might even use these networks for backup traffic.
  • 4 x SFP+ cables: These are to connect the two servers to the two SFP+ 10 GbE switches.
  • 2 x LSI 9280-8e Mini-SAS HBAs: These are dual port Mini-SAS adapters that you insert into each server to connect to the JBOD(s). Windows MPIO provides the path failover.
  • 2 x Windows Server Standard Edition: We don’t need virtualization rights on the SOFS nodes. Standard edition includes Failover Clustering.

Regarding the JBODs:

Only use devices on the Microsoft HCL for your version of Windows Server. There are hardware features in these “dumb” JBODs that are required. And the testing process will probably lead to the manufacturer tweaking their hardware.

Not that although “any” dual channel SAS drive can be used, some firmwares are actually better than others. DataOn Storage maintain their own HCL of tested HDDs & SSDs and HBAs. Stick with the list that your JBOD vendor recommends.

How many and what kind of drives do you need? That depends. My example is just that: an example.

How many trays do you need? Enough to hold your required number of drives :D Really though, if I know that I will scale out to fill 3 trays then I will buy those 3 trays up front. Why? Because 3 trays is the minimum required for tray fault tolerance with 2-way mirror virtual disks (LUNs). Simply going from 1 tray to 2 and then 3 won’t do because data does not relocate.

Also remember that if you want tiered storage then there is a minimum number of SSDs (STRONGLY) recommended per tray.

Regarding using SATA drives: DON’T DO IT! The available interposer solution is strongly discouraged, even by DataOn.  If you really need SSD for tiered storage then you really need to pay (through the nose).

Here’s my EXAMPLE configuration:

  • 3 x DataOn Storage DNS-1640D: 24 x 2.5” disk slots in each 2U tray, each with a blank disk caddy for a dual channel SAS SSD or HDD drive. Each has dual boards for Mini-SAS connectivity (A+B for server 1 and A+B for server 2), and A+B connectivity for tray stacking. There is also dual PSU in each tray.
  • 18 x Mini-SAS cables: These cables are used to connect the LSI cards in the servers to the JBOD(s) and to stack the trays. At least I think 18 cables are required. They’re short cables because the servers are on top/under the JBOD trays and the entire storage solution is just 10U in height.
  • 12 x STEC S842E400M2 400GB SSD: Go google the price of these for a giggle! These are not your typical (or even “enterprise”) SSD that you’ll stick in a laptop.  I’m putting 4 into each JBOD, the recommended minimum number of SSDs in tiered storage if doing 2-way mirroring.
  • 48 x Seagate ST900MM0026 900 GB 10K SAS HDD: This gives us the bulk of the storage. There are 20 slots free (after the SSDs) in each JBOD and I’ve put in 16 disks into each. That gives me loads of capacity and some wiggle room to add more disks of either type.
  • 18 x Mini-SAS Cables: I’m not looking at a diagram and I’m tired so 18 might not be the right number. There’s a total of 10U of hardware in the SOFS (servers + JBOD) so short Mini-SAS cables will do the trick. These are used to attach the servers to the JBODs and to daisy chain the JBODs. The connections are fault tolerant – hence the high number of cables.

And that’s the SOFS, servers + JBODs with disks.

Just to remind you: it’s a sample spec. You might have one JBOD, you might have 4, or you might go with the 60 disk slot models. It all depends.

Hyper-V Hosts

My hosting environment will consist of one Hyper-V cluster with 8 nodes. This could be:

  • A few clusters, all sharing the same SOFS
  • One or more clusters with some non-clustered hosts, all sharing the same SOFS
  • Lots of non-clustered hosts, all sharing the same SOFS

One of the benefits of SMB 3.0 storage is that a shared folder is more flexible than a CSV on a SAN LUN. There are more sharing options, and this means that Live Migration can span the traditional boundary of storage without involving Shared-Nothing Live Migration.

Regarding host processors, the L2/L3 cache plays a huge role in performance. Try to get as new a processor as possible. And remember, it’s all Intel or all AMD; do not mix the brands.

There are lots of possible networking designs for these hosts. I’m going to use the design that I’ve implemented in the lab at work, and it’s also one that Microsoft recommends. A pair or rNICs (iWARP) will be used for the storage and cluster networking, residing on the same two VLANs as the cluster/storage networks that the SOFS nodes are on. Then two other NICs are going to be used for host and VM networking. These two NICs could be 1 GbE or 10 GbE or faster, depending on the needs of your VMs. I’ve got 4 pNICs to play with so I will team them.

    • 8 x Dell R720: Dual Xeon CPU, 256 GB RAM, rail kits, dual CPU, on-board quad port 1 GbE NICs. These are some big hosts. Put lots of RAM in because that’s the cheapest way to scale. CPU is almost never the 1st or even 2nd bottleneck in host capacity. The 4 x 1 GbE NICs are teamed (dynamic load distribution) for VM networking and management functionality. I’d upgrade the built-in iDRAC Essentials to the Enterprise edition to get the KVM console and virtual media features. A pair of disks in RAID1 configuration are used for the management OS.
    • 40 x 1 GbE cables: This is to network the 4 x 1 GbE onboard NICs and the iDRAC management port in each host. Who needs KVM when you’ve already bought it in the form of iDRAC.
    • 8 x Chelsio T520-CR: Dual port 10 GbE SFP+ iWARP (RDMA) NICs. These two rNICs are not teamed (not compatible with RDMA). They will reside on the same two different VLANs/subnets as the SOFS nodes. The role of these NICs is to converge SMB 3.0 storage, SMB 3.0 Live Migration (you gotta see it to believe it!), and cluster communications. I might even use these networks for backup traffic.
    • 16 x SFP+ cables: These are to connect the two servers to the two SFP+ 10 GbE switches.
    • 8 x Windows Server Datacenter Edition: The Datacenter edition gives us unlimited rights to install Windows Server into VMs that will run on these licensed hosts, making it the economical choice. Enabling Automatic Virtual Machine Activation in the VMs will simplify VM guest OS activation.

There are no HBAs in the Hyper-V hosts; the storage (SOFS) is accessed via SMB 3.0 over the rNICs.

Other Stuff

Hmm, we’re going to need:

  • 2 x SFP+ 10 GbE Switches with DBC support: Data Center Bridging really is required to do QoS of RDMA traffic. If would need PFC (Priority Flow Control) support if using RoCE for RDMA (not recommended – do either iWARP or Infiniband). Each switch needs at least 12 ports – allow for scalability.  For example, you might put your backup server on this network.
  • 2 x 1 GbE Switches: You really need a pair of 48 port top-of-rack switches in this design due to the number of 1 GbE ports being used and the need for growth.
  • Rack
  • PDU

And there’s probably other bits. For example, you might run a 2-node cluster for System Center and other management VMs. The nodes would have 32-64 GB RAM each. Those VMs could be stored on the SOFS or even on a JBOD that is directly attached to the 2 nodes with Storage Spaces enabled. You might run a server with lots of disk as your backup server. You might opt to run a pair of 1U servers are physical domain controllers for your infrastructure.

I recently priced up a kit, similar to above. It came in much cheaper than the equivalent blade/SAN configuration, which was a nice surprise. Even better was that the SOFS had 3 times more storage included than the SAN in that pricing!

30 comments so far

Add Your Comment
  1. Very interesting !

    I was planning for something similar but on the cheaper side (10Gb but no RDMA, 7k2 + SSD SAS instead of 10k + SSD SAS) but this post makes me want to upgrade the budget :)

    Of course i know this is just an example, but i’v got a fews questions about your design :

    - you go for 2xSOFS nodes and 3xJBOD : i was thinking of maybe doing 3/2 or 3/3, so do you think complete JBOD Failure is higher prob than 2 SOFS failure (or one failure during maintenance of the other node) ?

    - if i go for the 3 JBOD and 2way mirror, do i need to populate each JBOD with one third of the disks like you did, or can i just put one half in two (let’s say i only have 8 SSD) and keep the third JBOD empty for witness purpose and future upgrade ?

    Thx a lot for you blog, it’s a pleasure to read. (as where your two last books !)

    • If you plan on patching then expect each SOFS node to go offline at least once per month. If you don’t plan on patching then please do your boss/customer a favour and resign now ;-) JBODs are unlikely to fail because they are fairly dumb but the possibility is there. The disks must be spread evenly across the JBODs if you want JBOD fault tolerance.

      • Of course i plan to update my SOFS ! that’s why i’v got concerns about only having 2 nodes in it ;-) (if patch goes wrong on the first node, we’ll be running on one leg)

        What i meant about the third JBOD being empty, is that one storage pool doesn’t need to span across all JBODs, does it ? (of course, i need all my columns to be mirrored in 2 separated JBODs)

        I fear i’m not being clear, so i’ll explain that again :

        If i create a 2 way mirror volume with disks from JBOD1 and JBOD2 (all columns from JBOD1 being mirrored to columns in JBOD2) will JBOD3 help to protect against JBOD failure ?

        Or do i need to put columns in each JBOD ? (columns 1a 1b in JOBD 1 / 2a 2c JBOD 2 / 3b 3c JBOD 3)

        Thanks for your time.

        • Or i’m completly wrong and 3 way mirror is the only way to get enclosure fault tolerance ?

          • You are completely wrong. Read the above.

        • Read the above.

  2. Hello Aidan,

    first i want to thank you for your great blog! You make even complicated topics easy to understand! I’m following your blog every day since one year and i learned so much about Hyper-V and the new storage features in Windows.

    I know you already posted about a small enviroment deployemt with two Hyper-V/Storage hosts and one JBOD. Would be great if you could create a post like this for the small enviroment solution. I already tested the Clustered Storage Spaces inside of Hyper-V (your described that) but i don’t understand the exact setup of Hyper-V ON the Clustered Storage Spaces and if i get an small HA enviroment with this.

    • Take the SOFS node design above, add RAM.

  3. Do you have any recommendations on the SFP+ switches? Dell PowerConnect 8132F?

    • None at all. Not a networks guy.

    • Aaron:
      Dell PowerConnect 8100 series are being replaced by Dell Networking N4000 series so I would recommend the latter. Of course if possible you can also go up one notch to the Dell Force10 S4810.

    • We use the PC8132F as top of rack switches & PFC config is easy enough and can be tested. The more challenging part is ETS setup and testing, especially when you have multiple hops. They are good swithes. Take note of wat Andreas said on the N4000 series and see if that fits your schedule. Otherwies the PC8100 series is here today. You can do Force10, we leverage the S4810 swithes with VTL as our core switches.
      Another issue are network guys who don’t understand SMB Direct & look at you like you’re nuts. DCB for them is iSCSI/FCoE and they don’t like convergence on storage switches.

  4. Excellent article, thank you.

    Just a couple of questions in respect of the SOFS disk controllers. You mention the LSI 9280-8e which I believe is a RAID controller. Any reason for not choosing a simpler SAS controller, say LSI 9300-8e? Also, if considering more than 4 SSD’s in a JBOD tray, at what point does a 6GB/s controller become a bottleneck?

    • Apologies, I just read your article again more carefully and it is a SAS controller you refer to. So just the one question please regarding 6GBs v 12GBs controllers when used with 12 or more fast SSDs, as you might plug into a 60 bay JBOD?

      Thanks again.

      • Don’t know if 12 GB SAS controllers have WS2012 R2 support yet. And you’d also need to verify that they are also on the HCL of the JBOD.

        • Bob is correct regarding the LSI 9280-8e which are LSI MegaRAID controllers and thus wouldn’t work with clustered storage spaces since you need to expose the drives to the SOFS nodes as simple SAS-connected drives. Just a simple mistake I assume since you know this stuff better than most.

          Suitable choices are SAS HBAs from the HCL for the DataON DNS-1640 JBOD mentioned. The obvious choice there would be the LSI SAS HBA 9207-8e 6Gb/s which can take advantage of PCIe 3.0 x8.

          Personally I would contact DataON and ask about the LSI SAS HBA 9300-8e and 9300-16e which are the new 12Gb/s SAS HBA’s from LSI. They are on the Windows Server 2012 R2 HCL so DataON compatibility would be the only thing to check for.

          • Incorrect. The devices (a) are on the DataOn HCL and (b) are being used in my lab for a SOFS.

          • Stumbled upon the reason why LSI 9280-8e works for Aidan. Most LSI MegaRAID controllers can be flashed into either IR (Integrated RAID) or IT (Initiator Target). The IR mode gives you RAID and IT doesn’t. IT should be used when you need present drives directly to the OS without RAID, which makes it just the thing you want for Storage Spaces or ZFS.

            So if you want the added flexibility of controllers that can be used either as RAID controllers or as HBAs that is possible. Just make sure you can flash you controller into a HBA (IT mode in LSIs world) if you want it to work with Storage Spaces.

    • The 6 GB is in fact 4 * 6 Gbps channels. Same bottleneck as you would have in a modern SAN.

  5. I would seriously consider the RDMA option with Chelsio. We have just managed to purchase 4x Chelsio T540-CR quad port SFP+ cards for the same price if not cheaper than the equivalent intel/broadcom cards without RDMA. We also found the quad ports came in at only a fraction more than the dual ports so though why not to allow future expansion.

  6. Do you have also any recommendations regarding the switching equipment?

    • Nothing other than the above.

  7. Hi Aidan,
    Can you elaborate on your comment that “FlexFabric-like solutions are expensive, and strictly speaking, not supported by Microsoft”

    Daniel

    • What I said.

  8. What’s your opinion on using a DataON CIB for SOFS instead of what you listed above?

    • Can do, absolutely.

  9. Hey Aidan,
    What is your opinion of the recommended configured maximums. I got this from the FAQ on the TechNet wiki.

    In Windows Server 2012 R2, the following are the recommended configuration limits:

    •Up to 240 physical disks in a storage pool; you can, however, have multiple pools of 240 disks.
    •Up to 480 TB of capacity in a single storage pool.
    •Up to 128 storage spaces in a single storage pool.
    •In a clustered configuration, up to four storage pools per cluster.

    Would you think it could mean I could use 240 4TB disks and split it up into many storage pools. Then make a mirrored 480TB virtual disk. Since its mirrored my volumes should be 240TB roughly of course mirrored volumes using NTFS? That way I fit under the 256TB max volume size of NTFS? Also be supported for the disk limit and the max recommended capacity of the storage pool? The numbers in this theory are all rough numbers… I’m tired like you.

    Thanks for your awesome blog and all the useful info.

    • 240 * 4 TB > 480 TB. So … no.

  10. I have my SOFS set up without RDMA. I’m having a 90 Mbps write throughput and a 300Mbps read throughput using Intel’s X540T2 nics and a Netgear M7100-24x switch. Would RDMA help that much?

    • You’ll get some improvement on latency. Real benefit of RDMA is that you offload from the CPU. So is your load impacting CPU to the point of becoming a bottleneck or impacting VM performance? If not, then you probably don’t need RDMA.

Get Adobe Flash player