Few Hyper-V topics burn up the Internet quite like “performance”. No matter how fast it goes, we always want it to go faster. If you search even a little, you’ll find many articles with long lists of ways to improve Hyper-V’s performance. The less focused articles start with general Windows performance tips and sprinkle some Hyper-V-flavored spice on them. I want to use this article to tighten the focus down on Hyper-V hardware settings only. That means it won’t be as long as some others; I’ll just think of that as wasting less of your time.
1. Upgrade your system
I guess this goes without saying but every performance article I write will always include this point front-and-center. Each piece of hardware has its own maximum speed. Where that speed barrier lies in comparison to other hardware in the same category almost always correlates directly with cost. You cannot tweak a go-cart to outrun a Corvette without spending at least as much money as just buying a Corvette — and that’s without considering the time element. If you bought slow hardware, then you will have a slow Hyper-V environment.
Fortunately, this point has a corollary: don’t panic. Production systems, especially server-class systems, almost never experience demand levels that compare to the stress tests that admins put on new equipment. If typical load levels were that high, it’s doubtful that virtualization would have caught on so quickly. We use virtualization for so many reasons nowadays, we forget that “cost savings through better utilization of under-loaded server equipment” was one of the primary drivers of early virtualization adoption.
2. BIOS Settings for Hyper-V Performance
Don’t neglect your BIOS! It contains some of the most important settings for Hyper-V.
- C States. Disable C States! Few things impact Hyper-V performance quite as strongly as C States! Names and locations will vary, so look in areas related to Processor/CPU, Performance, and Power Management. If you can’t find anything that specifically says C States, then look for settings that disable/minimize power management. C1E is usually the worst offender for Live Migration problems, although other modes can cause issues.
- Virtualization support: A number of features have popped up through the years, but most BIOS manufacturers have since consolidated them all into a global “Virtualization Support” switch, or something similar. I don’t believe that current versions of Hyper-V will even run if these settings aren’t enabled. Here are some individual component names, for those special BIOSs that break them out:
- Virtual Machine Extensions (VMX)
- AMD-V — AMD CPUs/mainboards. Be aware that Hyper-V can’t (yet?) run nested virtual machines on AMD chips
- VT-x, or sometimes just VT — Intel CPUs/mainboards. Required for nested virtualization with Hyper-V in Windows 10/Server 2016
- Data Execution Prevention: DEP means less for performance and more for security. It’s also a requirement. But, we’re talking about your BIOS settings and you’re in your BIOS, so we’ll talk about it. Just make sure that it’s on. If you don’t see it under the DEP name, look for:
- No Execute (NX) — AMD CPUs/mainboards
- Execute Disable (XD) — Intel CPUs/mainboards
- Second Level Address Translation: I’m including this for completion. It’s been many years since any system was built new without SLAT support. If you have one, following every point in this post to the letter still won’t make that system fast. Starting with Windows 8 and Server 2016, you cannot use Hyper-V without SLAT support. Names that you will see SLAT under:
- Nested Page Tables (NPT)/Rapid Virtualization Indexing (RVI) — AMD CPUs/mainboards
- Extended Page Tables (EPT) — Intel CPUs/mainboards
- Disable power management. This goes hand-in-hand with C States. Just turn off power management altogether. Get your energy savings via consolidation. You can also buy lower wattage systems.
- Use Hyperthreading. I’ve seen a tiny handful of claims that Hyperthreading causes problems on Hyper-V. I’ve heard more convincing stories about space aliens. I’ve personally seen the same number of space aliens as I’ve seen Hyperthreading problems with Hyper-V (that would be zero). If you’ve legitimately encountered a problem that was fixed by disabling Hyperthreading AND you can prove that it wasn’t a bad CPU, that’s great! Please let me know. But remember, you’re still in a minority of a minority of a minority. The rest of us will run Hyperthreading.
- Disable SCSI BIOSs. Unless you are booting your host from a SAN, kill the BIOSs on your SCSI adapters. It doesn’t do anything good or bad for a running Hyper-V host but slows down physical boot times.
- Disable BIOS-set VLAN IDs on physical NICs. Some network adapters support VLAN tagging through boot-up interfaces. If you then bind a Hyper-V virtual switch to one of those adapters, you could encounter all sorts of network nastiness.
3. Storage Settings for Hyper-V Performance
I wish the IT world would learn to cope with the fact that rotating hard disks do not move data very quickly. If you just can’t cope with that, buy a gigantic lot of them and make big RAID 10 arrays. Or, you could get a stack of SSDs. Don’t get six or so spinning disks and get sad that they “only” move data at a few hundred megabytes per second. That’s how the tech works.
Performance tips for storage:
- Learn to live with the fact that storage is slow.
- Remember that speed tests do not reflect real world load and that file copy does not test anything except permissions.
- Learn to live with Hyper-V’s I/O scheduler. If you want a computer system to have 100% access to storage bandwidth, start by checking your assumptions. Just because a single file copy doesn’t go as fast as you think it should, does not mean that the system won’t perform its production role adequately. If you’re certain that a system must have total and complete storage speed, then do not virtualize it. The only way that a VM can get that level of speed is by stealing I/O from other guests.
- Enable read caches
- Carefully consider the potential risks of write caching. If acceptable, enable write caches. If your internal disks, DAS, SAN, or NAS has a battery backup system that can guarantee clean cache flushes on a power outage, write caching is generally safe. Internal batteries that report their status and/or automatically disable caching are best. UPS-backed systems are sometimes OK, but they are not foolproof.
- Prefer few arrays with many disks over many arrays with few disks.
- Unless you’re going to store VMs on a remote system, do not create an array just for Hyper-V. By that, I mean that if you’ve got six internal bays, do not create a RAID-1 for Hyper-V and a RAID-x for the virtual machines. That’s a Microsoft SQL Server 2000 design. This is 2017 and you’re building a Hyper-V server. Use all the bays in one big array.
- Do not architect your storage to make the hypervisor/management operating system go fast. I can’t believe how many times I read on forums that Hyper-V needs lots of disk speed. After boot-up, it needs almost nothing. The hypervisor remains resident in memory. Unless you’re doing something questionable in the management OS, it won’t even page to disk very often. Architect storage speed in favor of your virtual machines.
- Set your fibre channel SANs to use very tight WWN masks. Live Migration requires a hand off from one system to another, and the looser the mask, the longer that takes. With 2016 the guests shouldn’t crash, but the hand-off might be noticeable.
- Keep iSCSI/SMB networks clear of other traffic. I see a lot of recommendations to put each and every iSCSI NIC on a system into its own VLAN and/or layer-3 network. I’m on the fence about that; a network storm in one iSCSI network would probably justify it. However, keeping those networks quiet would go a long way on its own. For clustered systems, multi-channel SMB needs each adapter to be on a unique layer 3 network (according to the docs; from what I can tell, it works even with same-net configurations).
- If using gigabit, try to physically separate iSCSI/SMB from your virtual switch. Meaning, don’t make that traffic endure the overhead of virtual switch processing, if you can help it.
- Round robin MPIO might not be the best, although it’s the most recommended. If you have one of the aforementioned network storms, Round Robin will negate some of the benefits of VLAN/layer 3 segregation. I like least queue depth, myself.
- MPIO and SMB multi-channel are much faster and more efficient than the best teaming.
- If you must run MPIO or SMB traffic across a team, create multiple virtual or logical NICs. It will give the teaming implementation more opportunities to create balanced streams.
- Use jumbo frames for iSCSI/SMB connections if everything supports it (host adapters, switches, and back-end storage). You’ll improve the header-to-payload bit ratio by a meaningful amount.
- Enable RSS on SMB-carrying adapters. If you have RDMA-capable adapters, absolutely enable that.
- Use dynamically-expanding VHDX, but not dynamically-expanding VHD. I still see people recommending fixed VHDX for operating system VHDXs, which is just absurd. Fixed VHDX is good for high-volume databases, but mostly because they’ll probably expand to use all the space anyway. Dynamic VHDX enjoys higher average write speeds because it completely ignores zero writes. No defined pattern has yet emerged that declares a winner on read rates, but people who say that fixed always wins are making demonstrably false assumptions.
- Do not use pass-through disks. The performance is sometimes a little bit better, but sometimes it’s worse, and it almost always causes some other problem elsewhere. The trade-off is not worth it. Just add one spindle to your array to make up for any perceived speed deficiencies. If you insist on using pass-through for performance reasons, then I want to see the performance traces of production traffic that prove it.
- Don’t let fragmentation keep you up at night. Fragmentation is a problem for single-spindle desktops/laptops, “admins” that never should have been promoted above first-line help desk, and salespeople selling defragmentation software. If you’re here to disagree, you better have a URL to performance traces that I can independently verify before you even bother entering a comment. I have plenty of Hyper-V systems of my own on storage ranging from 3-spindle up to >100 spindle, and the first time I even feel compelled to run a defrag (much less get anything out of it) I’ll be happy to issue a mea culpa. For those keeping track, we’re at 6 years and counting.
4. Memory Settings for Hyper-V Performance
There isn’t much that you can do for memory. Buy what you can afford and, for the most part, don’t worry about it.
- Buy and install your memory chips optimally. Multi-channel memory is somewhat faster than single-channel. Your hardware manufacturer will be able to help you with that.
- Don’t over-allocate memory to guests. Just because your file server had 16GB before you virtualized it does not mean that it has any use for 16GB.
- Use Dynamic Memory unless you have a system that expressly forbids it. It’s better to stretch your memory dollar farther than wring your hands about whether or not Dynamic Memory is a good thing. Until directly proven otherwise for a given server, it’s a good thing.
- Don’t worry so much about NUMA. I’ve read volumes and volumes on it. Even spent a lot of time configuring it on a high-load system. Wrote some about it. Never got any of that time back. I’ve had some interesting conversations with people that really did need to tune NUMA. They constitute… oh, I’d say about .1% of all the conversations that I’ve ever had about Hyper-V. The rest of you should leave NUMA enabled at defaults and walk away.
5. Network Settings for Hyper-V Performance
Networking configuration can make a real difference to Hyper-V performance.
- Learn to live with the fact that gigabit networking is “slow” and that 10GbE networking often has barriers to reaching 10Gbps for a single test. Most networking demands don’t even bog down gigabit. It’s just not that big of a deal for most people.
- Learn to live with the fact that a) your four-spindle disk array can’t fill up even one 10GbE pipe, much less the pair that you assigned to iSCSI and that b) it’s not Hyper-V’s fault. I know this doesn’t apply to everyone, but wow, do I see lots of complaints about how Hyper-V can’t magically pull or push bits across a network faster than a disk subsystem can read and/or write them.
- Disable VMQ on gigabit adapters. I think some manufacturers are finally coming around to the fact that they have a problem. Too late, though. The purpose of VMQ is to redistribute inbound network processing for individual virtual NICs away from CPU 0, core 0 to the other cores in the system. Current-model CPUs are fast enough that they can handle many gigabit adapters.
- If you are using a Hyper-V virtual switch on a network team and you’ve disabled VMQ on the physical NICs, disable it on the team adapter as well. I’ve been saying that since shortly after 2012 came out and people are finally discovering that I’m right, so, yay? Anyway, do it.
- Don’t worry so much about vRSS. RSS is like VMQ, only for non-VM traffic. vRSS, then, is the projection of VMQ down into the virtual machine. Basically, with traditional VMQ, the VMs’ inbound traffic is separated across pNICs in the management OS, but then each guest still processes its own data on vCPU 0. vRSS splits traffic processing across vCPUs inside the guest once it gets there. The “drawback” is that distributing processing and then redistributing processing causes more processing. So, the load is nicely distributed, but it’s also higher than it would otherwise be. The upshot: almost no one will care. Set it or don’t set it, it’s probably not going to impact you a lot either way. If you’re new to all of this, then you’ll find an “RSS” setting on the network adapter inside the guest. If that’s on in the guest (off by default) and VMQ is on and functioning in the host, then you have vRSS. woohoo.
- Don’t blame Hyper-V for your networking ills. I mention this in the context of performance because your time has value. I’m constantly called upon to troubleshoot Hyper-V “networking problems” because someone is sharing MACs or IPs or trying to get traffic from the dark side of the moon over a Cat-3 cable with three broken strands. Hyper-V is also almost always blamed by people that just don’t have a functional understanding of TCP/IP. More wasted time that I’ll never get back.
- Use one virtual switch. Multiple virtual switches cause processing overhead without providing returns. This is a guideline, not a rule, but you need to be prepared to provide an unflinching, sure-footed defense for every virtual switch in a host after the first.
- Don’t mix gigabit with 10 gigabit in a team. Teaming will not automatically select 10GbE over the gigabit. 10GbE is so much faster than gigabit that it’s best to just kill gigabit and converge on the 10GbE.
- 10x gigabit cards do not equal 1x 10GbE card. I’m all for only using 10GbE when you can justify it with usage statistics, but gigabit just cannot compete.
6. Maintenance Best Practices
Don’t neglect your systems once they’re deployed!
- Take a performance baseline when you first deploy a system and save it.
- Take and save another performance baseline when your system reaches a normative load level (basically, once you’ve reached its expected number of VMs).
- Keep drivers reasonably up-to-date. Verify that settings aren’t lost after each update.
- Monitor hardware health. The Windows Event Log often provides early warning symptoms, if you have nothing else.
If you carry out all (or as many as possible) of the above hardware adjustments you will witness a considerable jump in your hyper-v performance. That I can guarantee. However, for those who don’t have the time, patience or prepared to make the necessary investment in some cases, Altaro has developed an e-book just for you. Find out more about it here: Supercharging Hyper-V Performance for the time-strapped admin.
The category of questions that I most commonly field are related to host design. Provisioning is a difficult operation for small businesses; they don’t do it often enough to obtain the same level of experience as a large business and they don’t have the finances to absorb either an under-provision or an over-provision. If you don’t build your host large enough, you’ll be buying a new one while the existing still has life in it. If you buy too much, you’ll be wasting money that could have been used elsewhere. Unfortunately, there’s no magic formula for provisioning, but you can employ a number of techniques to guide you to a right-sized build.
1. Do Not Provision Blindly
Do not buy a pre-packaged build, do not have someone on a forum recommend their favorite configuration, and do not simply buy something that looks good. Vendors are only interested in profit margins, forum participants only know their own situations, and no one can adequately architect a Hyper-V host in a void.
2. Have a Budget in Mind
Everyone hates it when the vendor asks, “How much do you want/have to spend?” I completely understand why you don’t want to answer that question at all, and I agree with the sentiment. We all know that if you say that you have $5,000 to spend, your bill will somehow be $5,027 dollars. Unless you have a history with the vendor in question, you don’t know if the vendor is truly sizing against your budget or if they’re finding the highest-margin solution that more or less coincides with what you said you were willing to pay. That said, even if you don’t give the answer, you must know the answer. That answer must truly be an amount that you’re willing to spend; don’t say that you’ll spend $5,000 if what you’re truly able to spend is $3,000. I worked for a vendor of solid repute that earned their reputation, so I can tell you from direct experience that it’s highly unlikely that you’ll ever be sold a system that is meaningfully smaller than what you can afford even if your reseller isn’t trying to oversell. Every system that I ever architected for a small business made some compromises to fit within budget. The more money they could spend, the fewer compromises were necessary.
3. Storage and Memory are Your Biggest Concerns
Part of the reason that virtualization works at all is because modern CPU capability greatly outmatches modern CPU demand. I am one of the many people that can remember days when conserving CPU cycles was important, but I can clearly see that those days are long gone. Do not try to buy a system that will establish a 1-to-1 ratio of physical CPUs to virtual CPUs. If you’re a small business that will only have a few virtual machines, it would be difficult to purchase any modern server-class hardware that doesn’t have enough CPU power. For you, the generation of the CPU is much more important than the core count or clock speed.
Five years ago, I would (and did) say that memory was your largest worry. That’s no longer true, especially for the small business. DDR3 is substantially cheaper than DDR2, and, with only a few notable exceptions, the average system’s demand on memory has not increased as quickly as the cost has decreased. For the notable exceptions (Exchange and SharePoint), the small business can likely get better pricing by choosing a cloud-based or non-Microsoft solution as opposed to hosting these products on-premises. Even if you choose to host them in-house, a typical server-class system with 32 GB of RAM can hold an 8 GB SharePoint guest, an 8 GB Exchange guest, and still have a good 14 GB of memory left over for other guests (assuming 2 GB for the management operating system). Even a tight budget for server hardware should be able to accommodate 32 GB of RAM in a host.
Storage is where you need to spend some time applying thought. For small businesses that won’t be clustering (rationale on my previous post), these are my recommendations:
- Internal storage provides the best return for your dollar.
- For the same dollar amount, prefer many small and fast disks over a few large and slow disks.
- A single large array containing all of your disks is superior to multiple arrays of subsets.
- Hardware array controllers are worth the money. Tip: if the array controller that you’re considering offers a battery-backed version, it is hardware-based. The battery is worth the extra expense.
Storage sizing is important, but I am intentionally avoiding going any further about it in this article because I want it to be applicable for as many small businesses as possible. There are two takeaways that I want you to glean from this point:
- CPU is a problem that mostly solves itself and memory shouldn’t take long to figure out. Storage is the biggest question for you.
- The storage equation is particular to each situation. There is no one-size-fits-all solution. There isn’t a one-sized-fits-most solution. There isn’t a typical, or a standard, or a usual, or a regular solution that is guaranteed to be appropriate for you. Vendors that tell you otherwise are either very well-versed in a particular vertical market that you’re scoped to and will have the credentials and references to prove it or they’re trying to get the most money out of you for a minimum amount of time invested on their part.
Networking is typically the last thing a small business should be worried about. As with storage sizing I can’t be specific enough to cover everyone that I’d like this post to be relevant to, but it’s safe to say that 2 to 6 gigabit Ethernet connections per host are sufficient.
4. Do not be Goaded or Bullied into 10 Gigabit Ethernet
I won’t lie, 10 GbE is really nice. It’s impressive to see it in operation. But, the rest of the truth is that it’s unnecessary in most small businesses, and in lots of medium businesses too. You can grow to a few thousand endpoints before it even starts to become necessary as an interswitch backbone.
A huge part of the reasoning is simple economics:
- A basic business-class 20-port gigabit switch can be had for around $200 USD. You can reasonably expect to acquire gigabit network adapters for $50 or less per port.
- A basic 12-port 10GbE switch costs at least $1,500 USD. Adapters will set you back at least $250 per port.
When you’re connecting five server-class hosts, $1,500 for a switch and $500 apiece for networking doesn’t seem like much. When you’re only buying one host for $5,000 or less, the ratio isn’t nearly as sensible. That price is just for the budget equipment. Since 10GbE adapters can move network data faster than modern CPUs can process it, offloading and VMQ technologies are quite important to get the most out of 10GbE networking. That means that you’re going to want something better than just the bare minimum.
What might even be more relevant than price is the fact that most people don’t use as much network bandwidth as they think they do. The most common tests do not even resemble typical network utilization, which can fool administrators into thinking that they don’t have enough. If you need to verify your usage, I’ve written an article that can help you do just that with MRTG. This leads into a very important point.
5. You Need to Know What You Need
Unless you’re building a host for a brand-new business, you’ve got an existing installation to work with. Set up Performance Monitor or any monitoring tool of your choice and find out what your systems are using. Measure CPU, disk, memory, and networking. Do not even start trying to decide what hardware to buy until you have some solid long-term metrics to look at. I’m surprised at how many messages I get asking me to recommend a hardware build that have little or no information about what the environment is. I’m guessing that the questioners are just as surprised when I respond, “I don’t know.” It doesn’t take a great deal of work to find out what’s going on. Do that work first.
6. Build for the Length of the Warranty
Collecting data on your existing systems only tells you what you need to know to get through the first day. You’re probably going to need more over time. How much more depends on your environment. Some businesses have reached equilibrium and don’t grow much. Others are just kicking things off and will triple in size in a few months. Since those truly new environments are rare, I’m going to aim this next bit at that gigantic majority that is building for the established institutions. Decide how much warranty you’re willing to buy for the new host and use that as your measuring stick for the rest of it. How you proceed depends upon growth projections:
- If system needs won’t grow much (for example, 5-10% annually), then build the system with a long warranty period in mind. If the business has been experiencing a 5% average annual growth rate and is currently using 300 GB of data, a viable option is to purchase a system with 500 GB of usage storage with a 5-year warranty.
- If system needs will grow rapidly, you have two solid options:
- Buy an inexpensive system with a short warranty (1-3 years). Ensure that it’s understood that this system is not expected to live long. If decision-makers appear to be agreeing without understanding, you’re better off getting a bigger system.
- Buy a system that’s a little larger with a longer warranty (5 years). Plan a definite growth point at which you will scale out to a second host. Scaling out can become more than twice as expensive as the original, especially when clustering is a consideration, so do not take this decision lightly.
Most hardware vendors will allow warranty extensions which gives you some incentive to oversize. If you make a projection for a five-year system and it isn’t at capacity at the end of those five years, extending the warranty helps to maximize the initial investment.
In case it’s not obvious, future projections are much easier to perform when you have a solid idea of the environment’s history. There’s more than one reason that I make such a big deal out of performance monitoring.
7. Think Outside the One Box
Small businesses typically only need one physical host. There isn’t any line at which you cross over into “medium” business and magically need a second host. There isn’t any concrete relationship between business size and the need for infrastructure capacity. Just as I preach against the dangers of jumping into a cluster needlessly, I am just as fervent about not having less than what is adequate. Scaling out can undoubtedly be expensive, but when it’s time, it’s time.
Clustering isn’t necessarily the next step up from a single host. Stagger your host purchases across years so that you have an older system that handles lighter loads and a newer system that takes on the heavier tasks. What’s especially nice about having two Hyper-V hosts is that you can have two domain controllers on separate equipment. Even though I firmly stand behind my belief that most small businesses operate perfectly well with a single domain controller, I am just as certain that anyone who can run two or more domain controllers on separate hosts without hardship will benefit from the practice.
8. Maybe Warranties are Overrated
I’ve been in the industry long enough to see many hardware failures, some of legendary quality. That doesn’t change the fact that physical failures are very nearly a statistical anomaly. I work in systems administration; my clients and workers from other departments never call me just to be social. Anyone in my line of work deals with exponentially more failures than anyone outside my line of work. So, while I am probably always going to counsel you to spend extra so that you can get new or something that is new enough that you can still acquire a manufacturer’s warranty on it, I can also acknowledge that there are alternatives when the budget simply won’t allow for it.
In my own home, most of the equipment that we use is not new. As a technophile, I have more computers than people, and that’s before you start counting the devices that don’t have keyboards or the units that are only used for my blogging. I rarely buy anything new. I am unquestionably a fan of refurbished and cast-off systems. It’s all very simple to understand: I want to own more than I can technically afford to own, and this practice satisfies both my desire for tech and my need for frugality. Is that any way to run a business? Well…
Cons of Refurbished and Used Hardware
One the one hand, no, this is not a good idea. If any of this fails, I have to either repair it, live without it, or replace it. If you don’t have the skills for the first, the capacity for the second, or the finances for the third, that leaves your business in the lurch. If you’d have to make that choice, then no, don’t do this. Another concern is that if you’re doing this to be cheap, a lot of cheap equipment doesn’t meet the criteria to be listed on http://www.windowsservercatalog.com and might be more trouble than the savings is worth. And of course, even if it’s good enough for today’s version, it might not work with tomorrow’s version.
For another thing, I’ve seen a lot of really cheap business owners use equipment that they had to repair all the time and that was so inefficient that it impacted worker productivity. That sort of thing is a net loss. Avoid these conditions, even if it means spending more money. Remember what I said earlier about compromises? Sometimes the only viable compromise is to spend more money on better hardware.
If you go the route of having hardware that doesn’t carry a warranty, you need to be prepared to replace it at all times. Warranty repairs are commonly no longer than next-business-day in this era. Buying replacement hardware could have days or even weeks of lead time. Having replacement hardware on hand can cost more than just buying new with warranty.
Pros of Refurbished and Used Hardware
On the other hand, I spent less to acquire many of these things than their original owners did on their warranties, and, with the law of averages, most of my refurbished equipment has never failed. I quite literally have more for less. Something else to remember is that a lot of refurbished hardware is very new. Sometimes they’re just returns that can no longer be sold as “new”. You can often get original manufacturer warranties on them. The only downside to purchasing that sort of hardware is that you don’t get to pick exactly what you want. For the kind of savings that can be had, so what?
In case you’re curious, all of the places that I’ve worked pushed a very hard line of only selling new equipment. “Used” and “refurbished” carry a very strong negative connotation that no one I worked for wanted to be attached to. However, I didn’t work for anyone that would turn away a client that was using used or refurbished equipment that they acquired independently. I’ve encountered plenty of it in the field. It didn’t fail any more often than new equipment did. I’ll say that I do feel more comfortable about “refurbished” than “used”. I also know what it’s like to be looking at a tight budget and needing to make tough decisions.
I will say that I would prefer to avoid used hardware for a Hyper-V host. I understand that it can be enticing for the very small business budget so I will stop short of declaring this a rule. It’s reasonable to expect used hardware to be unreliable and short-lived. Used hardware will consume more of your time. Operating from the assumption that your time has great value, I encourage you to consider used hardware as a last resort.
9. Architect for Backup
I expect point #8 to stir a bit of controversy. I fully expect many people to disagree with any notion of non-new equipment, especially those that depend on margins from sales of new hardware. I don’t mind the fight; until someone comes up with 100% failure-free new hardware, there will never be a truly airtight case for only buying new.
If you want to give yourself a guaranteed peace of mind, backup is where you need to focus. I may not know the statistics around failures of new versus used or refurbished equipment, but I know that all hardware has a chance of breaking. What doesn’t break due to defect or negligence can be destroyed by malice or happenstance, so you can never rely too much upon the quality of a purchase.
What this means is that when you’re deciding how many hard disks to buy and the size of the network switch to plug the host in to, you also need to be thinking about where you’re going to copy the data every night. External hard disks are great, as long as they’re big enough. Offsite service providers are fine, as long as you know that your Internet bandwidth can handle it. If you don’t know this in advance, you run the risk of needing to sacrifice something in your backup rotations. I have yet to see any sacrifice in this aspect that was worth it.
I often lament the way that small businesses are treated by the majority of technical writers and a significant number of software companies. You know the ones I’m talking about — those who think of “small business” as a nebulous concept that means “something under a few thousand employees”. They take what they know and just scale it down to some arbitrary minimum that in no way represents a small business.
Fortunately, I’ve spent most of my career working with and for truly small businesses, right down to the loners building up companies out of their basements. I know the struggles of insufficient time and budget shortfalls and lack of expertise. I also know what works and what doesn’t. What I intend to share with you is the distillation of all that experience into the strategy that I would follow if I ever struck out on my own to build a startup. This is only the first of my articles targeting the small business environment. I’ve got plenty for you.
To begin, I want to introduce a set of 8 Hyper-V best practices for small businesses that are important to consider.
1. Backup is Your Highest Priority
In my second full-time IT position, one of my very first responsibilities was assisting small business customers across the country with backup problems. I held that position for five years. During that period, I accumulated stories that would give you nightmares. I’ve had to say the words, “I’m sorry, but your data is unrecoverable,” more than once. The most disheartening thing I see is forum posts from small businesses wanting to know what the “best free backup” program is. “Best” is not where they’re drawing the line; it’s drawn at “free”. Don’t get me wrong; free is good — but only if it meets every single one of your needs. Skimping on backup can very easily mean placing your entire business at risk, unless your business is able to withstand a complete data loss event. Even if it can withstand that today, are you not planning to grow?
I expect you to ask, “How can I know if a particular backup solution meets my needs?” That’s a wonderful question, and I’ll answer it even if you didn’t ask. Here’s how you know:
- Every last piece of data that you care about must be backed up for at least as long as you’ll ever care about it. If you’re in a particular industry, such as finance, make sure that you understand the regulations around data retention that apply to you. Regulations aren’t always the deciding factor; I once had to recover a 7-year-old e-mail to aid a plaintiff’s case. Some free backup applications have a limit on retention. When evaluating an application on these grounds, remember that “retention” only applies to the age of the backup itself; if the data that was backed up last night is 5 years old, then the backup age is less than a day.
- Any backup data in the same geographical location as the source data is forfeit to the whims of chance. You need some way to move it offsite, preferably out of the range of any natural disaster that might destroy your main location. For small businesses, there is a limit to the effectiveness of very long-range offsite storage. If your town is entirely destroyed by a flood and all of your customers are in that town, then how far do you really need to move your data? If your business is an insurance agency, you’ll likely answer that question differently than would someone who owns a landscaping outfit. Of course, your own business is likely insured, so you’ll need to protect any data that would be needed to file any claims in the event of a business-ending disaster. The takeaway of this point is that the backup solution that you use must allow you to take your data offsite and you must follow through.
- The backed-up data must be recoverable. I often hear backup referred to as “an insurance policy”. That analogy is awful. People go out of their way to avoid using insurance policies because filing a claim often leads to premium increases, and in the worst cases, policy cancellations. Practice data restoration often. The primary benefit is that it proves that your backup data is good, your backup application is good, and that your skills are good. I have worked with several applications through the years that take absolutely lovely backups that no one in the world can restore data from. I can’t imagine anything that would be less useless. You won’t know if yours is one of those without trying it out. The second benefit is that practice makes perfect. You don’t want to be learning how to restore data in the middle of a crisis.
- You can’t be the only one that knows. For the most part, this isn’t entirely about the backup application. Working from the assumption that your business is important to more people than you, such as those that your income supports, make sure that you leave some form of documentation that would allow someone else to recover your data in the event that you are… unavailable. This affects your backup application choices in that you don’t want some obscure application that no one else can figure out.
- Encryption matters. You only want to be mentioned in news headlines for good things. “Small Business Corp Loses Customer Records on Unencrypted Data Tapes” does not qualify.
I certainly understand that cost is important, but it is secondary at most. Give up something else before you skimp on backup.
2. Backup is Your Redundancy
Despite the word “backup” in the title, backup is not the target of this point. “Redundancy” is. I know at least a few people that seem to derive great pleasure from terrorizing forum posters over a “single point of failure”. Ignore them.
With Hyper-V being the focal technology of this post, we’ll start with that. In the “all businesses are the same only different sizes” philosophy, you’ll have at least two clustered hosts connected to fully redundant shared storage with every component connected to a minimum of two switches. That sounds good, and believe me, it is. But it’s more than twice as expensive as a single host with internal storage connected to a single switch. “Single point of failure!” scream the people that don’t have to deal with your budgetary issues. Well, I took a course last year from an instructor who continually spoke of the “single point of success.” “Gives it a more positive tone, don’t you think?” he asked when I inquired about it. He’s right. Fewer than 2% of computer problems are caused by hardware failure and fewer parts means less complexity and fewer things to troubleshoot and maintain. If you have a good backup and you know how to use it, take comfort in the knowledge that the odds are heavily in your favor.
3. Quality Matters
A core requirement to making the previous point work is choosing quality at every point. If you save $500 on a shiny new mid-tower to run Hyper-V but have to spend three days in the forums and several hours on the phone or e-mail with technical support because Hyper-V won’t start, that is three days and several hours that you weren’t out drumming up business. If you have trouble getting it to work, you’re probably going to have trouble keeping it working.
Hardware components fail sometimes, and sometimes they’re shipped out new in non-working condition. It’s rare, but unavoidable. What you need to concern yourself with is going with the most likely equipment that will succeed. Start on WindowsServerCatalog.com. Ask around. Perform an Internet search for “<vendorname> Hyper-V problems”.
Oh, and quality isn’t important just for hardware.
4. Get Help
If I were to start a new business, I probably would not hire any technical assistance to start. I’m fairly certain I know what I’m doing. However, if I started putting in a noticeable amount of my time doing technical support, I wouldn’t hesitate to pick up the phone. Neither should you. Even if you’re knowledgeable today, running a business is going to consume a lot of time. You’re eventually going to have to choose between maintaining your top-tier technical status and running a business. At some point, you’re going to have to pass the burden on.
Of course, as a small business, you’re not hiring any full-time engineers. But, there are many small technology services companies out there that cater to the small business. Find them.
In your search, get help with that too. Point #3 applies here just as strongly as it does to hardware. It’s a sad fact that many technology service companies are terrible with technology and owe their entire existence to customers that have even less knowledge. A few pointers:
- If you’re not technically inclined, attend local small business expos and networking sessions where you can meet other local small business owners. They love to share their horror stories and most will gladly make recommendations.
- If you are technically inclined, you can spend a bit of time researching technical subjects, at least enough to keep abreast of the current state of the art. Sometimes, you’ll just find people that will flat-out tell you how to spot frauds. For instance, if you have service provider insisting that you can’t oversubscribe CPU, must turn off Dynamic Memory, and can never use dynamically expanding virtual hard disks, that provider is uneducated and/or trying to sell you hardware that you don’t need. I have one tip for those times when you discover that your provider is clueless and/or unethical: do not attempt to argue with them or help them get better. Fire them immediately and move on. In almost twenty years of being in this industry, I have never seen any evil or willfully ignorant technology provider come to the light.
- Good engineers do not work for cheap. When I left the provider industry to move to internal IT, I was charging a higher rate than anyone else that I knew of. However, I charged individual customers noticeably fewer hours than my competitors, so I needed that higher rate to stay afloat. Do you want the firm that charges $60 per hour and will need a permanent office of their own in your building or the firm that charges $200 per hour and will be done in two days?
5. Most “Best Practices” were Written by People Who Do Not Believe that You Exist
As far as the technical world is concerned, you might as well name your company “The Tooth Fairy” or “Santa Claus”. Two or more domain controllers on separate physical units? Who is going to pay for all of that? Multiple Internet providers to one building? Not a chance. Redundant everything? Maybe if you win the lottery — assuming you can ever afford a ticket.
Most “Best Practices” lists are built on solid reasoning and they are important to understand. With solid understanding, you can also discern where your small business can deviate. Hopefully, you’ll have a good provider or assistant (#4) that can guide you in the proper direction. If your provider is blindly quoting and adhering to a list they printed off of someone else’s website, that’s a bad sign. The word “Why” is the most important tool in your kit in these situations. A provider’s explanation is less important than they way that they answer. If they don’t seem certain, it’s because they aren’t. If they have a canned, practiced answer that in no way involves or references your situation, they do not understand what they are doing.
6. Use Hyper-V in a Way that Saves Money
I used to work with a fairly standard build for small business: one server installation that did everything. Best practice? Absolutely not. But, when clients are working with annual technology budgets that are smaller than a first-class ticket to the nearest airport, you make do. Nowadays, choices are simpler with things like Office 365 and Exchange Online. The recent pricing shake-up that Microsoft enacted with its OneDrive product likely, and deservedly, has small business owners nervous about signing on with any Microsoft cloud service. However, Small Business Server is no longer an option (and frankly, was never a wonderful option) and the cost of acquiring and maintaining some Microsoft servers on-premises is prohibitively expensive.
Considerations for hosting on-premises:
- If you can’t get reliable, affordable Internet access, skip cloud providers. They don’t care if you can’t connect.
- If you’re going to have any servers on-premises at all, Active Directory is a must-have. Only go workgroup if you’ll only be using personal computers.
- SQL Server Express is far cheaper to host in-house than almost any cloud provider can match. However, treat this like #1; don’t buy less than you need and be ready for growth.
- Non-Microsoft software may be cheaper than cloud solutions, especially if you’d have to rent a virtual machine or dedicated host.
Whatever servers you choose to bring in-house, your goal should be to fit them all in Hyper-V. When I had to do one-host installations, I always had to worry about compatibility problems with all of those applications, including Active Directory, sitting on one unit. You don’t have to do that. You may have to purchase more Windows Server licenses, but you do not need to purchase more hosts.
Oh, and even though I already mentioned it: oversubscribe CPU, turn on Dynamic Memory (where appropriate), and use dynamically expanding disks (where appropriate).
The reason I wrote this point is that many people are going to tell you that the purpose of Hyper-V is so you that can buy more hosts and switches and storage to “avoid single point of failure!”. Virtualization for you is a money-saving move. Backup is your redundancy.
7. Small Is OK
I would never buy a Hyper-V host with fewer than 16 cores total or less than 256 gigabytes of memory and any host that I purchase will absolutely connect to my EMC SAN. I also have a campus agreement for ridiculously cheap Windows Server licenses, a history of volume purchases that nets me extremely favorable hardware pricing, and a budget for IT alone that is larger than many towns. Many other writers are either in the same place as I am or have never worked directly with small businesses and just do not understand what you’re going through.
Let’s set some realistic upper and lower boundaries:
- I don’t see many people over-recommending CPU, so you’re probably not in any real danger. Most small business’ Windows Server instances just need to have 2 vCPU available for contention issues. Few will put any meaningful load on them. I would consider keeping the ceiling at 8 total physical cores in each physical host just so you don’t run afoul of per-core licensing coming in 2016. I’d love to try to give you a hard number, but unfortunately, that’s not possible. However, had I ever had the sort of hardware and hypervisors available to me then that I do now, I don’t think any of my 20-or-fewer user clients would have needed anything larger than six total cores in a host.
- Memory is also easy to oversize. I would say that a well-proven average for a Windows Server instance running basic services is 2 GB, including the management operating system. Of course, the more roles and applications that you pack into an instance, the more it is likely to need. The big point is, don’t overdo it and make any provider reason out what they’re recommending within a few gigabytes. Don’t try to drop in 128 GB just because everyone else is doing it. If you’re only going to run two virtual machines, 16GB of physical memory is probably more than you’ll need but gives you breathing room at an affordable price point.
- Disk is where you can save.
- Internal storage is just fine.
- RAID-10 is not a requirement, despite what many claim. RAID-5 and -6 are both fine, but spend extra for a hardware controller. If you’re not certain whether or not the RAID controller advertised for a system is hardware, the give-away is whether or not it can be battery-supported. You do want battery support.
- For local storage, I would skip Storage Spaces for at least one more version iteration. Battery protection isn’t even an option, for one thing. For another, its redundancy features are handled on your CPUs — the same ones I told you not to oversize. In my opinion, Storage Spaces is still better when it’s used as remote, shared storage. I would back that up by pointing out that Microsoft is one of those vendors that are stymied by the small business.
- Rather than buying a few very large disks, try for a few more smaller disks. The performance and resiliency is superior, especially when using RAID-5 or -6.
- 10 GbE is an absurd expenditure for a small business. Ditto RDMA, SR-IOV and other high-end networking features. If you are using internal storage, two to four 1GbE connections are perfect. Add 2 if you’re using iSCSI.
- VDI’s benefits are in features. It is not a money-saver. The typical small (and medium) business should avoid VDI.
- OEM pricing usually works out better than volume pricing for a small business. If you have a decent reseller, they will help you verify that. They should not just give you a pricing sheet and leave it to you to figure out, nor should they give you a 5-minute synopsis of whatever the current state of volume licensing is and expect you to decide on your own. Those are signs that you need to find another provider.
- Your backup solution is part of your server solution, not an afterthought. Price out the software and the supporting hardware at the same time.
- Buy an uninterruptible power supply for your host. I have seen an uncountable number of instances in which thousands of dollars of server-class equipment went to the recyclers and days of transactions were lost due to refusal to purchase a $250 device. If you’re banking on insurance coverage, they also realize the value of a UPS and that will be one of the first questions they ask when you file your claim.
8. This Version is OK
When it comes to technology, don’t try to keep up with the Joneses. I liked Microsoft a lot better when they kept their operating system releases fairly far apart and issued periodic Service Packs. Those days are gone. However, you don’t need to feel pressured to jump to each new version as soon as it comes out. The people that call you on each new release date are not looking out for your best interests. All they can see is “licensing margins” and “engagement fees”.
- The version that you have is no longer sufficient for your needs in a way that is addressed by a newer release.
- You are purchasing a replacement host.
- Your current version is nearing the end of its support lifecycle.
What I’d like to do in future articles is expand #7 from a generic list into some practical build ideas to help guide the perplexed. If you’re a small business or provide service to small businesses, I’d like to hear your stories, suggestions, and questions.
It’s not difficult to find all sorts of lists and discussions of best practices for Hyper-V. There’s the 42 best practices for Balanced Hyper-V systems article that I wrote. There is the TechNet blog article by Roger Osborne. Best practices lists are a bit tougher to find for failover clustering, but there are a few if you look. What I’m going to do in this article is focus in on the overlapping portion of the Hyper-V/failover clustering Venn diagram. Items that apply only to Hyper-V will be trimmed away and items that would not apply in a failover cluster of roles other than Hyper-V will not be included. What’s also not going to be included is a lot of how-to, otherwise this document would grow to an unmanageable length. I’ll provide links where I can.
As with any proper list of best practices, these are not rules. These are simply solid and tested practices that produce predictable, reproducible results that are known to result in lower cost and effort in deployment and/or maintenance. It’s OK to stray from them as long as you have a defensible reason.
1. Document Everything
Failover clustering is always a precarious configuration of a messy business. There are a lot of moving parts that are neither independent nor inter-dependent. Most things that go wrong will do so without world-stopping consequences up until the point where their cumulative effect is catastrophic. The ease of working with cluster resources masks a complex underpinning that could easily come apart. It is imperative that you keep a log of things that change for your own record and for anyone that will ever need to assist or replace you. Track:
- Virtual machine adds, removes, and changes
- Node adds, removes, and changes
- Storage adds, removes, and changes
- Updates to firmware and patches
- All non-standard settings, such as possible/preferred owner restrictions
- Errors, crashes, and other problems
- Performance trends
- Variations from expected performance trends
For human actions taken upon a cluster, what is very important is to track why something was done. Patches are obvious. Why was a firmware update applied? Why was a node added?
For cluster problems, keep track of them. Don’t simply look at a log full of warnings and say, “Oh, I understand them all” and then clear it. You’d be amazed at how useful these histories are. Even if they are benign, what happens if you get a new IT Manager who says, “You’ve known about these errors and you’ve just been deleting them?” Even if you have a good reason, the question is inherently designed to make you look like a fool whether or not that’s the questioner’s intent. You want to answer, “I am keeping a record and continuing to monitor the situation.” Furthermore, if those messages turn out to be not as benign as you thought, you have that record. You can authoritatively say, “This began on…”
Even in a small cluster, automation is vital. For starters, it clears up your time from tedious chores. All of those things that you know that you should be doing but aren’t doing become a whole lot less painful. Start with Cluster-Aware Updating. If you want to get a record of all of those warnings and errors that we were talking about in point #1, how does this suit you?:
Get-WinEvent -LogName "*failoverclustering*", "*hyper-v*" -FilterXPath '*[System[Level=2 or Level=3]]' |
select LogName, LevelDisplayName, Id, Message |
Export-Csv -Path "C:temp$($env:ComputerName)-$($(Get-Date).ToLongDateString())-ClusterLogs.csv" -NoTypeInformation
Be aware that the above script can take quite a while to process, especially if you have a lot of problems and/or haven’t cleared your logs in a while, but it is very thorough. Be aware that this should be run against every node. The various Hyper-V logs will be different on each system. While many of the failover clustering logs should be the same, there will be discrepancies. For no more difficult than this script is to run and for no more space than it will require to store, it’s better to just have them all.
One of the things that I got in the habit of doing is periodically changing the ownership of all Cluster Shared Volumes. While the problem may have since been fixed, I had issues under 2008 R2 with nodes silently losing connectivity to iSCSI targets that they didn’t have live virtual machines on, which later caused problems when an owning node rebooted or crashed and the remaining node could not take over. Having a background script occasionally shuffling ownership served the dual purpose of keeping CSV connectivity alive and allowing the cluster to log errors without bringing the CSV offline. The command is Move-ClusterSharedVolume.
Always be on the lookout for new things to automate. There’s a pretty good rule of thumb for that: if it’s not fun and you have to do it once, you can pretty much guarantee that you’ll have to do it again, so it’s better to figure out how to get the computer to do it for you.
3. Monitor Everything
If monitoring a Hyper-V host is important, monitoring a cluster of Hyper-V hosts is doubly so. Not only does a host need to be able to handle its own load, it also needs to be able to handle at least some of the load of at least one other node at any given moment. That means that you need to be keeping an eye on overall cluster resource utilization. In the event that a node fails, you certainly want to know about that immediately. By leveraging some of the advanced capabilities of Performance Monitor, you can, with what I find to be a significant amount of effort, have a cluster that monitors itself and can use e-mail to notify you of issues. If your cellular provider has an e-mail-to-text gateway or you have access to an SMS conversion provider, you can even get hosts to text or page you so that you get urgent notifications quickly. However, if your resources are important enough that you built a failover cluster to protect them, they’re also important enough for you to acquire a proper monitoring solution. This solution should involve at least one system that is not otherwise related to the cluster so that complete outages are also caught.
Even just taking a few minutes here and there to click through the various sections of Failover Cluster Manager can be beneficial. You might not even know that a CSV is in Redirected Access Mode if you don’t look.
4. Use the Provided Auditing Tools
A quick and easy way to locate obvious problems is to let the system look for them.
It’s almost imperative to use the Cluster Validation Wizard. For one thing, Microsoft will not obligate itself to provide support for any cluster that has not passed validation. For another, it can uncover a lot of problems that you might otherwise not ever be aware of. Remember your validation report must be kept up-to-date. Get a new one if you add or remove any nodes or storage. Technically, you should also update it if you update firmware or drivers as well, although that’s substantially less critical. Reports are saved in C:WindowsClusterReports on every node for easy viewing later. This wizard does cause Cluster Shared Volumes to be briefly taken offline, so only run this tool on existing clusters during scheduled maintenance windows.
Don’t forget about the Best Practices Analyzer. The analyzer for Hyper-V is now rolled into Server Manager. If you combine all the hosts for a given cluster into one Server Manager display, you can run the BPA against them all at once. If you’re accustomed to writing off Server Manager because it was not so useful in previous editions, consider giving it another look. At additional hosts using the second option on the first page:
Adding other hosts in Server Manager
I don’t want to spend a lot of time on the Best Practices Analyzer in this post, but I will say that the quality of its output is more questionable than a lot of other tools. I’m not saying that it isn’t useful, but I wouldn’t trust everything that it says.
5. Avoid Geographically Dispersed Hyper-V Clusters
Geographically-dispersed clusters, also known as “stretched” clusters or “geo-clusters”, are a wonderful thing for a lot of roles, and can really amp up your “cool” factor and buzzword-compliance, but Hyper-V is really not the best application. If you have an application that requires real-time geographical resilience, then it is incumbent upon the application to provide the technology to enable that level of high availability. The primary limiting factor is storage; Hyper-V is simply not designed around the idea of real-time replicated storage, even using third-party solutions. It can be made to work, but doing so typically requires a great deal of architectural and maintenance overhead.
If an application does not provide the necessary features and you can afford some downtime in the event of a site being lost or disconnected, Hyper-V Replica is the preferred choice. Build the application to operate in a single site and replicate it to another cluster in another location. If the primary site is lost, you can quickly fail over to the secondary site. A few moments of data will be lost and there will be some downtime, but the amount of effort to build and maintain such a deployment is a fraction of what it would take to operate a comparable geo-cluster.
Of course, “never say never” wins the day. If you must build such a solution, remember to leverage features such as Possible Owners. Take care with your quorum configuration, and make doubly certain that absolutely everything is documented.
6. Strive for Node Homogeneity
Microsoft does not strictly require that all of the nodes in a cluster have the same hardware, but you should make it your goal. Documentation is much easier when you can say, “5 nodes of this” rather than maintain a lot of different build forms with the required differential notation.
This is a bigger deal for Hyper-V than for most other clustered roles. There aren’t any others that I’m aware of that have any noticeable issues when cluster nodes have different CPUs, beyond the expected performance differentials. Hyper-V, on the other hand, requires virtual machines to be placed in CPU compatibility mode or they will not Live Migrate. They won’t even Quick Migrate unless turned off. The end effects of CPU compatibility mode are not documented in an easy-to-understand fashion (you can take a look), but it is absolutely certain that the full capabilities of your CPU are not made available to any virtual machine in compatibility mode. The effective impact depends entirely upon what CPU instructions are expected by the applications on the guests in your Hyper-V cluster, and I don’t know that any software manufacturer publishes that information.
Realistically, I don’t expect that setting CPU compatibility mode for most typical server applications will be an issue. However, better safe than sorry.
7. Use Computer-Based Group Policies with Caution
Configurations that don’t present any issues on stand-alone systems can cause problems when those same systems are clustered. A notorious right that causes Live Migration problems when tampered with is “Create Symbolic Links”. It’s best to either avoid computer-scoped policies or only use those that are known and tested to work with Hyper-V clustering. For example, the GPO templates that ship with the Microsoft Baseline Security Analyzer will not cause problems with only one potential exception: they disable the iSCSI service. Otherwise, use them as-is.
8. Develop Patterns and Practices that Prevent Migration Failures
Using dissimilar CPUs and bad GPOs aren’t the only way that a migration might fail. Accidentally creating a virtual machine with resources placed on local storage is one potential problem. A practice that will avoid this is to always change the default locations on every host to shared storage. This helps control for human errors and for the (now fixed) bug in Failover Cluster Manager where it sometimes caused some components to be placed in the default storage location when it created virtual machines. A related pattern is to discourage the use of Failover Cluster Manager to create virtual machines.
A few other migration-breakers:
- Always use consistent naming for virtual switches. A single-character difference in a virtual switch name will prevent a Live/Quick Migration.
- Avoid using multiple virtual switches in hosts to reduce the possibility that a switch naming mismatch will occur.
- Do not use private or internal virtual switches on clustered hosts. A virtual machine cannot Live Migrate if it is connected to a switch of either type, even if the same switch appears on the target node.
- Use ISO images from shared storage. If you must use an image hosted locally, remember to eject it as soon as possible.
9. Use Multiple Shared Storage Locations
The saying, “Don’t place all your eggs in one basket,” comes to mind. Even if you only have a single shared storage location, break it up into smaller partitions and/or LUNs. The benefits are:
- Logical separation of resources; examples: general use storage, SQL server storage, file server storage.
- Performance. If your storage device doesn’t use tiering, you can take advantage of two basic facts about spinning disks: performance is better for data closer to the outer edge of the spindle (where the first data is written) and data is more quickly accessed when it is physically close on the platter. While I don’t worry very much about either of these facts and have found all the FUD around them to be much ado about nothing, there’s no harm in leveraging them when you can. Following the logical separation bullet point, I would place SQL servers in the first LUNs or partitions created on new disks and general purpose file servers in the last LUNs. This limits how much fragmentation will affect either and keeps the more performance-sensitive SQL data in the optimal region of disk.
- An escape hatch. In the week leading up to writing this article, I encountered some very strange problems with the SMB 3 share that I host my virtual machines on. I tried for hours to figure out what it was and finally decided to give up and recreate it from scratch. Even though I only have the one system that hosts storage, it had a traditional iSCSI Clustered Shared Volume on it in addition to the SMB 3 share. I used Storage Live Migration to move all the data to the CSV, deleted and recreated the share, and used Storage Live Migration to move all the virtual machines back.
- Defragmentation. As far as I’m concerned, disk fragmentation is far and away the most overblown topic of the modern computing era. But, if you’re worried about it, using Storage Live Migration to move all of your VMs to a temporary location and then moving them all back will result in a completely defragmented storage environment with zero downtime.
10. Use at Least Two Distinct Cluster Networks, Preferably More
As you know, a cluster will define networks based on unique TCP/IP subnets. What some people don’t know is that it will create distinct TCP/IP streams for inter-node communication based on this fact. So, some people will build a team of network adapters and only use one or two cluster networks to handle everything: management, cluster communications such as heartbeating, and Live Migration. Then they’ll be surprised to discover network contention problems, such as heartbeat failures during Live Migrations. This is because, without the creation of distinct cluster networks, it might attempt to co-opt the same network stream for multiple functions. All that traffic would be bound to only one or two adapters while the others stay nearly empty. Set up multiple networks to avoid this problem. If you’re teaming, create multiple virtual network adapters on the virtual switch for the hosts to use.
11. Minimize the Number of Network Adapter Teams
In the 2008 R2 days, people would make several teams of 1GbE adapters: one for management traffic, one for cluster traffic, and one for Live Migration. Unfortunately, people are still doing that in 2012+. Please, for your own sake, stop. Converge all of these into a single team if you can. It will result in a much more efficient and resilient utilization of hardware.
12. Do Not Starve Virtual Machines to Benefit Live Migration or Anything Else
It’s really disheartening to see 35 virtual machines crammed on to a few gigabit cards and a pair of 10 GbE cards reserved for Live Migration, or worse, iSCSI. Live Migration can wait and iSCSI won’t use that kind of bandwidth often enough to be worth it. If you have two 10 GbE cards and four built-in 1GbE ports, use the gigabit ports for iSCSI and/or Live Migration. Better yet, just let them sit empty and use convergence to put everything on the 10GbE adapters. Everything will be fine. Remember that your Hyper-V cluster is supposed to be providing services to virtual machines; forcing the virtual machines to yield resources to cluster services is a backwards design.
13. Only Configure QoS After You’ve Determined that You Need QoS
I’ve seen so many people endlessly wringing their hands over “correctly” configured QoS prior to deployment that I’ve long since lost count. Just stop. Set your virtual switches to use the “Weight” mode for QoS and leave everything at defaults. If you want, set critical things to have a few guaranteed percentage points, but stop after that. Get your deployment going. Monitor the situation. If something is starved out, find out what’s starving it and address that because it’s probably a run-away condition. If you can’t address it because it’s normal, consider scaling out. If you can’t scale out, then configure QoS. You’ll have a much better idea of what the QoS settings should be when you actually have a problem to address than you ever will when nothing is wrong. The same goes for Storage QoS.
14. Always be Mindful of Licensing
I’m not going to rehash the work we’ve already done on this topic. You should know by now that every Windows instance in a virtual machine must have access to a fully licensed virtualization right on every physical host that it ever operates on. This means that, in a cluster environment, you’re going to be buying lots and lots of licenses. That part, we’ve already explained into the ground. What some people don’t consider is that this can affect the way that you scale a cluster. While Microsoft is taking steps in Windows Server 2016 licensing to increase the cost of scaling up on a single host, it’s still going to be cheaper for most people in most situations than scaling out, especially for those people that are already looking at Datacenter licenses. In either licensing scheme, keep in mind that most people are not actually driving their CPUs nearly as hard as they could. Memory is likely to be the bigger bottleneck to same-host scaling than CPU.
15. Keep in Mind that Resources Exhaust Differently in a Cluster
When everything is OK, virtual machines in a cluster really behave exactly like virtual machines scattered across separate stand-alone hosts. But, when something fails or you try to migrate something, things can get weird. A migration might fail for a virtual machine because it is currently using more RAM than is available on the target host. But, if its host failed and the destination system is recovering from a crash condition, it might succeed. That’s because the virtual machine’s Startup Dynamic RAM setting is likely to be lower than its running RAM.
Of course, that’s only talking about a single virtual machine. What about the much more probable scenario that a host has crashed and several VMs need to move? That’s when all of those priority settings come into play. If you have more than two nodes in your cluster, the cluster service will do its best job of getting everyone online wherever they fit. But, you need to have decided in advance which virtual machines were most important. If you haven’t, then the creep of virtual machine sprawl might have left you in a situation of needing to make some hard decisions to turn off healthy virtual machines in order to bring up vital crashed machines. Manual intervention defeats a lot of the purpose of clustering.
Shared storage behaves differently than local storage in more than one way. Sure, it’s remote, and that has its own issues. But, the shared part is what I’m focusing on. Remember how I said it wasn’t a good idea to put your iSCSI on 10 GbE if it would take away from virtual machines? This is why. Sure, maybe, just maybe, your storage really can supply 20 Gbps of iSCSI data. But can it simultaneously supply 60 Gbps for 3 Hyper-V hosts that each have two 10GbE NICs using MPIO? If it can, do you really only have only three hosts accessing it? (if you answered “yes” and “yes”, most admins in the world are jealous) The point here is that the size of the pipe that a host has into your storage sets the upper limit on how that host can control — and potentially dominate — that storage system’s resources. Remember that it’s really tough to get a completely clear image of the way that storage performance is being consumed in a cluster when compared to a standalone system.
16. Do Not Cluster Virtual Machines that Don’t Need It
Don’t make your virtualized domain controllers into highly available virtual machines. The powers of Hyper-V and failover clustering pale in comparison to the native resiliency features of Active Directory. If you have any other application with similar powers, don’t make it HA via Hyper-V clustering either. Remember that you have to make licenses available everywhere that a VM might run anyway. If it provides better resiliency to just create a separate VM and configure application high availability, choose that route every time.
Both groups require the same licensing, but the second group is more resilient
17. Don’t Put Non-HA Machines on Cluster Shared Volumes
Storage is a confusing thing and it seems like there are no right or wrong answers sometimes, but you do need to avoid placing any non-HA VMs on a CSV. Microsoft doesn’t support it and it will cause VMM to panic. Things can get a little “off” when the node that owns the CSV isn’t the node that owns the non-HA virtual machine placed on it. It’s also confusing when you’re looking at a CSV and see more virtual machine files than you can account for in Failover Cluster Manager. Use internal space, singularly attached LUNs, or SMB 3 storage for non-HA VMs.
18. Minimize the Use of non-HA VMs in a Cluster
Any time you use a non-HA VM on a clustered host, make sure that it’s documented and preferably noted right on the VM. This helps eliminate confusion later. Some helpful new admin might think you overlooked something and HA the VM for you, even though you had a really good reason not to do so. I’m not saying “never”; I do it myself in my system for my virtualized domain controllers. But, if I had extra stand-alone Hyper-V hosts, that’s where I’d put my non-HA VMs.
19. Ask Somebody
If you don’t know, start with your favorite search engine. The wheels for both Hyper-V and failover clustering are very round, and there are lots of people that have written out lots of material on both subjects.
If that fails (and please, only if that fails), ask somebody. My favorite place is the TechNet forums, but that’s only one of many. However, on behalf of myself and everyone else who writes publicly, please don’t try to contact an individual directly. There’s a reason that companies charge for support; it’s time-consuming and we’re all busy people. Try a community first.
One of the things I commonly lament over is the poor state of the management tools available for Hyper-V (from Microsoft; I’m pointedly not talking about third party solutions). One issue I see a lot of is that there isn’t a quick way, when looking at the Hyper-V-specific tools, to know how much free memory a host has. People then have to resort to other tools like Task Manager to determine this. These methods are usually effective, but imperfect. Sometimes, you are unable to match up what those tools display against what happens in Hyper-V.
I could write out a long and complicated script that would display some fairly detailed information on memory usage in your systems, and someday I might do that. However, as this article is part of the ongoing Hyper-V and PowerShell series, the primary focus of this article will be to help you get to the information that you need as quickly as possible with steps that you have a chance of remembering. The secondary lessons in this article are to introduce you to custom objects and basic error handling in PowerShell.
The first stop on the PowerShell memory-exploration train will be WMI. Many people get a quick look at WMI and run away screaming. I can’t blame them. WMI is one of those things that can actually get scarier as you learn more about it. However, there’s no denying the raw power that you can harness with it. In this case, you can breathe easy; it’s fairly trivial to get sufficient information about memory from WMI. Even better, there isn’t a language out there that can interact with WMI as easily as PowerShell. To find out how much memory you’ve got left in your system, you only need to ask the Win32_OperatingSystem class:
Get-WmiObject -Namespace 'rootcimv2' -Class 'Win32_OperatingSystem' | Select-Object -Property FreePhysicalMemory
Wow! Looks like a whole lot of typing and things to remember, doesn’t it? Well, it only looks that way because I have a policy of showing you the entirety of my PowerShell cmdlets because it makes things a lot easier to decode, comprehend, tear apart, and tinker with later. You don’t really need all of that. This will do just as well:
gwmi Win32_OperatingSystem | select FreePhysicalMemory
You don’t even have to capitalize Win32_OperatingSystem, if you don’t want to.
You can run it for multiple hosts simultaneously using implicit remoting:
gwmi Win32_OperatingSystem -ComputerName svhv1, svhv2 | select PSComputerName, FreePhysicalMemory
How accurate is it? Well, that’s a two-part answer. I’ll tackle the easy one first. These two screenshots were taken from the same system at about the same time:
Task Manager Memory Check
GWMI Memory Check
As you can see, they’re quite close. Memory may fluctuate a bit from moment to moment and the precision of Task Manager isn’t the same as the precision of WMI so you shouldn’t expect a perfect match-up, but they’re close enough.
Unfortunately, this isn’t the entire story. The host has used up what it needs, the other guests have used up what they need, so you should be able to start up a guest with something around 2.6 GB of Startup RAM assigned, correct? The answer: maybe. There have been more than a few times when people have been unable to start a guest that was definitely below the Available line. So, what gives?
The answer to that is also the second part of the accuracy question around the WMI call. In the Task Manager screenshot, do you see the small, thin line in the Memory Composition graph? Hold your mouse over it, and you get this:
The segment after the used (shaded area) and before the completely unused (clear area at the far right) represents Standby memory. Just as the tooltip says, it “contains cached data and code that is not actively in use”. Basically, this is stuff that the system could page out to disk but, since there’s currently no pressing need to release it, is holding in RAM. Technically, this memory is available. When I was testing for this article, I was able to start a virtual machine with 2GB of fixed RAM without trouble. However, I’ve fielded questions from people in similar situations that could not start VMs that were within the “Available” range. My only guess is that the host couldn’t page enough of it for some reason. Without personally being there to investigate and never having that happen to me, I can’t definitively say what was going on. But, that’s not the point of this article.
What would really be nice to know is how much RAM is unquestionably available. I didn’t screen shot it, but in Task Manager, if you continue sliding the mouse to the right into the last segment of the Memory composition graph, it will show in the tool tip how much memory is completely open. But what about PowerShell? There’s probably some WMI way to do it, but I don’t know what that is — one of the worst things about WMI is discoverability. Most things in WMI are either painfully obvious or painfully obscure without a lot of middle ground. You could go digging around in WMI Explorer to see if there’s a field to query. Fortunately, there’s a really easy way just for us Hyper-V users:
That’s all you need. Sort of. One my test systems, this is what I get:
The MemoryAvailable field is quite accurate and shows an almost identical number to what is displayed as Free in Task Manager. As I went back and forth, they seem to stay within a couple dozen megabytes of each other, which is likely accounted for by the brief overhead of running the PowerShell cmdlet. I know that if I want to start a virtual machine that has 1340 megabytes or less of Startup memory that it will work.
But (there’s always a “but”), there is a problem. In the displayed system, this is a useful readout because I only have a single NUMA node in my test lab systems. If you’ve got a dual or quad socket host, then you have multiple NUMA nodes. Each node is going to get its own separate statistics set. If you’re not using NUMA spanning, then that’s OK; the largest MemoryAvailable reading represents the largest VM you’ll be guaranteed to be able to start. But, most of us are using NUMA spanning (there are precious few good reasons to turn it off). Unfortunately, there’s no quick and simple way to funnel all of that into a single reading. Sure, I could spend some time and craft a clever one-liner that would do the trick, but such things are very difficult to understand, almost impossible to remember, and day-to-day operations are not good places to use those clever one-liners. So, I worked up a script that is relatively simple but still shows you what you need to know:
Retrieves the upper and lower bounds of available Hyper-V host memory.
Retrieves the upper and lower bounds of available Hyper-V host memory. The upper value is the reported free space. The lower value is the truly free space.
Shows available memory on the local system.
PS C:Scripts>Get-VMHostAvailableMemory -ComputerName svhv2
Shows available memory on svhv2.
Author: Eric Siron
Authored date: November 5, 2015
Copyright 2015 Altaro Software
v1.1 January 28, 2016: Corrected for inconsistencies in the way Get-VMHostNumaNode works on multi-node hosts
#requires -Modules Hyper-V
[Parameter()][String]$ComputerName = @('.')
foreach ($Computer in $ComputerName)
$FreeMemoryBounds = New-Object -TypeName PSObject
if($ComputerName.Count -gt 1)
Add-Member -InputObject $FreeMemoryBounds -MemberType NoteProperty -Name PSComputerName -Value $Computer
Add-Member -InputObject $FreeMemoryBounds -MemberType NoteProperty -Name LowerBoundMB -Value 0
Add-Member -InputObject $FreeMemoryBounds -MemberType NoteProperty -Name UpperBoundMB -Value 0
$FreeMemoryBounds.UpperBoundMB = [int]((Get-WmiObject -ComputerName $Computer -Namespace rootcimv2 -Class Win32_OperatingSystem -ErrorAction Stop).FreePhysicalMemory / 1KB)
foreach ($NodeMemoryAvailable in (Get-VMHostNumaNode -ComputerName $Computer -ErrorAction Stop).MemoryAvailable)
$FreeMemoryBounds.LowerBoundMB += $NodeMemoryAvailable
This is designed to be dot-sourced, so save it as a .PS1 (Get-VMHostAvailableMemory.ps1 would be good) and then dot-source it from a PowerShell prompt:
You can also add it to your profile using steps you can find most anywhere, including earlier in our series. Another option would be to add a single line at the very end of the script with only “Get-VMHostAvailableMemory”; you can then run the script directly, but you can’t use Get-Help on it and you can’t feed it ComputerName qualifiers (well, you can, but they won’t do anything useful).
PowerShell Lessons from this Script
I tried to keep the complexity to a minimum but there are some new tricks in this script that I haven’t shown you before.
PowerShell Custom Objects
First is the usage of New-Object and Add-Member. I don’t want to use Write-Host or anything of the sort because it is inappropriate here, but I do need to ensure that I’m not just dumping raw numbers on you without context. The New-Object line creates a blank object, conveniently called “FreeMemoryBounds”. If you supplied more than one computer name, the first Add-Member line appends a property called PSComputerName and populates it with the name that you submitted; this is to simulate the action of a normal implicit remoting command since this script masks the underlying remoting capabilities of the cmdlets that it relies on. The next two lines add properties named “LowerBoundMB” and “UpperBoundMB”.
After all the processing is done, there is a single line that just contains the object all by itself (line 44). What that does is place the object into the pipeline. Once the script is done, you’re given a full PowerShell object named “FreeMemoryBounds” and it will have these properties attached to it. You can pass that object to other routines and access its properties using a dot just like you can any other PowerShell object. If you just run the script, then it outputs these fields to the screen in an easy-to-follow format. Also, because of the “foreach” loop, you’ll be given one object per valid ComputerName input.
PowerShell Error Handling with Try-Catch
The second thing is the usage of try-catch. I had two major reasons to use it here. The first is that, without it, it would be trivial to create a FreeMemoryBounds object that had invalid data. For just one host queried in an interactive session, that wouldn’t be a problem. If you were performing an automated run against 30 hosts, that would be a much bigger problem. The second reason is that, without the try block, an invalid computer name (or blocked firewall, or insufficient permissions, etc.) would result in both of the information retrieval cmdlets being run, even though an error condition on the first one is enough to let you know that the second isn’t worth the effort. Again, not a big deal for a single iteration, but problematic when there are many.
I chose to take a very simple route. I took the two cmdlets that are most likely to fail and put them into a single try block. Also, because I only want the new FreeMemoryBounds object to be placed in the pipeline if it has valid data, I included it in the same try block. That way, if either of the two cmdlets before it fails, that line is skipped and the object is silently destroyed. If an error does occur, I just emulate the same mechanism that PowerShell uses to place the error into the error stream (Write-Error). I do this because I don’t want to suppress the error but I also don’t care about the error itself. What I care about is not allowing an error to affect the product of my script.
There is one very, very important lesson to be learned. I even see some of the top-tier PowerShell experts make mistakes here. For every cmdlet exception that you wish to be caught in a try block, you must include the ErrorAction parameter and set it to “Stop”. PowerShell’s default error action is to display the error and keep going (“Continue”). Some people override that on their systems, but most people won’t. If the default action isn’t Stop and you don’t set ErrorAction to Stop, then the try block leads a pointless existence that just makes your script a little harder to read.
While this is important to remember, especially because it’s not intuitive, it also highlights a special feature(?) of PowerShell that you won’t find in traditional computer programming languages: if you have a cmdlet inside a try block that you don’t want to trigger the error handling system, then just set its ErrorAction to something other than “Stop”. Ordinarily, you would just keep cmdlets of that kind outside a try block in the first place, but this capability grants you quite a bit of flexibility to selectively ignore and capture error conditions as necessary.
Hyper-V Server 2012 actually doesn’t need much in the way of hardware in order to operate. You’re going to need more than that to make it really work.
When you install Hyper-V or a copy of Windows Server for the express purpose of running the Hyper-V role, its default configuration for the page file (also called a swap file) is generally wasteful, although not harmful. Page files for individual virtual machines are tuned in the same fashion as normal physical machines, but there are a couple of things to think about that are unique to VMs. (more…)