Save to My DOJO
Last year, Nirmal Sharma wrote a fantastic article on this blog titled 23 Best Practices to improve Hyper-V and VM Performance. This sparked up a very lively discussion in the comments section; some were very strongly in favor of some items, some very strongly opposed to others. What I think was perhaps missed in some of these comments was that, as Nirmal stated in the title, his list was specifically “to improve Hyper-V and VM performance.” If squeezing every last drop of horsepower out of your Hyper-V host is your goal, then it’s pretty hard to find any serious flaws with his list.
Just cause a Group can be brought to consensus, does not make them right. Assess the Risks of being wrong before proceeding on their say so.
— guy w wallace (@guywwallace) February 27, 2015
As you probably know, or can at least guess, I’m not the biggest fan of “best practices” lists. As I’ve said many times in the past, I think far too many administrators have an unhealthy obsession with performance and, as a result, build very wasteful environments. So, what I’d like to do with my “best practices” list is shift the goal. That particular goal is in my title as Nirmal’s was, but I’m also going to add it to the very first entry in this best practices list:
- In advance, determine the metric that you will consider to be a successfully functioning virtual environment and build for that. It can be a single metric or a combination of metrics. The only restriction is that you must be able to defend your choice of metric to anyone that asks. By “anyone”, I usually mean corporate leadership.
My choice? My metric is end-user experience. If the users are happy, then I’m happy. If the users are unhappy, I first find out why they’re unhappy and if my virtual environment has anything to do with it before I make any changes at all.
That means that I’ve sometimes built virtual environments that run at 110% or more of normal capacity in a fair weather state, which means they’d be running even hotter if something failed and we had to fall back on redundant systems. For the fair weather portion, if it doesn’t bother the users, then it doesn’t bother me. For the slower performance in failover times, if the performance hit is bearable for the users and they can understand that we’re working on the problem, then it doesn’t bother me either. The organization paid for 100% of the CPU, memory, and disk space/speed in their virtualization systems. As far as I’m concerned, tuning and balancing them so that they only use 4% of that capacity means that there was little point in virtualizing any of those guests at all (and that some administrators never learn that corporate servers are not video game systems).
- Do your research. Since you’re here reading our blog, I’m going to assume that’s not asking too much of you. What you really need to do for yourself is go deeper than these lists. When you find one, make sure you understand why a best practice is recommended. If it’s only there because it was on someone else’s list, and you can’t find your way back to the source, and the reason for it is not axiomatic, then you should be suspicious of it. Even if something does look well-supported, be at least a little suspicious. Its assumptions may not fit your situation. You’ll notice I didn’t say anything about the credentials of a list’s author. That’s because they are irrelevant. A person’s certifications and awards does not make that person correct. Only being correct does that. If you do something wrong because an “expert” was wrong, you’re the one that will get in trouble.
- Overload your Hyper-V Servers. Overcommitment is what we do in the virtualization space. Assign more vCPUs than you have physical cores. Turn on Dynamic Memory. Use dynamically expanding hard disks. If you’re not ready for any one of those items, you are not ready to virtualize. You paid for all the hardware, so use it all. Unless, of course, your hands are tied by a software vendor that’s not ready to virtualize. That’s on them, not you.
- Avoid creation of multiple Hyper-V switches. That was in Nirmal’s list, and I totally agree with that sentence. Since we’re not trying to shoot for the absolute maximum on performance, I’ll disagree with his “must always” sentence, though. If using a private/internal virtual switch alongside your external switch will help you isolate and you can still meet #1, create that switch. Just ensure that you understand what’s going on.
- Configure anti-virus. That item is also from Nirmal’s list and it might be my favorite. Even in a list designed to maximize performance, he doesn’t counsel you to just throw out a basic security mechanism as so many others do. He doesn’t succumb to that contagious FUD that tells you never to use AV on a Hyper-V system because it might slow you down a bit or, if misconfigured, will delete your VMs. Both of those things are true, but avoiding antivirus because you don’t want to follow the directions is hardly a valid excuse. Nirmal’s list for AV configuration is good, but I did find a few more thorough resources. You can find them on step 6 of my Hyper-V security post.
- Don’t spend any time stressing yourself over whether some VMs have integration services and some don’t. In the big picture, that just doesn’t make a meaningful amount of difference to performance, and makes absolutely no difference when you’re just trying to build a functioning virtual infrastructure. Trying to juggle all that will cause you more lost time in configuration than you’d ever lose by just not worrying about it.
- If you have to store Hyper-V guests on the same storage as the management operating system, it will all be OK. After boot-up, the management operating system doesn’t use its disks that much. In fact, when I had the honor of helping Jeffery Hicks do a bit of field testing on his Powershell Hyper-V health script, we found that he had to do some retuning because my production hosts were unexpectedly sustaining 0% disk usage on a regular basis.
If you have a standalone Hyper-V host and you’re only going to use internal storage, then you will get better overall performance by putting all those disks in a single RAID array and running everything from it than you will by segregating a few spindles just for the management operating system. I would use whatever tools I had at my disposal to create at least two logical volumes, whether in hardware or in Windows, so that the management OS’s storage is separated from the VMs that way. If you can’t keep them apart, remember #12 (because you always read ALL the steps before following ANY of them, right?).
When you are tuning your hard disk performance, always remember that the bottlenecks are the disks. Getting fancy by spreading I/O across controllers is a complete waste of time.
- Let your virtual machines share volumes. Not doing so is a disturbingly incredible waste of space. One of the things that brought Hyper-V out of the dark ages was the ability to have multiple virtual machines in a cluster share underlying disk space. This is one of the few places I think that Nirmal’s original article is fundamentally wrong. The various VM worker processes will use the same amount of I/O regardless of where the storage really is, so you gain nothing in that area no matter what you do. The bigger reason people often counsel against sharing volumes is because of the fragmentation boogeyman, but that’s mostly because they don’t understand how disk access works in this century.
I would counsel you to not put all your VMs in a single location, though. Multiple CSVs in a cluster help it to balance access across nodes. Multiple storage locations let you use Storage Live Migration to clean up things that you didn’t see coming. Having extra unassigned storage can help you out when you accidentally misconfigure something storage related. But, putting each VM in its own storage sandbox takes all of this way too far.
- Do use multiple physical NICs in the Hyper-V host. I would take Nirmal’s advice a bit farther and encourage you to use convergence rather than dividing up the physical NICs into multiple teams. Too many people are still using 2 pNICs for management, 2 pNICs for guest traffic, 2 pNICs for one cluster role, 2 pNICs for another cluster role, etc. This is wasteful of the networking hardware’s capacity because some will be idle while others could use more power.
- Use any operating system as a guest that you need to meet your organization’s needs. If it’s not supported but fully functional, then you lose nothing. If it’s not fully functional but you can get by, you’ve met organizational needs. If it doesn’t work, then you don’t need me or anyone else to tell you not to use it. Our jobs are to provide the compute resources that the other people in our organization require to do their jobs, not meet some arbitrary checklist from someone that doesn’t have to architect around our problems.
- Don’t always use Type 2 virtual machines. They don’t work for operating systems prior to WS 2012 and desktop Windows 8. I’ve personally had some odd, difficult-to-reproduce stability issues with them and I know others have as well. Your software vendors may not be comfortable with them. Using a new technology just because it exists is inappropriate behavior for a systems administrator, especially at the infrastructure level. They are a bit faster to boot up, though, so if you’re still in that hyper-performance mindset then they are your first choice.
- Monitor all your systems all the time. This entry, or something like it, should exist on every best practices list for all administrators in all fields. But, I wouldn’t say to monitor them specifically to crank up performance metrics. If your performance is decreasing and you know why (as in, you’ve added more guests or something) and your users are not impacted, then carry on.
A lot of very poor “best practices” are designed around the idea of protecting you from yourself, such as, not running out of hard disk space. This is what monitoring is for. Do this, not that.
- Quit it with the constant defragmentation jobs already. Are you running Hyper-V off a single 5300 RPM platter with seek times in the high double- or even triple-digits of milliseconds? No? Then fragmentation is unlikely to be a serious problem. Most computer systems will perform just fine with a surprisingly high degree of fragmentation, but too many administrators don’t know that because they’ve never let it happen, nor have they updated their knowledge base on drive performance characteristics beyond 1997. Actually, let me correct that: most administrators are still uncritically following recommendations written by other people that are still following recommendations that were designed in 1989 and updated no later than 1996 by people that pioneered the spinning disk small enough to fit into a desktop chassis and have long since retired and would laugh at you if they knew you were still following them.
Disk access in a modern operating system environment is naturally scattered because multiple processes are being constantly shuffled and their data and paged memory lives in different places on the disk. Disk access in modern virtual environments is naturally highly scattered. If that’s a problem for you, you’ve either been listening a little too closely to someone who makes money off of selling you storage solutions or you’re just not ready for virtualization. Or systems administration, for that matter. Defragmentation is almost exclusively for single-disk desktops and laptops, and even there it’s overdone. Defragmentation puts a lot of wear on your read/write heads and will never repay you for it. Quit it. If you need more disk performance, add spindles and/or switch to/augment with SSD.
- Always install integration services if they’re available. Yes. I can’t think of any time you wouldn’t want to do this, so it’s an uncontested best practice.
- Only use fixed VHD(X) files when you must. Again, I must remind you that Nirmal’s post was about maximizing performance. The performance variance between fixed and dynamically expanding is small – not nearly enough that it should prevent you from using them if a denser environment is within the parameters of the goal you set in #1. The myths surrounding this technology are extremely powerful but you must resist them. Overcommitment is what we do in virtualization. Get with the program.
If you’re ready to crawl out from under the shadows of the “never use dynamic VHD(X)” FUD, then I have an article for you on the use of Hyper-V Dynamic Disks. It does need to be updated, but I’d still say that it’s aged well.
- Use Dynamic Memory. I’m a little surprised to see this in Nirmal’s list, to be honest. In a truly high-performance environment, I would avoid it. That’s because it takes, I don’t know, maybe a whole millisecond for Hyper-V to reassign memory if a guest’s demand exceeds its allocation. All joking aside, if a guest needs more and it’s not there, that could hurt performance.
But, this post isn’t about maximum performance. I definitely say use Dynamic Memory when you can.
- Consider separating data onto a SCSI disk. I’ll confess that I’m a bit confused on Nirmal’s usage here. I think that he means to use data connected to a virtual SCSI controller in your virtual machine. In most cases, it won’t matter. Virtual IDE isn’t supposed to be measurably slower than virtual SCSI. In practice, I’ve seen people encounter some starkly different results, which I attribute to some characteristic of the underlying storage system. If he’s talking about physical SCSI as opposed to physical SATA, then the performance difference is probably not because of the interface, but because SCSI drives tend to have higher rotational speeds and lower access latencies.
Anyway, Nirmal’s post says “always” (because of performance, remember?). I say, “do what makes sense in pursuit of #1”. One part of his post that doesn’t make any sense is to further separate things such as SQL logs onto other virtual SCSI disks. That’s another example of trying to overcome multi-millisecond disk latencies by tinkering with multi-nanosecond controller functions. For a VM, they’re not even really separate controllers. If you separate a VM’s disks into separate virtual disks that aren’t actually on separate physical disks, you’ve accomplished nothing except logical separation.
- Don’t stress about your page file’s performance. If a virtual machine’s page file is being used so much that you have to worry about its performance, what you need to stress about is that you have provisioned that virtual machine’s CPU and/or memory very, very badly. The same concept goes behind excluding the VM’s page file disk for replication. That’s a good idea and I won’t argue with Nirmal on it, but really, your VM should not be churning its page file that hard.
- Choose between Hyper-V Server, Windows Server in Core mode, and Windows Server in one of the GUI modes as best suits your needs. I personally prefer to run only Hyper-V Server. But, I work with a lot of people that don’t yet know Hyper-V that well and almost never use Server in Core mode, and some of them have to be my backup when they’re on-call or when I’m on vacation. Making their lives difficult just to squeeze a couple more megabytes of memory out of the system is absolutely not worth it.
- Close unnecessary windows (and connections, and PowerShell sessions, and everything else). Nirmal was really kind about this. Personally, I’d be a bit more forceful. Disconnected RDP sessions waste lots of memory and are easier for an attacker to hijack than a system with no active sessions. Completely logging out of and closing all idle sessions should just be something that all administrators do in all situations where some environmental (e.g., really poorly designed server application software) doesn’t prevent it.
- Use Certified Hardware when possible. I really wish everyone could do this all the time, as it would help reduce their problems. Budgets often dictate otherwise, though. Don’t worry too much if that’s you. Of all the support teams I’ve ever worked with, Microsoft is the most forgiving. They will make a best effort to help you with whatever you’ve got.
- Provision your virtual machines wisely. Here is where we move away from Nirmal’s list and into mine. A lot of people can avoid a lot of things like needing to tune for page file performance and needing to stand up extra Hyper-V hosts if they would just spend a little more time designing virtual machines. Even where I work, where we have so many machines that we strive for a common build, we still spend time to architect virtual machines so as not to be either wasteful or restrictive.
- Provision your hosts wisely. 64GB of RAM and quad 12-core CPUs? Oops. Somebody goofed on #2.
- Leave slack memory for the management operating system. It’s supposedly possible to calculate the exact memory that your management operating system needs. I even included a formula in my book on building a Hyper-V cluster. The problem is, the system just needs more than that. I’ve seen people spend a whole bunch of time using Sysinternals tools to discover where that extra 70 or 80 megabytes is that’s preventing a guest from turning on, and it’s painful. I know that memory isn’t dirt cheap, but it’s not that expensive, either. Estimate high on management operating system usage. Where to draw the line depends on a lot of factors, but 2 GB for most hosts and 4 GB for high-RAM, high-density hosts would be a good rule of thumb.
- Join your Hyper-V host to the domain unless it is in a perimeter network. The first reason is that workgroup security is never superior to domain security. The second reason is that working with machines in a workgroup environment, especially remotely, is a micromanagement hell that you do not want any part of. I also want to write a BP that says to keep your Hyper-V host out of the perimeter network and let the isolation techniques of the Hyper-V virtual switch do their job, but I’m not sure that’s universal enough to quite reach “best practice”.
- Don’t install anything in the management operating system that isn’t meant to aid the management operating system or its guests. Antimalware is fine. Backup software is fine. RRAS is not. IIS is not. Active Directory domain services is out of the question. Web browsers are something I would fire you for installing on any server (web-browsing is a laptop/workstation activity). Put that stuff in a guest. If you install it in the management operating system, you forfeit a guest licensing privilege anyway, so you really have no excuse.
- If your Hyper-V host has multiple NICs, ensure that only the one you are using for management is registered in DNS. So many problems can be avoided this way. If a non-routable adapter is in DNS, then something out there is going to try to talk to your host on that IP and it’s not going to work, and Bad Things™ will result.
- Tune network performance to match #1, not Live Migration. OK, fast Live Migration is really cool, but if your users can’t move enough data to do their jobs while your Live Migrations are running, you did it wrong. Or, what you decided in #1 is that user needs don’t matter, in which case someone made a mistake in hiring you. It’s true that VMs rarely use very much of their available bandwidth and it’s fine to architect with that in mind, but remember to balance with your guest’s provided services being considered of paramount importance.
- Do not create a virtual adapter for the management operating system unless you have a defined use for it. This is that pesky “allow management operating system to share this adapter” checkbox. If you don’t know exactly why you’re checking it, don’t (#2). I do it because I have a converged network and I will use it for management and cluster roles. If you broke out all your physical adapters into separate teams and your virtual switch does nothing but carry VM traffic, don’t check that box.
- Quit worrying about binding order. Learn how TCP/IP works before you start tinkering with Binding Order, because this is probably the biggest waste-of-time red herring that new Hyper-V admins find themselves sent after. If you “need” to adjust binding order to fix a problem, you probably did something wrong in #29.
- Don’t fiddle with network offloading tech until you’ve identified a problem. Yes, for some people, turning off VMQ makes things better. For some, it doesn’t. Some of them might have been fixed by a driver update, too.
- Be mindful of snapshots, not fearful. These are great tools when used appropriately, especially when you are dutifully following #12. Unfortunately, there is a lot of senseless fear around them, because that one time, that one thing, and Bad Things™, and Never Again™. There is a popular and long-standing post written by Microsoft’s Roger Osborne that tells you to set snapshot locations to “a non-existent location”. I have always despised this advice, because breaking a technology to avoid abusing it is a terrible way to run an IT shop. But, I like it here because, as wonderful as that document is otherwise and as hesitant as I am to criticize Mr. Osborne, it’s a great example of why you should never blindly follow any “best practices” list. This is because you will find that you are unable to enact this strategy in Hyper-V anymore, at least in 2012 R2. It will literally not allow you to assign a non-existent location for snapshots. So, the morals are: follow #2, follow #12, train who you can, and get rid of who you can’t (or at least revoke admin powers).
- Do not mentally lump snapshots (checkpoints) and dynamic disks together. When you’re reading all the FUD bits about dynamically expanding disks, you’ll use find them erroneously equated to differencing disks. The differencing disk is what snapshots use. It has no upper limit on growth, is separate from its parent, can result in reduced performance as I/O gets bounced between parent and child, and requires even more space to merge back into the parent once you’re done with it. Dynamically expanding disks have none of these issues. While I already linked to my FUD-busting article on this in #15, I do want to repeat one point here: dynamically expanding disks will stop growing once they hit their maximum assigned value. If yours exceed available space, you did something wrong in #12 or #22 or #23.
- Don’t be afraid of the virtualized domain controller. It can save you time and money.
- Avoid physical-to-virtual conversions. Microsoft updated version 3 of its virtual machine converter so that you have a free way to perform physical to virtual conversions. Of course, there are other options, like Disk2VHD and Clonezilla and commercial solutions. However, it is almost always better to start with a native virtual machine and migrate software to it. P2V should be considered an option of last resort.
- When testing, use testing tools. Sometimes, a sentence is so axiomatic you wonder why it needed to be said. That’s one of them, right? Unfortunately, it is not. If you come to me and tell me that your host is having a network slowness problem and the only reason you know this is because you did a file copy, then do not be surprised if I don’t even respond. If you discovered that your network was slow because of a file copy and then you ran NTttcp to confirm it, then I will listen to you.
- Get good backups. I shouldn’t have to tell you that, since you’re a system administrator and all, but seriously, run a backup. Regularly. Test them. Regularly. Monitor them. Always.
- Use backup for backup, replica for replica, and snapshots (checkpoints) for snapshots. These technologies are meant for different purposes and are not interchangeable. Backup is meant to be a long-term multi-layered strategy against major disasters and minor accidents that go unnoticed for a while. Replica is meant for disaster recovery, as in, “all is lost!” situations. It is not long-term and does not maintain much of a history. Snapshots are meant for extremely short-term protection when recovery needs to be exceedingly quick.
- Get licensing in shape before going into production. This applies to the management operating systems, the guest operating systems, and all the software that you’ll be using. You might be on the phone for fifteen minutes sorting it out, but it’s better than the fifteen minute stopover you’ll make in your boss’s office on the way to the unemployment line as he writes a $200,000 check to the Business Software Alliance for piracy.
- Not sure about something? Try it. This isn’t always feasible, but I would say about 60% of the questions I see floating around about Hyper-V could be answered very quickly just by trying. For example, “can I assign more virtual CPUs to a guest than I have cores in my host?”
- Get training if you need it. I’m all about doing a lot of research and figuring things out on my own, but having been a Microsoft Certified Trainer for a number of years and having been a student of some good trainers for years before that, I can tell you unequivocally that there is benefit in sitting in a classroom and doing those labs and asking those questions, especially if you are brand-new to a technology. Hyper-V has grown a lot since its introduction and it is tough to know everything. Whereas I really only had classroom training available in my day, you now can use the Microsoft Virtual Academy, offline training videos from third-party vendors, online classrooms, and many other resources.
- Periodically review everything. That VM you built in 2009 and have been bringing along through successive updates to Hyper-V – do you still use it? Do you know what it does? Is it using the right amount of memory and CPU? What about the QoS settings on your virtual switch? It made sense when you designed it, but now that it’s been live for a while, does your switch push so much traffic that you can even justify the calculation overhead? Why are those VMs sharing that particular CSV? When was your last backup? When was the last time you tested your backup? How is the licensing situation?
Not a DOJO Member yet?
Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!