Save to My DOJO
We’ve got some great DRS posts on the Altaro VMware Blog such as how to setup DRS, and another on how to use affinity rules, but one void that I noticed we were missing was the lack of a troubleshooting post for DRS. This will set that right. We will break this post into a few sections as errors and warnings can come from multiple areas. First, we’ll start with cluster problems, host problems and then finish up with virtual machine problems.
A Cluster is Red Because of an Inconsistent Resource Pool
When working with DRS and clusters, you need to verify your Resource Pool tree is set up efficiently. If the cluster resource pool tree is not internally consistent, the cluster will not have enough resources to satisfy the reservations of all running virtual machines making the cluster red. You can check this by clicking the cluster and on the summary tab, and then checking the vSphere DRS section. You need to check to make sure that the sum of all the child reservations is not greater than the pools non-expandable reservation as well.
You’ll want to check all Resource Pools in the cluster. If possible remove and only use Resource Pools if you absolutely need to. I’ve seen it too many times when people implement them and then issues crop up and they then modify limits/reservations which were the reason they created them in the first place.
Load Imbalances on a Cluster
This is a very popular DRS issue. Basically, your cluster can’t move VMs! Ouch! There are a few reasons why you’ll get this and I’ve listed a few of the popular reasons below.
- VM/VM or VM/Host rules prevent virtual machines from being moved.
- This one can be a self-inflicted wound. If you create rules that say these VMs cannot be next to each other on the same host, VMware will not let you move them to the same host. So it’s a rather simple fix, add more hosts so the VM can go somewhere else, or disable the rule.
- DRS is disabled for the machine.
- Again, this one is a self-inflicted wound. If you disable DRS on a machine and want to keep it in one place, that’s fine, but it will cause issues with load balancing.
- CD/iso files mounted that are local to a host.
- This one will drive us all crazy at some point. You need to make sure you’re not using local ISO files. If they’re on shared devices, that’s fine, but be sure all the hosts see the storage devices. If a host can’t see the same device, it won’t move. It’s a vMotion thing. DRS will try to vMotion it and can’t.
- vMotion is not enabled or setup
- If you don’t have vMotion, you won’t be able to automatically load balance. VMs will still be migrated with DRS when they are powered on (as long as they’re on shared storage) but Fully Automated clusters.
DRS doesn’t move any virtual machines from a host.
If virtual machines don’t move from a host, it can signal a few issues. It can be affinity rules again, or it can be other issues like CPU architecture not being the same. Generally, I check a few things if it was working and now isn’t. DRS often times is limited in what it can do if the hosts can’t be vMotioned.
- Check that both hosts can vMotion between each other manually. Test another virtual machine.
- Verify that the VMkernel ports are configured properly and no recent changes have occurred.
It’s also possible that the load is already load-balanced and VMware DRS does not need to move anything. If you’re running Fully Automated clusters, don’t expect to manually run DRS and see much activity. You really should see nothing as VMware is doing it for you. Check the cluster slider.
Virtual Machine Problems
A Virtual Machine Power On Fails
This is often seen when you try to power on a VM in a DRS cluster and it simply fails to power on like it did in the past. It’s likely that you need to check your resources in the cluster. If the VM was working in the past, you might have reservations that cannot be satisfied with this machine now. You might have to adjust the reservation down or check the resource pool that it is in and any other parent resource pools that are above it (aka Parent Pools).
You might also consider reducing reservations on any sibling virtual machines in the cluster. Be very cautious to not give out too many reservations in a DRS cluster. They might seem like a good thing but if the virtual machine does not use them, you’re leaving resources not used and other VMs might need them.
DRS will move fault-tolerant VMs only when EVC is enabled, so it’s important to know that if you turn on Fault Tolerance, that you’ll need to closely monitor these machines. Fault-Tolerant VMs are machines that never go down when a host fails. VMware keeps two separate identical instances running across the cluster.
Check that your virtual machine is not set to manual migration mode. Check the DRS tab, and see if it wants you to apply any recommendations. It’s possible that you might have 99% of the virtual machines in the cluster set to automatic and set for instance your vCenter to manual. Most of the people who I hear say they do this is because if they have an issue, they want to know where vCenter is. My recommendation is to create a group and then put 3-4 hosts in the group and say that vCenter must always reside in this group. Sure, it’s not exactly in the same place, but it gives vCenter the ability to migrate as needed.
Hopefully, this post is useful for when you run into DRS migration issues. Generally, always try to check vMotion ports first because I have seen that cause problem more than I’d like to admit! DRS, for the most part, is a set it and forget it tool, so it’s odd when you do run into issues. Just know that most of them can be resolved without any downtime. Have you experienced any other issues not mentioned here? Let us know in the comments below!
Not a DOJO Member yet?
Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!