Troubleshooting Hyper-V Webinar – Q & A Follow Up

A couple of weeks ago Didier Van Hoye and I conducted a webinar on troubleshooting Hyper-V. We promised that any of the question that we were not able to get to during the webinar, we would follow up on with a blog post. Well here it is! If your question was not answered during the webinar, it should be in the list below. Hopefully this will answer all the remaining questions, and don’t be afraid to use the comments section below for follow up questions as needed!

Also note, that Didier and I have split up the questions, along with fellow Altaro blogger Eric Siron. Collectively we’ll be answering all of the remaining questions below.

Revisit the Webinar

You can watch the webinar recording below, then proceed to the questions section underneath.
https://www.altaro.com/hyper-v/category/powershell-automation/

The Questions

Q: Hi, we are seeing a number of “CPU Wait Time per Dispatch” errors during our backup window (we use Altaro), have you experienced this at all? 

Q: On our Windows Server 2012 Hyper-V Hosts, I have to keep stopping the Volume Shadow Copy service manually every week or so because the SSD array keeps running out of space. How do I get around this situation?

Q: Where is the best place to get assistance in the best way to configure your VMs on Hyper-V

This is a very generic question, so it’s difficult to provide a 1-size-fits-all answer. If you want to learn some more general information about configuring Hyper-V and the associated VMs, be sure to check out our Hyper-V blog HERE

Q: Hey guys – common problem: Linux (Debian) within VM on 2012 R2 core: During backup, hard disk within linux is mounted read-only, but after the backup is finished, there are times it remains in read-only mode. Thoughts?

This isn’t something that I’ve run into myself, and I do run quite a few Debian boxes myself. I would be curious to know what version of Debian and what version of the Linux Kernel is in play here. Without that info, i would first make sure that all packages on the system are latest and greatest using APT, and then check to make sure you’re running a newer version of the Linux Kernel. It could also be that the particular file system that is in use, has issues during a backup operation, so that is an angle that I would investigate further as well.

Q: Can I Backup to a Linux Server?

Q: I’ve Seen Applications failing during Live Migration (connection issues). I’ve never seen this in a VMware Environment. How can I troubleshoot this?

Q: We have a multi-node Hyper-V Cluster that has recently developed an issue with intermittent failure of live migrations.

We noticed this when one of our CAU runs failed because it could not place the hosts into maintenance mode or successfully drain all the roles from them. 

Scenario:

Place any node into maintenance mode/drain roles.

Most VMs will drain and live migrate across onto other nodes. Randomly one or a few will refuse to move (it always varies in regards to the VM and which node it is moving to or from) The live migration ends with a failure generating event ID’s 21502, 22038, 21111, and 21024. if you run the process again (drain roles) it will migrate the VMs, or if you manually live migrate them they will move just fine. manually live migrating a VM can result in the same intermittent error but re-running the process will succeed after one or two times, or just waiting for a couple minutes.

This occurs on all nodes in the cluster and can occur with seemingly any VM in the private cloud. Ideas?

Didier actually has a post that may potentially help HERE. Usually this type of behavior is caused by some network misconfiguration, so be sure to test all the links between the two end points as needed.

Q: Any good sources for PowerShell scripts or examples of the steps to create and setuip Hyper-V VHDXs instead of trying to do it via the GUI? I know there is MS Virtual Academy videos and stuff, but any good books or websites that would give scripting help?

Check out Jeff Hicks‘s Article HERE for help with the VHDX question. Other than that, we have a PowerShell section on our blog HERE. Also, PowerShell.org is a great resource as well.

Q: What are your thoughts on SCVMM? Would you recommend it versus the FCM and Hyper-V Manager?

Q: What is the difference between Hyper-V backup and VM Backup?

If you’re talking about Altaro products, Hyper-V backup is the name of the older version of our flagship backup application that only provided protection to Hyper-V based workloads. Altaro VM Backup is the new renamed version and now supports protecting VMware based workloads as well.

Q: Does Altaro support backing up VMs housed on SMB shares? Are there any special considerations for doing so?

We fully support backing up VMs housed on SMB Shares. no special considerations are needed as long as our software can talk with the Hyper-V host. That’s all we require for backing up the target virtual machines.

Q: Can you tell me if a VM Backup can be restored to another Hyper-V host, assuming the two hosts have the same Windows and Hyper-V Versions installed?

Yes! Our software supports restoring to another host in the environment. The only limitation is we cannot restore cross-platform. Meaning, we cannot restore a Hyper-V VM to a VMware ESXi host and vice-versa.

Q: Should Hotfixes be removed after a new patch comes out? For example, a new update rollup?

You’re going to want to look at this on a case by case basis with each hotfix. The documentation should state whether it is superseded by a new patch and what steps will need to be taken.

Q: We’ve had ongoing VM Data Network connectivity issues with onbaord NICs on Hyper-V hosts. The hardware is HP Gen8 DL380s with Broadcom NICs. Is this a common issue with this vendor that might be fixed by using Intel NICs?

Q: Problems with updates and Gen 2 systems with secure boot. Is it worth just not enabling secure boot on Gen 2 VMS?

I would always use secure boot and generation 2 VMs whenever possible. All of the new advanced Hyper-V features are being developed for Gen 2 VMs and the pros far outweigh the cons. Outside of that it’s hard to discuss further without knowing exactly what types of issues you’re having with updates on Gen 2 VMs.

Q: Can you please share with us what do you use to get a VM report from a 2012 R2 failover cluster? A report which shows how much RAM, CPUs, disks the VMs are using? I’m referring to a used resource report. I tried to use ops manager, but it’s quite complicated to setup. 

I would have a look at both of the below resources. Between the two, you should be able to achieve some of what you’re looking for.

SCRIPT from Technet Gallery

PowerShell Based Hyper-V Health Check

Q: How about management of MAC addresses? Should this be done by hyper-v hosts separately in a cluster or should SCVMM do that? We have seen MAC address changes on switches when the VM is Live Migrated to another host in the cluster.

That’s normal. The dynamic MAC address of the VM doesn’t change with a live migration. But the VM lands on a different vSwitch and port, quite possibly attached to different physical switches in different racks and so on. To handle this a gratuitous ARP is sent to inform the network of that change and update the routing tables in order for network traffic to and from the VM to use the correct switch port.

Whether you use SCVMM or not comes down to having it or not and what use cases you have. If you have SCVMM you might as well leverage it. Personally I would not buy it to “just” manage static MAC address of the VMs on my Hyper-V cluster unless that cluster is so big it becomes a must have and you have a need for static MAC addresses. It really depends on the environment, use case etc.

Q: would jumbo frames on a storage network improve anything?

In most cases yes. Normally the storage vendor will even have specific instruction for that. Your mileage may vary, but it pays to test this when you’re not sure. You might be missing out on some extra throughput without them. If they don’t do any good or cause issues you can disable them again.

Q: Do you recommend switching off VMQ on machines, where is it not needed?

If you don’t need it, you might not configure it and if not configured you will have some issues. So yes, if you don’t need or want it, disable it.  If you leave it on, always configure it. Do you need it on 1Gbps? Probably not. Can you get it to work? Yes. Will it make a huge difference? Probably not, but yes I do have one with a teamed (4*1Gbps) NICs where I use it, as I like to see how this behaves and what results we get in real live scenarios over time.

Q: Any ideas if Microsoft will implement ODX on their Windows 2012-R2 Storage Server or above?

ODX is fully supported in Windows Server 2016 with storage arrays that support but they did not make ODX a feature of Storage Spaces if that what you’re referring to with “storage server”.

Q: For each host in cluster it has its own mac pool. after LM the mac won’t change. after reboot it will. Then DHCP reservations are non-existing. Static MACs with DHCP or static IPs are the only way then for certain scenarios.

Correct. Do note that if the VM is started or restored from saved state it will check if the MAC address is within the pool range and not yet in use. If it is, it will be regenerated. So a reboot does not mean a MAC address by default, it does indeed after rebooting a VM that’s been Live migrated to another host with a different MAC address pool.

If you have a hard need for DHCP reservations in your use case, you’ll need static addresses. Another possible fix for that is the use of static IP addresses on the VMs itself, taking DHCP out of the picture. For software licenses we do this when required: a static MAC address and a static IP address.

Q: After setting Vmq the maxprocessors it’s not displayed correct. For example, I have 2 CPUs with 12 Cores and maxprocessors is displaying 16 maxproc. Any idea why?

When you look at Get-NetAdapterVMQ on a clean installed server you’ll see a default number of “MaxCores” this has nothing to do with the actual number of cores on the host. It’s a chosen number by the vendor and 16 is actually the maximum number of cores the Intel NIC can use for VMQ. They also only allow to set this as 1,2,4, 8/ Some vendors have 8 as that default but allow a value of anything between 1-64 like the Mellanox card in this example.

Mellenox

It goes without saying you can’t use more cores than are available on the host, no matter what you configure. Be realistic with your settings.

Q: everytime i create a new vm with PS script i set up a static mac address (the one given by host after the first boot, so I will not worry if the machine migrates to another node and it takes another IP address. that happened already and that’s why I decided to have static mac addresses. is that so bad?

It’s not “bad” if you do it well and for all VMs without exception. Having static MAC addresses that are in the pool for dynamic ones could lead to issues when you have VMs with dynamic MAC addresses in the environment.  If your use case warrants this, it’s not bad but you just have to figure out if it’s needed. Why go through the trouble when for the majority of use cases this is not needed or why not only do this for those VMs or those environments/ use cases that really require it?

You say “takes a different IP address” are you referring to DHCP or did you mean get another MAC address? Preventing IP address changes can be handled by configuring a static IP address without the need for a static MAC address. When you want or need to use DHCP reservations for your VMs than a static address is needed, but perhaps just using static IP addresses is a better solution? The benefit of DHCP reservations is that you have the DHCP server as a central reference for your network configuration.

When you live migrate a VM the dynamic MAC address of the VM doesn’t change. That only happens when you restart it. What does happen after a live migration is that an ARP is sent by the vSwitch to inform the network as to what switch port your VM is now attached too, so communications are not interrupted. It’s with such gratuitous ARPs (which the network has to allow) that we’ve seen some bugs on certain switch configurations and firmware. Most of those bugs where not Hyper-V networking specific but live migrations made them more probable to show up.

Q: Is there a technology like ODX for SMB 3 and SOFS ?

ODX is a capability of the storage array. When you build a SOFS with a storage array that supports ODX you get ODX. When your array doesn’t, there’s no ODX. Storage Spaces doesn’t support ODX, so if you use that to build a SOFS you don’t have it,

Q: Do you need to deploy DCB settings on a switch and/or on a Hyper-V host where the switch is exclusive to storage only and the only machines it serves are Hyper-V hosts and a Scale Out File Server.  All machines have iWarp (Chelsio) RDMA Nics

With iWarp you can get away with not leveraging, and as such, not configuring DCB. This is because iWarp offload leveraged TCP/IP, which by itself can handle packet loss and that will not cause issues. Do note that it can and will benefit of DCB under heavy load. When you use RoCE cards, DCB is mandatory.

Q: VM on WS 2012 R2 host: “stopping – critical” and status “service” what is this? I cannot kill the process, even not with sysinternal tool. I can only restart the host and then the story repeats. The original problem (after live migration) was that the VM was running but with no connectivity although it was connected to a virtual switch (working)

“Stopping-Critical” means that the Virtual Machine Management Service has a record for the virtual machine’s existence but cannot locate the matching files.

If the virtual machine’s files are intact:

  1. Relocate the files to keep them safe.
  2. Delete the virtual machine from Failover Cluster Manager and then from Hyper-V Manager.
  3. Import the virtual machine.
  4. Add it back into Failover Cluster Manager.

If the files are not intact, deleting it from both of the tools (step 2) and then restoring it is your only option.

Q: Please, How can I resolve this error: The IO operation at logical block address 85d for Disk 1 (PDO name: DeviceMPIODisk0) was retried.

This error message indicates that the storage subsystem is showing unreliable behavior. If it is remote storage (for example, iSCSI or fiber channel), then it may be a transient problem that you can ignore. If it is local storage, this is typical warning behavior for a hard drive that is beginning to fail.

Q: what would be the cause when on a a 2012 R2 hyper-v host the NIC teaming mgmt NIC goes to a unidentified network after each reboot? have deleted and recreate the NIC team and have set the NLA service to delay restart – are there other things to check on?

There are a number of causes for this, all related to negotiation with the physical network. For a team, it is typically related the build-up of the team between the physical switch and your NICs. Read up on your physical switch’s settings and make sure that it is using its fastest trunk discovery method. Doing so will usually involve a modification to spanning-tree protocol.

Q: Hello, Thank you for this great knowledge. What can stop SOME of the virtual machines from communicating with each other? This is on a cluster, where all validation tests have passed.

Unfortunately, there isn’t enough information about the problem to make anything more than vague generalizations.

  • VLANs not properly allowed on switches.
  • Virtual machines in the wrong VLANs.
  • General TCP/IP misconfigurations.
  • Overly restrictive physical switch security settings.
  • Misconfigured firewalls.

There are no network settings in Hyper-V specifically designed to prevent virtual machines from being able to establish network connectivity.

Q: What about iscsi with 1GB nics (broadcom especially) and disabling offloading?

With modern computing hardware, it is unlikely that any offloading technology on a 1GbE card will noticeably reduce the load on your physical CPUs. Some Broadcom cards include specialized hardware that allow them to operate as iSCSI HBAs; this technology is fine to use, but the card can’t be used for anything else. All other offloading technologies are not likely to produce any positive benefits, although most are harmless.

While it is not an offloading technology and is technically outside the scope of this question, it is recommended to disable VMQ on all 1GbE cards. VMQ is implemented improperly on most gigabit cards and, even if it worked correctly, would not be likely to improve performance.

Q: We have tried to do a P2V for 2 VMs and it happens successfully, but when we start the VM it is not able to boot.

The question does not include enough information about the problem to provide any specific guidance. There are many reasons that a virtual machine might not boot, and even many ways for a boot to fail. Some general guidance:

  • Physical to virtual conversions should be an option of last resort. Build a new virtual machine and migrate the application if it is at all possible.
  • Use a different P2V tool. For instance, if you were using Disk2VHD, try Microsoft Virtual Machine Converter 3.0.
  • Preinstall the current Hyper-V Integration Services in the physical environment before converting.
  • Do not try to convert a BIOS installation into a Generation 2 virtual machine and do not try to convert a UEFI installation into a Generation 1 virtual machine.
  • Try an operating system repair.

Q: Time Synchronization service on guests – ON or OFF? There are some reputable articles suggesting to leave it enabled and make sure the domain hierarchy synchronization is set inside guests, which btw didnt work for me.

Leaving time synchronization on along with domain hierarchy synchronization effectively means that the virtual machine will always use the Hyper-V Time Synchronization service unless it happens to be broken. This is the preferred configuration, but it requires you to make certain that the Hyper-V host is receiving time from a valid source. For that reason, all virtualized domain controllers must have their time synchronization service disabled. The domain controller that hosts the PDC Emulator FSMO role must be set to retrieve time from a known valid source. We have a more thorough article available on this subject: https://www.altaro.com/hyper-v/hyper-v-time-synchronization/.

Q: How do you resolve time sync in a virtualized environment when DC’s are virtual and takes time from Hyper-V host, but Hyper-V host is joined to a domain which takes time from the virtual DC?  Thank you.

Virtualized domain controllers should have the Hyper-V Time Synchronization service disabled and pull from the standard domain hierarchy. The domain controller that holds the PDC Emulator FSMO role, whether physical or virtual, should pull its time from a reliable source. More details are available at our article on the subject: https://www.altaro.com/hyper-v/hyper-v-time-synchronization/.

Q: have you seen a cluster loosing the boot on SAN ? our provider and Microsoft said at the time, that shouldn’t happen… and it did.

It’s difficult to tell from this question whether it is about the management operating system running in a boot-from-SAN configuration or Hyper-V guests booting from a SAN LUN, which is important because the troubleshooting steps are different. I don’t know of any particular reason that makes a boot-from-SAN failure completely unexpected. The hardware-to-software hand-off of a boot operation has several expectations that might sometimes not be met by even a correctly-configured boot-from-SAN system.

If both Microsoft and your provider have looked at your particular configuration and cannot find a problem, we probably can’t improve on what they said. We can make some general recommendations.

If the problem is with virtual machines starting from the SAN:

  • Make sure that fiber channel zoning masks are set as tightly as possible. Loose masks can cause timeouts that break Live Migration and boot.
  • Make sure that iSCSI networks are properly isolated. An iSCSI system that runs might not boot because the Windows operating system will tolerate read/write delays that would cause the boot process to fail.
  • Ensure that any security settings, especially with iSCSI, are not too restrictive.
  • Check all of the system, storage, and clustering events on the nodes to verify that there are no disconnect issues that might indicate fabric probles.

Most of the same problems for guest boot-from-SAN would apply to host-boot-from-SAN, although you’ll also need to verify that your stub boot system is configured correctly and that nothing in the back-end configuration has changed since that system was built.

Wrap-Up

That wraps up all of our questions. Hopefully that answers all of the left over pending questions from our troubleshooting webinar. If you have follow up questions on a response, or you have a question that you don’t seen answered above, be sure to reach out using the comments section below!

 

 

Altaro Hyper-V Backup
Share this post

Not a DOJO Member yet?

Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!

Leave a comment or ask a question

Your email address will not be published. Required fields are marked *

Your email address will not be published. Required fields are marked *

Notify me of follow-up replies via email

Yes, I would like to receive new blog posts by email

What is the color of grass?

Please note: If you’re not already a member on the Dojo Forums you will create a new account and receive an activation email.