Why Your Hyper-V PowerShell Commands Don’t Work (and how to fix them)

Why Your Hyper-V PowerShell Commands Don’t Work (and how to fix them)

I occasionally receive questions about Hyper-V-related PowerShell cmdlets not working as expected. Sometimes these problems arise with the module that Microsoft provides; other times, they manifest with third-party tools. Even my own tools show these symptoms. Most GUI tools are developed to avoid the problems that plague the command line, but the solutions aren’t always perfect.

The WMI Foundation

All tools, graphical or command-line, eventually work their way back to the only external interface that Hyper-V provides: its WIM/CIM provider. CIM stands for “Common Information Model”. The Distributed Management Task Force (DMTF) maintains the CIM standard. CIM defines a number of interfaces pertaining to management. Anyone can write CIM-conforming modules to work with their systems. These modules allow users, applications, and services to retrieve information and/or send commands to the managed system. By leveraging CIM, software and hardware manufacturers can provide APIs and controls with predictable, standardized behavior.

Traditionally, Microsoft has implemented CIM via Windows Management Instrumentation (WMI). Many WMI instructions involved VBS or WMIC. As PowerShell gained popularity, WMI also gained popularity due to the relative ease of Get-WmiObject. Depending on where you look in Microsoft’s vast documentation, you might see pushes away from the Microsoft-specific WMI implementation toward the more standard CIM corollaries. Get-CimInstance provides something of an analog to Get-WmiObject, but they are not interchangeable.

For any of this to ever make any sense, you need to understand one thing: anyone can write a CIM/WMI provider. The object definitions and syntax of a provider all descend from the common standard, but they do nothing more than establish the way an interface should look. The provider’s developer determines how it all functions behind the scenes.

Why Hyper-V PowerShell Cmdlets May Not Work

Beyond minor things like incorrect syntax and environmental things like failed hardware, two common reasons prevent these tools from functioning as expected.

The Hyper-V Security Model

I told you all that about WMI so that this part would be easier to follow. The developers behind the Hyper-V WMI provider decide how it will react to any given WMI/CIM command that it receives. Sometimes, it chooses to have no reaction at all.

Before I go too far, I want to make it clear that no documentation exists for the security model in Hyper-V’s WMI provider. I ran into some issues with WMI commands not working the way that I expected. I opened a case with Microsoft, and it wound up going all the way to the developers. The answer that came back pointed to the internal security coding of the module. In other words, I was experiencing a side effect of designed behavior. So, I asked if they would give me the documentation on that — basically, anything on what caused that behavior. I was told that it doesn’t exist. They obviously don’t have any externally-facing documentation, but they don’t have anything internal, either. So, everything that you’re going to see in this article originates from experienced (and repeatable) behavior. No insider secrets or pilfered knowledge were used in the creation of this material.

Seeing Effects of the Hyper-V Security Model in Action

Think about any “Get” PowerShell cmdlet. What happens when you run a “Get” against objects that don’t exist? For example, what happens when I run Get-Job when no jobs are present?

psnowork_emptygetjob

Nothing! That’s what happens. You get nothing. So, you learn to interpret “I got nothing” to mean “no objects of that type exist”.

So, if I run Get-VM and get nothing (2012/R2):

psnowork_emptygetvm

That means that the host has no virtual machines, right?

But wait:

Hyper-V Powershell commands help

What happened? A surprise Live Migration?

Look at the title bars carefully. The session on the left was started normally. The session on the right was started by using Run as administrator.

The PowerShell behavior has changed in 2016:

psnowork_emptygetvm2016

The PowerShell cmdlets that I tried now show an appropriate error message. However, only the PowerShell module has been changed. The WMI provider behaves as it always has:

psnowork_wmigetvm2016

To clarify that messy output, I ran gwmi -Namespace root\virtualization\v2 -Class Msvm_ComputerSystem -Filter 'Caption="Virtual Machine"' as a non-privileged user and the system gave no output. That window overlaps another window that contains the output from Get-VM in an elevated session.

Understanding the Effects of the Hyper-V Security Model

When we don’t have permissions to do something, we expect that the system will alert us. If we try to open a file, we get a helpful error message explaining why the system can’t allow it. We’ve all had that experience enough times that we’ve been trained to expect a red flag. The Hyper-V WMI provider does not exhibit that expected behavior. I have never attempted to program a WMI provider myself, so I don’t want to pass any judgment. I noticed that the MSCluster namespace acts the same way, so it may be something inherent to CIM/WMI that the provider authors have no control over.

In order for a WMI query to work against Hyper-V’s provider, you must be running with administrative privileges. Confusingly, “being a member of the Administrators group” and “running with administrative privileges” are not always the same thing. When working with the Hyper-V provider on the local system, you must always ensure that you run with elevated privileges (Run as administrator) — even if you log on with an administrative account. Remote processes don’t have that problem.

The administrative requirement presents another stumbling block: you cannot create a permanent WMI event watcher for anything in the Hyper-V provider. Permanent WMI registration operates anonymously; the Hyper-V provider requires confirmed administrative privileges. As with everything else, no errors are thrown. Permanent WMI watchers simply do not function.

The takeaway: when you unexpectedly get no output from a Hyper-V-related PowerShell command, you most likely do not have sufficient permissions. Because the behavior bubbles up from the bottom-most layer (CIM/WMI), the problem can manifest in any tool.

The Struggle for Scripters and Application Developers

People sometimes report that my tools don’t work. For example, I’ve been told that my KVP processing stack doesn’t do anything. Of course, the tool works perfectly well — as long as it has the necessary privileges. So, why didn’t I write that, and all of my other scripts, to check their privilege? Because it’s really hard, that’s why.

With a bit of searching, you’ll discover that I could just insert #requires -RunAsAdministrator at the top of all my scripts. Problem solved, right? Well, no. Sure, it would “fix” the problem when you run the script locally. But, sometimes you’ll run the script remotely. What happens if:

  • … you run the script with an account that has administrative privileges on the target host but not on the local system?
  • … you run the script with an account that has local administrative privileges but only user privileges on the target host?

The answer to both: the actual outcome will not match your desired outcome.

I would need to write a solution that:

  • Checks to see if you’re running locally (harder than you might think!)
  • Checks that you’re a member of the local administrators
  • If you’re running locally, checks if your process token has administrative privileges

That’s not too tough, right? No, it’s not awful. Unfortunately, that’s not the end of it. What if you’re running locally, but invoke PowerShell Remoting with -ComputerName or Enter-PSSession or Invoke-Command? Then the entire dynamic changes yet again, because you’re not exactly remote but you’re not exactly local, either.

I’ve only attempted to fully solve this problem one time. My advanced VM settings editor includes layers of checks to try to detect all of these conditions. I spent quite a bit of time devising what I hoped would be a foolproof way to ensure that my application would warn you of insufficient privileges. I still get messages telling me that it doesn’t show any virtual machines.

I get better mileage by asking you to run my tools properly.

How to Handle the Hyper-V WMI Provider’s Security

Simply put, always ensure that you are running with the necessary privileges. If you are working locally, open PowerShell with elevated permissions:

psnowork_runas

If running remotely, always ensure that the account that you use has the necessary permissions. If your current local administrator account does not have the necessary permissions on the target system, invoke PowerShell (or whatever tool you’re using) by [Shift]+right-clicking the icon and selecting Run as different user:

psnowork_runasdifferentuser

What About the “Hyper-V Administrators” Group?

Honestly, I do not deal with this group often. I don’t understand why anyone would be a Hyper-V Administrator but not a host administrator. I believe that a Hyper-V host should not perform any other function. Trying to distinguish between the two administrative levels gives off a strong “bad plan” odor.

That said, I’ve seen more than a few reports that membership in Hyper-V Administrators does not work as expected. I have not tested it extensively, but my experiences corroborate those reports.

The Provider Might Not Be Present

All this talk about WMI mostly covers instances when you have little or no output. What happens when you have permissions, yet the system throws completely unexpected errors? Well, many things could cause that. I can’t make this article into a comprehensive troubleshooting guide, unfortunately. However, you can be certain of one thing: you cannot tell Hyper-V to carry out an action if Hyper-V is not running!

Let’s start with an obvious example. I ran Get-VM on a Windows 10 system without Hyper-V:

psnowork_getvmnohv

Nice, clear error, right? 2012 R2/Win 8.1 have a slightly different message.

Things change a bit when using the VHD cmdlets. I don’t have any current screenshots to show you because the behavior changed somewhere along the way… perhaps with Update 1 for Windows Server 2012 R2. Windows Vista/Server 2008 and later include a native driver for mounting and reading/writing VHD files. Windows 8/Server 2012 and later include a native driver for mounting and reading/writing VHDX files. However, only Hyper-V can process any of the VHD cmdlets. Get-VHD, New-VHD, Optimize-VHD, Resize-VHD, and Set-VHD require a functioning installation of Hyper-V. Just installing the Hyper-V PowerShell module won’t do it.

Currently, all of these cmdlets will show the same or a similar message to the one above. However, older versions of the cmdlets give a very cryptic message that you can’t do much with.

How to Handle a Missing Provider

This seems straightforward enough: only run cmdlets from Hyper-V module against a system with a functioning installation of Hyper-V. You can determine which functions it owns with:

When running them from a system that doesn’t have Hyper-V installed, use the ComputerName parameter.

Further Troubleshooting

With this article, I wanted to knock out two very simple reasons that Hyper-V PowerShell cmdlets (and some other tools) might not work. Of course, I realize that any given cmdlet might error for a wide variety of reasons. I am currently only addressing issues that block all Hyper-V cmdlets from running.

For troubleshooting a failure of a specific cmdlet, make sure to pay careful attention to the error message. They’re not always perfect, but they do usually point you toward a solution. Sometimes they display explicit text messages. Sometimes they include the hexadecimal error code. If they’re not clear enough to understand immediately, you can use these things in Internet searches to guide you toward an answer. You must read the error, though. Far too many times, I see “administrators” go to a forum and explain what they tried to do, but then end with, “I got an error” or “it didn’t work”. If the error message had no value the authors wouldn’t have bothered to write it. Use it.

[the_ad_group id=”229″]

How to Avoid NTFS Permissions Problems During Hyper-V Live Migration

How to Avoid NTFS Permissions Problems During Hyper-V Live Migration

The title of this article describes the symptoms fairly well. You Live Migrate a virtual machine that’s backed by SMB storage, and the permissions shift in a way that prevents the virtual machine from being used. You’d have to be fairly sharp-eyed to notice before it causes problems, though. I didn’t catch on until virtual machines started failing because the hosts didn’t have sufficient permissions to start them. I don’t have a true fix, meaning that I can’t prevent the permissions from changing. However, I can show you how to eliminate the problem.

The root problem also affects local and Cluster Shared Volume locations, although the default permissions generally prevent blocking problems from manifesting.

I have experienced the problem on both 2012 R2 and 2016. The Hyper-V host causes the problem, so the operating system running on the SMB system doesn’t matter.

Symptom of Broken NTFS Permissions for Hyper-V

I discovered the problem when one of my nodes went down for maintenance and all of its virtual machines crashed. It only affected my test cluster, which I don’t keep a close eye on. That means that I can’t tell you when this became a problem. I do know that this behavior is fairly new (sometime in late 2016 or 1Q/2Q 2017).

Symptom 1: Cluster event logs will fill up with the generic access denied (0x80070005) message.

For example, Hyper-V-VMMS; Event ID 20100:

Hyper-V-High-Availability; Event ID 21502:

You will also have several of the more generic FailoverClustering IDs 1069, 1205, and 1254 and Hyper-V-High-Availability IDs 21102 and 21111 as the cluster service desperately tries to sort out the problem.

Symptom 2: Virtual machines disappear from Hyper-V Manager on all nodes while still appearing in Failover Cluster Manager.

Because the cluster can’t register the virtual machine ID on the target Hyper-V host, you won’t see it in Hyper-V Manager. The cluster still knows about it though. Remember that, even if they’re named the same, the objects that you see as Roles in Failover Cluster Manager are different objects than what you see in Hyper-V Manager. Don’t panic! As long as the cluster still knows about the objects, it can still attempt to register them once you’ve addressed the underlying problem.

What Happened?

I’m guessing that “helper” behavior gone awry has caused unintentional problems. When you Live Migrate a virtual machine, Hyper-V tries to “fix” permissions, even when they’re not broken. It adjusts the NTFS permissions for the host.

The GUI ACL looks like this:

broken ntfs settings

The permission level that I set, and that I counsel everyone to set, is Full Control. As you can see, it’s been reduced. We click Advanced as the first investigative step and see:

broken ntfs advanced settings

The Access still only tells us Special, but we can see that inheritance did not cause this. Whatever changes the permissions is making the changes directly on this folder. This is the same folder that’s shared via SMB. Double-clicking the entry and then clicking the Show advanced permissions link at the right shows us the new permission set:

broken ntfs new permissions

When I first found the permissions in this condition, I thought, “Huh, I wonder why/when I did that?” Then I set Full Control again. After the very next Live Migration, these permissions were back! Once I discovered that behavior, I tested other Live Migration types, such as using Cluster Shared Volumes. It does occur on those as well. However, the default permissions on CSVs have other entries that ensure that this particular issue does not prevent virtual machines from functioning. VMs on SMB shares don’t automatically have that kind of luck — but they can benefit from a similar configuration.

Permanently Correcting Live Migration NTFS Permission Problems

I don’t know why Hyper-V selects these particular permissions. I don’t know precisely which of those unchecked boxes cause these problems.

I do know how to prevent the problem from adversely affecting your virtual machines. In fact, even in the absence of the problem, I would label this as a “best practice” because it reduces overall administrative effort.

  1. In Active Directory (I’ll use Active Directory Users and Computers; you could also use PowerShell), create a new security group. For my test environment, I call mine “Hyper-V Hosts”. In a larger domain, you’ll likely want more granular groups.
    broken ntfs group
  2. Select all of the Hyper-V hosts that you want in that new group. Right-click them and click Add to group.
    brokenntfs_hostlistadd
  3. In the Select Groups dialog, enter or browse to the group that you just created. Click OK to add them.
    broken ntfs group
  4. Restart the Workstation service on each of the Hyper-V hosts.
  5. On the target SMB system, add the new group to the ACL of the folder at the root of the share. I personally recommend that you change both SMB and NTFS permissions, although the problem only manifests on NTFS. Grant the group Full Control.
    broken ntfs virtual machines

You will now be able to Live Migrate and start virtual machines from this SMB share. If your virtual machines disappeared from Hyper-V Manager, use Failover Cluster Manager to start and/or Live Migrate them. It will take care of any missing registrations.

Why Does this Work?

Through group permissions, the same object can effectively appear multiple times in a single NTFS ACL (access control list). When that happens, NTFS grants the least restrictive set of permissions. So, while the SVHV1’s specific ACE (access control entry) excludes Write attributes, the Hyper-V Hosts group’s ACE includes it. When NTFS accumulates all possible permissions that could apply to SVHV1, it will find an Allow entry for the Write attributes property (and others not set on ACE specific to SVHV1). If it found a Deny anywhere, that would override any conflicting Allow. However, there are no Deny settings, so that single Allow wins.

Do remember that when a computer accesses an NTFS folder through an SMB share, the permissions on that share must be at least as permissive as NTFS in order for access to work as expected. So, if the SMB permission only allows Read, then it won’t matter that the NTFS allows Full Control. When NTFS permissions and SMB permissions must be evaluated together, the most restrictive cumulative effect applies. I’m mostly telling you this for completeness; Hyper-V will not modify SMB permissions. If they worked before, they’ll continue to work. However, I do recommend that you add the same group with Full Control permissions to the share.

As I mentioned before, I recommend that you adopt the group membership tactic whether you need it or not. When you commission new Hyper-V hosts, you’ll only need to add them to the appropriate groups for SMB access to work automatically. When you decommission servers, you won’t need to go around cleaning up broken SID ACEs.

Extending Hyper-V’s Guest Grace Period on Host Shutdown

Extending Hyper-V’s Guest Grace Period on Host Shutdown

 

When a Hyper-V host shuts down, how long does it wait for its virtual machines to shut down or save? Did you say five minutes? That’s what I said too! Well, we’re wrong.

Where that Five Minute Answer Comes From

When you tell Hyper-V Manager to shut down a guest, it waits five minutes before giving up. I’m not sure exactly where I heard that first, but I know it was from an authoritative source, like one of the Hyper-V program managers. I, and no doubt others, then extrapolated that to mean that five minutes is the timeout period for virtual machine shut down.

However, there’s nothing to support that. If you look at the API, the Hyper-V management service ignores any supplied timeout value. That means that any program that calls it can wait as long as it wants to or it can just issue the command and let it run forever. Hyper-V Manager waits five minutes because its developers coded it to do that. The host shut down process does not use Hyper-V Manager, though.

I didn’t put all that together until I saw some reports on the forums about virtual machines suffering hard shut downs during host shut down. Responders tried to help with the bit about five minutes. That has never seemed to help anyone.

Changing the VM Shutdown Timeout Via Regedit

On a Hyper-V host (tested on 2012R2 and 2016), open up Regedit.exe (works on Hyper-V Server and Windows Server Core, too). Navigate to HKEY_LOCAL_MACHINESOFTWAREMicrosoftWindows NTCurrentVersionVirtualization. Find the ShutdownTimeout value.

shuttime_regedit

Changing the VM Shutdown Timeout Via Regedit

I highly recommend that you spend some time tinkering with a test host or a single guinea pig production host before pushing this out to all of your hosts.

Follow these steps to modify the registry key in Group Policy:

  1. On a system with the necessary console installed, open Group Policy Management Console.
  2. Unless your Hyper-V hosts are already processing a great many policies, I recommend creating a new policy. Right-click Group Policy Objects and select New.
    shuttime_newgpo
  3. Give your new GPO a descriptive name and click OK. If you’ll be using different values for different host categories, then remember to use a unique name.
    shuttime_gponame
  4. Right-click on your new GPO and click Edit.
    shuttime_gpoedit
  5. Drill down to Computer Configuration, Preferences, Windows Settings, Registry. Right-click Registry, hover over New, then click Registry Item.
    shuttime_newregmenu
  6. Use these settings in the New Registry Properties dialog:
    1. Action: Update
    2. Hive: HKEY_LOCAL_MACHINE
    3. Key path: SOFTWAREMicrosoftWindows NTCurrentVersionVirtualization
    4. Value name (Default unchecked): ShutdownTimeout
    5. Value type: REG_DWORD
    6. Value data: <some reasonable number of seconds >
    7. Base: Decimal
      shuttime_regprops
  7. Make sure that your new setting appears in the registry key list. Close the Group Policy Management Editor window to return to the main console.
  8. Under your domain, drill down to the OU that contains your Hyper-V host(s). Right-click on it and click Link an Existing GPO. If you want to use multiple settings, you’ll need to use separate GPOs or implement WMI Filtering.
    shuttime_linkexisting
  9. Select the GPO that you just created:
    shuttime_gposelect
  10. Now, just wait for your next GPO push. Alternatively, log on to one or more hosts in the affected OU and run gpupdate /force in an elevated command/PowerShell prompt.

Testing

I looked high and low, and couldn’t find any documentation on this key. I don’t have the resources to do a great deal of testing either. I think it’s safe to say that this number represents the timeout in seconds. Hosts clearly do not wait two hours to shut down, and a 120 millisecond timeout would just be ridiculous.

What I can’t be sure of is whether or not this applies to each virtual machine individually or in aggregate. However, given how long modern physical systems need to reboot, I’m also having a hard time thinking of a good reason for such a short timeout. If you’ve got a VM that takes a long time to shut down (Exchange guests, we’re all looking at you), then feel free to kick this number up.

I don’t know if you need to do anything else (undocumented, remember?), but I chose to restart VMMS (Virtual Machine Management Service). When I did that, my regedit screen flickered as VMMS was starting back up. That makes sense, as many of the settings found in this branch belong to VMMS. That doesn’t mean anything for this particular key. So, I used procmon.exe to watch a VMMS service startup. It doesn’t care at all about that key. So, restarting VMMS doesn’t do anything in this case.

For testing, I set my timeout to 10 minutes (decimal 600) and it shut down fairly quickly. That means that setting a number higher than you need won’t needlessly extend the shut down process.

Unfortunately, I’ve never been negatively affected by this problem, which makes true testing difficult. So, I’m asking you, dear reader, to try this out and report back on your findings.

Problems This Hopefully Fixes

I have goals for this, and it also doubles as giving you some things to look for in your tests.

1) Service Shutdown

Primarily, I want to see if those systems that need extra time to shut down receive that extra time. By default, Windows Server will only wait 50 seconds for any given service to shutdown (HKLMSYSTEMCurrentControlSetControlWaitToKillServiceTimeout; expressed in milliseconds). Some applications might adjust this number upward. We do know that Windows applies this setting to each service individually. We don’t know how long it takes Windows to issue the shut down command to each service. I assume that all of them receive the shutdown command in very tight sequence. If we fudge it, then maybe a total of one minute for shutdown.

However, we’ve all seen services hang on shutdown for much longer than that. Some adjust that registry key because they know that they’ll need more time. I do know that the API allows a service to ask for more time when it’s told to shut down; I’m not sure if it can do that during system shut down as well, but I know I’ve seen services hang a host for a long time on shut down.

I also don’t know how Linux handles a long shut down cycle. I do have a couple of non-HA Ubuntu Server Linux VMs hosted on the local C: drive that periodically suffer catastrophic damage to their boot volume and need to be restored from backup (or, more probable, require a repair that I can’t figure out). I’m starting to suspect that they need a hair bit longer to shut down than they’re being given.

These are my primary interests with this setting.

2) Application Shutdown

We all know that servers should be logged out all the time. That’s mostly because every logged in user’s password sits in LSASS’s memory space in clear text. It’s also because logged in sessions use memory that could be put to better use. Unfortunately, we also have software vendors stuck in the 90s that don’t care about their customers’ security or resources and develop faux-service applications that run in Session 0 or a persistent desktop session. These are not subject to the service shutdown timeout.

When you manually shut down, any application that doesn’t respond to the shut down is given a timeout, and then a logged-in user is prompted to deal with it. For almost all shut down APIs, an automated shut down command can issue some variant of a “force” parameter that simply kills these problem children and continues the shut down.

In my testing, a host does not run down the shutdown timer waiting on these applications. It’s probable that the very short WaitToKillAppTimeout inside the guest takes precedence.

Let me know about your experiences.

3) Cumulative Effects

We have some numbers on individual timeouts, but the aggregates might change the equations. For example, if a guest’s service shut down is set to 50 seconds and the host’s shut down grace period is 120 seconds, that should work out, right? Well, I can think of one place where it wouldn’t. A guest takes 50 seconds to kill its services. Then, it starts applying four months’ worth of Windows Update patches. It’s got 70 seconds to get it done. Are you laughing? You’re laughing. Or grimacing. Either response is appropriate.

Basically, we’re trying to figure out how multiple VMs with multiple applications and multiple services and multiple OS shut down tasks figure within the host’s grace shut down periods. Aggregate data from multiple sources (that’s all of you) can help us to come up with some guidelines.

Let’s Hear It

We’ve now entered the audience participation section of this article. If this setting clears up your shut down problem, I most definitely want to hear from you. If it doesn’t fix your problem but seems to make some sort of difference, that’s good information as well. Whatever you can share with us, please do so.

 

Troubleshooting After a Hyper-V Cluster Node Replacement

Troubleshooting After a Hyper-V Cluster Node Replacement

It’s simple to add and remove nodes in a Microsoft Failover Cluster. That ease can hide a number of problems that can spell doom for your virtual machines, though. I’ve put together a quick guide/checklist for you to check if your last node addition/replacement didn’t go as smoothly as expected.

1. Did You Validate Your Cluster?

Cluster validation can be annoying and it can suck up time that you don’t feel that you have to spare, but it must be done.

troublenode_validation

Cluster validation runs far more tests in far less time than you could ever hope to do on your own. It’s not perfect by any means, and it can seem obnoxiously overbearing in some regards, but it can also point out issues before they break your cluster.

Cluster validation can be performed while the cluster is in production, with one exception: storage. None of the storage tests can be safely run against online virtual machines unless they’re on SMB storage. If you can move them to a suitable file share, do so. If you can’t, then schedule a time when you can save all of the virtual machines for a few minutes. If you can’t do that either, then run validation without the storage tests. It’s better to have a partial test than none at all.

2. Do You Need to Clear the Node?

Sometimes, you can’t even re-add a node. You get a message that the computer is already in a cluster, and the wizard blocks you from proceeding! From any node that’s still in the cluster, run this from an elevated PowerShell prompt:

I’ve had to take that step every time, regardless of what else I did in advance.

3. Did You Fix DNS?

Hyper-V cluster nodes typically use at least two IP addresses: Management and Live Migration. You might well be using at least one other for cluster communications. If you’re connected via iSCSI, there will be at least one more IP address there. Many of those IPs may reside on isolated IP networks that don’t have utilize a router. That makes those IPs unreachable from other IP networks. If those IPs are being registered in DNS, then it’s only a matter of time before they cause problems.

I typically design a custom script to assign IPs to a host and ensure that it only registers the management address in DNS. You can make those changes manually, if you prefer. Just remember to get it done.

4. Did You Match All Windows Roles and Features?

On a recent rebuild, I moved quickly due to a great deal of urgency. All seemed well, but then I tested my first Live Migration to the rebuilt node, and it crashed the VM! The error code was: 0x80070780 (The file cannot be accessed by the system). It didn’t say which file (because, you know, why would that be useful information?), so I began by verifying that all of the VM’s files were in the same highly available location.

I’ll spare you the details of my fairly frantic searching, but it turned out that I had neglected to update my deployment script and had missed one very critical component: this cluster hosted virtual desktops, so each node ran the Data Deduplication role… except the one that I had newly rebuilt. I quickly whipped out a role/feature deployment script that I keep on hand, and all was well.

5. Do You Need to Fix Permissions?

Always be on the lookout for the lovely 0x80070005 — otherwise known as “access denied”. When you rebuild a cluster node using the same name, it should slide right back into Active Directory without any fuss. Deleting the Active Directory object before re-adding the node doesn’t really help things, so I’d avoid that. Either way, you might need to rebuild permissions. I would pay special attention to delegation. I wouldn’t spend a great deal of time guessing at it. If you think delegation might be an issue, then apply the fix and test.

Usually, you do not need to re-apply file level permissions after a node add/rebuild. If you feel that it’s necessary, I would work at the containing folder level as much as possible. It can be maddening trying to set ACLs on individual virtual machine locations.

6. Are You Having an SPN Issue?

Look in the Event Viewer on other nodes for event ID 4 from Security-Kerberos regarding failures around Kerberos tickets and SPNs (service principal names). This can happen whether or not you deleted the Active Directory object beforehand, although it seems to sort itself out more easily when you re-use the existing object.

If you continue having trouble with this message, you’ll find many references and fix suggestions by searching on the event ID and text. Everywhere that I went, I saw different answers. No one seemed to have gathered a nice little list of things to try.

7. Did You Set Your PowerShell Execution and Remoting Policies?

I have a long list of “issues” that I solve by group policy. If you’re not doing that, then you could miss a number of small things. For instance, if you have built up a decent repertoire of PowerShell scripts to handle automation, you might suddenly find that they don’t work after a node replacement. This should help (if run from an elevated PowerShell prompt):

Of course, the battle over what constitutes the “best” PowerShell execution policy continues to rage on, and will likely do so for as long as people like to argue. “RemoteSigned” has served me well. Use what you like. Just remember that the default “AllSigned” will restrict you.

8. Do You Just Need To Rebuild the Cluster?

I have never once needed to completely destroy and rebuild a cluster outside of my test lab. I wouldn’t take such a “nuclear” option off of the table, however. Each cluster node maintains its own small database about the cluster configuration. If you’re using a quorum mode that includes disk storage, a copy of the database exists there as well. Like any other database, I’m forced to accept the possibility that the database could become damaged beyond repair. If that happens, a complete rebuild might be your best bet.

Try all other options first.

If you must destroy a cluster to rebuild it, remember this:

  • The contents of CSVs and cluster disks will not be altered, but you won’t be able to keep them online.
  • If a cluster’s virtual machines are kept on SMB shares, they can remain online during the rebuild through careful adding and removing. You can add and remove HA features to/from a virtual machine without affecting its running state.
  • You must run the Clear-ClusterNode command against each node.
  • Delete the HKLMCluster key before re-adding a node.
  • Format any witness disks before re-adding to a cluster.
  • Delete any cluster-related data from witness shares before re-adding to a cluster.

Fortunately, Microsoft made their clustering technology simple enough that such drastic measures should never be necessary.

 

Critical Status in Hyper-V Manager

Critical Status in Hyper-V Manager

 

 

I’m an admitted technophile. I like blinky lights and sleek chassis and that new stuff smell and APIs and clicking through interfaces. I wouldn’t be in this field otherwise. However, if I were to compile a list of my least favorite things about unfamiliar technology, that panicked feeling when something breaks would claim the #1 slot. I often feel that systems administration sits diametrically opposite medical care. We tend to be comfortable learning by poking and prodding at things while they’re alive. When they’re dead, we’re sweating — worried that anything we do will only make the situation worse. For many of us in the Hyper-V world, that feeling first hits with the sight of a virtual machine in “Critical” status.

If you’re there, I can’t promise that the world hasn’t ended. I can help you to discover what it means and how to get back on the road to recovery.

The Various “Critical” States in Hyper-V Manager

If you ever look at the underlying WMI API for Hyper-V, you’ll learn that virtual machines have a long list of “sick” and “dead” states. Hyper-V Manager distills these into a much smaller list for its display. If you have a virtual machine in a “Critical” state, you’re only given two control options: Connect and Delete:

crit_sample

We’re fortunate enough in this case that the Status column gives some indication as to the underlying problem. That’s not always the case. That tiny bit of information might not be enough to get you to the root of the problem.

For starters, be aware that any state that includes the word “Critical” typically means that the virtual machine’s storage location has a problem. The storage device might have failed. The host may not be able to connect to storage. If you’re using SMB 3, the host might be unable to authenticate.

You’ll notice that there’s a hyphen in the state display. Before the hyphen will be another word that indicates the current or last known power state of the virtual machine. In this case, it’s Saved. I’ve only ever seen three states:

  • Off-Critical: The virtual machine was off last time the host was able to connect to it.
  • Saved-Critical: The virtual machine was in a saved state the last time the host was able to connect to it.
  • Paused-Critical: The paused state typically isn’t a past condition. This one usually means that the host can still talk to the storage location, but it has run out of free space.

There may be other states that I have not discovered. However, if you see the word “Critical” in a state, assume a storage issue.

Learning More About the Problem

If you have a small installation, you probably already know enough at this point to go find out what’s wrong. If you have a larger system, you might only be getting started. With only Connect and Delete, you can’t find out what’s wrong. You need to start by discovering the storage location that’s behind all of the fuss. Since Hyper-V Manager won’t help you, it’s PowerShell to the rescue:

Remember to use your own virtual machine name for best results. The first of those two lines will show you all of the virtual machine’s properties. It’s easier to remember in a pinch, but it also displays a lot of fields that you don’t care about. The second one pares the output list down to show only the storage-related fields. My output:

crit_powershellexample

The Status field specifically mentioned the configuration location. As you can see, the same storage location holds all of the components of this particular virtual machine. We are not looking at anything related to the virtual hard disks, though. For that, we need a different cmdlet:

Again, I recommend that you use the name of your virtual machine instead of mine. The first cmdlet will show a table display that includes the path of the virtual hard disk file, but it will likely be truncated. There’s probably enough to get you started. If not, the second shows the entire path.

crit_samplepsharddriveexample

Everything that makes up this virtual machine happens to be on the same SMB 3 share. If yours is on a SCSI target, use iscsicpl.exe to check the status of connected disks. If you’re using Fibre Channel, your vendor’s software should be able to assist you.

Correcting the Problem

In my case, the Server service was stopped on the system that I use to host SMB 3 shares. It got that way because I needed to set up a scenario for this article. To return the virtual machine to a healthy state, I only needed to start that service and wait a few moments.

Your situation will likely be different from mine, of course. Your first goal is to rectify the root of the problem. If the storage is offline, bring it up. If there’s a disconnect, reconnect. After that, simply wait. Everything should take care of itself.

When I power down my test cluster, I tend to encounter this issue upon turning everything back on. I could start my storage unit first, but the domain controllers are on the Hyper-V hosts so nothing can authenticate to the storage unit even if it’s on. I could start the Hyper-V hosts first, but then the storage unit isn’t there to challenge authentication. So, I just power the boxes up in whatever order I come to them. All I need to do is wait — the Hyper-V hosts will continually try to reach storage, and they’ll eventually be successful.

If the state does not automatically return to a normal condition, restart the “Hyper-V Virtual Machine Management” service. You’ll find it by that name in the Services control panel applet. In an elevated PowerShell session:

At an administrative command prompt:

That should clear up any remaining status issues. If it doesn’t, there is still an issue communicating with storage. Or, in the case of the Paused condition, it still doesn’t believe that the location has sufficient space to safely run the virtual machine(s).

Less Common Corrections

If you’re certain that the target storage location does not have issues and the state remains Critical, then I would move on to repairs. Try chkdsk. Try resetting/rebooting the storage system. It’s highly unlikely that the Hyper-V host is at fault, but you can also try rebooting that.

Sometimes, the constituent files are damaged or simply gone. Make sure that you can find the actual .xml (2012 R2 and earlier) or .vmcx (2016 and later) file that represents the virtual machine. Remember that it’s named with the virtual machine’s unique identifier. You can find that with PowerShell:

If the files are misplaced or damaged, your best option is restore. If that’s not an option, then Delete might be your only choice. Delete will remove any remainders of the virtual machine’s configuration files, but will not touch any virtual hard disks that belong to the virtual machine. You can create a new one and reattach those disk files.

Best of luck to you.

Cannot Delete a Virtual Hard Disk from a Cluster Shared Volume

Cannot Delete a Virtual Hard Disk from a Cluster Shared Volume

When you use the built-in Hyper-V tools (Hyper-V Manager and PowerShell) to delete a virtual machine, all of its virtual hard disks are left behind. This is by design and is logically sound. The configuration files are components of the virtual machine and either define it or have no purposeful existence without it; the virtual hard disks are simply attached to the virtual machine and could just as easily be attached to another. After you delete the virtual machine, you can manually delete the virtual hard disk files. Usually. Sometimes, when the VHD is placed on a cluster shared volume (CSV), you might have some troubles deleting it. The fix is simple.

Symptoms

There are a few ways that this problem will manifest. All of these conditions will be applicable, but the way that you encounter them is different.

Symptom 1: Cannot Delete a VHDX on a CSV Using Windows Explorer on the CSV’s Owner Node

When using Windows Explorer to try to delete the file from the node that own the CSV, you receive the error: The action can’t be completed because the file is open in System. Close the file and try again.

System has VHD Open

System has VHD Open

 

Note: this message does sometimes appear on non-owner nodes.

Symptom 2: Cannot Delete a VHDX on a CSV Using Windows Explorer on a Non-Owning Node

When using Windows Explorer to try to delete the file from a node other than the CSV’s owner, you receive the error: The action can’t be completed because the file is open in another program. Close the file and try again.

Another Program has the VHD Open

Another Program has the VHD Open

 

Note: non-owning nodes do sometimes receive the “System” message from symptom 1.

Symptom 3: Cannot Delete a VHDX on a CSV Using PowerShell

The error is always the same from PowerShell whether you are on the owning node or not: Cannot remove item virtualharddisk.vhdx: The process cannot access the file ‘virtualharddisk.vhdx’ because it is being used by another process.

Another Process has the CSV

Another Process has the CSV

 

Symptom 4: Cannot Delete a VHDX on a CSV Using the Command Prompt

The error message from a standard command prompt is almost identical to the message that you receive in PowerShell: The process cannot access the file because it is being used by another process.

Another Process has the VHD

Another Process has the VHD

 

Solution

Clean up is very simple, but it comes with a serious warning: I do not know what would happen if you ran this against a VHD that was truly in use by a live virtual machine, but you should expect the outcome to be very bad.

Open an elevated PowerShell prompt on the owner node and issue Dismount-DiskImage:

You do not need to type out the -ImagePath parameter name but you must fully qualify the path! If you try to use a relative path with Dismount-DiskImage or any of the other disk image manipulation cmdlets, you will be told that The system cannot find the file specified:

DiskImage Cmdlet Fails on Relative Paths

DiskImage Cmdlet Fails on Relative Paths

 

Once the cmdlet returns, you should be able to delete the file without any further problems.

If you’re not sure which node owns the CSV, you can ask PowerShell, assuming that the Failover Cluster cmdlets are installed:

CSV Owner in PowerShell

CSV Owner in PowerShell

 

You can also use Failover Cluster Manager, on the Storage/Disks node:

CSV Owner in Failover Cluster Manager

CSV Owner in Failover Cluster Manager

 

Other Cleanup

Make sure to use Failover Cluster Manager to remove any other mention of the virtual machine. Look on the Roles node. These resources are not automatically cleaned up when the virtual machine is deleted. Try to be in the habit of removing cluster resources before deleting the related virtual machine, if possible. I assume that for most of us, deleting a virtual machine is a rare enough occurrence that it’s easy to overlook things like this. I do know that the problem can occur even if the objects are deleted in the “proper” order, so this is not the root cause.

Alternative Cleanup Approaches

The above solution has worked for me every time, but it’s a very rare event without a known cause, so it’s impossible for me to test every possibility. There are two other things that you can try.

Move the CSV to Another Node

Moving the CSV to another node might break the lock from the owning node. This has worked for me occasionally, but not as reliably as the primary method described above.

In Failover Cluster Manager, right-click the CSV, expand Move, then click one of the two options. Best Possible Node will choose where to place the CSV for you; Select Node will give you a dialog for you to choose the target.

Move CSV in Failover Cluster Manager

Move CSV in Failover Cluster Manager

 

Alternatively, you can use PowerShell:

If you don’t want to tell it which node to place the CSV on, simply omit the Node  parameter:

Move CSV in PowerShell

Move CSV in PowerShell

 

Rolling Cluster Reboot

The “nuclear” option is to reboot each node in the cluster, starting with the original owner node. If this does not work, then the disk image is truly in use somewhere and you need to determine where.

Your definitive guide to Troubleshoot Hyper-V Live Migration

Your definitive guide to Troubleshoot Hyper-V Live Migration

As you saw in our earlier article that explained how Live Migration works, and how you can get it going in 2012 and 2012 R2, it is a process that requires a great deal of cooperation between source and target computer. It’s also a balancing act involving a great deal of security concerns. Failures within a cluster are uncommon, especially if the cluster has passed the validation wizard. Failures in Shared Nothing Live Migrations are more likely. Most issues are simple to isolate and troubleshoot.

General Troubleshooting

There are several known problem-causers that I can give you direct advice on. Some are less common. If you can’t find exactly what you’re looking for in this post, I can at least give you a starting point.

Migration-Related Event Log Entries

If you’re moving clustered virtual machines, the Cluster Events node in Failover Cluster Manager usually collects all the relevant events. If they’ve been reset or expired from that display, you can still use Event Viewer at these paths:

  • Applications and Service LogsMicrosoftWindowsHyper-V-High-AvailabilityAdmin
  • Applications and Service LogsMicrosoftWindowsHyper-V-VMMSAdmin

The “Hyper-V-High-Availability” tree usually has the better messages, although it has a few nearly useless ones, such as event ID 21111, “Live migration of ‘Virtual Machine VMName’ failed.” Most Live Migration errors come with one of three statements:

  • Migration operation for VMName failed
  • Live Migration did not succeed at the source
  • Live Migration did not succeed at the destination

These will usually, but not always, be accompanied by supporting text that further describes the problem. “Source” messages often mean that the problem is so bad and obvious that Hyper-V can’t even attempt to move the virtual machine. These usually have the most helpful accompanying text. “Destination” messages usually mean that either there is a configuration mismatch that prevents migration or the problem did not surface until the migration was either underway or nearly completed. You might find that these have no additional information or that what is given is not very helpful. In that case, specifically check for permissions issues and that the destination host isn’t having problems accessing the virtual machine’s storage.

Inability to Create Symbolic Links

As we talked about each virtual machine migration method in our explanatory article, part of the process is for the target host to create a symbolic link in C:ProgramDataMicrosoftWindowsHyper-VVirtual Machines. This occurs under a special built-in credential named NT Virtual MachineVirtual Machines, which has a “well-known” (read as: always the same) security identifier (SID) of S-1-5-83-0.

Some attempts to harden Hyper-V result in a policy being assigned from the domain involve only granting the Computer ConfigurationWindows SettingsSecurity SettingsLocal PoliciesUser Rights AssignmentCreate symbolic links right to the built-in Administrators account. Doing so will certainly cause Live Migrations to fail and can sometimes cause other virtual machine creation events to fail.

Your best option is to just not tinker with this branch of group policy. I haven’t ever even heard of an attack mitigated by trying to improve on the contents of this area. If you simply must override it from the domain, then add in an entry in your group policy for it. You can just type in the full name as shown in the first paragraph of this section.

Create Symbolic Link

Create Symbolic Link

 

Note: The “Run as a Service” right must also be assigned to the same account. Not having that right usually causes more severe problems than Live Migration issues, but it’s mentioned here for completeness.

Inability to Perform Actions on Remote Computers

Live Migration and Shared Nothing Live Migration invariably involves two computers at minimum. If you’re sitting at your personal computer with a Failover Cluster Manager or PowerShell open, telling Host A to migrate a virtual machine to Host B, that’s three computers that are involved. Most Access Denied errors during Live Migrations involve this multi-computer problem.

Solution 1: CredSSP

CredSSP is kind of a terrible thing. It allows one computer to store the credentials for a user or computer and then use them on a second computer. It’s sort of like cached credentials, only transmitted over the network. It’s not overly insecure, but it’s also not something that security officers are overly fond of. You can set this option on the Authentication protocol section of the Advanced Features section of the Live Migration configuration area on the Hyper-V Settings dialog.

Live Migration Advanced Settings

Live Migration Advanced Settings

 

CredSSP has the following cons:

  • Not as secure as Kerberos
  • Only works when logged in directly to the source host

CredSSP has only one pro: You don’t have to configure delegation. My preference? Configure delegation.

Solution 2: Delegation

Delegation can be a bit of a pain to configure, but in the long-term it is worth it. Delegation allows one computer to pass on Kerberos tickets to other computers. It doesn’t have CredSSP’s hop limit; computers can continue passing credentials on to any computer that they’re allowed to delegate to as far as is necessary.

Delegation has the following cons:

  • It can be tedious to configure.
  • If not done thoughtfully, can needlessly expose your environment to security risks.

Delegation’s major pro is that as long as you can successfully authenticate to one host that can delegate, you can use it to Live Migrate to or from any host it has delegation authority for.

As far as the “thoughtfulness”, the first step is to use Constrained Delegation. It is possible to allow a computer to pass on credentials for any purpose, but it’s unnecessary.

Delegation is done using Active Directory Users and Computers or PowerShell. I have written an article that explains both ways and includes a full PowerShell script to make this much easier for multiple machines.

Be aware that delegation is not necessary for Quick or Live Migrations between nodes of the same cluster.

Mismatched Physical CPUs

Since you’re taking active threads and relocating them to a CPU in another computer, it seems only reasonable that the target CPU must have the same instruction set as the source. If it doesn’t, the migration will fail. There are hard and soft versions of this story. If the CPUs are from different manufacturers, that’s a hard stop. Live Migration is not possible. If the CPUs are from the same manufacturer, that could be a soft problem. Use CPU compatibility mode:

CPU Compatibility

CPU Compatibility

 

As shown in the screenshot, the virtual machine needs to be turned off to change this setting.

A very common question is: What are the effects of CPU compatibility mode? For almost every standard server usage, the answer is: none. Every CPU from a manufacturer has a common core set of available instructions and usually a few common extensions. Then, they have extra function sets. Applications can query the CPU for its CPUID information, which contains information about its available function sets. When the compatibility mode box is checked, all of those extra sets are hidden; the virtual machine and its applications can only see the common instruction sets. These extensions are usually related to graphics processing and are almost never used by any server software. So, VDI installations might have trouble when enabling this setting, but virtualized server environments usually will not.

This screenshot was taken from a virtual machine with compatibility disabled using CPU-Z software:

CPU Compatibility Off

CPU Compatibility Off

 

The following screen shot shows the same virtual machine with no change made except the enabling of compatibility mode:

CPU Compatibility On

CPU Compatibility On

 

Notice how many things are the same and what is missing from the Instructions section.

Insufficient Resources

If the target system cannot satisfy the memory or disk space requirements of the virtual machine, any migration type will fail. These errors are usually very specific about what isn’t available.

Virtual Switch Name Mismatch

The virtual machine must be able to connect its virtual adapter(s) to a switch(es) with the same name. Furthermore, a clustered virtual machine cannot move if it is using an internal or private virtual switch, even if the target host has a switch with the same name.

If it’s a simple problem with a name mismatch, you can use Compare-VM to overcome the problem while still performing a Live Migration. The basic process is to use Compare-VM to generate a report, then pass that report to Move-VM. Luke Orellan has written an article explaining the basics of using Compare-VM. If you need to make other changes, such as where the files are stored, notice that Compare-VM has all of those parameters. If you use a report from Compare-VM with Move-VM, you cannot supply any other parameters to Move-VM.

Live Migration Error 0x8007274C

This is a very common error in Live Migration that always traced to network problems. If the source and destination hosts are in the same cluster, start by running the Cluster Validation Wizard, only specifying the network tests. That might tell you right away what the problem is. Other possibilities:

  • Broken or not completely attached cables
  • Physical adapter failures
  • Physical switch failures
  • Switches with different configurations
  • Teams with different configurations
  • VMQ errors
  • Jumbo frame misconfiguration

If the problem is intermittent, check teaming configurations first; one pathway might be clear while another has a problem.

Storage Connectivity Problems

A maddening cause for some “Live Migration did not succeed at the destination” messages is a problem with storage connectivity. These aren’t always obvious, because everything might appear to be in order. Do any independent testing that you can. Specifically:

  • If the virtual machine is placed on SMB 3 storage, double-check that the target host has Full Control on the SMB 3 share and the backing NTFS locations. If possible, remove and re-add the host.
  • If the virtual machine is placed on iSCSI or remote Fibre Channel storage, double-check that it can work with files in that location. iSCSI connections sometimes fail silently. Try disconnecting and reconnecting to the target LUN. A reboot might be required.
Hyper-V’s relationship with virtual machine XML Files

Hyper-V’s relationship with virtual machine XML Files

Through version 2012 R2, all Hyper-V virtual machines are defined by an XML file. It conforms to the XML specification and can be viewed in any text editor. Microsoft has never supported editing this file, but being able to read it has its uses. For instance, if you find an orphaned virtual machine file set, you just open its XML and navigate to ConfigurationPropertiesName to find the name of the virtual machine that the XML file describes. You could also create a custom XML reader to quickly poll virtual machine(s) for other information that you find relevant. Out of all the files that belong to a virtual machine, the XML file is the most important. The purpose of this article is to investigate the importance of these XML files and how Hyper-V utilizes them.

Changes in Client Hyper-V in Windows 10 and Hyper-V Server 2016

The focus of the content of this article is only useful up through version 2012 R2. Microsoft has elected to replace their open, comprehensible system with a new model that utilizes a larger, undocumented file format that causes problems for text-based file readers. I have begun investigating the differences and hope to return to this subject for versions 2016+ once I have a better understanding. The only thing that I’ve learned so far is that the hoopla around the format being”binary” is mostly meaningless. The new format contains the same data as the old, it just takes a bit more effort to look at it:

Old vs. New Hyper-V VM Definition Files

Old vs. New Hyper-V VM Definition Files

 

As you can see, the same information in the original file type on the left is more or less human-readable in a hex dump of the new file format on the right. The format doesn’t seem like it’s overly complicated across the board. For example, I saw fairly quickly that it appears that each device leads in with four bytes of sequential numbering. While I didn’t line up the highlighting perfectly, look at offset 70B000 (right at the beginning of the screen capture) and you’ll see the byte pattern 00 00 00 07. The next device appears to start at offset 722005 (very near the end of the highlighted area) and has the byte pattern 00 00 00 08. I don’t really see how this format is “more efficient” since it is larger, has nearly the same layout, and adds a lot of empty padding at the beginning, but it is the format we’ll have to get used to. I’m certain that someone will come out with a parser in fairly short order, if it hasn’t been done already.

The other change in 2016 is way that Hyper-V keeps tabs on the virtual machines via the VMCX files. I don’t understand that at all, yet. From this point forward, I doubt that much, if any, of the content of this article will apply to 2016 or later.

How Hyper-V Works with Virtual Machine XML Files

By default, all virtual machines are created in C:ProgramDataMicrosoftWindowsHyper-VVirtual Machines. If you do not change the default, then the XML file that represents each virtual machine that is created on the host is placed in this folder. This includes not only virtual machines created by Hyper-V Manager, Failover Cluster Manager, and New-VM, but also applies to guests that are introduced to the host by Quick Migration or any kind of Live Migration.

Here’s a quick, simple rule: if an XML file exists in this folder that contains a GUID in the file name and can be parsed as a virtual machine definition, Hyper-V will automatically treat it as a valid virtual machine.

Any virtual machine defined by a file that fits the above rule will appear in Hyper-V Manager and Get-VM. No other file is required to make this happen — not VHDs, not BINs, not VSVs, nothing. If the contents of the file are incorrect, then the virtual machine will be inoperable, but it will still appear. The XML file is the centerpiece for all the rest of the components. From the XML file, Hyper-V will be able to find its way to all of the other files that make up a virtual machine. The VHDs and some of the other files will be specifically indicated in the XML while others will be placed in locations that are relative to the XML file.

XML Files in Other Locations

Of course, most people do not leave the virtual machine placement default unchanged, as it’s generally accepted that you don’t want your virtual machines to be in the same place as the hypervisor files for a variety of reasons. However, the C:ProgramDataMicrosoftWindowsHyper-VVirtual Machines folder retains its significance. Here is a view of mine on my primary host:

Root Virtual Machines Folder with Relocated VMs

Root Virtual Machines Folder with Relocated VMs

 

Every single one of the virtual machines currently running on this host are represented here, even though some are on SMB 3 storage and some are in Cluster Shared Volumes.

This may be a bit confusing at this point in the explanation, but here’s another solid rule: only XML files that are present in C:ProgramDataMicrosoftWindowsHyper-VVirtual Machines can be detected by Hyper-V as virtual machines.

You might be wanting to tell me that all of your virtual machine’s XML files are elsewhere and they are working just fine. That’s also true. However, they are also present in C:ProgramDataMicrosoftWindowsHyper-VVirtual Machines. If you look at the Type column of my screenshot (and your own folder), you’ll see that these are all of type .symlink, not of type XML Document. Hyper-V looks in this folder, and only in this folder, for XML files. Because these .symlink objects exist in that location, it retrieves exactly what it’s looking for. Hyper-V does not really know that the XML is elsewhere. That might not be technically true under the covers, but conceptually, this is how it works.

Symbolic Links

To grasp how Hyper-V functions when the XML files are not physically present in C:ProgramDataMicrosoftWindowsHyper-VVirtual Machines, you must understand the concept of the symbolic link. Many people think of these as shortcut files, but they are not. A shortcut file (identifiable by the .lnk extension), is almost completely different.

This is how the traditional shortcut file functions:

  1. The shortcut file (.lnk) is activated.
  2. The operating system retrieves the location of the .lnk from the volume’s file allocation table.
  3. The operating system parses the contents of the .lnk file to find information on the target file.
  4. The operating system retrieves the location of the target file from the indicated volume’s file allocation table.
  5. The operating system opens the target file.

A visual representation of accessing a file on a D: volume via a shortcut file on the C: volume:

Shortcut Example

Shortcut Example

 

Symbolic links have a similar purpose, but quite different functionality. As you can see from the screenshot of my directory listing, they appear as the actual target file, not as a separate shortcut file. The operating system handles all of the functionality much more smoothly:

Symbolic Link

Symbolic Link

 

The first major difference between a symbolic link and a shortcut is that a symbolic link has the exact same file name as its target. This makes it far easier for an automated system, like Hyper-V, to use a single search to find what it is looking for. In the case of Hyper-V, the only thing it does is scan C:ProgramDataMicrosoftWindowsHyper-VVirtual Machines for files that match the pattern GUID.xml. It’s up to the operating system to deliver the actual files that it asks for, even if they are in other locations. The second difference is that a symbolic link exists entirely in the file allocation table. Where the process of translating a shortcut takes 5 steps, following a symbolic link only requires only one extra hop over directly accessing a file.

Broken or Lost XML Files

Prior to 2012, Hyper-V’s XML parser was very fragile. Simply having a well-formed XML file wasn’t enough. It was very easy to make very minor modifications to the XML that would not have broken a traditional parser but would completely throw Hyper-V for a loop. Beginning in 2012, the XML parser became much more resilient to the point that we should all be sad to see it go. Of course, it’s neither perfect nor omniscient. However, it does follow rules that are easily understood.

Rule: If a VM does not have a GUID.xml file in C:ProgramDataMicrosoftWindowsHyper-VVirtual Machines, then it does not exist.

Hyper-V does not look anywhere else for these files. They could be deleted; antivirus software has been known to do this in the past, although I believe most now understand to leave these files alone. They also might not be created when they should. I’ve had a few instances where a Live Migration had some sort of problem and the XML file wound up in a sort of limbo where it wasn’t on either system. I’m not entirely certain how it gets bound up, but usually, restarting the Hyper-V Virtual Machine Management (vmms.exe) service on one or both hosts sorts this particular issue out.

Rule: If Hyper-V cannot parse the XML file, the virtual machine might as well not exist.

If the XML file is present but damaged to the point that Hyper-V’s parser can’t untangle it, it might as well not exist. It won’t appear in Hyper-V Manager or any other Hyper-V tool. However, you’ll find error event 16030 recording in the Hyper-V-VMMS log: “Cannot load a virtual machine configuration because it is corrupt. (Virtual machine ID GUID) Delete the virtual machine configuration file (.XML file) and recreate the virtual machine”. If this happens to an XML file that is represented by a symbolic link, be aware that Hyper-V will delete the symbolic link. If you are able to repair the target XML file, you can use Hyper-V’s import feature to register the fixed XML in place and the virtual machine will be ready to use.

Rule: If Hyper-V’s local folder contains a symbolic link to a location that does not exist, the virtual machine will be set to a “Critical” state.

Most of us that use Hyper-V with remote storage have encountered this issue at least once. If the target location for a virtual machine’s symbolic link goes dead, Hyper-V will retain a memory of its name and state but will lose everything else.

Virtual Machine in Off Critical State

Virtual Machine in Off Critical State

 

Recovering from Critical States

In the previous screen shot, the recovery method was very simple: I took the CSV out of maintenance mode and waited for VMMS to catch on that something had changed. That worked out well for me because I already knew how Hyper-V would respond to a CSV going into maintenance mode and how it would act when maintenance mode ended. What you need to do to fix a virtual machine in a critical state depends on how it got there in the first place.

  • Temporarily Unreachable Storage
    If a VM is in a critical state because the back-end storage is offline for a short time, you have two options. The first is to simply wait. VMMS will periodically check on VMs in a critical state and if the storage becomes reachable, it will take appropriate steps. I prefer this option because it requires no effort on the part of the administrator. If the VM was previous running, it will be started automatically. This is what happens to me when I bring my entire test cluster up from an off state and don’t give my storage host sufficient time to start before turning on my Hyper-V hosts. The second option is to just start interacting with the VM(s). That could mean manually turning them on, refreshing the Hyper-V Manager screen, or even resetting the VMMS service. Changing their state in Failover Cluster Manager often sorts a lot of issues out with VMMS.
  • Collision Due to Clustering
    Every once in a while, clustering causes an issue (or shows symptoms due to some other underlying issue). A virtual machine might continue to appear on one node when it’s not really there, but in a critical state. First, make sure you’ve checked for any issue that you can address. A common issue that causes this is group policies that affect user rights assignments. For instance, controlling the “Create Symbolic Link” permission. If you’re reasonably certain everything is OK, attempt to Quick Migrate or Live Migrate the actual instance of the problematic machine to any other cluster node. This action should cause all the nodes to sync up and remove any invalid registrations.
  • Permanently Unreachable Storage
    If the storage has crashed and you’re replacing it and restoring the data at the file level (as opposed to performing a complete virtual machine restore), how you proceed depends. If the name of the target storage volume is the same, then the wait method, interaction method, or Quick/Live Migration method mentioned in the prior two dots should sort you out. If one technique doesn’t work, try another. If you changed something about the storage volume, then your best bet is to delete the item in a Critical state and import the virtual machine, choosing to register it in place. The nice thing is, if Failover Clustering had marked a virtual machine as one of its own, re-creating the virtual machine with the same GUID via import will allow Failover Clustering to automatically recognize the virtual machine.

Leave the XML Alone

The point of this article was to demystify the XML file. It’s not to give you a false sense of security. I’ve taken you through the usage of the XML file and how to keep it in Hyper-V’s good graces, but I didn’t talk about the contents. That’s because it’s really not a place you should ever need to be. Even though I don’t like that the “binary” format is replacing the friendly XML format, I do like that it will discourage tinkering. If you absolutely must go into the file, do so in a read-only fashion.

Troubleshooting Hyper-V Webinar – Q & A Follow Up

Troubleshooting Hyper-V Webinar – Q & A Follow Up

A couple of weeks ago Didier Van Hoye and I conducted a webinar on troubleshooting Hyper-V. We promised that any of the question that we were not able to get to during the webinar, we would follow up on with a blog post. Well here it is! If your question was not answered during the webinar, it should be in the list below. Hopefully this will answer all the remaining questions, and don’t be afraid to use the comments section below for follow up questions as needed!

Also note, that Didier and I have split up the questions, along with fellow Altaro blogger Eric Siron. Collectively we’ll be answering all of the remaining questions below.

Revisit the Webinar

You can watch the webinar recording below, then proceed to the questions section underneath.

The Questions

Q: Hi, we are seeing a number of “CPU Wait Time per Dispatch” errors during our backup window (we use Altaro), have you experienced this at all? 

Q: On our Windows Server 2012 Hyper-V Hosts, I have to keep stopping the Volume Shadow Copy service manually every week or so because the SSD array keeps running out of space. How do I get around this situation?

Q: Where is the best place to get assistance in the best way to configure your VMs on Hyper-V

This is a very generic question, so it’s difficult to provide a 1-size-fits-all answer. If you want to learn some more general information about configuring Hyper-V and the associated VMs, be sure to check out our Hyper-V blog HERE

Q: Hey guys – common problem: Linux (Debian) within VM on 2012 R2 core: During backup, hard disk within linux is mounted read-only, but after the backup is finished, there are times it remains in read-only mode. Thoughts?

This isn’t something that I’ve run into myself, and I do run quite a few Debian boxes myself. I would be curious to know what version of Debian and what version of the Linux Kernel is in play here. Without that info, i would first make sure that all packages on the system are latest and greatest using APT, and then check to make sure you’re running a newer version of the Linux Kernel. It could also be that the particular file system that is in use, has issues during a backup operation, so that is an angle that I would investigate further as well.

Q: Can I Backup to a Linux Server?

Q: I’ve Seen Applications failing during Live Migration (connection issues). I’ve never seen this in a VMware Environment. How can I troubleshoot this?

Q: We have a multi-node Hyper-V Cluster that has recently developed an issue with intermittent failure of live migrations.

We noticed this when one of our CAU runs failed because it could not place the hosts into maintenance mode or successfully drain all the roles from them. 

Scenario:

Place any node into maintenance mode/drain roles.

Most VMs will drain and live migrate across onto other nodes. Randomly one or a few will refuse to move (it always varies in regards to the VM and which node it is moving to or from) The live migration ends with a failure generating event ID’s 21502, 22038, 21111, and 21024. if you run the process again (drain roles) it will migrate the VMs, or if you manually live migrate them they will move just fine. manually live migrating a VM can result in the same intermittent error but re-running the process will succeed after one or two times, or just waiting for a couple minutes.

This occurs on all nodes in the cluster and can occur with seemingly any VM in the private cloud. Ideas?

Didier actually has a post that may potentially help HERE. Usually this type of behavior is caused by some network misconfiguration, so be sure to test all the links between the two end points as needed.

Q: Any good sources for PowerShell scripts or examples of the steps to create and setuip Hyper-V VHDXs instead of trying to do it via the GUI? I know there is MS Virtual Academy videos and stuff, but any good books or websites that would give scripting help?

Check out Jeff Hicks‘s Article HERE for help with the VHDX question. Other than that, we have a PowerShell section on our blog HERE. Also, PowerShell.org is a great resource as well.

Q: What are your thoughts on SCVMM? Would you recommend it versus the FCM and Hyper-V Manager?

Q: What is the difference between Hyper-V backup and VM Backup?

If you’re talking about Altaro products, Hyper-V backup is the name of the older version of our flagship backup application that only provided protection to Hyper-V based workloads. Altaro VM Backup is the new renamed version and now supports protecting VMware based workloads as well.

Q: Does Altaro support backing up VMs housed on SMB shares? Are there any special considerations for doing so?

We fully support backing up VMs housed on SMB Shares. no special considerations are needed as long as our software can talk with the Hyper-V host. That’s all we require for backing up the target virtual machines.

Q: Can you tell me if a VM Backup can be restored to another Hyper-V host, assuming the two hosts have the same Windows and Hyper-V Versions installed?

Yes! Our software supports restoring to another host in the environment. The only limitation is we cannot restore cross-platform. Meaning, we cannot restore a Hyper-V VM to a VMware ESXi host and vice-versa.

Q: Should Hotfixes be removed after a new patch comes out? For example, a new update rollup?

You’re going to want to look at this on a case by case basis with each hotfix. The documentation should state whether it is superseded by a new patch and what steps will need to be taken.

Q: We’ve had ongoing VM Data Network connectivity issues with onbaord NICs on Hyper-V hosts. The hardware is HP Gen8 DL380s with Broadcom NICs. Is this a common issue with this vendor that might be fixed by using Intel NICs?

Q: Problems with updates and Gen 2 systems with secure boot. Is it worth just not enabling secure boot on Gen 2 VMS?

I would always use secure boot and generation 2 VMs whenever possible. All of the new advanced Hyper-V features are being developed for Gen 2 VMs and the pros far outweigh the cons. Outside of that it’s hard to discuss further without knowing exactly what types of issues you’re having with updates on Gen 2 VMs.

Q: Can you please share with us what do you use to get a VM report from a 2012 R2 failover cluster? A report which shows how much RAM, CPUs, disks the VMs are using? I’m referring to a used resource report. I tried to use ops manager, but it’s quite complicated to setup. 

I would have a look at both of the below resources. Between the two, you should be able to achieve some of what you’re looking for.

SCRIPT from Technet Gallery

PowerShell Based Hyper-V Health Check

Q: How about management of MAC addresses? Should this be done by hyper-v hosts separately in a cluster or should SCVMM do that? We have seen MAC address changes on switches when the VM is Live Migrated to another host in the cluster.

That’s normal. The dynamic MAC address of the VM doesn’t change with a live migration. But the VM lands on a different vSwitch and port, quite possibly attached to different physical switches in different racks and so on. To handle this a gratuitous ARP is sent to inform the network of that change and update the routing tables in order for network traffic to and from the VM to use the correct switch port.

Whether you use SCVMM or not comes down to having it or not and what use cases you have. If you have SCVMM you might as well leverage it. Personally I would not buy it to “just” manage static MAC address of the VMs on my Hyper-V cluster unless that cluster is so big it becomes a must have and you have a need for static MAC addresses. It really depends on the environment, use case etc.

Q: would jumbo frames on a storage network improve anything?

In most cases yes. Normally the storage vendor will even have specific instruction for that. Your mileage may vary, but it pays to test this when you’re not sure. You might be missing out on some extra throughput without them. If they don’t do any good or cause issues you can disable them again.

Q: Do you recommend switching off VMQ on machines, where is it not needed?

If you don’t need it, you might not configure it and if not configured you will have some issues. So yes, if you don’t need or want it, disable it.  If you leave it on, always configure it. Do you need it on 1Gbps? Probably not. Can you get it to work? Yes. Will it make a huge difference? Probably not, but yes I do have one with a teamed (4*1Gbps) NICs where I use it, as I like to see how this behaves and what results we get in real live scenarios over time.

Q: Any ideas if Microsoft will implement ODX on their Windows 2012-R2 Storage Server or above?

ODX is fully supported in Windows Server 2016 with storage arrays that support but they did not make ODX a feature of Storage Spaces if that what you’re referring to with “storage server”.

Q: For each host in cluster it has its own mac pool. after LM the mac won’t change. after reboot it will. Then DHCP reservations are non-existing. Static MACs with DHCP or static IPs are the only way then for certain scenarios.

Correct. Do note that if the VM is started or restored from saved state it will check if the MAC address is within the pool range and not yet in use. If it is, it will be regenerated. So a reboot does not mean a MAC address by default, it does indeed after rebooting a VM that’s been Live migrated to another host with a different MAC address pool.

If you have a hard need for DHCP reservations in your use case, you’ll need static addresses. Another possible fix for that is the use of static IP addresses on the VMs itself, taking DHCP out of the picture. For software licenses we do this when required: a static MAC address and a static IP address.

Q: After setting Vmq the maxprocessors it’s not displayed correct. For example, I have 2 CPUs with 12 Cores and maxprocessors is displaying 16 maxproc. Any idea why?

When you look at Get-NetAdapterVMQ on a clean installed server you’ll see a default number of “MaxCores” this has nothing to do with the actual number of cores on the host. It’s a chosen number by the vendor and 16 is actually the maximum number of cores the Intel NIC can use for VMQ. They also only allow to set this as 1,2,4, 8/ Some vendors have 8 as that default but allow a value of anything between 1-64 like the Mellanox card in this example.

Mellenox

It goes without saying you can’t use more cores than are available on the host, no matter what you configure. Be realistic with your settings.

Q: everytime i create a new vm with PS script i set up a static mac address (the one given by host after the first boot, so I will not worry if the machine migrates to another node and it takes another IP address. that happened already and that’s why I decided to have static mac addresses. is that so bad?

It’s not “bad” if you do it well and for all VMs without exception. Having static MAC addresses that are in the pool for dynamic ones could lead to issues when you have VMs with dynamic MAC addresses in the environment.  If your use case warrants this, it’s not bad but you just have to figure out if it’s needed. Why go through the trouble when for the majority of use cases this is not needed or why not only do this for those VMs or those environments/ use cases that really require it?

You say “takes a different IP address” are you referring to DHCP or did you mean get another MAC address? Preventing IP address changes can be handled by configuring a static IP address without the need for a static MAC address. When you want or need to use DHCP reservations for your VMs than a static address is needed, but perhaps just using static IP addresses is a better solution? The benefit of DHCP reservations is that you have the DHCP server as a central reference for your network configuration.

When you live migrate a VM the dynamic MAC address of the VM doesn’t change. That only happens when you restart it. What does happen after a live migration is that an ARP is sent by the vSwitch to inform the network as to what switch port your VM is now attached too, so communications are not interrupted. It’s with such gratuitous ARPs (which the network has to allow) that we’ve seen some bugs on certain switch configurations and firmware. Most of those bugs where not Hyper-V networking specific but live migrations made them more probable to show up.

Q: Is there a technology like ODX for SMB 3 and SOFS ?

ODX is a capability of the storage array. When you build a SOFS with a storage array that supports ODX you get ODX. When your array doesn’t, there’s no ODX. Storage Spaces doesn’t support ODX, so if you use that to build a SOFS you don’t have it,

Q: Do you need to deploy DCB settings on a switch and/or on a Hyper-V host where the switch is exclusive to storage only and the only machines it serves are Hyper-V hosts and a Scale Out File Server.  All machines have iWarp (Chelsio) RDMA Nics

With iWarp you can get away with not leveraging, and as such, not configuring DCB. This is because iWarp offload leveraged TCP/IP, which by itself can handle packet loss and that will not cause issues. Do note that it can and will benefit of DCB under heavy load. When you use RoCE cards, DCB is mandatory.

Q: VM on WS 2012 R2 host: “stopping – critical” and status “service” what is this? I cannot kill the process, even not with sysinternal tool. I can only restart the host and then the story repeats. The original problem (after live migration) was that the VM was running but with no connectivity although it was connected to a virtual switch (working)

“Stopping-Critical” means that the Virtual Machine Management Service has a record for the virtual machine’s existence but cannot locate the matching files.

If the virtual machine’s files are intact:

  1. Relocate the files to keep them safe.
  2. Delete the virtual machine from Failover Cluster Manager and then from Hyper-V Manager.
  3. Import the virtual machine.
  4. Add it back into Failover Cluster Manager.

If the files are not intact, deleting it from both of the tools (step 2) and then restoring it is your only option.

Q: Please, How can I resolve this error: The IO operation at logical block address 85d for Disk 1 (PDO name: DeviceMPIODisk0) was retried.

This error message indicates that the storage subsystem is showing unreliable behavior. If it is remote storage (for example, iSCSI or fiber channel), then it may be a transient problem that you can ignore. If it is local storage, this is typical warning behavior for a hard drive that is beginning to fail.

Q: what would be the cause when on a a 2012 R2 hyper-v host the NIC teaming mgmt NIC goes to a unidentified network after each reboot? have deleted and recreate the NIC team and have set the NLA service to delay restart – are there other things to check on?

There are a number of causes for this, all related to negotiation with the physical network. For a team, it is typically related the build-up of the team between the physical switch and your NICs. Read up on your physical switch’s settings and make sure that it is using its fastest trunk discovery method. Doing so will usually involve a modification to spanning-tree protocol.

Q: Hello, Thank you for this great knowledge. What can stop SOME of the virtual machines from communicating with each other? This is on a cluster, where all validation tests have passed.

Unfortunately, there isn’t enough information about the problem to make anything more than vague generalizations.

  • VLANs not properly allowed on switches.
  • Virtual machines in the wrong VLANs.
  • General TCP/IP misconfigurations.
  • Overly restrictive physical switch security settings.
  • Misconfigured firewalls.

There are no network settings in Hyper-V specifically designed to prevent virtual machines from being able to establish network connectivity.

Q: What about iscsi with 1GB nics (broadcom especially) and disabling offloading?

With modern computing hardware, it is unlikely that any offloading technology on a 1GbE card will noticeably reduce the load on your physical CPUs. Some Broadcom cards include specialized hardware that allow them to operate as iSCSI HBAs; this technology is fine to use, but the card can’t be used for anything else. All other offloading technologies are not likely to produce any positive benefits, although most are harmless.

While it is not an offloading technology and is technically outside the scope of this question, it is recommended to disable VMQ on all 1GbE cards. VMQ is implemented improperly on most gigabit cards and, even if it worked correctly, would not be likely to improve performance.

Q: We have tried to do a P2V for 2 VMs and it happens successfully, but when we start the VM it is not able to boot.

The question does not include enough information about the problem to provide any specific guidance. There are many reasons that a virtual machine might not boot, and even many ways for a boot to fail. Some general guidance:

  • Physical to virtual conversions should be an option of last resort. Build a new virtual machine and migrate the application if it is at all possible.
  • Use a different P2V tool. For instance, if you were using Disk2VHD, try Microsoft Virtual Machine Converter 3.0.
  • Preinstall the current Hyper-V Integration Services in the physical environment before converting.
  • Do not try to convert a BIOS installation into a Generation 2 virtual machine and do not try to convert a UEFI installation into a Generation 1 virtual machine.
  • Try an operating system repair.

Q: Time Synchronization service on guests – ON or OFF? There are some reputable articles suggesting to leave it enabled and make sure the domain hierarchy synchronization is set inside guests, which btw didnt work for me.

Leaving time synchronization on along with domain hierarchy synchronization effectively means that the virtual machine will always use the Hyper-V Time Synchronization service unless it happens to be broken. This is the preferred configuration, but it requires you to make certain that the Hyper-V host is receiving time from a valid source. For that reason, all virtualized domain controllers must have their time synchronization service disabled. The domain controller that hosts the PDC Emulator FSMO role must be set to retrieve time from a known valid source. We have a more thorough article available on this subject: https://www.altaro.com/hyper-v/hyper-v-time-synchronization/.

Q: How do you resolve time sync in a virtualized environment when DC’s are virtual and takes time from Hyper-V host, but Hyper-V host is joined to a domain which takes time from the virtual DC?  Thank you.

Virtualized domain controllers should have the Hyper-V Time Synchronization service disabled and pull from the standard domain hierarchy. The domain controller that holds the PDC Emulator FSMO role, whether physical or virtual, should pull its time from a reliable source. More details are available at our article on the subject: https://www.altaro.com/hyper-v/hyper-v-time-synchronization/.

Q: have you seen a cluster loosing the boot on SAN ? our provider and Microsoft said at the time, that shouldn’t happen… and it did.

It’s difficult to tell from this question whether it is about the management operating system running in a boot-from-SAN configuration or Hyper-V guests booting from a SAN LUN, which is important because the troubleshooting steps are different. I don’t know of any particular reason that makes a boot-from-SAN failure completely unexpected. The hardware-to-software hand-off of a boot operation has several expectations that might sometimes not be met by even a correctly-configured boot-from-SAN system.

If both Microsoft and your provider have looked at your particular configuration and cannot find a problem, we probably can’t improve on what they said. We can make some general recommendations.

If the problem is with virtual machines starting from the SAN:

  • Make sure that fiber channel zoning masks are set as tightly as possible. Loose masks can cause timeouts that break Live Migration and boot.
  • Make sure that iSCSI networks are properly isolated. An iSCSI system that runs might not boot because the Windows operating system will tolerate read/write delays that would cause the boot process to fail.
  • Ensure that any security settings, especially with iSCSI, are not too restrictive.
  • Check all of the system, storage, and clustering events on the nodes to verify that there are no disconnect issues that might indicate fabric probles.

Most of the same problems for guest boot-from-SAN would apply to host-boot-from-SAN, although you’ll also need to verify that your stub boot system is configured correctly and that nothing in the back-end configuration has changed since that system was built.

Wrap-Up

That wraps up all of our questions. Hopefully that answers all of the left over pending questions from our troubleshooting webinar. If you have follow up questions on a response, or you have a question that you don’t seen answered above, be sure to reach out using the comments section below!

 

 

Ubuntu Linux Server on Hyper-V: Guest Will not Shut Down or Restart

Ubuntu Linux Server on Hyper-V: Guest Will not Shut Down or Restart

Linux doesn’t always run smoothly on Hyper-V, although it’s undeniable that the situation has gotten substantially better as Microsoft has put more effort into it.

Sometimes, the way that Linux manifests a problem befuddles administrators that are used to the Windows platform. When issues crop up that we’re unfamiliar with, there is a strong temptation to wonder if it’s because there’s something wrong with Linux or if it’s a Hyper-V issue. From experience, I know that most people tend to blame Hyper-V first.

One such scenario is a Linux-based virtual machine is instructed to shut down or restart, but never does. The assumption seems to be that something has gone wrong in the guest-to-host communications process. One “fix” that I saw was to disable Dynamic Memory. Sorry, but that isn’t a fix — at best, it’s a work-around.

This article was written using Ubuntu Linux, but should be applicable to any Debian-based distribution. Other distributions might have similar symptoms and fixes, although you’ll have to look elsewhere for a translation. Depending on your confidence in your Linux system and skills, you might want to take a backup before trying the steps shown here.

Unfortunately, there isn’t a single reason that a Linux system might refuse to shut down. It’s very verbose about it, though, so you might be able to just watch the process and find out that it’s hanging up in a particular location and search on that for a fix. The case that I’m going to present here is one that can happen to anyone, is fairly simple to reproduce, and (at least for me), has a fix that is not at all obvious.

The symptom is quite simple: the guest goes through all the motions of shutting down and reaches the point where it should stop or restart, and then simply hangs:

Linux Hanging on Shut Down

Linux Hanging on Shut Down

The final line states “Reached target Shutdown.” When all is well, this is the very last thing that’s displayed before the guest shuts down or restarts, as instructed.

I stumbled on the fix for this by accident. If you’re having the same problem that I did, there’s a very simple way to find out. Run the following at any prompt:

In the output, check the Mounted on column for the /boot mount point. If it is all, or mostly, consumed, this is likely to be your problem. Compare to the following:

Boot Mount Point Full

Boot Mount Point Full

What’s likely happened is that you’ve updated the system kernel a few times. Unlike other packages, old kernels are not automatically cleaned up by the normal processes.

How to Clean Up Old Kernels on Ubuntu Linux

  1. The very first thing to do is find out which kernel you’re currently running. The reason is simple: don’t delete that one. To determine your current kernel version, type the following:

    On my system, that produced:

    Kernel Version Check

    Kernel Version Check

    From that, I know that I do not want to delete anything about kernel version 3.19.0-33. Why would Linux allow me to delete that version? I don’t know. Maybe it won’t; I didn’t try. Why doesn’t Windows stop me from running “deltree %systemroot%”? The point is, take a few seconds to protect you from yourself.

  2. The next thing you need is a list of old kernels that are hanging around. That’s discovered with:

    Mine produced the following output:

    Linux Kernel List

    Linux Kernel List

    I don’t know what the story is behind the items marked as deinstall. They take up zero space. I deleted them anyway.

  3. To remove an item from the list, just enter the following:

    You do not need to specifically remove the items marked with the word “extra”. They will be automatically removed along with the kernel of the same name.

  4. You’ll then be taken through the normal apt-get process of confirming the removal and watching the results as they happen.
  5. Repeat the above steps for any remaining kernels… except the one from step 1. In case you’re wondering, I worked from oldest to newest but I don’t believe that it matters.

Note: It is possible to issue multiple items after –purge, but I recommend that you not do so in this case. I actually caused my test system to hang when I tried that. It recovered just fine, but I don’t know that you’d want to take that kind of gamble on a production system.

If you’re running through PuTTY, there is a way to make this process faster. First, type out the remove line:

Make sure you put in a space after purge. Then, scroll back up to the list of kernels. Highlight the oldest:

Kernel Text Selection

Kernel Text Selection

Then just right-click. PuTTY will automatically append the highlighted text wherever the cursor is, automatically producing this line (or whatever you highlighted on your system):

That’s it! Just press [Enter] and finish up as normal from step 4.

Confirmation

Once all the old kernels are removed, run df again. You should have much different results:

Cleaned Linux Boot Partition

Cleaned Linux Boot Partition

If your Linux system is in a place where a bit of downtime won’t hurt anyone, verify your results the easiest way:

If all went well, your Linux guest should restart as expected. Unless you deleted your active kernel, the worst-case scenario is that you have a much cleaner boot partition.

Notes

  • I’m not promising that this will allow your Linux guest to properly shutdown or restart every time. Linux under Hyper-V is improving; it’s not bullet-proof. However, the condition described in this article consistently causes this issue. Applying this fix will dramatically improve the odds of avoiding the issue.
  • While not nearly as urgent, you can remove the header files for those old kernels as well. Start at step 2 with “linux-headers” as your search criteria.
  • If you followed my earlier article or the Microsoft recommended article, you’ve also get a number of “linux-virtual” items. Leave those alone.

 

Page 1 of 212