VMware Cloud Foundation: Domain Failed State
Environment Upgrade Blocked
The Issue?
Upgrading VMware Cloud Foundation (VCF) 4.5.1 to VCF 5 results in a host failure or host error state, which effectively roadblocks completion of the upgrade, during the host upgrade phase. If you are concerned that you may be facing this issue, refer to the image below, if you see this message then continue reading.
The video below is a demonstration of the upgrade process.
Notice that host1.region2.shank.com has a configuration status of Error. The error message you saw earlier is a result of this host being in an errored state. One thing to keep in mind is, when starting the upgrade, you are required to complete and pass the environment precheck, so you may be asking yourself how the upgrade proceeded.
The environment was in a healthy state, however, once the upgrade phase entered host remediation during the upgrade, a race condition occurred, where the host was upgraded, but not before the SDDC Manager workflow task timeout. As far as SDDC Manager was concerned, the host was now in an errored state.
Notice, in the below image, the host is reporting the correct version (8.0.1).
You can verify the version has been upgraded by checking vCenter.
The Fix?
If you are certain that the host has not errored in anyway and you have hit this race condition, follow the steps below.
- SSH onto SDDC Manager, as the vcf account.
- Issue the command in the code block below as the VCF user, replace <hostID> with the ID of your host. You can retrieve the host ID in the host tab, simply, select the host and it will be visible in the URL.
curl localhost/inventory/entities/<hostID> -X PATCH -d '{"type" : "ESXI","status":"ACTIVE"}' -H 'Content-Type:application/json'
There's no response from this curl command, so if you don't see a response don't be too concerned. After approximately 10 minutes, the error state in the UI should be cleared, it could also be sooner. You will not need to restart any services.
Once your host is in an active state, you will be able to proceed with the domain upgrade.
Summary
When upgrading a VMware Cloud Foundation Domain, the domain that is being upraded may move into an errored state. This could be for multiple reasons, however, in this case it was due to a host moving into an errored state. As long as the host is not in any actual errored state, a simple curl command can clear this issue and allow you to complete the upgrade.