Why is my Managed/Teams controller deployed in AKS reporting License/Certificate/PVC issues?

2 minute read

Issue

In a CloudBees CI environment deployed in Azure Kubernetes Service (AKS), you can see how one of your controllers is continuously reporting issues and does not finish provisioning operation.

In the pod logs you can see some entries like the ones shown below:

[DATE][Warning][Pod][teams-name-0][FailedAttachVolume] AttachVolume.Attach failed for volume "pvc-XXX" : failed to get azure instance id for node "aks-XXX" (azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://localhost:7788/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Compute/virtualMachines/aks-XXX?%!!(MISSING)e(MISSING)xpand=instanceView&api-version=2018-04-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {"error":"invalid_client","error_description":"AADSTS7000222: The provided client secret keys are expired. Visit the Azure Portal to create new keys for your app, or consider using certificate credentials for added security: https://docs.microsoft.com/azure/active-directory/develop/active-directory-certificate-credentials\r\nTrace ID: XXX0\r\nCorrelation ID: XXXX\r\nTimestamp: XXX","error_codes":[7000222],"timestamp":"XXX","trace_id":"XXXXX","correlation_id":"XXXX","error_uri":"https://login.microsoftonline.com/error?code=7000222"})

Resolution

The key for what is happening can be inferred from the Microsoft Error code. By browsing to the url included in the error message: https://login.microsoftonline.com/error?code=7000222 we can verify that this means that the Application (controller pod) is trying to sign in without the correct authentication parameters.

Reviewing the additional link that we can find in the error code page mentioned above, you will get to how Microsoft’s Identity Platform allows different applications to use their own credentials as described here.

If you are using a Service principal to define or manage the application access to your AKS resources, the error shown above means that the service principal defined for the application is no longer valid and you need to recreate it.

One way to validate this is to go to your AKS Resource Health Dashboard, check for any reported service principal problems and finally follow the steps provided in the AKS service principal documentation to get it fixed.