Knative intermittently failing to create deployments

268 Views Asked by At

I've been running into this issue where every once in a while Knative will become unable to create new Deployments, and will spontaneously recover within a few hours and create it. Until then, the following errors keep playing out within the serving components. What it feels like to me is the requests to kubernetes service are timing out, but I cannot tell why.

Expected Behavior

On making updates to a service, expecting deployment of new revision to work.

Actual Behavior

Occasionally, while making valid changes ex: changing the value of an annotation Knative will become unable to deploy a new revision, getting stuck in the state of constantly trying to reconcile it for hours before spontaneously recovering.

$ kn revision list -A
NAMESPACE        NAME                       SERVICE              TRAFFIC   TAGS      GENERATION   AGE         CONDITIONS   READY     REASON
knative       service-00033                  service                                 33           <invalid>   0 OK / 3     Unknown   Deploying
knative       service-00032                  service             100%      primary   32           <invalid>   4 OK / 4     True

In the controller logs I see the following context deadline exceeded error while trying to post to the Kubernetes service IP:

{
  "insertId": "plhs429mzmf9nh5f",
  "jsonPayload": {
    "logger": "controller.event-broadcaster",
    "caller": "record/event.go:285",
    "knative.dev/pod": "controller-8c6b99cb7-7zg6n",
    "commit": "484e848",
    "message": "Event(v1.ObjectReference{Kind:\"Revision\", Namespace:\"knative\", Name:\"service-00033\", UID:\"8a09a3ff-655e-4e5f-b8d4-1a4886ab0678\", APIVersion:\"serving.knative.dev/v1\", ResourceVersion:\"1844291799\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' failed to create deployment \"service-api-00033-deployment\": Post \"https://10.123.20.1:443/apis/apps/v1/namespaces/knative/deployments\": context deadline exceeded",
    "timestamp": "2023-06-30T09:57:08.7332053Z"
  }

and right before it the following in Webhook logs:

{
  "insertId": "k078pd2dmx16qrr7",
  "jsonPayload": {
    "knative.dev/pod": "webhook-d44b476b8-89gbx",
    "message": "Failed the resource specific validation",
    "knative.dev/operation": "UPDATE",
    "logger": "webhook",
    "knative.dev/name": "service",
    "knative.dev/subresource": "",
    "knative.dev/namespace": "knative",
    "knative.dev/kind": "serving.knative.dev/v1, Kind=Service",
    "knative.dev/resource": "serving.knative.dev/v1, Resource=services",
    "commit": "484e848",
    "knative.dev/userinfo": "system:serviceaccount:service:default",
    "timestamp": "2023-06-30T09:56:38.327880939Z",
    "caller": "validation/validation_admit.go:183",
    "stacktrace": "knative.dev/pkg/webhook/resourcesemantics/validation.validate\n\tknative.dev/[email protected]/webhook/resourcesemantics/validation/validation_admit.go:183\nknative.dev/pkg/webhook/resourcesemantics/validation.(*reconciler).Admit\n\tknative.dev/[email protected]/webhook/resourcesemantics/validation/validation_admit.go:79\nknative.dev/pkg/webhook.admissionHandler.func1\n\tknative.dev/[email protected]/webhook/admission.go:123\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\nnet/http.(*ServeMux).ServeHTTP\n\tnet/http/server.go:2487\nknative.dev/pkg/webhook.(*Webhook).ServeHTTP\n\tknative.dev/[email protected]/webhook/webhook.go:263\nknative.dev/pkg/network/handlers.(*Drainer).ServeHTTP\n\tknative.dev/[email protected]/network/handlers/drain.go:113\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991"
  }

At a complete loss here at this point.

Steps to Reproduce the Problem

Unknown

1

There are 1 best solutions below

3
E. Anderson On

I haven't looked at your service yaml, but I have a hypothesis that this might be related to slow tag to digest resolution. Your can try the following:

  1. Monitor latency for registry operations, particularly GET operations.

  2. Use image digests when referencing images. These look like @sha256:... rather than :latest, and ensure that the image does not change after deployment.

  3. Disable tag to digest resolution. Note that this can lead to unpredictable behavior if a referenced tag is moved. Some instances may pick up the new image, while other instances may use an earlier image.

If this is tag to digest resolution and you're using public Dockerhub images, adding pull credentials to the service account that's running the Knative Service might give you higher rate limits.