How to integrate Argo CD with DataDog to query the deployed resources status for auto promotion (B/G)?

2.8k Views Asked by At

I'm trying to integrate Argo with DataDog to query the metrics and based on the metric value to evaluate the deployment to automatically promote for B/G promotion. In my case the issue is Argo fails to evaluate the DataDog query that passed via Analysis template...

Kubernetes version: v1.20 (EKS), argo cd version: v2.2.2, argo rollouts: v1.1.1

The Analysis template I'm using:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: gateway-uat-pat
spec:
  args:
  - name: service-name
  metrics:
  - name: gateway-uat-pat
    interval: 5m
    successCondition: default(result, 0) <= 10
    failureLimit: 3
    provider:
      datadog:
        interval: 5m
        query: |
          sum:trace.http.request.errors{service:{{args.service-name}}}

The secret object I'm creating:

apiVersion: v1
kind: Secret
metadata:
  name: datadog
type: Opaque
stringData:
  address: https://api.datadoghq.com
  api-key: '***'
  app-key: '***' 

Both Analysis Template and secret are created outside of Argo. And then tried deploying original application using Argo Rollouts and I have included the following strategy in my rollout file spec:

  strategy:
    blueGreen:
      activeService:  gateway
      previewService:  gateway-preview
      postPromotionAnalysis:
        templates:
        - templateName: gateway-uat-pat
        args:
        - name: service-name
          value: gateway-qa

The error I keep getting:

InvalidSpec: The Rollout "gateway-rollouts" is invalid: spec.strategy.blueGreen.postPromotionAnalysis.templates: Invalid value: "gateway-uat-pat": AnalysisTemplate gateway-uat-pat has metric gateway-uat-pat which runs indefinitely. Invalid value for count:

I dig into the Argo CD Analysis docs, but couldn't find any information on how to successfully evaluate the DataDog queries with Argo. Have I done any mis-configurations with args in AnalysisTemplate / any information on where I'm doing wrong? Thanks

1

There are 1 best solutions below

0
On

I found the solution @naveen. "Count" attribute should be added to the analysis. If not, analysis will loop forever and timeout.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: loq-error-rate
spec:
  metrics:
  - name: error-rate
    interval: 30s
    count: 2
    successCondition: result < 1
    failureLimit: 3
    provider:
      datadog:
        interval: 5m
        query: |
          sum:system.cpu.user