Autoscaling Fargate cluster based on custom metric

547 Views Asked by At

I'm trying to set up an autoscaling Fargate cluster for GitHub self-hosted runners. The high-level design for this looks like this –

  1. A GitHub app will send a webhook event to a Lambda behind an API gateway.
  2. The Lambda will put a custom COUNT metric with value 1 if the request is for a new workflow, and a -1 for a completed or cancelled workflow. The metric will include the repo owner (REPO_OWNER), the repo name (REPO_NAME), event type (EVENT_TYPE, which I know will always be workflow_job) and the workflow run ID (ID) as dimensions.
  3. 2 app autoscaling policies (up and down) will change the ecs:service:DesiredCount dimension based on the value of the custom metric.
  4. 2 Cloudwatch metric alarms (up and down) will attach the above 2 policies whenever the scaling thresholds are breached.
const autoscalingTarget = new AppautoscalingTarget(this, `appautoscaling-target-${environment}`, {
  serviceNamespace: 'ecs',
  resourceId: `service/${ecsCluster.awsEcsClusterClusterNameOutput}/${ecsService.awsEcsServiceServiceNameOutput}`,
  scalableDimension: 'ecs:service:DesiredCount',
  minCapacity: 0,
  maxCapacity: options.maxClusterSize,
})

const scaleUpPolicy = new AppautoscalingPolicy(this, `autoscale-up-policy-${environment}`, {
  dependsOn: [autoscalingTarget],
  name: `autoscale-up-policy-${environment}`,
  serviceNamespace: 'ecs',
  resourceId: `service/${ecsCluster.awsEcsClusterClusterNameOutput}/${ecsService.awsEcsServiceServiceNameOutput}`,
  scalableDimension: 'ecs:service:DesiredCount',
  stepScalingPolicyConfiguration: {
    adjustmentType: 'ChangeInCapacity',
    cooldown: 30,
    metricAggregationType: 'Maximum',
    stepAdjustment: [{
      metricIntervalLowerBound: '1',
      scalingAdjustment: 1,
    }]
  },
})

const scaleDownPolicy = new AppautoscalingPolicy(this, `autoscale-down-policy-${environment}`, {
  dependsOn: [autoscalingTarget],
  name: `autoscale-down-policy-${environment}`,
  serviceNamespace: 'ecs',
  resourceId: `service/${ecsCluster.awsEcsClusterClusterNameOutput}/${ecsService.awsEcsServiceServiceNameOutput}`,
  scalableDimension: 'ecs:service:DesiredCount',
  stepScalingPolicyConfiguration: {
    adjustmentType: 'ChangeInCapacity',
    cooldown: 30,
    metricAggregationType: 'Maximum',
    stepAdjustment: [{
      metricIntervalUpperBound: '0',
      scalingAdjustment: -1,
    }]
  }
})

const alarmPeriod = 120 as const

new CloudwatchMetricAlarm(this, `autoscale-up-alarm-${environment}`, {
  alarmName: `fargate-cluster-scale-up-alarm-${environment}`,
  metricName: options.customCloudWatchMetricName,
  namespace: options.customCloudWatchMetricNamespace,
  alarmDescription: `Scales up the Fargate cluster based on the ${options.customCloudWatchMetricNamespace}.${options.customCloudWatchMetricName} metric`,
  comparisonOperator: 'GreaterThanThreshold',
  threshold: 0,
  evaluationPeriods: 1,
  metricQuery: [{
        id: 'm1',
        metric: {
          metricName: options.customCloudWatchMetricName,
          namespace: options.customCloudWatchMetricNamespace,
          period: alarmPeriod,
          stat: 'Sum',
          unit: 'Count',
          dimensions:
          {
            // Note: this is the only dimension I can know in advance
            EVENT_TYPE: 'workflow_job',
          },
        },
      }, {
        id: 'm2',
        metric: {
          metricName: options.customCloudWatchMetricName,
          namespace: options.customCloudWatchMetricNamespace,
          period: alarmPeriod,
          stat: 'Sum',
          unit: 'Count',
          dimensions:
          {
            // Note: this is the only dimension I can know in advance
            EVENT_TYPE: 'workflow_job',
          },
        },
      }, {
        id: 'e1',
        expression: 'SUM(METRICS())',
        label: 'Sum of Actions Runner Requests',
        returnData: true,
  }],
  alarmActions: [
    scaleUpPolicy.arn,
  ],
  actionsEnabled: true,
})

new CloudwatchMetricAlarm(this, `autoscale-down-alarm-${environment}`, {
  alarmName: `fargate-cluster-scale-down-alarm-${environment}`,
  alarmDescription: `Scales down the Fargate cluster based on the ${options.customCloudWatchMetricNamespace}.${options.customCloudWatchMetricName} metric`,
  comparisonOperator: 'LessThanThreshold',
  threshold: 1,
  period: alarmPeriod,
  evaluationPeriods: 1,
  metricQuery: [{
        id: 'm1',
        metric: {
          metricName: options.customCloudWatchMetricName,
          namespace: options.customCloudWatchMetricNamespace,
          period: alarmPeriod,
          stat: 'Sum',
          unit: 'Count',
          dimensions: {
            // Note: this is the only dimension I can know in advance
            EVENT_TYPE: 'workflow_job',
          }
        },
      }, {
        id: 'm2',
        metric: {
          metricName: options.customCloudWatchMetricName,
          namespace: options.customCloudWatchMetricNamespace,
          period: alarmPeriod,
          stat: 'Sum',
          unit: 'Count',
          dimensions: {
            // Note: this is the only dimension I can know in advance
            EVENT_TYPE: 'workflow_job',
          }
        },
      }, {
        id: 'e1',
        expression: 'SUM(METRICS())',
        label: 'Sum of Actions Runner Requests',
        returnData: true,
  }],
  alarmActions: [
    scaleDownPolicy.arn,
  ],
  actionsEnabled: true,
})

I do not see the metrics showing data nor the alarm changing states until I add all the 4 dimensions. Adding only 1 dimension (EVENT_TYPE, which is the only static dimension) gives me no data, but adding all 4 does.

How do I model my metrics so I can continue adding more dynamic metadata as dimensions but still set up working alarms based on well-known static dimensions?

1

There are 1 best solutions below

0
On BEST ANSWER

I was able to solve this by removing all dimensions on the Cloudwatch metrics.