Skip to content

Add glue status-based alarms for FAILURE and TIMEOUT #653

@landon912

Description

@landon912

Feature scope

glue

Describe your suggested feature

The current alarms are based on numFailedTasks/numKilledTasks which are not always emitted during a job run, even for a failure.

An example would be a glue job filled with gibberish:

job.py:

i'm confused on how to use python

This job will not emit any of the above metrics due to the failure being very early in the glue job and before the full glue_context starts up.

Glue emits events such as Glue Job State Change which can be used to pick up the very high-level status of a job such as FAILURE and TIMEOUT.

Alarms should generally be added for these status changes.

Please add functionality similar to the below:

 import { Construct } from 'aws-cdk-lib';
import { Alarm, ComparisonOperator, Metric, TreatMissingData } from 'aws-cdk-lib/aws-cloudwatch';
import { Rule } from 'aws-cdk-lib/aws-events';

export interface SimpleGlueAlarmProps {
  readonly glueJobName: string;
}

export class SimpleGlueAlarms extends Construct {
  constructor(scope: Construct, id: string, props: SimpleGlueAlarmProps) {
    super(scope, id);

    this.addStateChangeAlarm(props.glueJobName, 'FAILED');
    this.addStateChangeAlarm(props.glueJobName, 'TIMEOUT');
  }

  private addStateChangeAlarm(glueJobName: string, state: string): Alarm {
    const ruleName = `${glueJobName}-${state}-glueRule`;
    const alarmName = `${glueJobName}-${state}-glueAlarm`;

    const stateChangeRule = new Rule(this, ruleName, {
      description: `Event rule for catching ${glueJobName} ${state}`,
      ruleName: ruleName,
      eventPattern: {
        source: ['aws.glue'],
        detailType: ['Glue Job State Change'],
        detail: {
          jobName: [glueJobName],
          state: [state],
        },
      },
      enabled: true,
    });

    const stateChangeMetric = new Metric({
      namespace: 'AWS/Events',
      metricName: 'TriggeredRules',
      dimensionsMap: {
        RuleName: stateChangeRule.ruleName,
      },
      statistic: 'Sum',
      period: 300, // 5 minutes
    });

    return new Alarm(this, alarmName, {
      alarmName: alarmName,
      alarmDescription: `${state} alarm for ${glueJobName}`,
      metric: stateChangeMetric,
      threshold: 1,
      evaluationPeriods: 1,
      comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
      treatMissingData: TreatMissingData.NOT_BREACHING,
    });
  }
}
```

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions