AWS data pipeline unable to create through serverless yaml template

490 Views Asked by At

I was creating data pipeline for dynamo db export to s3. The template given for serverless yaml is not working on "PAY_PER_REQUEST" billing mode

Created one using aws console itr worked fine, exported its definition, tried to create using same definition in serverless but it is giving me following error

ServerlessError: An error occurred: UrlReportDataPipeline - Pipeline Definition failed to validate because of following Errors: [{ObjectId = 'TableBackupActivity', errors = [Object references invalid id: 's3://dynamodb-dpl-#{myDDBRegion}/emr-ddb-storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}']}] and Warnings: [].

Can anyone help me on this. Pipeline created using console is working perfectly with same value of step in table backup activity.

Pipeline template is pasted below

UrlReportDataPipeline:
      Type: AWS::DataPipeline::Pipeline
      Properties: 
        Name: ***pipeline name****
        Activate: true
        ParameterObjects: 
          - Id: "myDDBReadThroughputRatio"
            Attributes: 
              - Key: "description"
                StringValue: "DynamoDB read throughput ratio"
              - Key: "type"
                StringValue: "Double"
              - Key: "default"
                StringValue: "0.9"
          - Id: "myOutputS3Loc"
            Attributes: 
              - Key: "description"
                StringValue: "S3 output bucket"
              - Key: "type"
                StringValue: "AWS::S3::ObjectKey"
              - Key: "default"
                StringValue: 
                  !Join [ "", [ "s3://", Ref: "UrlReportBucket" ] ]
          - Id: "myDDBTableName"
            Attributes: 
              - Key: "description"
                StringValue: "DynamoDB Table Name"
              - Key: "type"
                StringValue: "String"
          - Id: "myDDBRegion"
            Attributes:
              - Key: "description"
                StringValue: "DynamoDB region"
        ParameterValues: 
          - Id: "myDDBTableName"
            StringValue: 
              Ref: "UrlReport"
          - Id: "myDDBRegion"
            StringValue: "eu-west-1"
        PipelineObjects: 
          - Id: "S3BackupLocation"
            Name: "Copy data to this S3 location"
            Fields: 
              - Key: "type"
                StringValue: "S3DataNode"
              - Key: "dataFormat"
                RefValue: "DDBExportFormat"
              - Key: "directoryPath"
                StringValue: "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"
          - Id: "DDBSourceTable"
            Name: "DDBSourceTable"
            Fields: 
              - Key: "tableName"
                StringValue: "#{myDDBTableName}"
              - Key: "type"
                StringValue: "DynamoDBDataNode"
              - Key: "dataFormat"
                RefValue: "DDBExportFormat"
              - Key: "readThroughputPercent"
                StringValue: "#{myDDBReadThroughputRatio}"
          - Id: "DDBExportFormat"
            Name: "DDBExportFormat"
            Fields: 
              - Key: "type"
                StringValue: "DynamoDBExportDataFormat"
          - Id: "TableBackupActivity"
            Name: "TableBackupActivity"
            Fields: 
              - Key: "resizeClusterBeforeRunning"
                StringValue: "true"
              - Key: "type"
                StringValue: "EmrActivity"
              - Key: "input"
                RefValue: "DDBSourceTable"
              - Key: "runsOn"
                RefValue: "EmrClusterForBackup"
              - Key: "output"
                RefValue: "S3BackupLocation"
              - Key: "step"
                RefValue: "s3://dynamodb-dpl-#{myDDBRegion}/emr-ddb-storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}"
          - Id: "DefaultSchedule"
            Name: "Every 1 day"
            Fields: 
              - Key: "occurrences"
                StringValue: "1"
              - Key: "startDateTime"
                StringValue: "2020-09-17T1:00:00"
              - Key: "type"
                StringValue: "Schedule"
              - Key: "period"
                StringValue: "1 Day"
          - Id: "Default"
            Name: "Default"
            Fields: 
              - Key: "type"
                StringValue: "Default"
              - Key: "scheduleType"
                StringValue: "cron"
              - Key: "failureAndRerunMode"
                StringValue: "CASCADE"
              - Key: "role"
                StringValue: "DatapipelineDefaultRole"
              - Key: "resourceRole"
                StringValue: "DatapipelineDefaultResourceRole"
              - Key: "schedule"
                RefValue: "DefaultSchedule"
          - Id: "EmrClusterForBackup"
            Name: "EmrClusterForBackup"
            Fields: 
              - Key: "terminateAfter"
                StringValue: "2 Hours"
              - Key: "masterInstanceType"
                StringValue: "m3.xlarge"
              - Key: "coreInstanceType"
                StringValue: "m3.xlarge"
              - Key: "coreInstanceCount"
                StringValue: "1"
              - Key: "type"
                StringValue: "EmrCluster"
              - Key: "releaseLabel"
                StringValue: "emr-5.23.0"
              - Key: "region"
                StringValue: "#{myDDBRegion}"
2

There are 2 best solutions below

0
On BEST ANSWER

Guys I solved it with AWS support team. As of Today, following is the yaml code which creates a data-pipleine on on-demand pay-per-request dynamodb tables

You can also convert this to json if you want

    UrlReportBucket:
      Type: AWS::S3::Bucket
      Properties:
        BucketName: ***bucketname***

    UrlReportDataPipeline:
      Type: AWS::DataPipeline::Pipeline
      Properties: 
        Name: ***pipelinename***
        Activate: true
        ParameterObjects: 
          - Id: "myDDBReadThroughputRatio"
            Attributes: 
              - Key: "description"
                StringValue: "DynamoDB read throughput ratio"
              - Key: "type"
                StringValue: "Double"
              - Key: "default"
                StringValue: "0.9"
          - Id: "myOutputS3Loc"
            Attributes: 
              - Key: "description"
                StringValue: "S3 output bucket"
              - Key: "type"
                StringValue: "AWS::S3::ObjectKey"
              - Key: "default"
                StringValue: 
                  !Join [ "", [ "s3://", Ref: "UrlReportBucket" ] ]
          - Id: "myDDBTableName"
            Attributes: 
              - Key: "description"
                StringValue: "DynamoDB Table Name"
              - Key: "type"
                StringValue: "String"
          - Id: "myDDBRegion"
            Attributes:
              - Key: "description"
                StringValue: "DynamoDB region"
        ParameterValues: 
          - Id: "myDDBTableName"
            StringValue: 
              Ref: "UrlReport"
          - Id: "myDDBRegion"
            StringValue: "eu-west-1"
        PipelineObjects: 
          - Id: "S3BackupLocation"
            Name: "Copy data to this S3 location"
            Fields: 
              - Key: "type"
                StringValue: "S3DataNode"
              - Key: "dataFormat"
                RefValue: "DDBExportFormat"
              - Key: "directoryPath"
                StringValue: "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"
          - Id: "DDBSourceTable"
            Name: "DDBSourceTable"
            Fields: 
              - Key: "tableName"
                StringValue: "#{myDDBTableName}"
              - Key: "type"
                StringValue: "DynamoDBDataNode"
              - Key: "dataFormat"
                RefValue: "DDBExportFormat"
              - Key: "readThroughputPercent"
                StringValue: "#{myDDBReadThroughputRatio}"
          - Id: "DDBExportFormat"
            Name: "DDBExportFormat"
            Fields: 
              - Key: "type"
                StringValue: "DynamoDBExportDataFormat"
          - Id: "TableBackupActivity"
            Name: "TableBackupActivity"
            Fields: 
              - Key: "resizeClusterBeforeRunning"
                StringValue: "true"
              - Key: "type"
                StringValue: "EmrActivity"
              - Key: "input"
                RefValue: "DDBSourceTable"
              - Key: "runsOn"
                RefValue: "EmrClusterForBackup"
              - Key: "output"
                RefValue: "S3BackupLocation"
              - Key: "step"
                StringValue: "s3://dynamodb-dpl-#{myDDBRegion}/emr-ddb-storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{myDDBTableName},#{myDDBReadThroughputRatio}"
          - Id: "DefaultSchedule"
            Name: "Every 1 day"
            Fields: 
              - Key: "occurrences"
                StringValue: "1"
              - Key: "startDateTime"
                StringValue: "2020-09-23T1:00:00"
              - Key: "type"
                StringValue: "Schedule"
              - Key: "period"
                StringValue: "1 Day"
          - Id: "Default"
            Name: "Default"
            Fields: 
              - Key: "type"
                StringValue: "Default"
              - Key: "scheduleType"
                StringValue: "cron"
              - Key: "failureAndRerunMode"
                StringValue: "CASCADE"
              - Key: "role"
                StringValue: "DatapipelineDefaultRole"
              - Key: "resourceRole"
                StringValue: "DatapipelineDefaultResourceRole"
              - Key: "schedule"
                RefValue: "DefaultSchedule"
          - Id: "EmrClusterForBackup"
            Name: "EmrClusterForBackup"
            Fields: 
              - Key: "terminateAfter"
                StringValue: "2 Hours"
              - Key: "masterInstanceType"
                StringValue: "m3.xlarge"
              - Key: "coreInstanceType"
                StringValue: "m3.xlarge"
              - Key: "coreInstanceCount"
                StringValue: "1"
              - Key: "type"
                StringValue: "EmrCluster"
              - Key: "releaseLabel"
                StringValue: "emr-5.23.0"
              - Key: "region"
                StringValue: "#{myDDBRegion}"
2
On

Step has a refValue that points to multiple resources and also looks like they are specified as a string. According to serverless documentation a refValue is

A field value that you specify as an identifier of another object in the same pipeline definition.

If you look where you use S3BackupLocation it is created under PipelineObjects and then referenced using its Id.

For step you have refValue using a string for it's value, that string then has commas so it looks like it is specifying multiple objects.

I am not sure what step is meant to be but if you want to use refValue create it somewhere else in the template and use it's ID here?

Could also try using string value here instead of ref value