CSV data load from S3 - Apache Pinot

177 Views Asked by At

I am trying to load a CSV file from S3 to the apache pinot table. One column of data has a semicolon in the CSV file as I highlighted: TestCSV; displayType

I am getting the below error while loading this data to the pinot table: java.lang.IllegalArgumentException: Cannot read single-value from Object[] : [TestCSV, displayType]

I noticed from the error that the semicolon in the data is converted to comma, so it's throwing the above error.

Here I have added the sample CSV data for reference:

column1 column2 column3 column4 column5
925aa-1 00d925 TestCSV; displayType testbox sample.com

Also, here I have listed what I have provided in jobSpec.yml file:

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: 's3://********/******/******/'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: 's3://********/******/******/segments'
overwriteOutput: true
pinotFSSpecs:
  - scheme: s3
    className: org.apache.pinot.plugin.filesystem.S3PinotFS
    configs:
      region: us-east-1
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
  configs:
    fileFormat: 'csv' 
    delimiter: ','
tableSpec:
  tableName: 'testload'
  schemaURI: 'http://localhost:9000/tables/testload/schema'
  tableConfigURI: 'http://localhost:9000/tables/testload'
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'
pushJobSpec:
  # pushAttempts: number of attempts for push job, default is 1, which means no retry.
  pushAttempts: 2

  # pushRetryIntervalMillis: retry wait Ms, default to 1 second.
  pushRetryIntervalMillis: 1000

I want to load the data with a semicolon. Can anyone help me with this?

Note: Data got loaded without issues after removing a semicolon.

0

There are 0 best solutions below