Pinot batch ingestion removing old data

729 Views Asked by At

I am playing with Pinot, and have set it up locally using ./bin/pinot-admin.sh QuickStart -type batch, and have also added a table with a single multi value column (named values).

I now created a csv file with following data (NOTE: I am using '-' as a delimter multivalues)

values
a-b
a
b

and ingested it using standalone batch ingestion with following job specs:

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner'

# Recommended to set jobType to SegmentCreationAndMetadataPush for production environment where Pinot Deep Store is configured  
jobType: SegmentCreationAndTarPush

inputDirURI: '.'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: './csv/segments/'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
  configs:
     multiValueDelimiter: '-'
tableSpec:
  tableName: 'exp'
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'
pushJobSpec:
  pushAttempts: 2
  pushRetryIntervalMillis: 1000

Now the first time I add the data using ./bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile ingestion-job.yaml, I see all the three values in the table, now I again add the same values using the job, but I don't see 6 rows, rather I still see 3 rows. I then tried changing the csv file to have a single row with value x , when I launched the job then it is just showing a single row. Seems like every time I run the ingestion job the previous data is deleted and the ingested data is the only one left.

I expected the batch ingestion to add the data, am I missing something over where ?

2

There are 2 best solutions below

0
nandevers On

a huge maybe here but have you tried setting the following configs to APPEND:

 "batchIngestionConfig": {
      "segmentIngestionType": "APPEND",
      "segmentIngestionFrequency": "DAILY"
    }
0
Robert Głowacki On

This config - batch ingestion - APPEND mode should add TIMESTAMP to segment name, but at least this is not working for me - using Pinot 1.0.0

However there is another solution which included in yml file resolve issue:

"segmentNameGeneratorSpec": {
   "type": "inputFile",
   "configs": {
      "file.path.pattern": ".+/(.+)/parquet",
      "segment.name.template": "TABLE_NAME_OFFLINE_${filePathPattern:\\1}_${date}
   }
}

segment name would be like this: TABLE_NAME_OFFLINE_FILE_NAME_WITHOUT_EXTENSION_DATE