parquet StreamReader giving blank values for few columns, and correct for another?

23 Views Asked by At

This is how I am populating parquet file, using example given in documentation: There are three columns - day, month & year

  arrow::Int8Builder int8builder;
  int8_t days_raw[15] = {1, 12, 17, 23, 28, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 15));
  std::shared_ptr<arrow::Array> days;
  ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());

  int8_t months_raw[15] = {1, 3, 5, 7, 1, 2, 12, 4, 5, 6, 7, 8, 9, 10, 11};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 15));
  std::shared_ptr<arrow::Array> months;
  ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());

  arrow::Int16Builder int16builder;
  int16_t years_raw[15] = {1990, 2000, 1995, 2000, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
                           2002, 2003, 2004, 2015};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 15));
  std::shared_ptr<arrow::Array> years;
  ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());

  /* Get a vector of our Arrays */
  std::vector<std::shared_ptr<arrow::Array>> columns = {days, months, years};

  /* Make a schema to initialize the Table with */
  std::shared_ptr<arrow::Field> field_day, field_month, field_year;
  std::shared_ptr<arrow::Schema> schema;

  field_day = arrow::field("Day", arrow::int8());
  field_month = arrow::field("Month", arrow::int8());
  field_year = arrow::field("Year", arrow::int16());

  schema = arrow::schema({field_day, field_month, field_year});
  /* With the schema and data, create a Table */
  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, columns);

  /* Write out test files in CSV, and Parquet for the example to use. */
  std::shared_ptr<arrow::io::FileOutputStream> outfile;

  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.parquet"));
  PARQUET_THROW_NOT_OK(
      parquet::arrow::WriteTable(*table, arrow::default_memory_pool(), outfile, 10));

This is how I am trying to read and display to check if file is populated correctly: It displays last field 'year' correctly, but day & month are coming as either blank or spaces. What am I doing wrong? I searched about it, but not getting anything specific.

PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open("test_in.parquet"));

parquet::StreamReader stream{parquet::ParquetFileReader::Open(infile)};

int8_t day;
int8_t month;
int16_t year;

while (!stream.eof())
{
  stream >> day >> month >> year >> parquet::EndRow;

  std::cout << "Day=" << day << ", month=" << month << ", year=" << year << std::endl;
}

!!!readDisplayParquetFile: opening file test_out.parquet
Day=, month=, year=1990
Day= , month=, year=2000
Day=, month=, year=1995
Day=, month=, year=2000
Day=, month=, year=1995
Day= , month=, year=1996
Day= , month= , year=1997
. .

I am trying to decode all column's data for printing, but only one (last) column data is getting correctly decoded.

1

There are 1 best solutions below

0
amitfreeman On

Never mind, I got the answer.

For Day & Month int8_t is getting inserted, but parquet is not accepting, so its getting written but value is getting corrupted, hence reading issue.

If I convert it to int16_t while inserting, and use same at display, just like for variable 'year', then it works.

Although on their page its mentioned that Arrow type int8 will be converted to parquet INT32 physical form, its probably not happening, or has some extra steps. https://arrow.apache.org/docs/cpp/parquet.html#parquet-writer-properties

enter image description here