Modify column type in Parquet file with ruby (using parquet Gem)

49 Views Asked by At

I have a number of Parquet files in our data warehouse. Some of the earlier files ~700 have a Schema type for a column set to string when they should have been int32. Understanding Parquet are immutable; I'm looking for the best way to re-write these files with the correct column type. I am using Ruby with the red-parquet gem.

I have tried to cast the column to int and then save the file to a new location. It doesn't error but doesn't work. I've outlined the method I'm using below. Any help would be much appreciated.

def castCol(col = nil)
  filesWritten = 0
  getParquets.each do |file|
    table = Arrow::Table.load(file)
    if table.heading.data_type == "string"
      newFileLoc = @saveDir + File.path(file)
      puts newFileLoc

      # Create Dir if Required
      unless File.directory?(File.dirname(newFileLoc))
        FileUtils.mkdir_p(File.dirname(newFileLoc))
      end

      table.heading.cast('int32')
      table.save(newFileLoc)

      filesWritten += 1
    end
  end
  puts "Numebr of File Written: #{filesWritten}"
end
1

There are 1 best solutions below

0
Smithy On

I've written the conversion in Python. The PyArrow libraries are better documented, which is probably to be expected.