Reading a Parquet file using Vaex

3.8k Views Asked by At

I'm trying to read some data into python from a Parquet file, using Vaex.

This is the output I get using the vaex.open function.

>>> import vaex
>>> trade = vaex.open('trade.parquet')
>>> trade
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/userman/.local/lib/python3.6/site-packages/vaex/dataframe.py", line 3703, in __repr__
    return self._head_and_tail_table(format='plain')
  File "/home/userman/.local/lib/python3.6/site-packages/vaex/dataframe.py", line 3464, in _head_and_tail_table
    return self._as_table(0, n, N - n, N, format=format)
  File "/home/userman/.local/lib/python3.6/site-packages/vaex/dataframe.py", line 3599, in _as_table
    parts = table_part(i1, i2, parts)
  File "/home/userman/.local/lib/python3.6/site-packages/vaex/dataframe.py", line 3573, in table_part
    df = self[k1:k2]
  File "/home/userman/.local/lib/python3.6/site-packages/vaex/dataframe.py", line 4626, in __getitem__
    df = self.trim()
  File "/home/userman/.local/lib/python3.6/site-packages/vaex/dataframe.py", line 3859, in trim
    df = self if inplace else self.copy()
  File "/home/userman/.local/lib/python3.6/site-packages/vaex/dataframe.py", line 5036, in copy
    df.add_column(name, column, dtype=self._dtypes_override.get(name))
  File "/home/userman/.local/lib/python3.6/site-packages/vaex/dataframe.py", line 6053, in add_column
    super(DataFrameArrays, self).add_column(name, data, dtype=dtype)
  File "/home/userman/.local/lib/python3.6/site-packages/vaex/dataframe.py", line 2942, in add_column
    raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original()))
ValueError: array is of length 1048576, while the length of the DataFrame is 34421587

The length of the dataframe is correct, but I don't understand what 1048576 relates to. I've found a previous answer concerning the reading of hdf5 files, but it doesn't seem to relate to my issue. The data was initial read from a csv file, then exported to parquet using pyarrow.

Can anyone elaborate on what the issue is and how to solve it?

1

There are 1 best solutions below

1
On

I had the same issue, therefore, I assume you're using vaex 3.x Try the latest alpha 4.0.0a13, ideally in a fresh virtual environment.

pip install vaex==4.0.0a13

Update

As of March, 9th 2021 vaex 4 is out and marked as default in pypi, therefore specifying the version is not required anymore.

pip install vaex