FileNotFoundError when using pd.read_parquet to read file in FTP server

74 Views Asked by At

I'm trying to read a binary file (.parquet) located in an ftp server using pandas read_parquet:

import pandas as pd
df = pd.read_parquet('ftp://ftp.hostname.com/binary/filename.parquet',engine='fastparquet')

I get the following error message:

FileNotFoundError: ftp://ftp.hostname.com/binary/filename.parquet

Even though the file is clearly in that path, and I've checked the path name.

Extra Info:

When accessing .csv files in that same ftp server, there are no errors:

pd.read_csv('ftp://ftp.hostname.com/csv/filename.csv') 

It's only when using pd.read_parquet to read binary files in ftp server. I've also tried engine='pyarrow', but the results are the same.

When I download and save the file locally, and open it using pd.read_parquet it works fine.
Download using python urllib:

import urllib.request
urllib.request.urlretrieve('ftp://ftp.hostname.com/binary/filename.parquet', 'file')

When opening using request:

from urllib import request
req = request.urlopen('ftp://ftp.hostname.com/binary/filename.parquet')
df = req.read()

I get the following result:

df = '\x00\x11...'

Not sure if it's an issue with the file encoding.

UPDATE:
Full Traceback read_parquet:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\Python312\Lib\site-packages\pandas\io\parquet.py", line 670, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "...\Python312\Lib\site-packages\pandas\io\parquet.py", line 400, in read
    parquet_file = self.api.ParquetFile(path, **parquet_kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\Python312\Lib\site-packages\fastparquet\api.py", line 178, in __init__
    raise FileNotFoundError(fn)
FileNotFoundError: ftp://ftp.hostname.com/binary/filename.parquet

Attempting to access same .parquet file with read_csv traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\Python312\Lib\site-packages\pandas\io\parsers\readers.py", line 948, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\Python312\Lib\site-packages\pandas\io\parsers\readers.py", line 611, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\Python312\Lib\site-packages\pandas\io\parsers\readers.py", line 1448, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\Python312\Lib\site-packages\pandas\io\parsers\readers.py", line 1723, in _make_engine
    return mapping[engine](f, **self.options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\Python312\Lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 93, in __init__
    self._reader = parsers.TextReader(src, **kwds)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "parsers.pyx", line 579, in pandas._libs.parsers.TextReader.__cinit__
  File "parsers.pyx", line 668, in pandas._libs.parsers.TextReader._get_header
  File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2050, in pandas._libs.parsers.raise_parser_error
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 7-8: invalid continuation byte
0

There are 0 best solutions below