XLSX file extraction in Palantir Foundry

122 Views Asked by At

Below is my code I'm trying extract XLSX file. Please let know if there any other methods to extract XLSX files in Palantir Foundry Code Repos.

def compute(source_df, output_df, ctx):
    filestatus = list(source_df.filesystem().ls(glob='**/*.xlsx'))
    assert(len(filestatus) == 1)
    latest_file = filestatus[0]
    print(latest_file)
    # rows = []
    with source_df.filesystem().open(latest_file.path, 'rb') as f:
        wb = openpyxl.load_workbook(f, read_only=True)
        ws = wb['Sheet1']
        headers = ws.row_values(1)
        rows = []
        for row in ws.rows[2:]:
            row_dict = {}
            for i in range(len(headers)):
                row_dict[headers[i]] = row[i].value
        rows.append(row_dict)
    df = ctx.spark_session.createDataFrame(rows, schema)
    output_df.write_dataframe(df)

Getting this error. How to resolve it?

[module version: 1.913.0]

zipfile.BadZipFile: File is not a zip file
1

There are 1 best solutions below

0
On

First, I'd recommend taking a look at the following YouTube video published by Palantir. This may be a better method using xlrd:

Code Repositories | How to Parse Excel Files into a Usable Dataset in Palantir Foundry

Regarding the error using openpyxl in your existing code, this indicates that there is something wrong with your XLSX file. It could be that your XLSX file is empty, password protected, or corrupted in some way. Be sure to open it in Microsoft Excel to check that the spreadsheet looks normal.