Below is my code I'm trying extract XLSX file. Please let know if there any other methods to extract XLSX files in Palantir Foundry Code Repos.
def compute(source_df, output_df, ctx):
filestatus = list(source_df.filesystem().ls(glob='**/*.xlsx'))
assert(len(filestatus) == 1)
latest_file = filestatus[0]
print(latest_file)
# rows = []
with source_df.filesystem().open(latest_file.path, 'rb') as f:
wb = openpyxl.load_workbook(f, read_only=True)
ws = wb['Sheet1']
headers = ws.row_values(1)
rows = []
for row in ws.rows[2:]:
row_dict = {}
for i in range(len(headers)):
row_dict[headers[i]] = row[i].value
rows.append(row_dict)
df = ctx.spark_session.createDataFrame(rows, schema)
output_df.write_dataframe(df)
Getting this error. How to resolve it?
[module version: 1.913.0]
zipfile.BadZipFile: File is not a zip file
First, I'd recommend taking a look at the following YouTube video published by Palantir. This may be a better method using
xlrd
:Code Repositories | How to Parse Excel Files into a Usable Dataset in Palantir Foundry
Regarding the error using
openpyxl
in your existing code, this indicates that there is something wrong with your XLSX file. It could be that your XLSX file is empty, password protected, or corrupted in some way. Be sure to open it in Microsoft Excel to check that the spreadsheet looks normal.