PySpark failing to decode 'cp1047' when processing mainframe input

251 Views Asked by At

In one of my requirements is to decode a byte-array based on a cp1047 code page. A sample of my code is:

ebcdic_str = input_bytes.decode('cp1047')

The above code using python works correctly but while executing the same operation as part of the PySpark code (by creating an udf wrapping the above code) i am getting the following error:

    ebcdic_str = input_bytes.decode('cp1047')
LookupError: unknown encoding: cp1047

I have successfully done the same operation in PySpark using the code page cp037, but faced other issues using that code page. Per a suggestion from IBM the code page has been changed to cp1047. Unfortunately the code itself is failing.

Can anybody suggest a reason for the failure?

2

There are 2 best solutions below

1
Kaushik Ghosh On BEST ANSWER

The issue was happening because we are not using a PySpark package ebcdic in our code. Once that package is imported the issue was resolved.

A side note, since ebcdic package is not a widely used package it might not be pre-distributed on all your worker/edge nodes. So you might want to validate the package is available; otherwise you might receive an 'ebcdic module not found' error.

0
jabostian On

Note that the ebcdic package is in site-packages since the first release of the Python SDK on z/OS. So you should be able to import it into your code on z/OS from Python 3.8 and later. As usual, the more current the version of Python, the better.