PySpark failing to decode 'cp1047' when processing mainframe input

251 Views Asked by Kaushik Ghosh At 18 May 2023 at 06:42

In one of my requirements is to decode a byte-array based on a cp1047 code page. A sample of my code is:

ebcdic_str = input_bytes.decode('cp1047')

The above code using python works correctly but while executing the same operation as part of the PySpark code (by creating an udf wrapping the above code) i am getting the following error:

    ebcdic_str = input_bytes.decode('cp1047')
LookupError: unknown encoding: cp1047

I have successfully done the same operation in PySpark using the code page cp037, but faced other issues using that code page. Per a suggestion from IBM the code page has been changed to cp1047. Unfortunately the code itself is failing.

Can anybody suggest a reason for the failure?

Original Q&A

There are 2 best solutions below

Kaushik Ghosh On 25 May 2023 at 07:01 BEST ANSWER

The issue was happening because we are not using a PySpark package ebcdic in our code. Once that package is imported the issue was resolved.

A side note, since ebcdic package is not a widely used package it might not be pre-distributed on all your worker/edge nodes. So you might want to validate the package is available; otherwise you might receive an 'ebcdic module not found' error.

jabostian On 22 June 2023 at 14:35

Note that the ebcdic package is in site-packages since the first release of the Python SDK on z/OS. So you should be able to import it into your code on z/OS from Python 3.8 and later. As usual, the more current the version of Python, the better.

PySpark failing to decode 'cp1047' when processing mainframe input

There are 2 best solutions below

Related Questions in PYSPARK

Related Questions in ENCODING

Related Questions in MAINFRAME

Related Questions in CODEPAGES

Related Questions in EBCDIC

Trending Questions

Popular # Hahtags

Popular Questions