Convert EBCDIC to ASCII in Apache Beam

1.3k Views Asked by At

I am trying to convert EBCDIC file to ASCII using CobolIoProvider class from JRecord in Apache Beam.

Code that I'm using:

CobolIoProvider ioProvider = CobolIoProvider.getInstance();
AbstractLineReader reader  = ioProvider.getLineReader(Constants.IO_FIXED_LENGTH, Convert.FMT_MAINFRAME,CopybookLoader.SPLIT_NONE, copybookname, cobolfilename);

The code reads and converts the file as required. I am able to read the cobolfilename and copybookname only from the local system which are basically paths of the EBCDIC file and the copybook respectively. However, when I try to read the files from GCS, it fails with FileNotFoundException – “The filename, directory name, or volume label syntax is incorrect” .

Is there a way to read Cobol file(EBCDIC) from GCS using CobolIoProvider class ?

If not, is there any other class available to convert Cobol file(EBCDIC) to ASCII and allowing the files to be read from GCS.

Using ICobolIOBuilder:-

Code that I’m using:

ICobolIOBuilder iob = JRecordInterface1.COBOL.newIOBuilder("copybook.cbl")

AbstractLineReader reader = iob.newReader(bs); //bs is an InputStream object of my Cobol file

However, here are a few concerns:-

1) I have to keep my copybook.cbl locally. Is there any way to read copybook file from GCS. I tried the below code, trying to read my copybook from GCS to Stream and pass the stream to LoadCopyBook(). But the code didn’t work.

Sample code below:

InputStream  bs2 = new ByteArrayInputStream(copybookfile.toString().getBytes());
LayoutDetail schema = new CobolCopybookLoader()
                     .loadCopyBook(   bs, " copybook.cbl",
                         CopybookLoader.SPLIT_NONE, 0, "",
                         Convert.FMT_INTEL, 0, new TextLog())

AbstractLineReader reader = LineIOProvider.getInstance().getLineReader(schema);, schema);

2) Reading the EBCDIC file from stream using newReader didn’t convert my file to ascii.



There are 2 best solutions below


I do not have a full answer. If you are using a recent version of suggest changing the JRecord code to use the JRecordInterface1. The IO-Builder is a lot more flexible than the older CobolIoProvider interface.

String encoding = "cp037"; // cp037/IBM037 US ebcdic; cp273 - German ebcdic 
ICobolIOBuilder iob = JRecordInterface1.COBOL
            .setFont(encoding);  // should set encoding if you can

AbstractLineReader reader = iob.newReader(datastream);

With the IO-Builder interface you can use streams. This question Stream file from Google Cloud Storage is about creating a stream from GCS, may be useful. Hopefully some one with more knowledge of GCS can help.

Alternatively you could read from GCS directly and create data-lines(data-records) using the newLine method of a JRecord-IO-Builder:

     AbstractLine l = iob.newLine(byteArray);

I will look at creating a basic Read/Write interface to JRecord so JRecord user's can write there own interface to GCS or IBM's Mainframe Access (ZFile) etc. But this will take time.


The easiest way to use Beam/Dataflow with new kinds of file-based sources is to first use FileIO to get a PCollection<ReadableFile> and then use a DoFn to read that file. This will require implementing the code to read from a given channel. Something like the following:

Pipeline p = ...
 .apply(new DoFn<ReadableFile, String>() {
   public void processElement(ProcessContext c) {
     try (ReadableByteChannel channel = c.element().open()) {
       // Use CobolIO to read from the byte channel