Convert EBCDIC to ASCII in Apache Beam

1.3k Views Asked by At

I am trying to convert EBCDIC file to ASCII using CobolIoProvider class from JRecord in Apache Beam.

Code that I'm using:

CobolIoProvider ioProvider = CobolIoProvider.getInstance();
AbstractLineReader reader  = ioProvider.getLineReader(Constants.IO_FIXED_LENGTH, Convert.FMT_MAINFRAME,CopybookLoader.SPLIT_NONE, copybookname, cobolfilename);

The code reads and converts the file as required. I am able to read the cobolfilename and copybookname only from the local system which are basically paths of the EBCDIC file and the copybook respectively. However, when I try to read the files from GCS, it fails with FileNotFoundException – “The filename, directory name, or volume label syntax is incorrect” .

Is there a way to read Cobol file(EBCDIC) from GCS using CobolIoProvider class ?

If not, is there any other class available to convert Cobol file(EBCDIC) to ASCII and allowing the files to be read from GCS.

Using ICobolIOBuilder:-

Code that I’m using:

ICobolIOBuilder iob = JRecordInterface1.COBOL.newIOBuilder("copybook.cbl")
                                    .setFileOrganization(Constants.IO_FIXED_LENGTH)
                      .setSplitCopybook(CopybookLoader.SPLIT_NONE);

AbstractLineReader reader = iob.newReader(bs); //bs is an InputStream object of my Cobol file

However, here are a few concerns:-

1) I have to keep my copybook.cbl locally. Is there any way to read copybook file from GCS. I tried the below code, trying to read my copybook from GCS to Stream and pass the stream to LoadCopyBook(). But the code didn’t work.

Sample code below:

InputStream  bs2 = new ByteArrayInputStream(copybookfile.toString().getBytes());
LayoutDetail schema = new CobolCopybookLoader()
                     .loadCopyBook(   bs, " copybook.cbl",
                         CopybookLoader.SPLIT_NONE, 0, "",
                         Constants.USE_STANDARD_COLUMNS,
                         Convert.FMT_INTEL, 0, new TextLog())
                           .asLayoutDetail();

AbstractLineReader reader = LineIOProvider.getInstance().getLineReader(schema);

reader.open(inputStream, schema);

2) Reading the EBCDIC file from stream using newReader didn’t convert my file to ascii.

Thanks.

2

There are 2 best solutions below

2
On BEST ANSWER

I do not have a full answer. If you are using a recent version of suggest changing the JRecord code to use the JRecordInterface1. The IO-Builder is a lot more flexible than the older CobolIoProvider interface.

String encoding = "cp037"; // cp037/IBM037 US ebcdic; cp273 - German ebcdic 
ICobolIOBuilder iob = JRecordInterface1.COBOL
       .newIOBuilder("CopybookFile.cbl") 
            .setFileOrganization(Constants.IO_FIXED_LENGTH)
            .setFont(encoding);  // should set encoding if you can

AbstractLineReader reader = iob.newReader(datastream);

With the IO-Builder interface you can use streams. This question Stream file from Google Cloud Storage is about creating a stream from GCS, may be useful. Hopefully some one with more knowledge of GCS can help.

Alternatively you could read from GCS directly and create data-lines(data-records) using the newLine method of a JRecord-IO-Builder:

     AbstractLine l = iob.newLine(byteArray);

I will look at creating a basic Read/Write interface to JRecord so JRecord user's can write there own interface to GCS or IBM's Mainframe Access (ZFile) etc. But this will take time.

2
On

The easiest way to use Beam/Dataflow with new kinds of file-based sources is to first use FileIO to get a PCollection<ReadableFile> and then use a DoFn to read that file. This will require implementing the code to read from a given channel. Something like the following:

Pipeline p = ...
p.apply(FileIO.match().filepattern("..."))
 .apply(FileIO.readMatches(...))
 .apply(new DoFn<ReadableFile, String>() {
   @ProcessElement
   public void processElement(ProcessContext c) {
     try (ReadableByteChannel channel = c.element().open()) {
       // Use CobolIO to read from the byte channel
     }
   });