Reading a Huge Fixed Width file

3.4k Views Asked by At

I Have a requirement to read a Huge Flat File, without keeping the entire file in memory. It is flat file with multiple segments, each record starting with a Header record identified by 'H' in the beginning followed by many lines and then again Header record, this pattern repeats For e.g.

HXYZ CORP  12/12/2016
R1 234 qweewwqewewq wqewe
R1 234 qweewwqewewq wqewe
R1 234 qweewwqewewq wqewe
R2 344 dfgdfgdf gfd  df g
HABC LTD  12/12/2016
R1 234 qweewwqewewq wqewe
R2 344 dfgdfgdf gfd  df g
HDRE CORP  12/12/2016
R1 234 qweewwqewewq wqewe
R2 344 dfgdfgdf gfd  df g
R2 344 dfgdfgdf gfd  df g 

I want to read a record set at a time for e.g.

HDRE CORP  12/12/2016
R1 234 qweewwqewewq wqewe
R2 344 dfgdfgdf gfd  df g
R2 344 dfgdfgdf gfd  df g 

How can i achieve this keep in mind that i do not want to keep the entire file in memory Is there any standard library that i can use for this purpose? I have tried using some implementations without much success, i have used Apache's Line Iterator , but that reads line by line.

Any help or suggestions will be much appreciated.

6

There are 6 best solutions below

1
On

In Java 8 Using nio Files.lines() method, Stream.map() and PrintWriter.

I updated the code to be able to write line by line to a new file adding the current date to the headers.

import java.util.stream.Stream;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.io.IOException;

import java.time.LocalDate;
import java.time.format.DateTimeFormatter;    

public class Main {

    public static void main(String[] args) {

        String input =  "C://data.txt";
        String output = "C://data1.txt";
        String date = getDate();

        addDate(input,output,date);

    }

    public static void addDate(String in, String out,String date)
    {

        try (Stream<String> stream = Files.lines(Paths.get(in));
             PrintWriter output = new PrintWriter(out, "UTF-8"))
        {    
         stream.map(x -> {
            if(x.startsWith("H")) return (x +" "+date); 
            else return x;
            }
         ).forEach(output::println);
        }
        catch(IOException e){e.printStackTrace();}
    }

    public static String getDate(){
        DateTimeFormatter dtf = DateTimeFormatter.ofPattern("dd/MM/yyyy");
        LocalDate localDate = LocalDate.now();
        return dtf.format(localDate);
    }
}
1
On

A library for that purpose is BeanIO

There are a lot of unsupported libraries for fixed file format out there.

Flatpack is more recent, but I didn't try it.

0
On

The data is stored by line, and you don't know the record has ended until you read the header line of the next record. You need to read line-by-line. Something like this should work:

BufferedReader br = new BufferedReader( new FileReader( file ) );
Vector<String> record = new Vector<>();
String line;

// loop is explicitly broken when file ends
for ( ;; )
{
    line = br.readline();

    // no more lines - process what's in record and break the loop
    if ( null == line )
    {
        ProcessRecord( record );
        break;
    }

    // new header line, process what's in record and clear it
    // for the new record
    if ( line.startsWith( "H" ) )
    {
        ProcessRecord( record );
        record.clear()
    }

    // add the current line to the current record
    record.add( line );
}
0
On

You should aim to achieve your goal using line-by-line reading (like Apache you used or Java8 Files.lines()).

Use two loops: outer that processes until the EOF is reached. Inner loop for reading a record set at a time. Once you process whole record - you can discard the lines you have read to garbage-collector. And then (outer loop) process next record.

If using Lambdas and Java 8 Files.lines(...) - you may want to group (collect) lines related to same record. Then process these grouped objects.

0
On

I would just go with the built-in BufferedReader and read it line-by-line.

I don't know what you mean by fixed-width file because in your comment you mention that

R1,R2,R3 all are optional,repeatable and are of varying width's.

In any case, based on your description, your format is structured so

1. Read the first character to get the TOKEN
2. Check if TOKEN equals "H" or "R"
3. Split the line and parse it based on what type of TOKEN it is.

If R1, R2, and R3 are separate tokens, then you would need to check whether it's an R-entry, and then check the next character as needed.

For step 3, you may consider splitting on spaces if each field in the line is separated by a space. Or, if each record has a fixed-width, it may be acceptable to use substring to extract each segment.

I'm not sure what you mean by

My use-case requires to read a entire record set at a time.

0
On

As per @firephil's suggestion, I have used Java 8 Stream API for this requirement. I have used a buffer in form of StringBuilder to store lines between a Header and another Header record. Finally getting a iterator from the Stream to get one full record(H+R1+R2+R3) from the file at a time. There is a problem fetching the last record, the way I am processing the last record is getting lost, so I had to concatenate a Fake Record to the original Stream. This will do for this time, however I am sure there will be a better way to process.

public static StringBuilder sbTemp;

public static Iterator<String> process(String in) throws IOException
{
    Iterator<String> recordIterator = null;
    sbTemp = new StringBuilder();
    List<String> fakeRecordList = new ArrayList<String>();
    fakeRecordList.add("H Fake Line");
    Stream<String> fakeRecordStream = fakeRecordList.stream(); //For getting last Record Set
    Stream<String> stream = Files.lines(Paths.get(in)).sequential();
        Stream<String> finalStream = Stream.concat(stream,fakeRecordStream);
        // PrintWriter output = new PrintWriter(out, "UTF-8"))
    {    
        recordIterator =    finalStream.map(x -> {
        if(x.startsWith("H")) {
            String s = sbTemp.toString();
            //System.out.println("Header: "+x);
            sbTemp = new StringBuilder();
            sbTemp.append(x);
            return s; 
            }
        else {
            sbTemp.append("\n").append(x);              
            return "";
        } 
     }
     ).filter(line -> (line.startsWith("H")) ).iterator();

        System.out.println(recordIterator.next()); 
    }
    return recordIterator;
}