Reading large files and filter using contains() java

5.9k Views Asked by At

I am reading large log files using BufferedReader in java.I have to filter the contents of the file and store the data in database. example.

BufferedReader br=new BufferedReader(new FileReader("test.log"));
String line;
while((line=br.readLine())!=null){
   if(line.contains("filter1") || line.contains("filter2") ||
       line.contains("filter3") || line.contains("filter4")...){
        //creating object and storing using hibernate
    }
}

I have more than 50 such filters and the problem occurs in reading files over 100 MB. A lot of time is wasted in matching these filter strings.

I cannot use Collection.contains(line) as the filters in if conditions are substrings of the line read. The time taken is not due to IO but the filtering of contents and creating objects for storing.

Edit 1 :- filter1, filter2 are just for simplicity only. In actual cases, the filter would be like - "new file", "report","removed from folder","schema","move","copy","added to queue","unique id" etc. These are the specific keyword that I check to see if the line contains relevant data for storing.

Please suggest a better way for achieving the same.

2

There are 2 best solutions below

4
On

It depends on how your filters look like. If it really were filter1, filter2, etc. then you could use a regex like

private static final Pattern pattern = Pattern.compile("filter[0-9]");

... // in a loop
if (pattern.matcher(line).matches()) {...}

(you could also avoid the allocation). You don't need an exact filter here, just something excluding non-matching lines with high probability (and excluding no matching line).

For example, you can use 4-grams or alike, use a rolling hash like

/// Initialization
Set<Integer> hashesOf4grams = new HashSet<>();
for (String s : filters) {
    if (s.length() < 4) {
        ... do some handling for short strings, omitted here as probably not needed.
    }
    int hash = 0;
    for (int i = 0; i < 4; ++i) {
        hash = (hash << 8) + s.charAt(i);
    }
    hashesOf4grams.add(hash);
}


/// Loop.
for (String line : lines) {
    boolean maybeMatching = false;
    int hash = 0;
    for (int i = 0; i < line.length() && !maybeMatching; ++i) {
       hash = (hash << 8) + line.charAt(i);
       maybeMatching = hashesOf4grams.contains(hash);
    }
    if (!maybeMatching) {
        continue;
    }

    // Slow test.
    boolean surelyMatching = false;
    for (String s : filters) {
        if (line.contains(s)) {
            surelyMatching = true;
            break;
        }
    }
    if (surelyMatching) {...}
}

The shifting above ensures that only the 4 last characters matter. Instead of Set.contains (with boxing), you could use some primitive collection.

You could use tries...

You could also use common substrings. Your example is still too short for anything useful, but something like

private static final Pattern pattern = Pattern.compile("new file|re(port|moved from folder)");

could work better than testing everything separately. I guess, tries should be best, but the N-grams are simpler and can work well enough.

In my above implementation, I'm assuming that all filters are of length at least 4.


3
On

In Java 8, you can use Files.lines to read file as Stream.

This example shows you how to use Stream to filter content, convert the entire content to upper case and return it as a List.

c://lines.txt – A simple text file for testing
line1
line2
line3
line4
line5

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;

public class TestReadFile {

    public static void main(String args[]) {

        String fileName = "c://lines.txt";
        List<String> list = new ArrayList<>();

        try (Stream<String> stream = Files.lines(Paths.get(fileName))) {

            //1. filter line 3
            //2. convert all content to upper case
            //3. convert it into a List
            list = stream
                    .filter(line -> !line.startsWith("line3"))
                    .map(String::toUpperCase)
                    .collect(Collectors.toList());

        } catch (IOException e) {
            e.printStackTrace();
        }

        list.forEach(System.out::println);

    }

}