I am reading large log files using BufferedReader in java.I have to filter the contents of the file and store the data in database. example.
BufferedReader br=new BufferedReader(new FileReader("test.log"));
String line;
while((line=br.readLine())!=null){
if(line.contains("filter1") || line.contains("filter2") ||
line.contains("filter3") || line.contains("filter4")...){
//creating object and storing using hibernate
}
}
I have more than 50 such filters and the problem occurs in reading files over 100 MB. A lot of time is wasted in matching these filter strings.
I cannot use Collection.contains(line) as the filters in if conditions are substrings of the line read. The time taken is not due to IO but the filtering of contents and creating objects for storing.
Edit 1 :- filter1, filter2 are just for simplicity only. In actual cases, the filter would be like - "new file", "report","removed from folder","schema","move","copy","added to queue","unique id" etc. These are the specific keyword that I check to see if the line contains relevant data for storing.
Please suggest a better way for achieving the same.
It depends on how your filters look like. If it really were
filter1
,filter2
, etc. then you could use a regex like(you could also avoid the allocation). You don't need an exact filter here, just something excluding non-matching lines with high probability (and excluding no matching line).
For example, you can use 4-grams or alike, use a rolling hash like
The shifting above ensures that only the 4 last characters matter. Instead of
Set.contains
(with boxing), you could use some primitive collection.You could use tries...
You could also use common substrings. Your example is still too short for anything useful, but something like
could work better than testing everything separately. I guess, tries should be best, but the N-grams are simpler and can work well enough.
In my above implementation, I'm assuming that all filters are of length at least 4.