Search in txt file with more than 3 million lines in C#

115 Views Asked by At

I have several txt files that each file contains more than 3 million lines. Each line contains customer's connections and there are Customer ID, IP address....

I need to find specific IP address and get Customer ID related to it.

I read the file and Split it in an array and search in each line by foreach, but because there are many lines, below error occur.

Exception of type 'System.OutOfMemoryException' was thrown.

I should decompress txt files, because they are compressed. I use below code:

string decompressTxt = decompressTxt = this.Decompress(new FileInfo(filePath));
char[] delRow = { '\n' };
string[] rows = decompressTxt.Split(delRow);
for (int i = 0; i < rows.Length; i++){
   if(rows[i].Contains(ip)){
    
   }
}

string Decompress(FileInfo fileToDecompress)
{
   string newFileName = "";
   string newFIleText = "";
   using (FileStream originalFileStream =fileToDecompress.OpenRead())
   {
        string currentFileName = fileToDecompress.FullName;
        newFileName = currentFileName.Remove(currentFileName.Length - fileToDecompress.Extension.Length);
    
        using (FileStream decompressedFileStream = File.Create(newFileName))
        {
               using (GZipStream decompressionStream = new GZipStream(originalFileStream, CompressionMode.Decompress))
               {             
                  decompressionStream.CopyTo(decompressedFileStream);
               }
         }
    
         newFIleText = File.ReadAllText(newFileName);
         File.Delete(newFileName);
    }
    return newFIleText;
}
3

There are 3 best solutions below

0
Etienne de Martel On BEST ANSWER

Okay, so there's a lot of things you're doing that aren't necessary, even before we get to how you're running out of memory.

First off, you don't need an intermediate file for decompression, just read off GZipStream directly. But wait, did you think that you had to use File.ReadAllText to read text, and thus that's why you uncompress the file first?

That's unecessary. When you want to read text from a stream, you can just use a StreamReader to do it (this is what File.ReadAllText uses underneath).

The reader can also be used to read line by line without having to fit the entire file in memory, just each individual line, one at a time. Just call ReadLine() until it returns null.

Putting it all together, here's code that decompresses the data and reads it one line at a time, without having to split anything. Not only does it scale with very large files, it's also much faster.

using var stream = new GZipStream(fileToDecompress.OpenRead(), CompressionMode.Decompress);
using var reader = new StreamReader(stream);

string? line;
while ((line = reader.ReadLine()) != null)
{
     if (line.Contains(ip))
     {
         // etc.
     }
}
2
Charlieface On

You need to process your file as an enumerable of lines. Do not copy it to another MemoryStream or copy it into a string. StreamReader can process it line by line.

You definitely don't need another file to decompress it and then read it back.

All together, you won't need to hold the whole file in memory all at once.

foreach (var decompressTxt in this.Decompress(new FileInfo(filePath)))
{
    if(decompressTxt.Contains(ip))
    {
          // do stuff
    }
}
IEnumerable<string> Decompress(FileInfo fileToDecompress)
{
    using var originalFileStream = fileToDecompress.OpenRead();
    using var decompressionStream = new GZipStream(originalFileStream, CompressionMode.Decompress);
    using var reader = new StreamReader(decompressionStream);
    string s;
    while ((s = reader.ReadLine()) != null)
    {
        yield return s;
    }
}

There are even more efficient methods, such as splitting the data in blocks of bytes, but this is significantly more complex.

You should probably also think about async, and about encoding.

1
Mahesh Kumar On

You don't need to store whole file in memory or even in new file.

    string FindCustomer(FileInfo fileToDecompress, string ip)
{
    using var originalFileStream = fileToDecompress.OpenRead();
    using var decompressionStream = new GZipStream(originalFileStream, CompressionMode.Decompress);
    using var reader = new StreamReader(decompressionStream);
    string row ;
    while ((row = reader.ReadLine()) != null)
    {
        if(row.Contains(ip))
          {
           return row;
          }
    }
    return "";
}