Searching for structures in a continuous, unstructured file stream

99 Views Asked by At

I am trying to figure out a (hopefully easy) way to read a large, unstructured file without bumping into the edge of a buffer. An example is helpful here.

Imagine you are trying to do some data-recovery of a 16GB flash-drive and have saved a dump of the drive to a 16GB file. You want to scan through the image, looking for certain items of interest. If the file were smaller, you could read the entire thing into a memory buffer (let’s say 1MB) and do a simple scan through the buffer. However, because it is too big to read in all at once, you need to read it in chunks. The problem is that an item of interest may not be perfectly aligned so as to fall within a single 1MB buffer. In other words, it may end up straddling the edge of the buffer so that it starts at the end of the buffer during one read, and ends in the next one (or even further).

At one time in the past, I dealt with this by using two buffers and copying the second one to the first one to create a sort of sliding window, however I imagine that this should be a common enough scenario that there are better, existing solutions. I looked into memory-mapped files, thinking that they let you read the file by simply increasing the array index/pointer, but I ended up in the exact same situation as before due to the limit of the map view size. I tried looking for some practical examples of using MapViewOfFile with offsets, but all I could find were contrived examples that skipped that.

How is this situation normally handled?

1

There are 1 best solutions below

3
On

If you are running in a 64 bit environment, I would just use memory mapped files. There is no (reasonable) memory limit for a process. You can read the file in, even jump around, and the OS will swap memory to and from disk.

Here's some basic information:

http://msdn.microsoft.com/en-us/library/ms810613.aspx

And an example of a file viewer here:

http://www.catch22.net/tuts/memory-techniques-part-1

This case works on a 2.8GB file in x64, but fails in win32 because it cannot allocate more than 2GB per process. It is very fast since it touches only the first and last byte in the pBuf array. Modifying the method to traverse the buffer and count the number of 'zero' bytes works as expected. You can watch the memory footprint go up as it does it but that memory is only virtually allocated.

#include "stdafx.h"
#include <string>
#include <Windows.h>

TCHAR  szName[] = TEXT( pathToFile );

int _tmain(int argc, _TCHAR* argv[])
{
   HANDLE hMapFile;
   char* pBuf;

   HANDLE file = CreateFile( szName, GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
   if ( file == NULL )
   {
         _tprintf(TEXT("Could not open file object (%d).\n"),
             GetLastError());
      return 1;
   }

   unsigned int length  = GetFileSize(file, 0);

   printf( "Length = %u\n", length );


   hMapFile = CreateFileMapping( file, 0, PAGE_READONLY, 0, 0, 0 );

   if (hMapFile == NULL)
   {
      _tprintf(TEXT("Could not create file mapping object (%d).\n"),  GetLastError());
      return 1;
   }

   pBuf = (char*) MapViewOfFile(hMapFile,  FILE_MAP_READ, 0,0, length);

   if (pBuf == NULL)
   {
      _tprintf(TEXT("Could not map view of file (%d).\n"), GetLastError());

       CloseHandle(hMapFile);

      return 1;
   }

   printf("First Byte: 0x%02x\n", pBuf[0] );
   printf("Last Byte: 0x%02x\n", pBuf[length-1] );
   UnmapViewOfFile(pBuf);

   CloseHandle(hMapFile);

   return 0;
}