fastest way to read large text file

1.4k Views Asked by At

I am looking to pull certain groups of lines from large (~870,000,000 line/~4GB) text files. As a small example, in a 50 line file I might want lines 3-6, 18-27, and 39-45. Using SO to start, and writing some programs to benchmark with my data, it seems that fortran90 has given me the best results (as compared with python, shell commands (bash), etc...).

My current scheme is simply to open the file and use a series of loops to move the read pointer to where I need and writing the results to an output file.

With the above small example this would look like:

    open(unit=1,fileName)
    open(unit=2,outFile)

    do i=1,2
      read(1,*)
    end do
    do i=3,6
      read(1,*) line
      write(2,*) line
    end do
    do i=7,17
      read(1,*)
    end do
    do i=18,27
      read(1,*) line
      write(2,*) line
    end do
    do i=28,38
      read(1,*)
    end do
    do i=39,45
      read(1,*) line
      write(2,*) line
    end do

*It should be noted I am assuming buffered i/o when compiling, although this seems to only minimally speed things up.

I am curious if this is the most efficient way to accomplish my task. If the above is in fact the best way to do this with fortran90, is there another language more suited to this task?

*Update: Made sure I was using buffered i/o, manually finding the most efficient blocksize/blockcount. That increased speed by about 7%. I should note that the files I am working with do not have a fixed record length.

2

There are 2 best solutions below

2
On

One should be able to do this is most any language, so sticking with the theme here is something that should be close to working if you fix up the typos. (If I had a fortran compiler on an iPad that would make it more useful.)

PROGRAM AA
IMPLICIT NONE
INTEGER :: In_Unit, Out_Unit, I
LOGICAL, DIMENSION(1000) :: doIt
CHARACTER(LEN=20) :: FileName = 'in.txt'
CHARACTER(LEN=20) :: Outfile = 'out.txt'
CHARACTER(LEN=80) :: line

open(NEWunit=In_Unit,  fileName)  ! Status or action = read only??
open(NEWunit=Out_Unit, outFile)   ! Status or action = new or readwrite??

DoIt        = .FALSE.
DoIt(3:6)   = .TRUE.
DoIt(18:27) = .TRUE.
DoIt(39:45) = .TRUE.

do i=1,1000
  read(I_Unit,*) line
  IF(doIt(I)) write(Out_Unit,*) line
end do

CLOSE(In_Unit)
CLOSE(Out_Unit)

END PROGRAM AA
8
On

You can also try to use sed utility.

sed '3,6!d' yourfile.txt
sed '18,27!d' yourfile.txt

Unix utilities tend to be very optimized and to solve easy tasks like this very fast.