Is there a way to seek the "\n" character that is faster than looping through char one at a time?

369 Views Asked by At

Looking at the sample implementation of wc.c when counting number of lines, it loop through the file, one character at a time and accumulating the '\n' to count the number of newlines:

#define COUNT(c)       \
      ccount++;        \
      if ((c) == '\n') \
        lcount++;
  • Is there a way to just seek the file for '\n' and keep jumping to the newline characters and do a count?

  • Would seeking for '\n' be the same as just reading characters one at a time until we see '\n' and count it?

4

There are 4 best solutions below

2
viraltaco_ On

Well, all characters are not '\n', except for one. A branch-less algorithm is likely to be faster.
Have you tried std::count, though?

#include <string>
#include <algorithm>

int main() {
  const auto s = std::string("Hello, World!\nfoo\nbar\nbaz");
  const auto lines_in_s = std::count(s.cbegin(), s.cend(), '\n');
  return lines_in_s;
}

Compiler Explorer

Or with a file:

#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>

int main() {
    if (std::ifstream is("filename.txt"); is) {
        const auto lines_in_file =
            std::count(std::istreambuf_iterator<char>(is),
                       std::istreambuf_iterator<char>{}, '\n');

        std::cout << lines_in_file << '\n';
    }
}

Compiler Explorer

0
Christian Severin On

The only way you could skip looking at every character would be if you had domain knowledge about the string you're currently looking at:

If you knew that you're handling a text with continuous paragraphs of at least 50 words or so, you could, after each '\n', advance by 100 or 200 chars, thus saving some time. You'd need to test and refine that jump length, of course, but then you wouldn't need to check every single char.

For a general-purpose counting function you're stuck with looking at every possible char.

2
Konstantin Svintsov On

You can use strchr function to "jump" to next '\n' in string, and it will be faster on some platforms, because strchr usually implemented in assembly language and use processor instructions that can scan memory faster where such instructions are available. something like this:

#include <string.h>

unsigned int count_newlines(const char *str) {
   unsigned result = 0;
   const char s = str;
   while ((s = strchr(s, '\n')) != NULL) {
      ++result; // found one '\n'
      ++s; // and start searching again from the next character
   }
   return result;
} 
0
Simon Goater On

Q: Is there a faster way to count the number of lines in a file than reading one character at a time?
A: The quick answer is no, but one can parallelize the counting which might shorten the runtime but the program would still have to run through every byte once. Such a program may by IO bound and so it depends on the hardware involved as to how useful parallelization is in this case.
Q: Is there a way to skip from one newline character to the next without having to read through all the bytes in between?
A: The quick answer is no, but if one had a really large text file for example, what one could do is make an 'index' file of offsets. One would still have to make one pass over the file in order to generate such a file, but once it was made, one could find the nth line by reading the nth offset in the index and then 'seek'-ing to it. The index would have to maintained or regenerated though every time the file changed. If one used fixed width offsets, one could seek straight to the offset required with some simple arithmetic, read the index for the offset, then seek to the correct position in the file. A line count can be obtained at the same time as generating the index. Once the index is generated, a line count could quickly be determined from the size of the index file if it has to be computed again.

It probably should be mentioned that the number of lines in a text file might not be derived from the number of '\n' bytes because of multi-byte character encoding. To count the number of lines, one needs to scan the file character by character rather than just byte by byte, and to do that, one needs to know what character encoding scheme is being used.