Rope data structure & Lines

651 Views Asked by At

I'm using a Rope to store a large amount (GB's) of text. The text can be tens of millions of lines long.

The rope itself is extremely fast inserting at any position, and is also fast getting a character at a specific position.

However, how would I get where a specific line (\n for this case) starts? For example, how would I get where line 15 starts? There are a couple options that I can see.

  1. Don't have any extra data. Whenever you want say the 15th line, you iterate through all the characters in the Rope, find the newlines, and when you reach the 15th newline, then you stop.
  2. Store the start and length of each line in a vector. So you would have your Rope data structure containing all the characters, and then a separate std::vector<line>. The line structure would just consist of 2 fields; start and length. Start represents where the line starts inside of the Rope, and length is the length of the line. To get where the 15th line starts, just do lines[14].start

Problems:

#1 is a horrible way to do it. It's extremely slow because you have to go through all of the characters.

#2 is also not good. Although getting where a line starts is extremely fast (O(1)), every time you insert a line, you have to move all the lines ahead of it, which is O(N). Also, storing this means that for every line you have, it takes up an extra 16 bytes of data. (assuming start and length are 8 bytes each). That means if you have 13,000,000 lines, it would take up 200MB of extra memory. You could use a linked list, but it just makes the access slow.

Is there any better & more efficient way of storing the line positions for quick access & insert? (Preferably O(log(n)) for inserting & accessing lines)

I was thinking of using a BST, and more specifically a RB-Tree, but I'm not entirely sure how that would work with this. I saw VSCode do this but with a PieceTable instead.

Any help would be greatly appreciated.

EDIT:

The answer that @interjay provided seems good, but how would I handle CRLF if the CR and LF were split between 2 leaf nodes?

I also noticed ropey, which is a rust library for the Rope. I was wondering if there was something similar but for C++.

1

There are 1 best solutions below

6
On

In each rope node (both leaves and internal nodes), in addition to holding the number of characters in that subtree, you can also put the total number of newlines contained in the subtree.

Then finding a specific newline will work exactly the same way as finding the node holding a specific character index. You would look at the "number of newlines" field instead of the "number of characters" field.

All rope operations will work mostly the same. When creating a new internal node, you just need to add its children's number of newlines. Complexity of all operations is the same.