Extract data lines from txt for each ID if number of lines for each varies

148 Views Asked by At

Suppose I have data in txt format structed as this:

|-------------------------|
| ID 1                    |
| Category A              |
| Class ABC               |
| Type 1                  |
| Subcat A                |
| positional entry 1      |
| positional entry 2      |
|-------------------------|
| ID 2                    |
| Category B              |
| Class ABC               |
| Type 2                  |
| Subcat B                |
| positional entry 1      |
| positional entry 2      |
| positional entry 3      |
|-------------------------|
| ID 3                    |
| Category A              |
| Class E                 |
| Type 4                  |
| Subcat A                |
| positional entry 1      |
|-------------------------|

The data is stored in one large txt file (approx. 7 GB) with more than 100 million rows. I want to extract only those IDs that fall into Category A and Subcat A. However, I also want the positional entries. These are not fixed, it is unknown how many lines there are. So for each ID it can vary.

I tried to go through it with opening it as txt file and going through each line. My problem here is that each time the file pointer enters a new line the information so to say is lost, although I could try to set flags that are retained.

Second approach was to extract the beginning and end of each ID in a list first. Then check the position where an ID starts that has category A and Subcat A. However, I have many rows, so storing these information in ranges with lists that have so many elements is not possible. I wanted to check then for each ID to which range it falls.

Expected output:

|-------------------------|
| ID 1                    |
| Category A              |
| Class ABC               |
| Type 1                  |
| Subcat A                |
| positional entry 1      |
| positional entry 2      |
|-------------------------|
| ID 3                    |
| Category A              |
| Class E                 |
| Type 4                  |
| Subcat A                |
| positional entry 1      |
|-------------------------|

How can I do this extraction?

Edit: positional entry 1,2 and so on just means this can be some varying entries. So these are lines with for example text entries which I need for later analysis.

Edit 2 according to Zach Young's answer:

When I adopt the code to the following:

import csv
DIVIDER = "-------------------------"

f_in = open(r"C:\myfile\testfile.txt")
block: list[str] = []

with open(r"C:\myfile\output.csv", "w", newline="") as f_out:
    writer = csv.writer(f_out)
    for line in f_in:
        line = line.replace("|", "").strip()
        if line == DIVIDER:
            if len(block) > 4 and block[1] == "Category A" and block[4] == "Subcat A":
                print("check")
                print(block)
                writer.writerow(block)

            block = []
            continue

        block.append(line)

Then I get the ouptut as in the answer below, however I do not want transposing, I would like to have the expected output as I wrote it here.

2

There are 2 best solutions below

10
Zach Young On

I see this problem as:

  1. accumulate lines of a complete block
  2. evaluate the complete block, and then conditionally print the block

...and, printing dividers at beginning and end of file, and between printed blocks.

I've created two files from the unmodified samples you shared. Both have no extraneous whitespace at the beginning; both have a final linebreak (an "empty line" at the bottom):

  • input.txt:

    |-------------------------|
    | ID 1                    |
    | Category A              |
    | Class ABC               |
    | Type 1                  |
    | Subcat A                |
    | positional entry 1      |
    | positional entry 2      |
    |-------------------------|
    | ID 2                    |
    | Category B              |
    | Class ABC               |
    | Type 2                  |
    | Subcat B                |
    | positional entry 1      |
    | positional entry 2      |
    | positional entry 3      |
    |-------------------------|
    | ID 3                    |
    | Category A              |
    | Class E                 |
    | Type 4                  |
    | Subcat A                |
    | positional entry 1      |
    |-------------------------|
    
  • target.txt:

    |-------------------------|
    | ID 1                    |
    | Category A              |
    | Class ABC               |
    | Type 1                  |
    | Subcat A                |
    | positional entry 1      |
    | positional entry 2      |
    |-------------------------|
    | ID 3                    |
    | Category A              |
    | Class E                 |
    | Type 4                  |
    | Subcat A                |
    | positional entry 1      |
    |-------------------------|
    

To start off, I want to see what capturing and printing every block looks like. Consider this simple program that just reads a block, and prints a block, but only after a complete block has been read:

f = open("input.txt")

print(next(f), end="")          # print first line (assumed to be divider)

block: list[str] = []           # create empty list to accumulate lines of a block

for line in f:
    if line.startswith("|--"):  # look for block's terminating divider
        for x in block:         # print the accumulated lines of the block
            print(x, end="")
        print(line, end="")     # print terminal divider

        block = []              # reset block
        continue                # skip to next line

    block.append(line)          # otherwise, accumulate line in block

I expect to get out exactly what I put in. I test that:

% python3 main1.py > output.txt
% diff input.txt output.txt

diff doesn't complain, so I've got at least that much correct.

Next, I'll add in the evaluation step:

f = open("input.txt")
print(next(f), end="")
block: list[str] = []

for line in f:
    if line.startswith("|--"):
        if len(block) > 4 and "Category A" in block[1] and "Subcat A" in block[4]:
            for x in block:
                print(x, end="")
            print(line, end="")

        block = []
        continue

    block.append(line)

I expect to get an output that matches target.txt. I test that:

% python3 main2.py > output.txt
% diff target.txt output.txt

Please, recreate the conditions for those tests exactly and try those two sample programs, without any modifications... not even the to the input path.

That will mean making sure the programs can access a file named input.txt. A straightforward way to do that would be to:

  1. put the scripts and input.txt in the same directory
  2. cd to that directory
  3. run the scripts with a redirect (">") to an output file, like Z:\your\path\to\python3 main1.py > output.txt, and the same for main2.py

Do not change anything/add variables to the test, yet.

Once you've verified the second script exactly produces target.txt, you can move on to trying your input text.

Because of the size of your file, to start, I recommend adding a simple counter and then halting after the program has read enough blocks that should qualify; after, say... 100 blocks?:

i = 0
for line in f:
    if line.startswith("|--"):
        i += 1
        ...
        block = []
        if i == 100:
            break
        continue
    ...

After that, maybe add in a file to write to, to avoid the prints and redirect. You don't need to do with open(...) for these programs: once the program finishes (even from an unhandled exeception), Python will close all open files.

3
Jacek Błocki On

Below example should be self contained. Replace StringIO with files (see How can I open multiple files using "with open" in Python?) and you can filter your data.

s = """
|-------------------------|
| ID 1                    |
| Category A              |
| Class ABC               |
| Type 1                  |
| Subcat B                |
| positional entry 1      |
| positional entry 2      |
|-------------------------|
| ID 2                    |
| Category B              |
| Class ABC               |
| Type 2                  |
| Subcat B                |
| positional entry 1      |
| positional entry 2      |
| positional entry 3      |
|-------------------------|
| ID 3                    |
| Category A              |
| Class E                 |
| Type 4                  |
| Subcat A                |
| positional entry 1      |
|-------------------------|
| ID 4                    |
| Category A              |
| Class Foo               |
| Type 999999             |
| Subcat A                |
| positional entry 1      |
| positional entry 2      |
| positional entry 3      |
| positional entry 4      |
| positional entry 5      |
| positional entry 6      |
|-------------------------|
"""

import re
import io
f=io.StringIO(s) # your data file
selected=io.StringIO() # filtered data

def take(data): # returns True if block selected
    c1=c2=False
    for s in data:
        if s.startswith('| Category A'): c1=True
        if s.startswith('| Subcat A'): c2=True
    return c1 and c2

    
data=[]
sep1=sep2=''
for line in f:
    if re.match('[|]-*[|]', line): # data record separator?
        sep1,sep2 = sep2,line
        if not data: continue # we are at data block begin
        if take(data):
            print('take')
            if selected.tell() == 0: # write separator before first selected record
                selected.write(sep1)
            selected.write(''.join(data))
            selected.write(sep2)
        else:
            print('toss')
        data=[]
        continue # separator not added to data
    data.append(line) # line is data line 
         

selected.seek(0)
print(selected.read())