Take text file and create csv file

78 Views Asked by At

I have a larger Python 3 program that processes OCR outputs and some bubble detection and I have it mostly worked out. I have one function that I got off Stack Overflow that works but has a weird side effect and since I do not understand the code very well I would like to get a little help coming up with something that works as I would like.

Here is the code I am using now: Link

How it works: I have a text file we can call address.txt that looks like this:

First Name,
Address,
City State Zip,
Second Name,
Second Address,
Second City State zip,

I would like to convert that to this:

First Name, Address, City State Zip,
Second Name, Second Address, City State Zip,

Ideally I would have it write to address.txt in the format I want to start, rather then create the file and have to edit the file afterwards using the above function I picked up from stack overflow. Here is my function that reads the images creates the file and adds commas at the end of each line. If I could get it to line up every three lines in one line I would not need the above code at all.

def tess_address():
    files = os.listdir("address")
    sorted_files = sorted(files)
    for image in sorted_files:
        # read image
        output = "address/" + image
        # Pass the image through pytesseract
        text = pytesseract.image_to_string(output)
        #remove all commas
        no_comma_text = re.sub(",", "", text)
        for line in no_comma_text.splitlines():
            #print to file
            print(line + ",", file=open("address" + '.txt', 'a', encoding='utf8'))
1

There are 1 best solutions below

0
Zach Young On

Python can make reading an consistent number of lines per logical grouping quite easy.

Start by reading the whole file line-by-line, taking care to strip away the trailing linebreak; you can also replace the extraneous commas:

with open("input.txt") as f:
    lines = [x.strip() for x in f.readlines()]
    lines = [x.replace(",", "") for x in lines]
    # lines = [x[:-1] for x in lines] # to remove the trailing comma, to preserve commas inside the string

print(lines)

and lines now looks like:

[
    "First Name",
    "Address",
    "City State Zip",
    "Second Name",
    "Second Address",
    "Second City State zip",
]

You can now do a simple assertion to make sure you have groups of three lines:

assert len(lines) % 3 == 0, f"len(lines)={len(lines)}; expected a multiple of 3"

Then create a loop that increments an index three-at-a-time to turn each chunk of three lines into three fields in a (CSV) row:

rows: list[list[str]] = []

for i in range(0, len(lines), 3):
    rows.append(lines[i : i + 3])

print(rows)
[
    ["First Name", "Address",         "City State Zip"       ],
    ["Second Name", "Second Address", "Second City State zip"],
]

Finally, use the csv module to write those rows to a new CSV file:

import csv

with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

All together, without the print statements:

import csv

with open("input.txt") as f:
    lines = [x.strip() for x in f.readlines()]
    lines = [x.replace(",", "") for x in lines]


assert len(lines) % 3 == 0, f"len(lines)={len(lines)}; expected an even multiple of 3"


rows: list[list[str]] = []

for i in range(0, len(lines), 3):
    rows.append(lines[i : i + 3])


with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

Following Adesoji_Alu's suggestion, you can skip the intermdediate file and process the text variable directly:

lines = [x.replace(",", "") for x in text.splitlines()]

assert len(lines) % 3 == 0, f"len(lines)={len(lines)}; expected an even multiple of 3"

...