Formatting .txt documents with Python

105 Views Asked by At

I am trying to brush up on Python & I have come to a halt with figuring out how to deal with .txt documents. I am challenging myself to make a program that can calculate GPA.

I decided to make my program a little more complicated and instead of a list I want to read a .txt and scrape the relevant information out of it and turn it into a list. I want the list in the format: (list = [Grades, Credits] list = [97.0, 3.0])

How do I target only the information I want? I have tried a couple things, but each have their own minuet issue, so I will share the code that got me closest to my end point. My code/output & .txt is as follows:

Please let me know if I any other information is needed to make better sense of what I am attempting & Thank you!

Code:

digits = []
n = 0
digit_str = ""
with open("Grades.txt", "r") as g_c_txt:
    lines = g_c_txt.readlines()
    print(lines)

for line in lines:
    for i in line:
        if (i.isdecimal() or i == "." or i == "|") == True:            
            if (n == 0):
                i = ""
                n += 1
            digit_str += i

digits = digit_str.split("|")

print()
print(digits)

Output:

['Government | 97.0% | 3.0 Credits |\n', 'History | 87.0% | 3.0 Credits |\n', 'College Algebra | 76.0% | 4.0 Credits |\n', 'Trignometry | 93.0% | 3.0 Credits |\n', 'Pre-Calculus | 83.0% | 4.0 Credits |\n', 'English 1 | 75.0% | 3.0 Credits |\n', 'Calculus 1 | 56.6% | 4.0 Credits |\n', 'English 2 | 89.4% | 3.0 Credits |']

['97.0', '3.0', '', '87.0', '3.0', '', '76.04.0', '', '93.0', '3.0', '', '83.0', '4.0', '1', '75.0', '3.0', '1', '56.6', '4.0', '2', '89.4', '3.0', '']

.txt:

Government | 97.0% | 3.0 Credits |
History | 87.0% | 3.0 Credits |
College Algebra | 76.0% | 4.0 Credits |
Trignometry | 93.0% | 3.0 Credits |
Pre-Calculus | 83.0% | 4.0 Credits |
English 1 | 75.0% | 3.0 Credits |
Calculus 1 | 56.6% | 4.0 Credits |
English 2 | 89.4% | 3.0 Credits |

If I get the text using txt_text.readlines() and use a for loop to iterate and check if each is a decimal using i.isdecimal() it nearly works except I don't understand how to get rid of multiple extra "|". It also grabs "1, 1, 2" from "English 1, Calculus 1, and English 2". I know I could make this easier by simplifying the .txt, but I won't always be able to choose the format I get my data in and want to learn how to process a bit more complicated documents.

3

There are 3 best solutions below

2
Suramuthu R On

Please Note: the line 3 of 'a.txt' you 've given as College Algebra | 76.0% - 4.0 Credits | which does not match the format of other lines. So correct the type error to College Algebra | 76.0% | 4.0 Credits |.

In your question the format of the required output is not clear. So, I'm giving as multidimensional list

f = open('a.txt', 'r')
lst = f.readlines()
f.close()
for i in range(len(lst)):
    
    #Add an extra '|' before 'credits'
    lst[i] = lst[i].replace(' Credits', ' | Credits')
    
    #Convert each line into list
    lst[i] = lst[i].split('|')
    
    #remove unwanted characters
    for j in range(len(lst[i])):
            lst[i][j] = lst[i][j].replace(' ', '').replace('\n', '').replace('%', '')

l = []

#iterate thru list
for x in lst:
    # a new sublist containing 2 lists, in each list the 2nd element is value
    sub = [[x[0], int(float(x[1]))], [x[3], int(float(x[2]))]]
    l.append(sub)

print(l)

'''Output:
[[['Government', 97], ['Credits', 3]], [['History', 87], ['Credits', 3]], [['CollegeAlgebra', 76], ['Credits', 4]], [['Trignometry', 93], ['Credits', 3]], [['Pre-Calculus', 83], ['Credits', 4]], [['English1', 75], ['Credits', 3]], [['Calculus1', 56], ['Credits', 4]], [['English2', 89], ['Credits', 3]]]
'''

In case if you want keys in a list and values in a list in the output list, change the line 22 to sub = [[x[0], x[3]], [int(float(x[1])), int(float(x[2]))]]

0
juanpa.arrivillaga On

When you approach any problem in programming, split your abstractions into modular units.

We know that we want to process the text file line by line. Your format is basically a csv that usees "|" as a delimiter. We can use str.split to extract each "cell" into a list.

Then from the second and third cells on each line, we want to extract the decimal string. We could use re to do this, but if your input is regular, a simple approach is also possible, which keeps the spirit of your original attemp. We put the characters into a set, although, you could just use a str as well in this particular case.

So we want to process each line as a delimited list, and extract decimal numbers we assume will be in the second and third cells.

DECIMAL_CHARS = set('0123456789.')
GRADE_POS = 1
CREDIT_POS = 2

def extract_number(item: str) -> str:
    """
    filters out any character that isn't a decimal digit or decimal point
    """
    result = []
    for c in item:
        if c in DECIMAL_CHARS:
            result.append(c)
    return ''.join(result)

def process_line(line: str) -> tuple[float, float]:
    cells = [cell.strip() for cell in line.split('|')]
    grade = float(extract_number(cells[GRADE_POS]))
    credits = float(extract_number(cells[CREDIT_POS]))
    return grade, credits

with open("grades.txt") as f:
    records = [process_line(line) for line in f]

print(records)

If I were actually faced with this task, I would probably use a combination of csv and re.

0
Hai Vu On

We can use the the csv library for splitting up the fields:

with open("Grades.txt") as stream:
    reader = csv.reader(stream, delimiter="|")
    rows = list(reader)
    pprint.pprint(rows)

Output:

[['Government ', ' 97.0% ', ' 3.0 Credits ', ''],
 ['History ', ' 87.0% ', ' 3.0 Credits ', ''],
 ['College Algebra ', ' 76.0% ', ' 4.0 Credits ', ''],
 ['Trignometry ', ' 93.0% ', ' 3.0 Credits ', ''],
 ['Pre-Calculus ', ' 83.0% ', ' 4.0 Credits ', ''],
 ['English 1 ', ' 75.0% ', ' 3.0 Credits ', ''],
 ['Calculus 1 ', ' 56.6% ', ' 4.0 Credits ', ''],
 ['English 2 ', ' 89.4% ', ' 3.0 Credits ', '']]

For each row, we can write a short function to parse it. The goal is to convert a row such as

[['Government ', ' 97.0% ', ' 3.0 Credits ', '']

to

[97.0, 3.0]

That function is

def parse(row: list[str]):
    # Convert ' 97.0% ' --> 97.0
    grade = float(row[1].strip().removesuffix("%"))

    # Convert ' 3.0 Credits ' -> 3.0
    credit = float(row[2].split()[0])

    return [grade, credit]

Putting it together:

import csv
import pprint


def parse(row: list[str]):
    # Convert ' 97.0% ' --> 97.0
    grade = float(row[1].strip().removesuffix("%"))

    # Convert ' 3.0 Credits ' -> 3.0
    credit = float(row[2].split()[0])

    return [grade, credit]


with open("Grades.txt") as stream:
    reader = csv.reader(stream, delimiter="|")
    rows = list(reader)
    pprint.pprint(rows)

    output = [parse(row) for row in rows]
    pprint.pprint(output)

Output:

[['Government ', ' 97.0% ', ' 3.0 Credits ', ''],
 ['History ', ' 87.0% ', ' 3.0 Credits ', ''],
 ['College Algebra ', ' 76.0% ', ' 4.0 Credits ', ''],
 ['Trignometry ', ' 93.0% ', ' 3.0 Credits ', ''],
 ['Pre-Calculus ', ' 83.0% ', ' 4.0 Credits ', ''],
 ['English 1 ', ' 75.0% ', ' 3.0 Credits ', ''],
 ['Calculus 1 ', ' 56.6% ', ' 4.0 Credits ', ''],
 ['English 2 ', ' 89.4% ', ' 3.0 Credits ', '']]
[[97.0, 3.0],
 [87.0, 3.0],
 [76.0, 4.0],
 [93.0, 3.0],
 [83.0, 4.0],
 [75.0, 3.0],
 [56.6, 4.0],
 [89.4, 3.0]]

Notes

  • Using the csv library, we can break up each lines into rows, which is a list of strings.
  • The function parse should be easy to understand: it strip off everything but digit and dots, then convert to float the two fields we are interested in.