Write lines starting with a determine substring in a new file with Python

57 Views Asked by At

I have a txt file that I want to clean. The aim is to read a file line by line and to remove all lines not starting by previousy defined combinations of letters (or keywords) in a new written file.

Here is a sample of my original document (the one to be cleaned):

agigolón. (Tb. ajigolón). m. 1. El Salv., Guat., Méx. y Nic. Prisa, ajetreo. U.
m. en pl. 112. El Salv., Guat., Hond., Méx. y Nic. Apuro, aprieto. U. m. en
pl. 113. Guat., Méx. y Nic. Fatiga, cansancio.

agigotar. tr. desus. hacer gigote.
ágil. (Del lat. agilis). adj. 1. Que se mueve con soltura y rapidez. Estuvo
muy ágil y esquivó el golpe. 12. Dicho de un movimiento: Hábil y rápido.
Camina con paso ágil. 1 3. Que actúa o se desarrolla con rapidez o
prontitud. Tiene una prosa ágil.

agílibus. m. coloq. agibílibus.
agilidad. (Del lat. agilítas, -atis). f. 1. Cualidad de ágil. 12. Rel. Una de las
cuatro dotes de los cuerpos gloriosos, que consiste en la facultad de
trasladarse de un lugar a otro instantáneamente, por grande que sea la
distancia.

And here is my cde:

from itertools import product

path = r'C:\Users\Usuario\Desktop'

spanish_alphabet = 'aábcdeéfghiíjklmnñoópqrstuúvwxyz'
keywords = [''.join(i) for i in product(spanish_alphabet, repeat = 2)]
Keywords = [i.capitalize() for i in keywords]
keywords = keywords + Keywords

A_keywords = [i for i in keywords if i.startswith(('A', 'a', 'Á', 'á'))]

with open(path + '\raw_text.txt', 'r', encoding='utf-8') as input_file:
    with open(path + '\clean_text.txt', 'w', encoding ='utf-8') as output_file:
        for line in input_file:
            # If line begins with given keyword, then write it in clean_text file
            if line.strip("\n").startswith(tuple(A_keywords)):
                output_file.write(line + '\n')

I do not understand why my code is not working and the resulting file is totally empty. All lines starting with 'ag' should be written in the new file. Can you help me?

1

There are 1 best solutions below

0
On

Try (input_file.txt contains your text from the quesion):

from itertools import product


spanish_alphabet = "aábcdeéfghiíjklmnñoópqrstuúvwxyz"
keywords = ["".join(i) for i in product(spanish_alphabet, repeat=2)]
Keywords = [i.capitalize() for i in keywords]
keywords = keywords + Keywords

A_keywords = tuple(i for i in keywords if i.startswith(("A", "a", "Á", "á")))

with open("input_file.txt", "r") as f_in, open("output_file.txt", "w") as f_out:
    for line in map(str.strip, f_in):
        if line.startswith(A_keywords):
            print(line, file=f_out)

output_file.txt will contain:

agigolón. (Tb. ajigolón). m. 1. El Salv., Guat., Méx. y Nic. Prisa, ajetreo. U.
agigotar. tr. desus. hacer gigote.
ágil. (Del lat. agilis). adj. 1. Que se mueve con soltura y rapidez. Estuvo
agílibus. m. coloq. agibílibus.
agilidad. (Del lat. agilítas, -atis). f. 1. Cualidad de ágil. 12. Rel. Una de las