python: join codes to avoid opening recursive lists

50 Views Asked by At

I have the following list:

list1 = ['aberración..', '.', 'aberrante. (Del ant. part. act. de aberrar). adj. Dicho de una cosa: Que se.', 'desvía o aparta de lo normal o usual..', 'aberrar. (Del lat. aberrare). intr. p. us. Desviarse, extraviarse, apartarse de.', 'lo normal o usual..', "abertal. (Der. del lat. apertus 'abierto'). adj. 1. Dicho de una finca rústica o.", 'de un campo: Que no está cerrado con tapia, vallado ni de otra manera. 11 2..', 'Dicho de un terreno: Que con la sequía se agrieta..', 'abertura. (Del lat. apertura). f. 1. Acción de abrir o abrirse. 11 2. Boca,.', 'hendidura, agujero. 11 3. grieta (11 hendidura en la tierra). 11 4. Terreno ancho.', 'y abierto que media entre dos montañas. 11 5. ensenada (11 parte de mar que.', 'entra en la tierra). 11 6. Der. apertura (11 acto de dar publicidad a un.', 'testamento). 11 7. Fon. Amplitud que los órganos articulatorios dejan al.']

And when I apply my code:

import os
from re import findall
from itertools import product

spanish_alphabet = 'aábcdeéfghiíjklmnñoópqrstuúvwxyz'
spanish_characters = spanish_alphabet + spanish_alphabet.upper() + ' '
characters = [''.join(i) for i in spanish_characters]
keywords = [''.join(i) for i in product(spanish_alphabet, repeat = 2)]
Keywords = [i.capitalize() for i in keywords]
keywords = keywords + Keywords

A_keywords = [i for i in keywords if i.startswith(('A', 'a', 'Á', 'á'))]

list1 = ['aberración..', '.', 'aberrante. (Del ant. part. act. de aberrar). adj. Dicho de una cosa: Que se.', 'desvía o aparta de lo normal o usual..', 'aberrar. (Del lat. aberrare). intr. p. us. Desviarse, extraviarse, apartarse de.', 'lo normal o usual..', "abertal. (Der. del lat. apertus 'abierto'). adj. 1. Dicho de una finca rústica o.", 'de un campo: Que no está cerrado con tapia, vallado ni de otra manera. 11 2..', 'Dicho de un terreno: Que con la sequía se agrieta..', 'abertura. (Del lat. apertura). f. 1. Acción de abrir o abrirse. 11 2. Boca,.', 'hendidura, agujero. 11 3. grieta (11 hendidura en la tierra). 11 4. Terreno ancho.', 'y abierto que media entre dos montañas. 11 5. ensenada (11 parte de mar que.', 'entra en la tierra). 11 6. Der. apertura (11 acto de dar publicidad a un.', 'testamento). 11 7. Fon. Amplitud que los órganos articulatorios dejan al.']

list2 = []
list3 = []

# Step 1
for i in list1:
     if i.startswith(tuple(A_keywords)):
            list2.append(i)

# Step 2           
for i in list2:
    regex = findall(r'^[^[.|,|-]+(?=[.|,])', i)
    for matching_i in regex:
        list3.append(matching_i)

print(list3)

I get the list cleaned the way I want, which is like this:

['aberración', 'aberrante', 'aberrar', 'abertal', 'abertura']

However, I would like to join both Step 1 and Step 2, and get the same result by using only one for loop and avoiding the employment of list2 and list3 separately. In order to do that, I have joined both steps in a new code, like this:

for i in list1[:]:
    if not i.startswith(tuple(A_keywords)):
            list1.remove(i)
            
    regex = findall(r'^[^[.|,|-]+(?=[.|,])', i)
    for matching_i in regex:
            i= i.replace(i, matching_i)

Nevertheless, the result I get is not the one that I want. Step 1 has been perfectly carried out but Step 2 has not. Note that I am trying to execute my actions over list1, and this is why I used [:], in order to avoid creating list2 and list3. The faulty output I am getting is this one:

['aberración..', 'aberrante. (Del ant. part. act. de aberrar). adj. Dicho de una cosa: Que se.', 'aberrar. (Del lat. aberrare). intr. p. us. Desviarse, extraviarse, apartarse de.', "abertal. (Der. del lat. apertus 'abierto'). adj. 1. Dicho de una finca rústica o.", 'abertura. (Del lat. apertura). f. 1. Acción de abrir o abrirse. 11 2. Boca,.'] 

Can you help me?

2

There are 2 best solutions below

0
On

You can try to join list1 with a newline (\n) and use only one regex:

A_keywords = "^((?:" + "|".join(A_keywords) + ")[^[.|,|-]+(?=[.|,]))"
list1 = re.findall(A_keywords, "\n".join(list1), flags=re.M)

print(list1)

Prints:

['aberración', 'aberrante', 'aberrar', 'abertal', 'abertura']
0
On

In this code:

# Step 1
for i in list1:
     if i.startswith(tuple(A_keywords)):
            list2.append(i)

# Step 2           
for i in list2:
    regex = findall(r'^[^[.|,|-]+(?=[.|,])', i)
    for matching_i in regex:
        list3.append(matching_i)

Two things happen:

  • list1 is filtered into list2, only including the elements that start with some substring from a list of options.
  • from the elements in that resulting list, matching substrings are selected.

However, in this code:

for i in list1[:]:
    if not i.startswith(tuple(A_keywords)):
            list1.remove(i)
            
    regex = findall(r'^[^[.|,|-]+(?=[.|,])', i)
    for matching_i in regex:
            i= i.replace(i, matching_i)

Something else happens:

  • you loop over a copy of list1
  • you remove elements from the original list if they don't meet the startswith condition
  • you then apply the same matching to that element (whether it was removed or not) and you change the value of i - which promptly gets discarded at the end of the loop.

It seems you're misunderstanding a few things:

  • changing the value of a loop variable doesn't change the original value, if you simply reassign it (i = ...).
  • changing the value of a loop variable only changes the original value, if it is a mutable value to begin with (i.replace(i, matching_i)).
  • even if that worked, which it doesn't, since i.replace(..) doesn't change i, it only returns the changed value, in your case it would be changing an i from the copy of the list.

Your second example doesn't work for each of those reasons individually, and thus won't work at all.

What you were likely going for:

for i in list1:
    if i.startswith(tuple(A_keywords)):
        regex = findall(r'^[^[.|,|-]+(?=[.|,])', i)
        for matching_i in regex:
            list2.append(matching_i)
  • this loops over the elements of list1.
  • if an element meets a condition, the regex is applies.
  • any matches from the regex are added to a result.

However, it still has a few remaining problems:

  • calling a looping variable i when it's not really an index, but the actual value or item, is a bit confusing.
  • the operation is so straightforward that a list comprehension would be more readable and efficient.
  • A_keywords starts with a capital, which makes it look like a class instead of a variable.
  • list1 and list2 are not very desciptive, which can be confusing.

So, I'd do something like this:

from re import findall

text = ['aberración..', '.', 'aberrante. (Del ant. part. act. de aberrar). adj. Dicho de una cosa: Que se.', 'desvía o aparta de lo normal o usual..', 'aberrar. (Del lat. aberrare). intr. p. us. Desviarse, extraviarse, apartarse de.', 'lo normal o usual..', "abertal. (Der. del lat. apertus 'abierto'). adj. 1. Dicho de una finca rústica o.", 'de un campo: Que no está cerrado con tapia, vallado ni de otra manera. 11 2..', 'Dicho de un terreno: Que con la sequía se agrieta..', 'abertura. (Del lat. apertura). f. 1. Acción de abrir o abrirse. 11 2. Boca,.', 'hendidura, agujero. 11 3. grieta (11 hendidura en la tierra). 11 4. Terreno ancho.', 'y abierto que media entre dos montañas. 11 5. ensenada (11 parte de mar que.', 'entra en la tierra). 11 6. Der. apertura (11 acto de dar publicidad a un.', 'testamento). 11 7. Fon. Amplitud que los órganos articulatorios dejan al.']
a_keywords = ['aberración', 'aberrante', 'aberrar', 'abertal', 'abertura']

results = [
    match for item in text if item.startswith(tuple(a_keywords)) 
    for match in findall(r'^[^[.|,|-]+(?=[.|,])', item)
]

print(results)

Output:

['aberración', 'aberrante', 'aberrar', 'abertal', 'abertura']