How to replace a string in python if it occurs in more than one list?

183 Views Asked by At

At a stage in my code, I have a function that receives problems of two or three nested lists In each nested list, I have [word_form : word_tag]. An example for the two sentences problems are like:

P1(input):

['italian:JJ', ['an:DT'], ['became:VB', ['world:NN', ['the:DT'], 
['s:PO', ['tenor:NN', ['greatest:JJ']]]]], ['.:.']]

H(input):

['was:VX', ['there:EX'], ['an:DT'], ['italian:JJ'], ['became:VB', 
['who:WP'], ['world:NN', ['the:DT'], ['s:PO', ['tenor:NN', 
['greatest:JJ']]]]], ['.:.']]

Each nested list with tag in ['NN', 'VB', 'JJ'], I would like to replace their forms to Variables like X, Y, Z,... etc

If the sentence ( H ), has a word in common with P1 or P2, if exists, then they take the same Variable name. For instance, if ['italian' : 'JJ'] in H turned to ['X' : 'JJ'] then it must take 'X' in P1 or P2 (if exists).

what I do is only changing the forms into variables, and my variables are no X, Y, Z, I just did:

if tag in ['NN', 'VB', 'JJ']:
    form = form.upper()+'-0'

Which turns the form 'italian' to 'ITALIAN-0', but I would prefer to make it [X, Y, Z, ... etc]

So the wanted output is something like:

P1(output):

['X:JJ', ['an:DT'], ['Y:VB', ['Z:NN', ['the:DT'], 
['s:PO', ['A:NN', ['B:JJ']]]]], ['.:.']]

H(output):

['was:VX', ['there:EX'], ['an:DT'], ['X:JJ'], ['Y:VB', 
['who:WP'], ['Z:NN', ['the:DT'], ['s:PO', ['A:NN', 
['B:JJ']]]]], ['.:.']]

Similarly, the three sentences problems, like:

P1(input):

['want:VB', ['men:NN', ['every:DT'], ['italian:JJ']], ['be:VX', 
['to:TO'], ['a:DT'], ['great:JJ', ['tenor:VB']]]]

P2(input):

['are:VX', ['men:NN', ['some:DT'], ['italian:JJ']], ['great:JJ'], 
['tenor:VB']]

H(input):

['are:VX', ['there:EX'], ['italian:JJ'], ['Y:NN', ['want:VB', 
['who:WP'], ['be:VX', ['to:TO'], ['a:DT'], ['great:JJ', ['tenor:VB']]]]]]

Becomes:

P1(output):

['Z:VB', ['Y:NN', ['every:DT'], ['X:JJ']], ['be:VX', 
['to:TO'], ['a:DT'], ['A:JJ', ['B:VB']]]]

P2(output):

['are:VX', ['men:NN', ['some:DT'], ['X:JJ']], ['A:JJ'], 
['B:VB']]

H(output):

['are:VX', ['there:EX'], ['X:JJ'], ['Y:NN', ['Z:VB', 
['who:WP'], ['be:VX', ['to:TO'], ['a:DT'], ['A:JJ', ['B:VB']]]]]]
2

There are 2 best solutions below

0
On

Assuming I understand your question, you could write a function

next_var = 'A'
var_dict = {}

def var_map(s):
    global next_var
    if s in var_dict:
        return var_dict[s]
    var_dict[s] = next_var
    next_var = chr(ord('A') + 1)
    return var_dict[s]

That maps an object to a string, uniquely. next_var increments each call do var_map.

You could then call this on every instance of your string. The way that I update next_var can be changed, in case you have more than 26 variables.

0
On

For the purposes of answering the question, I'll rewrite your WORD instances with tuples. Your first example becomes:

p1 = [('italian', 'JJ'),
      [('an', 'DT')],
      [('became', 'VB'),
       [('world', 'NN'),
        [('the', 'DT')],
        [('s', 'PO'), [('tenor', 'NN'), [('greatest', 'JJ')]]]]],
      [('.', '.')]]

h = [('was', 'VX'),
     [('there', 'EX')],
     [('an', 'DT')],
     [('italian', 'JJ')],
     [('became', 'VB'),
      [('who', 'WP')],
      [('world', 'NN'),
       [('the', 'DT')],
       [('s', 'PO'), [('tenor', 'NN'), [('greatest', 'JJ')]]]]],
     [('.', '.')]]

Lets pull out the list of word forms which are common to both p1 and h. We'll define a very simple recursive in-order tree traversal generator:

def flatten(l):
    for x in l:
        if isinstance(x, tuple):
            yield x
        else:
            for y in flatten(x):
                yield y

NB: Change tuple to WORD.

We can use this to get the words which are common to both p1 and h:

>>> common_words = set(x[0] for x in flatten(h)) & set(x[0] for x in flatten(p1))
>>> common_words
{'.', 'an', 'became', 'greatest', 'italian', 's', 'tenor', 'the', 'world'}

NB: Change x[0] here to x.form. This can be extended to get word forms in, e.g., (p1 | p2) & h. Filtering, e.g., on POS-tags, can be done inside the generator expressions: set(x[0] for x in flatten(h) if x[1] in ['NN', 'VB', 'JJ']).

Label these words with some kind of unique string value:

>>> import itertools
>>> labels = dict((x, chr(y)) for x, y in
...               itertools.izip(common_words, itertools.count(ord('A'))))
>>> labels
{'.': 'H',
 'an': 'A',
 'became': 'C',
 'greatest': 'D',
 'italian': 'I',
 's': 'B',
 'tenor': 'E',
 'the': 'G',
 'world': 'F'}

And now we just need to replace instances of these words in h and p1.

We'll build another simple recursive function:

def apply_labels(l, labels):
    rv = []
    for x in l:
        if isinstance(x, tuple):
            if x[0] in labels:
                rv.append((labels[x[0]], x[1]))
            else:
                rv.append(x)
        else:
            rv.append(apply_labels(x, labels))
    return rv

And then:

>>> apply_labels(h, labels)
[('was', 'VX'),
 [('there', 'EX')],
 [('A', 'DT')],
 [('I', 'JJ')],
 [('C', 'VB'),
  [('who', 'WP')],
  [('F', 'NN'), [('G', 'DT')], [('B', 'PO'), [('E', 'NN'), [('D', 'JJ')]]]]],
 [('H', '.')]]

Rinse and repeat with p1:

>>> apply_labels(p1, labels)
[('I', 'JJ'),
 [('A', 'DT')],
 [('C', 'VB'),
  [('F', 'NN'), [('G', 'DT')], [('B', 'PO'), [('E', 'NN'), [('D', 'JJ')]]]]],
 [('H', '.')]]

Here's your second example, again expressing things as tuples:

p1 = [('want', 'VB'),
      [('men', 'NN'), [('every', 'DT')], [('italian', 'JJ')]],
      [('be', 'VX'),
       [('to', 'TO')],
       [('a', 'DT')],
       [('great', 'JJ'), [('tenor', 'VB')]]]]
p2 = [('are', 'VX'),
      [('men', 'NN'), [('some', 'DT')], [('italian', 'JJ')]],
      [('great', 'JJ')],
      [('tenor', 'VB')]]
h = [('are', 'VX'),
     [('there', 'EX')],
     [('italian', 'JJ')],
     [('men', 'NN'),
      [('want', 'VB'),
       [('who', 'WP')],
       [('be', 'VX'),
        [('to', 'TO')],
        [('a', 'DT')],
        [('great', 'JJ'), [('tenor', 'VB')]]]]]]

We do:

>>> def wordset(l):
...     return set(x[0] for x in flatten(l) if x[1] in ['NN', 'VB', 'JJ'])

>>> common_words = wordset(h) & (wordset(p1) | wordset(p2))
>>> common_words
{'great', 'italian', 'men', 'tenor', 'want'}

>>> labels = dict(zip(common_words,
...                   (chr(x) for x in itertools.count(ord('Z'), -1))))
>>> labels
{'great': 'Y', 'italian': 'Z', 'men': 'X', 'tenor': 'V', 'want': 'W'}

>>> apply_labels(p1, labels)
[('W', 'VB'),
 [('X', 'NN'), [('every', 'DT')], [('Z', 'JJ')]],
 [('be', 'VX'), [('to', 'TO')], [('a', 'DT')], [('Y', 'JJ'), [('V', 'VB')]]]]