Select elements from a matrix with numpy

74 Views Asked by At

I try to use numpy to fast text analysis. Exactly to collocations analysis. Let's suppose I have the following string which I converted into a numpy array:

text = np.array(['a', 'b', 'c', 'd', 'e', 'b', 'f', 'g'])

Suppose I want to take from that array the left and right contexts of the letter 'b'. Let's say 1 element to the left and 2 elements on the right. So I want to have something like that:

['a', 'c', 'd'] +  ['e', 'f', 'g']

Is it possible to do it with Numpy broadcasting all the operations? I did just looping on the text, but it's very time consuming.

I tried np.select, np.where and np.mask

Thanks for your help :)

3

There are 3 best solutions below

0
On

One possible way is to find b value indices (with np.where(arr == 'b')) to further index the adjacent values:

arr = np.array(['a', 'b', 'c', 'd', 'e', 'b', 'f', 'g'])
lr_contexts = [arr[[i-1, i+1, i+2]] for i in np.where(arr == 'b')[0]]
print(lr_contexts) 

[array(['a', 'c', 'd'], dtype='<U1'), array(['e', 'f', 'g'], dtype='<U1')]
0
On

I believe previous answer is the way to go if you really want to use numpy. But if it is applicable, I would suggest you give a try to regex functionalities on your text pattern task. For this task, the following function would solve it using re package.

import re

def get_text_around_char(text, char, n_left, n_rigth):
    matches = []
    for match in re.finditer(char, text):
        s, e = match.start(), match.end()
        matches.append(text[s-n_left:s]+text[s+1:e+n_rigth]) 
    return matches

print(get_text_around_char("abcdebfg", "b", 1, 2))

['acd', 'efg']

0
On

Maybe you could consider every window of 4 letters?

import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as swv

text = np.array(['a', 'b', 'c', 'd', 'e', 'b', 'f', 'g'])

arr = swv(text, 4)
out = arr[ np.ix_(      # Take from the array,
    arr[:, 1] == 'b',   # for each row where the 2nd value is a b,
    [0, 2, 3]           # the 1st, 3rd and 4th column.
)]

out:

array([['a', 'c', 'd'],
       ['e', 'f', 'g']], dtype='<U1')

arr:

array([['a', 'b', 'c', 'd'],
       ['b', 'c', 'd', 'e'],
       ['c', 'd', 'e', 'b'],
       ['d', 'e', 'b', 'f'],
       ['e', 'b', 'f', 'g']], dtype='<U1')