Split several sentences in pandas dataframe

157 Views Asked by At

I have a pandas dataframe with a column that looks like this.

sentences
['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']
['This is the same in another row.', 'Another row another text.', 'Text in second row.', 'Last text in second row.']

In every row there are 10 sentences in ' ' or " " separated by commas. The column type is "str". I was not able to transform it to a list of strings.

I want to transform the values of this dataframe that they look like this:

[['This', 'is', 'text'], ['This', 'is', 'another', 'text'], ['This', 'is', 'also', 'text'], ['Even', 'more', 'text']]

I tried something like this:

    new_splits = []
    for num in range(len(refs)):
      komma = refs[num].replace(" ", "\', \'")#regex=True)
      new_splits.append(komma)

and this:

    new_splits = []
    for num in range(len(refs)):
      splitted = refs[num].split("', '")
      new_splits.append(splitted)

Disclaimer: I need this for evaluating bleu score and haven't found a way to do this for this kind of dataset. Thanks in advance!

2

There are 2 best solutions below

1
On BEST ANSWER

You can use np.char.split in one line:

df['separated'] = np.char.split(df['sentences'].tolist()).tolist()

@Kata if you think the sentences column type is str meaning the element in each row is a string instead of a list, for e.g. "['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']" then you need to try to convert them into lists first. One way is to use ast.literal_eval.

from ast import literal_eval
df['sentences'] = df['sentences'].apply(literal_eval)
df['separated'] = np.char.split(df['sentences'].tolist()).tolist()

NOTE on data: This is not a recommended way of storing data. If possible fix the source from which data is coming. It needs to be strings in each cell not lists preferably, or at least just lists, and not a string representing list.

0
On

With df your dataframe you could try the following:

df["splitted"] = (
    df["sentences"]
    .str.strip("[]\'\"").str.split("\'. \'|\'. \"|\". \'|\". \"")
    .explode()
    .str.findall(r"\b([^ ]+?)\b")
    .groupby(level=0).agg(list)
)
  • Fist .strip the [, ], ', and " from the beginning and end of the rows.
  • Then .split the rows into lists of sentences.
  • .explode the resulting column to extract the words in the sentences into a list via .findall.
  • And then group the corresponding word lists back together in one list.

Result df["splitted] for

df = pd.DataFrame({
    "sentences": [
        """['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']""",
        """["This is the same in another row.", 'Another row another text.', 'Text in second row.', 'Last text in second row.']"""
    ]
})

is

0  [['This', 'is', 'text'], ['This', 'is', 'another', 'text'], ['This', 'is', 'also', 'text'], ['Even', 'more', 'text']]
1  [['This', 'is', 'the', 'same', 'in', 'another', 'row'], ['Another', 'row', 'another', 'text'], ['Text', 'in', 'second', 'row'], ['Last', 'text', 'in', 'second', 'row']]