Most common sentences extractions with count using Python

Question

Most common sentences extractions with count using Python

303 Views Asked by DJKarma At 19 November 2018 at 12:57

I want to write a Python Script that searches all Excel rows and returns top 10 most common sentences. I have written the basics of ngrams for a txt file.

The file contains csv text with dj is best 4 times and gd is cool 3 times.

import nltk
import pandas as pd

file = open('dj.txt', encoding="utf8")
text= file.read()
length = [3]
ngrams_count = {}
for n in length:
    ngrams = tuple(nltk.ngrams(text.split(' '), n=n))
    ngrams_count.update({' '.join(i) : ngrams.count(i) for i in ngrams})
ngrams_count
df = pd.DataFrame(list(zip(ngrams_count, ngrams_count.values())), 
                  columns=['Ngramm', 'Count']).sort_values(['Count'], 
                                                           ascending=False)
df

Output -

   Ngramm  Count
1                      is best,dj is      4
3                      is cool,gd is      2
21                     is best,gd is      2
25                best,dj is Best,dj      1
19                    not cool,dj is      1
20                cool,dj is best,gd      1
22                best,gd is cool,dj      1
23                     is cool,dj is      1
24                cool,dj is best,dj      1
0                      dj is best,dj      1
18                    is not cool,dj      1
27                Best,dj is best,dj      1
28                best,dj is best,dj      1
29                best,dj is best,gd      1
30                best,gd is cool,gd      1
31                cool,gd is COOL,gd      1
32                     is COOL,gd is      1
26                     is Best,dj is      1
17                    good,dj is not      1
16                    not good,dj is      1
15                    is not good,dj      1
14                  better,dj is not      1
13                   is better,dj is      1
12         good,sandeep is better,dj      1
11                is good,sandeep is      1
10    excellent,prem is good,sandeep      1
9               is excellent,prem is      1
8   superb,sandeep is excellent,prem      1
7               is superb,sandeep is      1
6        best,prem is superb,sandeep      1
5                    is best,prem is      1
4               cool,gd is best,prem      1
2                 best,dj is cool,gd      1
33                   COOL,gd is cool      1

So firstly, It shows 2 for gd is cool , i cant figure out why ?.. and then I want to sort this output so that it shows something like this

Ngramm  Count
dj is cool   4
gd is cool   3
....and so on....

And then i want this to do it for excel file row by row.

I am really new at this can anyone point me in the right direction?

Original Q&A

There are 2 best solutions below

**tevemadar** · Answer 1 · 2018-11-19T13:25:13.723000

As you can see, text.split(' ') does not split on punctuation, like commas.
A quick and dirty fix for this particular data (where the only punctuation appearing seems to be commas, and none of them are trailed by whitespace) could be writing.

text.replace(',',' ').split(' ')

>>> "a b,c".split(' ')
['a', 'b,c']                                 # <--- 2 elements
>>> "a b,c".replace(',',' ').split(' ')
['a', 'b', 'c']                              # <--- 3 elements

On the longer run you may want to learn about regular expressions, which can be a painful experience, but for this case it is easy:

>>> import re
>>> re.split("[ ,]+","a b,c")
['a', 'b', 'c']

**alexis** · Answer 2 · 2018-11-19T20:32:51.383000

Since this is a csv file, please do yourself a favor and parse the csv first! Then take the contents and process them any way you want. But your data seems to contain one "sentence" per cell, so if our goal is to find the most common sentence, why are you throwing tokenization and ngrams at this task?

import csv
from collections import Counter
with open('dj.txt', encoding="utf8") as handle:
    sentcounts = Counter(cell for row in csv.reader(handle) for cell in row)

print("Frequency  Sentence")
for sent, freq in sentcounts.most_common(5):
    print("%9d"%freq, sent)

If you did want the tokens you could just use split() in this simple case, but for more realistic text use nltk.word_tokenize(), which knows all about punctuation.

Most common sentences extractions with count using Python

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in TEXT

Related Questions in NLTK

Related Questions in N-GRAM

Related Questions in COLLOCATION

Trending Questions

Popular # Hahtags

Popular Questions