I'm writing a function on Python that takes the name of a text file (as a string) as input. The function should first determine how many times each word appears in the file. Later, I will make a bar chart showing the frequency of the ten most common words in the file, and next to each bar is a second bar whose height is the frequency predicted by Zipf’s Law. I have some code for the graph already but I need help with finding the most common words in a text file.
def zipf_graph(text_file):
import string
file = open(text_file, encoding = 'utf8')
text = file.read()
file.close()
#the following strips and removes punctuation and makes the words lowercase
punc = string.punctuation + '’”—⎬⎪“⎫'
new_text = text
for char in punc:
new_text = new_text.replace(char,'')
new_text = new_text.lower()
text_split = new_text.split()
I'm stuck here, I'm trying to find the most common strings in a list but I'm not sure where to go from here, the following is what I tried:
words = text_split
most_common = max(words, key = words.count)
# print(most_common)
I also want to add the following code as it was suggested to help
# Sorting a list by frequency
# Assumes you have your elements as (word, frequency) tuples
# (Useful for the zipf function)
words = [('the', 1), ('and', 1), ('test',2)]
sorted(words, key = lambda x: x[1], reverse = True)
# "Sorting" a dictionary by frequency
# Assumes you have your elements as word:frequency
# (Useful for the zipf function)
words = dict()
words['the'] = 1
words['and'] = 1
words['test'] = 2
# This returns a list of just the most common words without their frequencies
most_common_words = sorted(words, key = words.get, reverse = True)
# print(most_common_words)
# We can go back to the dictionary to get the frequencies
for word in most_common_words:
print(word, words[word])
zipf_graph('fortune.txt') #name of the file I chose to use
You can use the nltk library:
will give values in the format: