Count every word in a text and arrangement with any number and how to clean useless symbols in text and print with count

49 Views Asked by At

how i can make my Wikipedia output with count all words in this text and arrangement them with the top 10 most existing words and print them without any symbols?

import wikipedia

wikipedia.set_lang("en")
a = wikipedia.page("bitcoin")
words = a.content

print(words)
1

There are 1 best solutions below

0
On

Considering that the var words is a string, you can use nltk lib to split your string in a list of words, and then, perform your tasks. Something like that:

import nltk
from nltk.probability import FreqDist

words_list = nltk.word_tokenize(words)
words_frquence = FreqDist(words_list)
words_count = len(words_list)
words_unique_count =  len(set(words_list))

Now, to remove undesired words or symbols, you will need to apply a func in your string, try that:

import re

def nomalize(string):
    clean_string = re.sub(r'Ø|\+','',string) #add '|your symbol' to remove more symbols

    return clean_string