Trying to use output of one function to influence the next function to count words in text file

97 Views Asked by At

I'm trying to use one function to count the number of words in a text file, after having this text file "cleaned" up by only including letters and single spaces. So i have my first function, which i want to clean up the text file, then i have my next function to actually return the length of the result of the previous function (cleaned text). Here are those two functions.

def cleanUpWords(file):
    words = (file.replace("-", " ").replace("  ", " ").replace("\n", " "))
    onlyAlpha = ""
    for i in words:
        if i.isalpha() or i == " ":
            onlyAlpha += i
    return onlyAlpha

So words is the text file cleaned up without double spaces, hyphens, line feeds. Then, i take out all numbers, then return the cleaned up onlyAlpha text file. Now if i put return len(onlyAlpha.split()) instead of just return onlyAlpha...it gives me the correct amount of words in the file (I know because i have the answer). But if i do it this way, and try to split it into two functions, it screws up the amount of words. Here's what i'm talking about (here's my word counting function)

def numWords(newWords):
    '''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
    return len(newWords.split())

newWords i define in main(), where `newWords = cleanUpWords(harper)-----harper is a varible that runs another read funtion (besides the point).

def main():
    harper = readFile("Harper's Speech.txt")    #readFile function reads
    newWords = cleanUpWords(harper)
    print(numWords(harper), "Words.")

Given all of this, please tell me why it gives a different answer if i split it into two functions.

for reference, here is the one that counts the words right, but doesn't split the word cleaning and word counting functions, numWords cleans and counts now, which isn't preffered.

def numWords(file):
    '''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
    words = (file.replace("-", " ").replace("  ", " ").replace("\n", " "))
    onlyAlpha = ""
    for i in words:
        if i.isalpha() or i == " ":
            onlyAlpha += i
    return len(onlyAlpha.split())

def main():
    harper = readFile("Harper's Speech.txt")
    print(numWords(harper), "Words.")

Hope i gave enough info.

1

There are 1 best solutions below

2
On BEST ANSWER

The problem is quite simple: You split it into two function, but you completely ignore the result of the first function and instead calculate the number of words before the cleanup!

Change your main function to this, then it should work.

def main():
    harper = readFile("Harper's Speech.txt")
    newWords = cleanUpWords(harper)
    print(numWords(newWords), "Words.") # use newWords here!

Also, your cleanUpWords function could be improved a bit. It can still leave double or triple spaces in the text, and you could also make it a bit shorter. Either, you could use regular expressions:

import re
def cleanUpWords(string):
    only_alpha = re.sub("[^a-zA-Z]", " ", string)
    single_spaces = re.sub("\s+", " ", only_alpha)
    return single_spaces

Or you could first filter out all the illegal characters, and then split the words and join them back together with a single space.

def cleanUpWords(string):
    only_alpha = ''.join(c for c in string if c.isalpha() or c == ' ')
    single_spaces = ' '.join(only_alpha.split())
    return single_spaces

Example, for which your original function would leave some double spaces:

>>> s = "text with    triple spaces and other \n sorts \t of strange ,.-#+ stuff and 123 numbers"
>>> cleanUpWords(s)
text with triple spaces and other sorts of strange stuff and numbers

(Of course, if you intend to split the words anyway, double spaces are not a problem.)