Check Values of A Dictionary for Repeating Numbers

109 Views Asked by At

I am trying to take a text file and take all the words longer then three letters and print them in a column. I then want to match them with the line numbers that they appear on, in a second column. e.g.

Chicken 8,7
Beef    9,4,1
....

The problem is I don't want to have duplicates. Right now I have the word kings which appears in a line twice, and I only want it to print once. I am thoroughly stumped and am in need of the assistance of a wise individual.

My Code:

storyFile=open('StoryTime.txt', 'r')

def indexMaker(inputFile):
    ''
    # Will scan in each word at a time and either place in index as a key or
    # add to value.
    index = {}
    lineImOn = 0
    for line in inputFile:
        individualWord = line[:-1].split(' ')
        lineImOn+=1
        placeInList=0
        for word in individualWord:
            index.get(individualWord[placeInList])
            if( len(word) > 3): #Makes sure all words are longer then 3 letters
                if(not individualWord[placeInList] in index):
                    index[individualWord[placeInList]] = [lineImOn]

                elif(not index.get(individualWord[placeInList]) == str(lineImOn)):
                    type(index.get(individualWord[placeInList]))
                    index[individualWord[placeInList]].append(lineImOn)
            placeInList+=1

    return(index)

print(indexMaker(storyFile))

Also if anyone knows anything about making columns you would be a huge help and my new best friend.

2

There are 2 best solutions below

2
On BEST ANSWER

I would do this using a dictionary of sets to keep track of the line numbers. Actually to simplify things a bit I'd use acollections.defaultdictwith values that were of typeset. As mentioned in another answer, it's probably best to parse of the words using a regular expression via theremodule.

from collections import defaultdict
import re

# Only process words at least a minimum number of letters long.
MIN_WORD_LEN = 3
WORD_RE = re.compile('[a-zA-Z]{%s,}' % MIN_WORD_LEN)

def make_index(input_file):
    index = defaultdict(set)

    for line_num, line in enumerate(input_file, start=1):
        for word in re.findall(WORD_RE, line.lower()):
            index[word].add(line_num)  # Make sure line number is in word's set.

    # Convert result into a regular dictionary of simple sequence values.
    return {word:tuple(line_nums) for word, line_nums in index.iteritems()}

Alternative not usingremodule:

from collections import defaultdict
import string

# Only process words at least a minimum number of letters long.
MIN_WORD_LEN = 3

def find_words(line, min_word_len=MIN_WORD_LEN):
    # Remove punctuation and all whitespace characters other than spaces.
    line = line.translate(None, string.punctuation + '\t\r\n')
    return (word for word in line.split(' ') if len(word) >= min_word_len)

def make_index(input_file):
    index = defaultdict(set)

    for line_num, line in enumerate(input_file, start=1):
        for word in find_words(line.lower()):
            index[word].add(line_num)  # Ensure line number is in word's set.

    # Convert result into a regular dictionary of simple sequence values.
    return {word:tuple(line_nums) for word, line_nums in index.iteritems()}

Either way, themake_index()function could be used and the results output in two columns like this:

with open('StoryTime.txt', 'rt') as story_file:
    index = make_index(story_file)

longest_word = max((len(word) for word in index))
for word, line_nums in sorted(index.iteritems()):
    print '{:<{}} {}'.format(word, longest_word, line_nums)

As a test case I used the following passage (notice the word "die" is in the last line twice):

Now the serpent was more subtle than any beast of the field which
the LORD God had made. And he said unto the woman, Yea, hath God said,
Ye shall not eat of every tree of the garden?  And the woman said
unto the serpent, We may eat of the fruit of the trees of the garden:
But of the fruit of the tree which is in the midst of the garden,
God hath said, Ye shall not eat of it, neither shall ye touch it, lest
ye die, or we all die.

And get the following results:

all     (7,)
and     (2, 3)
any     (1,)
beast   (1,)
but     (5,)
die     (7,)
eat     (3, 4, 6)
every   (3,)
field   (1,)
fruit   (4, 5)
garden  (3, 4, 5)
god     (2, 6)
had     (2,)
hath    (2, 6)
lest    (6,)
lord    (2,)
made    (2,)
may     (4,)
midst   (5,)
more    (1,)
neither (6,)
not     (3, 6)
now     (1,)
said    (2, 3, 6)
serpent (1, 4)
shall   (3, 6)
subtle  (1,)
than    (1,)
the     (1, 2, 3, 4, 5)
touch   (6,)
tree    (3, 5)
trees   (4,)
unto    (2, 4)
was     (1,)
which   (1, 5)
woman   (2, 3)
yea     (2,)
1
On

First of all I would use regex to find words. To remove line repeats simply make set() from a list (or use set). "Pretty format" is possible with str.format() from 2.6+ (other solutions tabulate, clint, ..., column -t)

import re
data = {}

word_re = re.compile('[a-zA-Z]{4,}')


with open('/tmp/txt', 'r') as f:
    current_line = 1
    for line in f:
        words = re.findall(word_re, line)
        for word in words:
            if word in data.keys():
                data[word].append(current_line)
            else:
                data[word] = [current_line]
        current_line += 1


for word, lines in data.iteritems():
    print("{: >20} {: >20}".format(word, ", ".join([str(l) for l in set(lines)])))