I am working on a generating text from Hobbit with Kneser-Ney smoothing. My model is generating sentences but I believe there is room for improvement.
Currently, I am not using symbols to mark the beginning and the end of the sentences. When I try to insert them with the code below, I can see only the first beginning of sentence symbols are there but somehow, for the rest of the sentences, the symbols aren't inserted. It is almost as if it doesn't detect the end of sentences at all.
I tried not converting the text to lowercase but it hasn't changed anything.
Could you please advise how I can insert the end of sentence symbols?
with open ("hobbit.txt") as f:
hobbit_text = f.read()
hobbit_text = word_tokenize(hobbit_text.lower())
stop_words = stopwords.words('english')
personal_names = ['legolas', 'gimli', 'boromir', 'frodo', 'thorin', 'thror', 'gandalf', 'smeagol', 'gollum', 'balin', 'elrond','aragorn','bilbo', 'sauron']
signs = ['”','“', '!', '?', '’', '`', "'", '``', ',', ";", "(", ")"]
use_stop_words = True
use_punctuation = False
# get rid of stop words, punctuation (if necessary)
if not use_stop_words:
hobbit_text = [x for x in hobbit_text if x not in stop_words]
if not use_punctuation:
hobbit_text = [x for x in hobbit_text if x not in signs]
vocab = set(hobbit_text)
counter = 0
hobbit_trigram = ngrams(hobbit_text, 3, pad_left=True, pad_right=True, left_pad_symbol='BOS', right_pad_symbol='EOS')
for a in hobbit_trigram:
print(a)
counter += 1
if counter == 100:
break
The output for the first sentence looks as below. I was expecting the end of sentence symbol after the word "gold".
('BOS', 'BOS', 'the')
('BOS', 'the', 'hobbit')
('the', 'hobbit', 'or')
('hobbit', 'or', 'there')
('or', 'there', 'and')
('there', 'and', 'back')
('and', 'back', 'again')
('back', 'again', 'j.r.r')
('again', 'j.r.r', '.')
('j.r.r', '.', 'tolkien')
('.', 'tolkien', 'the')
('tolkien', 'the', 'hobbit')
('the', 'hobbit', 'is')
('hobbit', 'is', 'a')
('is', 'a', 'tale')
('a', 'tale', 'of')
('tale', 'of', 'high')
('of', 'high', 'adventure')
('high', 'adventure', 'undertaken')
('adventure', 'undertaken', 'by')
('undertaken', 'by', 'a')
('by', 'a', 'company')
('a', 'company', 'of')
('company', 'of', 'dwarves')
('of', 'dwarves', 'in')
('dwarves', 'in', 'search')
('in', 'search', 'of')
('search', 'of', 'dragon-guarded')
('of', 'dragon-guarded', 'gold')
('dragon-guarded', 'gold', '.')
('gold', '.', 'a')
Try doing the following way:
I tried doing it this way and it gave me the convenient format, the way you were asking in the question.