I am trying to generate bigrams using nltk.ngrams but getting the RuntimeError: generator raised StopIteration error. How do I fix this error in regards to my specific problem?
My dataframe df has multiple columns out of which only two, namely, FonctionsStagiaire and ExigencesParticulieres are of interest to me. The two columns look like this as tokens before unigrams and bigrams are generated.
| FonctionsStagiaire | ExigencesParticulieres |
|---|---|
| [fall, 2022, human, resources, training, inter... | [required, skills, what, you, need, to, succee... |
| [specifically, seeking, software, engineers, r... | [for, game, mechanics, core, engine, tools, li... |
| [what, do, you, value, in, a, career, at, agni... | [bachelor's, degree, student, electrical, engi... |
| [your, contribution, reporting, reliability, s... | [fluent, english, speaking, skills, able, swim... |
The output of df[['FonctionsStagiaire', 'ExigencesParticulieres']].head(4).to_dict()
{'FonctionsStagiaire': {21: ['fall',
'2022',
'human',
'resources',
'training',
'internship',
'remote',
'mea01229',
'what',
'do',
'you',
'value',
'in',
'a',
'career',
'at',
'agnico',
'eagle',
'values',
'never',
'waver',
'we',
'believe',
'trust',
'respect',
'equality',
'family',
'responsibility',
'why',
'because',
'express',
'helped',
'us',
'succeed',
'business',
'60',
'years',
'about',
'meadowbank',
'our',
'nunavut',
'operations',
'agnico',
'eagle',
'always',
'looking',
'new',
'talented',
'team',
'members',
'join',
'nunavut',
'mining',
'operations',
'we',
'operating',
'meadowbank',
'first',
'low',
'arctic',
'mine',
'near',
'baker',
'lake',
'nine',
'years',
'the',
'mine',
'produced',
'three',
'millionth',
'ounce',
'gold',
'2018',
'2019',
'marked',
'last',
'year',
'production',
'meadowbank',
'mine',
'since',
'transitioned',
'process',
'ore',
'amaruq',
'satellite',
'deposit',
'with',
'official',
'opening',
'amaruq',
'whale',
'tail',
'project',
'august',
'2019',
'project',
'referred',
'meadowbank',
'complex',
'your',
'contribution',
'reporting',
'training',
'coordinator',
'training',
'intern',
'part',
'people',
'development',
'department',
'collaborates',
'departments',
'mine',
'shehe',
'ensure',
'goals',
'objectives',
'achieved',
'promoting',
'respecting',
'agnico',
"eagle's",
'culture',
'health',
'safety',
'code',
'conduct',
'environment',
'coordinate',
'manage',
'projects',
'related',
'training',
'northern',
'mining',
'environment',
'develop',
'modify',
'training',
'content',
'assist',
'planning',
'tracking',
'training',
'activities',
'develop',
'maintain',
'effective',
'training',
'materials',
'assist',
'management',
'elearning',
'training',
'platform',
'participate',
'creation',
'training',
'practices',
'procedures',
'your',
'work',
'schedule',
'schedule',
'14',
'days',
'work',
'12',
'hour',
'shifts',
'followed',
'14',
'days',
'transportation',
'rest',
'flights',
'departing',
'communities',
'kivalliq',
'region',
'mirabel',
"vald'or",
'quebec',
'travel',
'room',
'board',
'provided',
'agnico',
'eagle',
'to',
'apply',
'position',
'please',
'use',
'following',
'url',
'httpsars2equestcomresponse_id1cf099f2f50a501d123d332ae1931084'],
22: ['fall',
'2022',
'mechanical',
'engineering',
'mobile',
'maintenance',
'planner',
'intern',
'remote',
'mea01233',
'your',
'contribution',
'reporting',
'reliability',
'specialist',
'part',
'maintenance',
'department',
'collaborates',
'departments',
'mine',
'heshe',
'ensure',
'goals',
'objectives',
'achieved',
'promoting',
'respecting',
'agnico',
"eagle's",
'culture',
'health',
'safety',
'code',
'conduct',
'environment',
'your',
'task',
'be',
'optimize',
'implement',
'preventive',
'maintenance',
'plans',
'monitor',
'oil',
'samples',
'fleet',
'schedule',
'preventive',
'replacement',
'benchmarked',
'components',
'update',
'prediction',
'report',
'be',
'responsible',
'various',
'optimization',
'projects',
'maintenance',
'department',
'monitor',
'23',
'new',
'long',
'haul',
'trucks',
'mine',
'acquired',
'support',
'reliability',
'specialist',
'different',
'tasks',
'your',
'work',
'schedule',
'schedule',
'14',
'days',
'work',
'followed',
'14',
'days',
'transportation',
'rest',
'flights',
'departing',
'communities',
'kivalliq',
'region',
'mirabel',
"vald'or",
'quebec',
'to',
'apply',
'position',
'please',
'use',
'following',
'url',
'httpsars2equestcomresponse_ide5baa1f06dae8d5dcf05e2b228680085'],
23: ['fall',
'2022',
'mine',
'engineering',
'drill',
'blast',
'internship',
'remote',
'mea01235',
'your',
'contribution',
'reporting',
'production',
'engineering',
'coordinator',
'production',
'engineering',
'intern',
'part',
'engineering',
'department',
'collaborates',
'departments',
'mine',
'shehe',
'ensure',
'goals',
'objectives',
'achieved',
'promoting',
'respecting',
'agnico',
"eagle's",
'culture',
'health',
'safety',
'code',
'conduct',
'environment',
'there',
'two',
'primary',
'tasks',
'engineering',
'intern',
'the',
'first',
'task',
'position',
'fulfilling',
'quality',
'assurancequality',
'control',
'qaqc',
'duties',
'drill',
'blast',
'by',
'performing',
'qaqc',
'drill',
'blast',
'patterns',
'field',
'henceforth',
'referred',
'qaqc',
'drilling',
'loading',
'collecting',
'compiling',
'qaqc',
'data',
'communicating',
'qaqc',
'data',
'engineering',
'team',
'the',
'second',
'task',
'position',
'performing',
'fragmentation',
'analysis',
'drill',
'blast',
'engineers',
'by',
'taking',
'pictures',
'muck',
'faces',
'field',
'performing',
'split',
'desktop',
'analysis',
'pictures',
'communicating',
'fragmentation',
'results',
'engineering',
'team',
'primary',
'duties',
'tracking',
'progression',
'drilling',
'mucking',
'morning',
'meeting',
'ensure',
'priorities',
'qaqc',
'fragmentation',
'analysis',
'met',
'quality',
'assurancequality',
'control',
'drilling',
'loading',
'practices',
'field',
'fragmentation',
'analysis',
'active',
'mucking',
'faces',
'promote',
'health',
'safety',
'participating',
'monthly',
'departmental',
'hs',
'meeting',
'secondary',
'duties',
'provide',
'technical',
'support',
'drill',
'blast',
'engineer',
'blast',
'optimization',
'project',
'proposals',
'or',
'needsinterests',
'floor',
'analysis',
'muck',
'floor',
'water',
'presence',
'drill',
'patterns',
'loading',
'statistics',
'powder',
'factor',
'analysis',
'provide',
'relief',
'support',
'mine',
'clerk',
'vacations',
'special',
'projects',
'according',
'needs',
'engineering',
'mine',
'department',
'providing',
'relief',
'support',
'engineering',
'team',
'vacations',
'drill',
'pattern',
'design',
'blast',
'timing',
'design',
'your',
'work',
'schedule',
'schedule',
'14',
'days',
'work',
'followed',
'14',
'days',
'transportation',
'rest',
'flights',
'departing',
'communities',
'kivalliq',
'region',
'mirabel',
"vald'or",
'quebec',
'to',
'apply',
'position',
'please',
'use',
'following',
'url',
'httpsars2equestcomresponse_id0fc30561bcc0715fe84e7690663f1bc8'],
24: ['what',
'do',
'you',
'value',
'in',
'a',
'career',
'at',
'agnico',
'eagle',
'values',
'never',
'waver',
'we',
'believe',
'trust',
'respect',
'equality',
'family',
'responsibility',
'why',
'because',
'express',
'helped',
'us',
'succeed',
'business',
'60',
'years',
'about',
'meadowbank',
'our',
'nunavut',
'operations',
'agnico',
'eagle',
'always',
'looking',
'new',
'talented',
'team',
'members',
'join',
'nunavut',
'mining',
'operations',
'we',
'operating',
'meadowbank',
'first',
'low',
'arctic',
'mine',
'near',
'baker',
'lake',
'nine',
'years',
'the',
'mine',
'produced',
'three',
'millionth',
'ounce',
'gold',
'2018',
'2019',
'marked',
'last',
'year',
'production',
'meadowbank',
'mine',
'since',
'transitioned',
'process',
'ore',
'amaruq',
'satellite',
'deposit',
'with',
'official',
'opening',
'amaruq',
'whale',
'tail',
'project',
'august',
'2019',
'project',
'referred',
'meadowbank',
'complex',
'your',
'contribution',
'reporting',
'senior',
'grade',
'control',
'technician',
'geology',
'intern',
'part',
'mine',
'geology',
'department',
'collaborates',
'departments',
'mine',
'shehe',
'ensure',
'goals',
'objectives',
'achieved',
'promoting',
'respecting',
'agnico',
"eagle's",
'culture',
'health',
'safety',
'code',
'conduct',
'environment',
'drill',
'blast',
'excavation',
'monitoring',
'regards',
'mine',
'geology',
'grade',
'control',
'standards',
'qaqc',
'field',
'regular',
'audits',
'quality',
'sampling',
'layout',
'ore',
'packets',
'blasted',
'muck',
'define',
'different',
'ore',
'zones',
'mining',
'daily',
'sample',
'collection',
'mine',
'shipment',
'lab',
'monitoring',
'any',
'tasks',
'senior',
'grade',
'control',
'andor',
'production',
'geologist',
'might',
'identify',
'position',
'approximately',
'90',
'field',
'work',
'10',
'office',
'work',
'your',
'work',
'schedule',
'schedule',
'14',
'days',
'work',
'followed',
'14',
'days',
'transportation',
'rest',
'flights',
'departing',
'communities',
'kivalliq',
'region',
'mirabel',
"vald'or",
'quebec',
'travel',
'room',
'board',
'provided',
'agnico',
'eagle',
'to',
'apply',
'position',
'please',
'use',
'following',
'url',
'httpsars2equestcomresponse_idb79a50f3edc987d26ffdf6568a7c1604']},
'ExigencesParticulieres': {21: ['required',
'skills',
'what',
'you',
'need',
'to',
'succeed',
'enrolled',
'graduated',
"bachelor's",
'degree',
'human',
'resources',
'administration',
'management',
'industrial',
'relations',
'related',
'field',
'mining',
'experience',
'asset',
'strong',
'sense',
'organization',
'quick',
'learner',
'experience',
'working',
'multicultural',
'environment',
'asset',
'excellent',
'communication',
'skills',
'english',
'written',
'spoken',
'must',
'strong',
'interpersonal',
'communication',
'team',
'building',
'skills',
'strong',
'computer',
'skills',
'including',
'use',
'word',
'excel',
'powerpoint'],
22: ['required',
'skills',
'what',
'you',
'need',
'to',
'succeed',
'being',
'autonomous',
'proactive',
'valuable',
'must',
'quick',
'learner',
'organizational',
'skills',
'required',
'enrolled',
'graduated',
"bachelor's",
'degree',
'mechanical',
'mining',
'engineering',
'related',
'field',
'mining',
'experience',
'asset',
'underground',
'experience',
'asset',
'experience',
'working',
'multicultural',
'environment',
'asset',
'excellent',
'communication',
'skills',
'english',
'written',
'spoken',
'must',
'strong',
'interpersonal',
'communication',
'team',
'building',
'skills'],
23: ['required',
'skills',
'what',
'you',
'need',
'to',
'succeed',
'valid',
"driver's",
'license',
'enrolled',
'graduated',
"bachelor's",
'degree',
'mining',
'engineering',
'related',
'field',
'mining',
'experience',
'asset',
'experience',
'working',
'multicultural',
'environment',
'asset',
'excellent',
'communication',
'skills',
'english',
'written',
'spoken',
'must',
'strong',
'interpersonal',
'communication',
'team',
'building',
'skills'],
24: ['required',
'skills',
'what',
'you',
'need',
'to',
'succeed',
'enrolled',
'graduated',
"bachelor's",
'degree',
'kind',
'geosciences',
'related',
'fields',
'mining',
'experience',
'asset',
'experience',
'working',
'multicultural',
'environment',
'asset',
'excellent',
'communication',
'skills',
'english',
'written',
'spoken',
'must',
'strong',
'interpersonal',
'communication',
'team',
'building',
'skills']}}
The code for generating the unigrams and bigrams and their counts
from nltk.util import ngrams
from nltk import FreqDist
from collections import Counter
col_list = ['FonctionsStagiaire', 'ExigencesParticulieres']
for col in col_list:
df[col+'_unigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 1)))
#try:
df[col+'_bigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 2)))
#except RuntimeError:
# for i in df.index:
# print(df.index)
df[col+'_unigrams_freq_dist'] = df[col+'_unigrams'].apply(lambda row: list(nltk.FreqDist(row)))
df[col+'_bigrams_freq_dist'] = df[col+'_bigrams'].apply(lambda row: list(nltk.FreqDist(row)))
df[col+'_unigrams_counts'] = df[col+'_unigrams_freq_dist'].apply(lambda row: list(Counter(row).most_common()))
df[col+'_bigrams_counts'] = df[col+'_bigrams_freq_dist'].apply(lambda row: list(Counter(row).most_common()))
The error in more details
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/nltk/util.py:468, in ngrams(sequence, n, pad_left, pad_right, left_pad_symbol, right_pad_symbol)
467 while n > 1:
--> 468 history.append(next(sequence))
469 n -= 1
StopIteration:
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
Cell In[100], line 12
10 df[col+'_unigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 1)))
11 #try:
---> 12 df[col+'_bigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 2)))
13 #except RuntimeError:
14 # for i in df.index:
15 # print(df.index)
16 df[col+'_unigrams_freq_dist'] = df[col+'_unigrams'].apply(lambda row: list(nltk.FreqDist(row)))
File /opt/conda/lib/python3.10/site-packages/pandas/core/series.py:4771, in Series.apply(self, func, convert_dtype, args, **kwargs)
4661 def apply(
4662 self,
4663 func: AggFuncType,
(...)
4666 **kwargs,
4667 ) -> DataFrame | Series:
4668 """
4669 Invoke function on values of Series.
4670
(...)
4769 dtype: float64
4770 """
-> 4771 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File /opt/conda/lib/python3.10/site-packages/pandas/core/apply.py:1123, in SeriesApply.apply(self)
1120 return self.apply_str()
1122 # self.f is Callable
-> 1123 return self.apply_standard()
File /opt/conda/lib/python3.10/site-packages/pandas/core/apply.py:1174, in SeriesApply.apply_standard(self)
1172 else:
1173 values = obj.astype(object)._values
-> 1174 mapped = lib.map_infer(
1175 values,
1176 f,
1177 convert=self.convert_dtype,
1178 )
1180 if len(mapped) and isinstance(mapped[0], ABCSeries):
1181 # GH#43986 Need to do list(mapped) in order to get treated as nested
1182 # See also GH#25959 regarding EA support
1183 return obj._constructor_expanddim(list(mapped), index=obj.index)
File /opt/conda/lib/python3.10/site-packages/pandas/_libs/lib.pyx:2924, in pandas._libs.lib.map_infer()
Cell In[100], line 12, in <lambda>(row)
10 df[col+'_unigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 1)))
11 #try:
---> 12 df[col+'_bigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 2)))
13 #except RuntimeError:
14 # for i in df.index:
15 # print(df.index)
16 df[col+'_unigrams_freq_dist'] = df[col+'_unigrams'].apply(lambda row: list(nltk.FreqDist(row)))
RuntimeError: generator raised StopIteration
I am not able to understand why this error is being thrown. I have run the same code on other datasets as well and it has run fine on each one of them.
Any help would be appreciated.