I am trying to write data from several documents (implemented in a a for loop) to a csv file in Python 3. However, the column gets overwritten every time. How can I make that data from the individual documents be printed on a csv in the rows below, without overwriting?
from pdfminer.high_level import extract_text
for selectedfile in glob.glob(r'C:\Users\...\*.pdf'):
text = extract_text(selectedfile)
Y = set(text)
Z = []
Znew = []
for val in Y:
occurrences = wordlist2.count(val)
if occurrences > 50: # define min. no. of occurrences
# print(val, ':', occurrences)
Z.append(val)
Znew.append(occurrences)
dict = {'Stem': Z, 'Count': Znew}
df = pd.DataFrame(dict)
df.to_csv('Exported list.csv', header=True, index=True, encoding='utf-8')
The problem is in that first
for
loop. You keep replacingtext
with new extracted text and only process the final extraction. You could move the processing into thefor
loop to work on each extraction. In this example, I've opened the file beforehand and written the header once. Then its a question of making sure the index is correct for each write.