I'm trying to turn a *.docx file with questions into a python dictionary.
The questions have this format:
- Question
a. first answer
b. second answer
c. third answer
d. fourth answer
e. fifth answer
In the file, the correct answer is the bold one, in this case the third. The word file is built with MS Word bullet points (1. and so on for questions, and a. and so on for answers).
The resulting dictionary should be like:
{
'1': {
'question': 'the question text',
'answer': ['first answer','second answer','third answer','fourth answer','fifth answer'],
'correct_answer': 2
},
Other questions...
}
I tried this code:
from docx import *
def is_bold(run):
return run.bold
# Open the document
doc = Document('sample.docx')
# Create an empty dictionary for questions and answers
questions_and_answers = {}
# Iterate only through paragraphs
for paragraph in doc.paragraphs:
text = paragraph.text.strip()
# Check if the paragraph starts with a number and a dot
if text and text[0].isdigit() and text[1] == '.':
question_number, question = text.split(' ', 1)
answer_choices = []
correct_answer_index = None
# Continue to the next paragraph that will contain the answers
next_paragraph = paragraph
while True:
next_paragraph = next_paragraph.next_paragraph
# If there are no more paragraphs or it starts with a number, we've reached the end of the answers
if not next_paragraph or (next_paragraph.text.strip() and next_paragraph.text.strip()[0].isdigit()):
break
next_text = next_paragraph.text.strip()
# If it starts with a letter and a period, consider it as an answer
if next_text and next_text[0].isalpha() and next_text[1] == '.':
answer_run = next_paragraph.runs[0] # Consider only the first "run" to check the style
answer_text = next_text[3:] # Remove the answer format (a., b., c., ...)
answer_choices.append(answer_text)
# Check if the answer is bold (hence, correct)
if is_bold(answer_run):
correct_answer_index = len(answer_choices) - 1 # Save the index of the correct answer
# Add the question and answers to the dictionary
questions_and_answers[question_number] = {
'question': question,
'answers': answer_choices,
'correct_answer_index': correct_answer_index
}
# Print the resulting dictionary
for number, data in questions_and_answers.items():
print(f"{number}: {data['question']}")
print("Answers:")
for answer in data['answers']:
print(f"- {answer}")
print(f"Index of the correct answer: {data['correct_answer_index']}")
print()
Unfortunately, I'm getting an empty dictionary. How do I fix this?
Related question
But, by merging this solutions, we can do something like this:
Input
Output