dataset to use for question formation from any text

368 Views Asked by At

I am trying to create an improved quiz generator that accepts a certain text as an input and forms questions from the sentences. I want to create a machine learning model that splits the sentence into different parts so it is capable of forming different questions from the same sentence. For example: from the sentence "Amazon river is the longest river in South America." should form questions: What is the longest river in South America? Is Amazon river the longest river in South America? Where is Amazon river located? etc. If possible, I would also like it to get the context from multiple sentences and then form one question from multiple sentence information. I want it to be able to perform well on any text, not just specific topic. How should I make my dataset or which dataset should I use?

I don't have a lot of previous knowledge on the topic, so I was thinking of somehow using nltk.pos_tag() which specifies everyword in a sentence. I am just not sure how to use it in my model and dataset.

1

There are 1 best solutions below

0
On

What you're attempting to do is non-trivial and is related to the task of Automatic Question Generation (AQG) which looks at converting structured or unstructured declarative natural language sentences into valid interrogative forms. Various automated linguistic (rules-based) and statistical methods have been employed. I'd recommend reading [1] by Blšták & Rozinajová, particularly Section 2 which summarises some of the datasets and methods available. The survey by Lu & Lu [2] provides a recent overview of the field. It seems like the most common approach is to leverage existing QA datasets (e.g. SQuAD, HotpotQA et cetera, see Table 5 of [2]). In terms of more practical, quick ways to get started without having to train your own ML/DL model, you could use existing Transformer-based models from HuggingFace such as iarfmoose/t5-base-question-generator available here which takes concatenated answers and context as an input sequence, e.g.:

<answer> answer text here <context> context text here

and will generate a full question (interrogative) sentence as an output sequence. According to the author, it is recommended that a large number of sequences be generated and then filtered with iarfmoose/bert-base-cased-qa-evaluator.

References

[1] Blšták, M. and Rozinajová, V., 2022. Automatic question generation based on sentence structure analysis using machine learning approach. Natural Language Engineering, 28(4), pp.487-517.

[2] Lu, C.Y. and Lu, S.E., 2021, October. A Survey of Approaches to Automatic Question Generation: from 2019 to Early 2021. In Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021) (pp. 151-162).