Split string in groups of n consecutive sentences or in small paragraph to iterate a function over it

51 Views Asked by At

I need to run a function over a large set of string which consists of more than 500 sentences. I would like to iterate the function after every 20 sentences.

I tried many different ways but couldn't get the expected result.

For example:

    text = """ Asia is Earth's largest and most populous continent, located primarily in the Eastern and Northern Hemispheres. It shares the continental landmass of Eurasia with the continent of Europe and the continental landmass of Afro-Eurasia with both Europe and Africa. Asia covers an area of 44,579,000 square kilometres (17,212,000 sq mi), about 30% of Earth's total land area and 8.7% of the Earth's total surface area. The continent, which has long been home to the majority of the human population,[5] was the site of many of the first civilizations. Asia is notable for not only its overall large size and population, but also dense and large settlements, as well as vast barely populated regions. Its 4.5 billion people (as of June 2019) constitute roughly 60% of the world's population.[6]

In general terms, Asia is bounded on the east by the Pacific Ocean, on the south by the Indian Ocean, and on the north by the Arctic Ocean. The border of Asia with Europe is a historical and cultural construct, as there is no clear physical and geographical separation between them. It is somewhat arbitrary and has moved since its first conception in classical antiquity. The division of Eurasia into two continents reflects East–West cultural, linguistic, and ethnic differences, some of which vary on a spectrum rather than with a sharp dividing line. The most commonly accepted boundaries place Asia to the east of the Suez Canal separating it from Africa; and to the east of the Turkish Straits, the Ural Mountains and Ural River, and to the south of the Caucasus Mountains and the Caspian and Black Seas, separating it from Europe.[7]

China and India alternated in being the largest economies in the world from 1 to 1800 CE. China was a major economic power and attracted many to the east,[8][9][10] and for many the legendary wealth and prosperity of the ancient culture of India personified Asia,[11] attracting European commerce, exploration and colonialism. The accidental discovery of a trans-Atlantic route from Europe to America by Columbus while in search for a route to India demonstrates this deep fascination. The Silk Road became the main east–west trading route in the Asian hinterlands while the Straits of Malacca stood as a major sea route. Asia has exhibited economic dynamism (particularly East Asia) as well as robust population growth during the 20th century, but overall population growth has since fallen.[12] Asia was the birthplace of most of the world's mainstream religions including Hinduism, Zoroastrianism, Judaism, Jainism, Buddhism, Confucianism, Taoism, Christianity, Islam, Sikhism, as well as many other religions.
"""

The expected distribution of sentences or paragraph would be:

' '.join(text.split()[300:1000])

Output: attracted many to the east,[8][9][10] and for many the legendary wealth and prosperity of the ancient culture of India personified Asia,[11] attracting European commerce, exploration and colonialism. The accidental discovery of a trans-Atlantic route from Europe to America by Columbus while in search for a route to India demonstrates this deep fascination. The Silk Road became the main east–west trading route in the Asian hinterlands while the Straits of Malacca stood as a major sea route. Asia has exhibited economic dynamism (particularly East Asia) as well as robust population growth during the 20th century, but overall population growth has since fallen.[12] Asia was the birthplace of most of the world's mainstream religions including Hinduism, Zoroastrianism, Judaism, Jainism, Buddhism, Confucianism, Taoism, Christianity, Islam, Sikhism, as well as many other religions.

Now, if I have a function find_the_answer(x), how can I iterate the function over the given text (in above given case from 300 to 700 words) after 5 consecutive sentences?

As a side note: I am working on BERT Question-Answering.

1

There are 1 best solutions below

2
On

I'm not sure if I fully understand your question, maybe you can elaborate a bit more on whether you are dealing with one string containing many sentences or a list of strings containing many sentences.

However, we should be able to use sentences = text.split(".") to break the text into sentences. Then you can access the first 5 sentences using sentences[:5].

Thus if we want to skip the first 5 sentences in the text in our iteration. We can do something like

for sentence in sentences[5:]:
    find_the_answer(sentence)