I wrote successfully the following code to get the titles of a Wikipedia category. The category consists more than 404 titles. But my output file gives only 200 titles/pages. how to extend my code to get all the titles of that category's link (next page) and so on.
command : python3 getCATpages.py
Code of getCATpages.py ;-
from bs4 import BeautifulSoup
import requests
import csv
#getting all the contents of a url
url = 'https://en.wikipedia.org/wiki/Category:Free software'
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
#showing the category-pages Summary
catPageSummaryTag = soup.find(id='mw-pages')
catPageSummary = catPageSummaryTag.find('p')
print(catPageSummary.text)
#showing the category-pages only
catPageSummaryTag = soup.find(id='mw-pages')
tag = soup.find(id='mw-pages')
links = tag.findAll('a')
# giving serial numbers to the output print and limiting the print into three
counter = 1
for link in links[:3]:
print (''' '''+str(counter) + " " + link.text)
counter = counter + 1
#getting the category pages
catpages = soup.find(id='mw-pages')
whatlinksherelist = catpages.find_all('li')
things_to_write = []
for titles in whatlinksherelist:
things_to_write.append(titles.find('a').get('title'))
#writing the category pages as a output file
with open('001-catPages.csv', 'a') as csvfile:
writer = csv.writer(csvfile,delimiter="\n")
writer.writerow(things_to_write)
The idea is to follow the next page until there is no "next page" link on the page. We'll maintain a web-scraping session while making multiple requests collecting the desired link titles in a list:
Prints:
Omitted the irrelevant CSV writing part.