I am trying to scrape a list all of the scholarships in the https://bigfuture.collegeboard.org/scholarships/; I was able to scrape all of the links and store it in a list using Selenium. However, Selenium is not scalable to scrape the data in each web address. I am trying to use Scrapy and Splash, but using the Xpath or the CSS selector don't work. This is my first time webscraping so I am very lost. I would greatly appreciate any help!
class ScholarshipSpider(scrapy.Spider):
name = 'scholarship'
start_urls = [line.strip() for line in open("links.txt")]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 7, 'html': 1, 'png': 1})
def __init__(self, *args, **kwargs):
super(ScholarshipSpider, self).__init__(*args, **kwargs)
self.items_list = []
def parse(self, response):
item = {
'name': response.xpath('//*[@id="main-content"]/div/div[2]/div/div/div[1]/section[1]/div/div[1]/h1/text()').get()
#other items here
}
self.logger.info(item)
self.items_list.append(item)
print(f"Name: {item['name']}")
def closed(self, reason):
df = pd.DataFrame(self.items_list)
df.to_csv('scraped_data.csv', index=False)
When I tried using Selenium, Xpaths worked, but my code stopped to work after a while. Scrapy seems like the best alternative but doesn't matter what I try, it does not work.
I am using Jupyter Notebook btw.