I have used scholarly package and parsed the author names generated in the 3 question its method search by author name to get the author profiles including all the citation information of all the professors. I was able to load the data into a final dataframe with NA values for those who do not have a google scholar profile. However, there is an issue approx. 8 authors citation information is not matching the information on google scholar website, it is because the scholarly package is retrieving the citation information of other authors with the same name. I believe I can fix it by using search_author_id function but the question is how do we get the author_ids of all the professors in the first place.
Any help would be appreciated.
Cheers, Yash
This solution possibly will not be suitable for the
scholarly
package.beautifulsoup
will be used instead.Author
id's
is located under the tag name inside the<a>
taghref
attribute. Here's how we can grab their id's:Code that goes a "bit" out of your question scope (full example in the online IDE under bs4 folder ->
get_profiles.py
):Output:
Alternatively, you can do the same thing with Google Scholar Profiles API from SerpApi, but without thinking about how to solve the CAPTCHA, find proxies, and maintain the parser over time.
It's a paid API with a free plan.
Code to integrate:
Part of the output: