Given a company ticker or name I would like to get its sector using python.
I have tried already several potential solutions but none has worked succesfully
The two most promising are:
1) Using the script from: https://gist.github.com/pratapvardhan/9b57634d57f21cf3874c
from urllib import urlopen
from lxml.html import parse
'''
Returns a tuple (Sector, Indistry)
Usage: GFinSectorIndustry('IBM')
'''
def GFinSectorIndustry(name):
tree = parse(urlopen('http://www.google.com/finance?&q='+name))
return tree.xpath("//a[@id='sector']")[0].text, tree.xpath("//a[@id='sector']")[0].getnext().text
However I am using python --version 3.8
I have been able to tweak this solution, but the last line is not working and I am completely new to scraping web pages, so I would appreciate if anyone has some suggestions.
Here is my current code:
from urllib.request import Request, urlopen
from lxml.html import parse
name="IBM"
req = Request('http://www.google.com/finance?&q='+name, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req)
tree = parse(webpage)
But then the last part is not working and I am very new to this xpath syntax:
tree.xpath("//a[@id='sector']")[0].text, tree.xpath("//a[@id='sector']")[0].getnext().text
2) The other option was embedding R's TTN package as shown here: Find which sector a stock belongs to
However, I want to run it within my Jupyter notebook, and it is just taking ages to run ss <- stockSymbols()
Following your comment, for marketwatch.com/investing/stock specifically, the xpath that is likely to work is
"//div[@class='intraday__sector']/span[@class='label']"meaning that doingshould return the desired information.
Some precisions:
"//a[@id='sector']"in the page you mention in comments, since this xpath (now outdated) was google-finance specific. Put differently, you first need to "study" the page you are interested in to know where the information you want is located.$x(<your-xpath-of-interest>)where the function$xis documented here (with examples!).