I am given the task to extract title and meta_description from a list of URLs. I have used goose. Am I doing it correctly?
from goose import Goose import urlparse import numpy as np import os import pandas
os.chdir("C:\Users\EDAWES01\Desktop\Cookie profiling")
data = pandas.read_csv('activity_url.csv', delimiter=';')
data_read=np.array(data)
quantity = data_read[0:, 2]
url_data = data_read[quantity==1][0:3,1]
user_id = data_read[quantity==1][0:3,0]
url_data
#remove '~oref='
clean_url_data=[] #intialize
for i in xrange(0,len(url_data)):
clean_url_data.append(i)
clean_url_data[i]=urlparse.urlparse(url_data[i])[2].split("=")
clean_url_data[i]=clean_url_data[i][1]
clean_url_data=np.array([clean_url_data])
#store title
website_title=[]
#store meta_description
website_meta_description=[]
g=Goose()
for urlt in xrange(0, len(clean_url_data)):
website_title.append(urlt)
website_title[urlt]=g.extract(clean_url_data[urlt])
website_title[urlt]=website_title[urlt].title
website_title=np.array([website_title])
for urlw in xrange(0, len(clean_url_data)):
website_meta_description.append(urlw)
website_meta_description[urlw]=g.extract(clean_url_data[urlw])
website_meta_description[urlw]=website_meta_description[urlw].meta_description
website_meta_desciption=np.array([website_meta_description])
You can open the url and assign it to any channel. When you read it and store in any variable, that would be page source with the html tags and values. The required information from that page, you can fetch using a regular expression matching your search criteria. You can do something like this:
variable page will give you all the html page tags and the structure. you can write any regular expersion to fetch the details you need. say re.findall(r'https?://.*?/', page), will give you all the urls. similarly you can fetch the details you need from the page