How do I store extracted titles from URL using python?

104 Views Asked by Technologic27 At 28 July 2025 at 00:58

I am given the task to extract title and meta_description from a list of URLs. I have used goose. Am I doing it correctly?

from goose import Goose import urlparse import numpy as np import os import pandas

os.chdir("C:\Users\EDAWES01\Desktop\Cookie profiling")
data = pandas.read_csv('activity_url.csv', delimiter=';')
data_read=np.array(data)
quantity = data_read[0:, 2]
url_data = data_read[quantity==1][0:3,1] 
user_id = data_read[quantity==1][0:3,0] 
url_data 

#remove '~oref='
clean_url_data=[] #intialize
for i in xrange(0,len(url_data)):
    clean_url_data.append(i)
    clean_url_data[i]=urlparse.urlparse(url_data[i])[2].split("=")
    clean_url_data[i]=clean_url_data[i][1]

clean_url_data=np.array([clean_url_data])

#store title 
website_title=[]
#store meta_description
website_meta_description=[] 


g=Goose()

for urlt in xrange(0, len(clean_url_data)):
    website_title.append(urlt)
    website_title[urlt]=g.extract(clean_url_data[urlt])
    website_title[urlt]=website_title[urlt].title

website_title=np.array([website_title])

for urlw in xrange(0, len(clean_url_data)):
    website_meta_description.append(urlw)
    website_meta_description[urlw]=g.extract(clean_url_data[urlw])
    website_meta_description[urlw]=website_meta_description[urlw].meta_description


website_meta_desciption=np.array([website_meta_description])

Original Q&A

There are 1 best solutions below

Deca On 23 June 2016 at 07:48

You can open the url and assign it to any channel. When you read it and store in any variable, that would be page source with the html tags and values. The required information from that page, you can fetch using a regular expression matching your search criteria. You can do something like this:

sock = urllib2.urlopen('http://www.google.co.in')
page = sock.read()
sock.close()
listOfUrls = re.findall(r'https?://.*?/', page)

variable page will give you all the html page tags and the structure. you can write any regular expersion to fetch the details you need. say re.findall(r'https?://.*?/', page), will give you all the urls. similarly you can fetch the details you need from the page

How do I store extracted titles from URL using python?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in URL

Related Questions in STORE

Related Questions in META-TAGS

Related Questions in GOOSE

Trending Questions

Popular # Hahtags

Popular Questions