• Link 1
  • Link 2
    • Link 1
    • Link 2
      • Link 1
      • Link 2
      • DEVHIDE
        • Home (current)
        • About
        • Contact
        • Cookie
        • Home (current)
        • About
        • Contact
        • Cookie
        • Disclaimer
        • Privacy
        • TOS
        Login Or Sign up

        How to get text and href value in anchor tag with scrapy, xpath, python

        827 Views Asked by Claire Duong At 12 June 2020 at 08:02 2025-12-24T10:54:45.459000

        I have a HTML file like this:

        <div ckass="jokes-nav">
          <ul>
            <li><a href="http://link_1">Link 1</a></li>
            <li><a href="http://link_2">Link 2</a></li>
          </ul>
        </div>
        

        In the folder spiders, I have a file jokes.py like this:

        import scrapy
        from demo_project.items import JokeItem
        from scrapy.loader import ItemLoader
        
        class JokesSpider(scrapy.Spider):
            name = 'jokes'
        
            start_urls = [
                'http://www.laughfactory.com/jokes/'
            ]
        
            def parse(self, response):
                for joke in response.xpath("//div[@class='jokes-nav']/ul"):
                    l = ItemLoader(item = JokeItem(), selector = joke)
                    l.add_xpath('joke_title', ".//li/a/text()")
        
                    """ yield {
                        'joke_text': joke.xpath(".//div[@class='joke-text']/p").extract_first()
                    } """
        
                    yield l.load_item()
        

        and I call the class JokesSpider in my main.py (this file is at root), and this is my code

        from scrapy.crawler import CrawlerProcess
        from demo_project.spiders.jokes import JokesSpider
        
        process = CrawlerProcess(settings={
            "FEEDS": {
                "items.json": {"format": "json"},
            },
        })
        
        process.crawl(JokesSpider)
        process.start() # the script will block here until the crawling is finished
        

        I want to write data to items.json, but when I run this code, items.json does not contain anything in it, how can I solve this problem. Thank you very much

        python web-scraping scrapy web-mining
        Original Q&A
        1

        There are 1 best solutions below

        0
        Patrick Klein Patrick Klein On 13 June 2020 at 07:17 BEST ANSWER

        You can set FEED_FORMAT and FEED_URI settings to save data in a json file.

        process = CrawlerProcess(settings={
            'FEED_FORMAT': 'json',
            'FEED_URI': 'items.json'
        })
        

        Related Questions in PYTHON

        • How to store a date/time in sqlite (or something similar to a date)
        • Instagrapi recently showing HTTPError and UnknownError
        • How to Retrieve Data from an MySQL Database and Display it in a GUI?
        • How to create a regular expression to partition a string that terminates in either ": 45" or ",", without the ": "
        • Python Geopandas unable to convert latitude longitude to points
        • Influence of Unused FFN on Model Accuracy in PyTorch
        • Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
        • Writes to child subprocess.Popen.stdin don't work from within process group?
        • Conda has two different python binarys (python and python3) with the same version for a single environment. Why?
        • Problem with add new attribute in table with BOTO3 on python
        • Can't install packages in python conda environment
        • Setting diagonal of a matrix to zero
        • List of numbers converted to list of strings to iterate over it. But receiving TypeError messages
        • Basic Python Question: Shortening If Statements
        • Python and regex, can't understand why some words are left out of the match

        Related Questions in WEB-SCRAPING

        • Using Puppeteer to scrape a public API only when the data changes
        • Scraping information in a span located under nested span
        • How to scrape website which loads json content dynamically?
        • How can I find a button element and click on it?
        • WebScraping doesnt work, even without error
        • Need Help Extracting Redirect URL from a div Element with Specific Class Name in Python Selenium
        • beautifulsoup library not showing below #document data inside iframe tag in python
        • how to create robust scraper for specific website without updating code after develop?
        • Optimizing Selenium script for faster execution
        • Parse Dynamic Power BI table with selenium
        • How to extract table from webpage that requires click/toggle?
        • SSL Certificate Verification Error When Scraping Website and Inserting Data into MongoDB
        • Scraping all links using BeautifulSoup
        • How do I make it so all arrays are the same length?
        • I am getting 'NoneType object is not subscriptable' error in web scraping method

        Related Questions in SCRAPY

        • pagination, next page with scrapy
        • Scraping Text through sections using scrapy
        • How to access Script Tag Variables From a Website using Python
        • xpath issue in nested div
        • How to fixed Crawled (403) forbbiden in scrapy?
        • Cannot set LOG_LEVEL when using CrawlerRunner
        • Scrapy handle closespider timeout in middleware
        • Scrapy CrawlProcess is throwing reactor already installed
        • Scrapy playwright non-headless browser always closing
        • why can't I retrieve the track of my Spotify playlist even i have given correct full xpath
        • Scrapy - how do I load data from the database in ItemLoader before sending it to the pipeline?
        • Scrapy Playwright Page Method: Prevent timeout error if selector cannot be located
        • Why scrapy shell did not return an output?
        • Python Scrapy Function that does always work
        • Scrapy / extracting data across multiple HTML tags

        Related Questions in WEB-MINING

        • Unable to fetch the Youtube Username using Javascript ( Chrome Extension )
        • API | Coinimp | user/withdraw | Invalid parameters (POST)
        • POST request issue with httr: desired table not retrieved
        • Scrape join-dates/user info from a list (csv) of Twitter-users
        • How can I use scrapy on booking.com without being blocked?
        • Defensive web scraping techniques for scrapy spider
        • Apache Nutch index only article pages to Solr
        • Function not importing from external js file in react
        • Craw data from urls by passing URL to Scrapy from other *.py file
        • How to get text and href value in anchor tag with scrapy, xpath, python
        • ECLAT Algorithm to find maximal and closed frequent sets
        • Is it easier to scrape the AMP versions of webpages?
        • Degree, Proximity and Rank Prestige
        • Rcrawler - How to crawl account/password protected sites?
        • Problems text mining using the ‘rJava’ and ‘tm.plugin.webmining’ packages

        Trending Questions

        • UIImageView Frame Doesn't Reflect Constraints
        • Is it possible to use adb commands to click on a view by finding its ID?
        • How to create a new web character symbol recognizable by html/javascript?
        • Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
        • Heap Gives Page Fault
        • Connect ffmpeg to Visual Studio 2008
        • Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
        • How to avoid default initialization of objects in std::vector?
        • second argument of the command line arguments in a format other than char** argv or char* argv[]
        • How to improve efficiency of algorithm which generates next lexicographic permutation?
        • Navigating to the another actvity app getting crash in android
        • How to read the particular message format in android and store in sqlite database?
        • Resetting inventory status after order is cancelled
        • Efficiently compute powers of X in SSE/AVX
        • Insert into an external database using ajax and php : POST 500 (Internal Server Error)

        Popular # Hahtags

        javascript python java c# php android html jquery c++ css ios sql mysql r reactjs

        Popular Questions

        • How do I undo the most recent local commits in Git?
        • How can I remove a specific item from an array in JavaScript?
        • How do I delete a Git branch locally and remotely?
        • Find all files containing a specific text (string) on Linux?
        • How do I revert a Git repository to a previous commit?
        • How do I create an HTML button that acts like a link?
        • How do I check out a remote Git branch?
        • How do I force "git pull" to overwrite local files?
        • How do I list all files of a directory?
        • How to check whether a string contains a substring in JavaScript?
        • How do I redirect to another webpage?
        • How can I iterate over rows in a Pandas DataFrame?
        • How do I convert a String to an int in Java?
        • Does Python have a string 'contains' substring method?
        • How do I check if a string contains a specific word?
        .

        Copyright © 2021 Jogjafile Inc.

        • Disclaimer
        • Privacy
        • TOS
        • Homegardensmart
        • Math
        • Aftereffectstemplates