Is it possible to crawl ajax-based web sites using Heritrix-3.2.0?
Is Heritrix3.2.0 able to crawl ajax-based web sites?
457 Views Asked by T.Sh At
1
There are 1 best solutions below
Related Questions in JAVA
- I need the BIRT.war that is compatible with Java 17 and Tomcat 10
- Creating global Class holder
- No method found for class java.lang.String in Kafka
- Issue edit a jtable with a pictures
- getting error when trying to launch kotlin jar file that use supabase "java.lang.NoClassDefFoundError"
- Does the && (logical AND) operator have a higher precedence than || (logical OR) operator in Java?
- Mixed color rendering in a JTable
- HTTPS configuration in Spring Boot, server returning timeout
- How to use Layout to create textfields which dont increase in size?
- Function for making the code wait in javafx
- How to create beans of the same class for multiple template parameters in Spring
- How could you print a specific String from an array with the values of an array from a double array on the same line, using iteration to print all?
- org.telegram.telegrambots.meta.exceptions.TelegramApiException: Bot token and username can't be empty
- Accessing Secret Variables in Classic Pipelines through Java app in Azure DevOps
- Postgres && statement Error in Mybatis Mapper?
Related Questions in WEB-CRAWLER
- How do i get the newly opened page after a form submission using puppeteer
- How to crawl 5000 different URLs to find certain links
- Selenium cannot load a page
- FaceBook-Scraper (without API) works nicely - but Login Process failes some how
- Why scrapy shell did not return an output?
- Highcharts Spider Chart with different scale for each category
- Chrome for Testing crashes soon after launching chrome driver in script
- Permission denied When deploy Splash in OpenShift
- scrape( n ′ gcontent−serverapp ′ , ′ How to scrape HTML elements with a specific attribute using Python ′ )
- Puppeteer recognized by BET365 during crawler
- Python requests.get(url) returns empty content in Colab
- I want some of the content in my page to be crawlable but should not be indexed
- Selenium crawler had no problems starting up locally, but it always failed to start up on Linux,org.openqa.selenium.interactions.Coordinates
- Website Branch address not updating in Google search engine even after 1 month
- How can I execute javasript function before page load for search engine crawlers?
Related Questions in HERITRIX
- Getting 401 error when trying to make a teardown request to Heritrix via Node.js http module
- Crawling rules in heritrix, how to load embedded content?
- Which block represents a WARC-Block-Digest?
- How can i rightly configure my crawling program crawl-beans.cxml
- Heritrix 3.2.0 can't find files and won't execute
- Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode
- How to write a cron job for Heritrix3 web crawling?
- Heritrix 3.2.x , how to read content from warc files ?
- How do we know when Heritrix completes a crawl job?
- Is Heritrix Crawl Deterministic?
- find web trace to a web list in heritrix
- Increasing number of threads
- Heritrix Content Filtering
- Heritrix not finding CSS files in conditional comment blocks
- Heritrix: Ignoring robots.txt for one site only
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
If you intend to make a "copy" of an ajax website, clearly no.
If you want to grab some data by analysing the content of the website, you can customize the crawler with an Extractor that would determine which URLs to follow. On most website you can easily guess the urls that are interesting for your case without having to interpret the javascript. Then the ajax callbacks would be crawled and given to the Processor chain. By default this would store the ajax callback answers in the archive files.
Making your own Extractor looks like that:
Compile, put the jar file in the heritrix/lib directory and insert a bean refering to MyExtractor in the fetchProcessors chain : basically, duplicate the extractorHtml line in the crawl job cxml file.