I am trying to analyze all existing URLs of a website that have a certain path. To demonstrate it on an example, the URL pattern is as follows:
https://www.example.com/users/john
and I am trying to get a list of existing URL starting with "https://www.example.com/users/".
So the desired output would be something like this:
https://www.example.com/users/john
https://www.example.com/users/alice
https://www.example.com/users/bob
https://www.example.com/users/jeff
https://www.example.com/users/sarah
...
There's no sitemap. How do I get such a list?
To generate a list of existing URLs following a specific pattern without a sitemap, you can use web scraping techniques. Here's a general approach using Python with the BeautifulSoup library:
Send HTTP requests to the website and retrieve its HTML content. Parse the HTML content to extract URLs matching the desired pattern. Store the extracted URLs in a list. Here's a sample Python code demonstrating this approach:
Replace "https://www.example.com/users/" with the actual base URL of the website you want to scrape. This script will recursively crawl through the website starting from the base URL and extract all URLs matching the specified pattern. It will then print out the list of URLs found.