How to find sitemap in each domain and sub domain using python

Question

How to find sitemap in each domain and sub domain using python

1k Views Asked by William Johnson At 29 July 2025 at 16:55

I want to know how to find sitemap in each domain and sub domain using python? Some examples:

abcd.com/sitemap.xml
abcd.com/sitemap.html
abcd.com/sitemap.html
sub.abcd.com/sitemap.xml

And etc.

What is the most probable sitemap names, locations and also extensions?

Original Q&A

There are 3 best solutions below

**Bhavesh Mandloi** · Answer 1

You should try using the URLLIB robotsparser

import urllib.robotparser

robots = "branndurl/robots.txt"
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots)
    rp.read()
    rp.site_maps()

This will give you all the sitemaps in the robots.txt

Most of the sites are havig the sitemaps present there.

**Tajs** · Answer 2

I've used a small function to find sitemaps by the most common name.

Sitemap naming stats: https://dret.typepad.com/dretblog/2009/02/sitemap-names.html

def get_sitemap_bruto_force(website):
    potential_sitemaps = [
        "sitemap.xml",
        "feeds/posts/default?orderby=updated",
        "sitemap.xml.gz",
        "sitemap_index.xml",
        "s2/sitemaps/profiles-sitemap.xml",
        "sitemap.php",
        "sitemap_index.xml.gz",
        "vb/sitemap_index.xml.gz",
        "sitemapindex.xml",
        "sitemap.gz"
    ]

    for sitemap in potential_sitemaps:
        try:
            sitemap_response = requests.get(f"{website}/{sitemap}")
            if sitemap_response.status_code == 200:
                return [sitemap_response.url]
            continue
        except:
            continue

As I retrieve sitemap index I'll send it to a recursive function to find all links from all sitemaps.

def dig_up_all_sitemaps(website):
    sitemaps = []
    index_sitemap = get_sitemap_paths_for_domain(website)

    def recursive(sitemaps_to_crawl=index_sitemap):    
        current_sitemaps = []

        for sitemap in sitemaps_to_crawl:
            try:
                child_sitemap = get_sitemap_links(sitemap)
                current_sitemaps.append([x for x in child_sitemap if re.search("\.xml|\.xml.gz|\.gz$",x)])
            except:
                continue
        current_sitemaps = list(itertools.chain.from_iterable(current_sitemaps))
        sitemaps.extend(current_sitemaps)
        if len(current_sitemaps) == 0:
            return sitemaps
        return recursive(current_sitemaps)
    return recursive()

get_sitemap_paths_for_domain returns a list of sitemaps

**Klaus-Dieter Warzecha** · Answer 3

Please take a look at the robots.txt file first. That's what I usually do.

Some domains do offer more than one sitemap and there are cases with more than 200 xml files.

Please remember that according to the FAQ on sitemap.org, a sitemap file can be gzipped. Consequently, you might want to consider sitemap.xml.gz too!

How to find sitemap in each domain and sub domain using python

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in BEAUTIFULSOUP

Related Questions in SCRAPY

Related Questions in SITEMAP

Related Questions in PYSPIDER

Trending Questions

Popular # Hahtags

Popular Questions