How to find sitemap in each domain and sub domain using python

1.1k Views Asked by At

I want to know how to find sitemap in each domain and sub domain using python? Some examples:

abcd.com/sitemap.xml
abcd.com/sitemap.html
abcd.com/sitemap.html
sub.abcd.com/sitemap.xml

And etc.

What is the most probable sitemap names, locations and also extensions?

3

There are 3 best solutions below

1
On

You should try using the URLLIB robotsparser

import urllib.robotparser

robots = "branndurl/robots.txt"
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots)
    rp.read()
    rp.site_maps()

This will give you all the sitemaps in the robots.txt

Most of the sites are havig the sitemaps present there.

0
On

I've used a small function to find sitemaps by the most common name.

Sitemap naming stats: https://dret.typepad.com/dretblog/2009/02/sitemap-names.html

def get_sitemap_bruto_force(website):
    potential_sitemaps = [
        "sitemap.xml",
        "feeds/posts/default?orderby=updated",
        "sitemap.xml.gz",
        "sitemap_index.xml",
        "s2/sitemaps/profiles-sitemap.xml",
        "sitemap.php",
        "sitemap_index.xml.gz",
        "vb/sitemap_index.xml.gz",
        "sitemapindex.xml",
        "sitemap.gz"
    ]

    for sitemap in potential_sitemaps:
        try:
            sitemap_response = requests.get(f"{website}/{sitemap}")
            if sitemap_response.status_code == 200:
                return [sitemap_response.url]
            continue
        except:
            continue

As I retrieve sitemap index I'll send it to a recursive function to find all links from all sitemaps.

def dig_up_all_sitemaps(website):
    sitemaps = []
    index_sitemap = get_sitemap_paths_for_domain(website)

    def recursive(sitemaps_to_crawl=index_sitemap):    
        current_sitemaps = []

        for sitemap in sitemaps_to_crawl:
            try:
                child_sitemap = get_sitemap_links(sitemap)
                current_sitemaps.append([x for x in child_sitemap if re.search("\.xml|\.xml.gz|\.gz$",x)])
            except:
                continue
        current_sitemaps = list(itertools.chain.from_iterable(current_sitemaps))
        sitemaps.extend(current_sitemaps)
        if len(current_sitemaps) == 0:
            return sitemaps
        return recursive(current_sitemaps)
    return recursive()

get_sitemap_paths_for_domain returns a list of sitemaps

4
On

Please take a look at the robots.txt file first. That's what I usually do.

Some domains do offer more than one sitemap and there are cases with more than 200 xml files.

Please remember that according to the FAQ on sitemap.org, a sitemap file can be gzipped. Consequently, you might want to consider sitemap.xml.gz too!