How can I get download links of files from a Webpage without scraping the document itself?

1k Views Asked by Alex At 07 May 2019 at 19:19

I want to code a Download Manager in Python like JDownloader that downloads easy files for you. But not every file has a download url in the document. How can I get download url's if the files are like "invisible" in the document ? I found on the internet, that network sniffing is maybe working, but it doesn't seem to be the right thing I need. JDownloader is just checking for a second and directly finds what you need. How does this work ? For example: https://speed.hetzner.de/

I am a beginner btw.

Original Q&A

There are 1 best solutions below

mbhargav294 On 07 May 2019 at 21:48

Looking at your example page, it has 3 hrefs that points to a file. When you look at a href, sometime you can tell it is a file based on the extension. But, in a normal scenario websites can do some serverside processing and then return a file. Sometimes the URLs are not even files, they are pointing to some other page.

So, you have two things to do.

Retrieve all anchor tags and hrefs on a webpage. (You can use BeautifulSoup for this step)
Filter out file urls from html urls. (This is the tricky part. You can come across static assets like .js or .css or image files etc.)

To perform the second part, you can use python requests library to get the content type. Here is a small example:

In [3]: import requests                                                                                                                       

In [4]: response = requests.head('https://speed.hetzner.de/100MB.bin', allow_redirects=True)                                                  

In [5]: response                                                                                                                              
Out[5]: <Response [200]>

In [6]: response.content                                                                                                                      
Out[6]: b''

In [7]: response.headers                                                                                                                      
Out[7]: {'Server': 'nginx', 'Date': 'Tue, 07 May 2019 21:21:28 GMT', 'Content-Type': 'application/octet-stream', 'Content-Length': '104857600'
, 'Last-Modified': 'Tue, 08 Oct 2013 11:48:13 GMT', 'Connection': 'keep-alive', 'ETag': '"5253f0fd-6400000"', 'Strict-Transport-Security': 'ma
x-age=15768000; includeSubDomains', 'Accept-Ranges': 'bytes'}

If your look at the response.headers here you can see the 'Content-type' which is set to 'application/octet-stream'. This field should be used to filter out files. There are other content types that you have to look for, in order to decide if it is a downloadable or not. Once you have this filtered list, it is the list of downloadable files on this webpage.

Notice that I am using requests.head to get the content type. Use HEAD request to get some meta information about a URL. If you do a GET/POST, it might timeout.

How can I get download links of files from a Webpage without scraping the document itself?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in HTML

Related Questions in HTTP

Related Questions in PACKET-SNIFFERS

Related Questions in JDOWNLOADER

Trending Questions

Popular # Hahtags

Popular Questions