I'm working on a Python project and having difficulty extracting PDF links from a public Google Drive folder/storage. I write this code:
import requests
import re
googleDrive = "https://drive.google.com/drive/folders/1my5S5mOPaIk7jkQOv-P62_dpLwAmFpnm"
response = requests.get(googleDrive).content
with open("drive.html", "wb") as f:
f.write(response)
list = []
def extract_links_from_response(text):
drive_link_pattern = re.compile(r"https://drive.google.com/file/d/\w+/view")
matches = drive_link_pattern.findall(text)
return matches
drive_links = extract_links_from_response(response.decode("utf-8"))
for link in drive_links:
print(link+"?usp\u003ddrive_web")
list.append(link+"?usp\u003ddrive_web")
print(len(list))
After looking at the response, I realised something the resource links that I am looking for can be easily found by this code.
And I got success in my output
(.venv) deepesh@devdas:~/development/help/automaticCertificateFinder$ python drive.py
https://drive.google.com/file/d/1qNUNXMA1eu0CabROqTJlTc2OX5dXx_rZ/view?usp=drive_web
https://drive.google.com/file/d/1KQvWBYlPLbHVW8H3Pqa_bX5Sladw5v_9/view?usp=drive_web
https://drive.google.com/file/d/1VO92_pTwhDlP7c_qAgoKLUjew3EwGYFo/view?usp=drive_web
https://drive.google.com/file/d/1lq8ZXWf7P35QevRG1A7W0UOVnzGTALr4/view?usp=drive_web
https://drive.google.com/file/d/14YNYhztnqg5MpatjqDULslCKiU8UDkTR/view?usp=drive_web
https://drive.google.com/file/d/1pUp_bUQeReRjm6TcIEacbgSRyh0Gtxc6/view?usp=drive_web
https://drive.google.com/file/d/1Oo7b4qGxqvmlQmcrW0HrjejU4jnAVuyW/view?usp=drive_web
https://drive.google.com/file/d/1tfEhmTvhW_QpG6pLBSH0ZB7T1Lu7XFpN/view?usp=drive_web
https://drive.google.com/file/d/1a45jWWSZS6bhlmNa4tEvKbAdfyzJcsSb/view?usp=drive_web
https://drive.google.com/file/d/1KK9Ng84oCKwbT6ttzUN7qU3PKFYazTRX/view?usp=drive_web
https://drive.google.com/file/d/1oTjjb42wO46zIcKP35qWx6amfHsN_8bH/view?usp=drive_web
https://drive.google.com/file/d/1Ba_tu7ndWxoDz4Qy7xbMcthHZe0ihfzz/view?usp=drive_web
https://drive.google.com/file/d/1OJLzTmof09a_BUFubrkJIFQ_9PQq2Bcw/view?usp=drive_web
https://drive.google.com/file/d/1i9xa9q4HjiVWqQkKJmUaagN3eSkEEVe3/view?usp=drive_web
https://drive.google.com/file/d/1Bemuv1lYmTBzjRpEVzsA3q1m0F7j5Evt/view?usp=drive_web
https://drive.google.com/file/d/1E3Ycg2_B3kmJAn60LKz8udjNn_ubgadJ/view?usp=drive_web
https://drive.google.com/file/d/1uyFzHXBoQbur4gQ95p9HUWgiE9HaeiIT/view?usp=drive_web
https://drive.google.com/file/d/1lsduOJLa2WrRFSzNkAVpaimTiKqsufMv/view?usp=drive_web
https://drive.google.com/file/d/13cNclWzkobdJB0rAtd2gHhgcXoBsv6AT/view?usp=drive_web
https://drive.google.com/file/d/1FERvbIRiN56BZJMAOFnhhmdOTdtkDGld/view?usp=drive_web
https://drive.google.com/file/d/1OlD3B4xm6I5EBO2keNAlRE5rJKK9YQ5Q/view?usp=drive_web
https://drive.google.com/file/d/19PE0SCCnlSNFGS8r7rxmuqlADbSw6Qeh/view?usp=drive_web
https://drive.google.com/file/d/1SxV1a4CzpQMMnmld8tm2iHnYCcBTi6lL/view?usp=drive_web
https://drive.google.com/file/d/1z6pL1tLqankY9e2Uxjv2LCsGVvMLMXY5/view?usp=drive_web
https://drive.google.com/file/d/17AJFq4QFRrO5oFRo7XjpWBvXXRWJpQ63/view?usp=drive_web
https://drive.google.com/file/d/17qK8pT6YdRk8F6sZTcU7UjDQUBExz_rr/view?usp=drive_web
https://drive.google.com/file/d/1X1xnK3y9jeLxdv7N_Wi7urYVOEK0up8C/view?usp=drive_web
https://drive.google.com/file/d/1RLXiJwh7L3RySSoH10WwMRGzmrp3qgL5/view?usp=drive_web
https://drive.google.com/file/d/1XtCzDYeATw5uprSGIT8eu2bEv1qS_Km9/view?usp=drive_web
https://drive.google.com/file/d/1tIlqq7FP4brQL9BJ1U47x3VT1ocPL7JB/view?usp=drive_web
https://drive.google.com/file/d/1qNUNXMA1eu0CabROqTJlTc2OX5dXx_rZ/view?usp=drive_web
https://drive.google.com/file/d/1KQvWBYlPLbHVW8H3Pqa_bX5Sladw5v_9/view?usp=drive_web
https://drive.google.com/file/d/1VO92_pTwhDlP7c_qAgoKLUjew3EwGYFo/view?usp=drive_web
https://drive.google.com/file/d/1lq8ZXWf7P35QevRG1A7W0UOVnzGTALr4/view?usp=drive_web
https://drive.google.com/file/d/14YNYhztnqg5MpatjqDULslCKiU8UDkTR/view?usp=drive_web
https://drive.google.com/file/d/1pUp_bUQeReRjm6TcIEacbgSRyh0Gtxc6/view?usp=drive_web
https://drive.google.com/file/d/1Oo7b4qGxqvmlQmcrW0HrjejU4jnAVuyW/view?usp=drive_web
https://drive.google.com/file/d/1tfEhmTvhW_QpG6pLBSH0ZB7T1Lu7XFpN/view?usp=drive_web
https://drive.google.com/file/d/1a45jWWSZS6bhlmNa4tEvKbAdfyzJcsSb/view?usp=drive_web
https://drive.google.com/file/d/1KK9Ng84oCKwbT6ttzUN7qU3PKFYazTRX/view?usp=drive_web
https://drive.google.com/file/d/1oTjjb42wO46zIcKP35qWx6amfHsN_8bH/view?usp=drive_web
https://drive.google.com/file/d/1Ba_tu7ndWxoDz4Qy7xbMcthHZe0ihfzz/view?usp=drive_web
https://drive.google.com/file/d/1OJLzTmof09a_BUFubrkJIFQ_9PQq2Bcw/view?usp=drive_web
https://drive.google.com/file/d/1i9xa9q4HjiVWqQkKJmUaagN3eSkEEVe3/view?usp=drive_web
https://drive.google.com/file/d/1Bemuv1lYmTBzjRpEVzsA3q1m0F7j5Evt/view?usp=drive_web
https://drive.google.com/file/d/1E3Ycg2_B3kmJAn60LKz8udjNn_ubgadJ/view?usp=drive_web
https://drive.google.com/file/d/1uyFzHXBoQbur4gQ95p9HUWgiE9HaeiIT/view?usp=drive_web
https://drive.google.com/file/d/1lsduOJLa2WrRFSzNkAVpaimTiKqsufMv/view?usp=drive_web
https://drive.google.com/file/d/13cNclWzkobdJB0rAtd2gHhgcXoBsv6AT/view?usp=drive_web
https://drive.google.com/file/d/1FERvbIRiN56BZJMAOFnhhmdOTdtkDGld/view?usp=drive_web
https://drive.google.com/file/d/1OlD3B4xm6I5EBO2keNAlRE5rJKK9YQ5Q/view?usp=drive_web
https://drive.google.com/file/d/19PE0SCCnlSNFGS8r7rxmuqlADbSw6Qeh/view?usp=drive_web
https://drive.google.com/file/d/1SxV1a4CzpQMMnmld8tm2iHnYCcBTi6lL/view?usp=drive_web
https://drive.google.com/file/d/1z6pL1tLqankY9e2Uxjv2LCsGVvMLMXY5/view?usp=drive_web
https://drive.google.com/file/d/17AJFq4QFRrO5oFRo7XjpWBvXXRWJpQ63/view?usp=drive_web
https://drive.google.com/file/d/17qK8pT6YdRk8F6sZTcU7UjDQUBExz_rr/view?usp=drive_web
https://drive.google.com/file/d/1X1xnK3y9jeLxdv7N_Wi7urYVOEK0up8C/view?usp=drive_web
https://drive.google.com/file/d/1RLXiJwh7L3RySSoH10WwMRGzmrp3qgL5/view?usp=drive_web
https://drive.google.com/file/d/1XtCzDYeATw5uprSGIT8eu2bEv1qS_Km9/view?usp=drive_web
https://drive.google.com/file/d/1tIlqq7FP4brQL9BJ1U47x3VT1ocPL7JB/view?usp=drive_web
60
As you can see in the output, there are 60 resource links printed in the terminal. While it's good that there is output, there are two major issues.
In total, there are 500+ pdf but I got 60 output.
I checked 60 links twice and observed that not all the files were present, and some files were repeating. To explain my observation, I found PDF number 5, then 6, 7 and so on, but I could not find PDFs 1, 2, 3, and 4. In those 60 PDFs, after 49, the rotation starts again. In short, some files from the middle are missing, and those 60 are a repetition of files.
I expected to receive 500+ links, ideally unique ones. If any methods are available to achieve this, I would greatly appreciate the assistance.