Meet cookie error when crawl website that use php session

269 Views Asked by At

I want to crawl the following page: https://db.aa419.org/fakebankslist.php with search word "sites".

I'm using requests package in python. No plan to try selenium b/c there is no javascript in this page, neither do I need to click any button. I think requests package should have the ability to crawl.

For the website itself, I guess it send query words using php. So I created a php session using requests.post() and retrieve cookies using response.cookies, then feed the cookies to the site in the following post requests. The code structure is below:

#crawl 1st page with search word in url
url='https://db.aa419.org/fakebankslist.php?psearch=sites&Submit=GO&psearchtype='
response = requests.post(url)
cookies= response.cookies
print(cookies)

#crawl page 2-4
for i in range(2, 5):
    url = 'https://db.aa419.org/fakebankslist.php?start={}'.format(str(1+20*(i-1)))
    response = requests.post(url, cookies=cookies)
    cookies= response.cookies #update cookie for each page
    print(cookies)

However, it only works for the first 2 pages. After the loop begin to crawl page 3, the cookie becomes empty: <RequestsCookieJar[]>. I checked the response of page 3 and found it's some random page irrelevant to my query words "sites".

Could anyone explain whats's going on with this situation? How can I keep crawling the following pages? Thanks in advance!

1

There are 1 best solutions below

1
On BEST ANSWER

I am not certainly sure what you are trying to obtain from that website but I will try to help. First page with results can be obtained through this url:

https://db.aa419.org/fakebankslist.php?psearch=essa&Submit=GO&start=1

Value 1 for start key indicates the first result that apears on page. Since there are 19 results on each page to view second page you need to switch '1' to '21' :

https://db.aa419.org/fakebankslist.php?psearch=essa&Submit=GO&start=21

The second thing is that your requests should be made using GET method.

I checked the response of page 3 and found it's some random page irrelevant to my query words "sites"

I believe this is related to broken search engine of the website.

I hope this code helps:

#crawl page 1-5
s = requests.Session()
for i in range(0, 5):
    url = 'https://db.aa419.org/fakebankslist.php?psearch=essa&Submit=GO start='+str(1+i*20)
    response = s.get(url)
    cookies= s.cookies #update cookie for each page
    print('For page ', i+1, 'with results from', 1+i*20, 'to', i*20+20, ', cookies are:', str(cookies))