I am doing parsechecker for url:https://www.nicobuyscars.com o/p Fetch failed with protocol status: exception(16), lastModified=0: Http code=403, url=https://www.nicobuyscars.com

May I know what is the issue and how to solve it. I tried changing the agent name but it did not work. Please help me.

1

There are 1 best solutions below

0
On

looks like the server is blocking requests based on the user-agent request header. It's reproducible using another HTTP client (wget):

$> wget --header='User-Agent: mycrawler/Nutch-1.17' https://www.nicobuyscars.com
--2020-09-25 11:08:19--  https://www.nicobuyscars.com/
Resolving www.nicobuyscars.com (www.nicobuyscars.com)... 205.147.88.151
Connecting to www.nicobuyscars.com (www.nicobuyscars.com)|205.147.88.151|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2020-09-25 11:08:19 ERROR 403: Forbidden.

$> wget https://www.nicobuyscars.com
--2020-09-25 11:08:27--  https://www.nicobuyscars.com/
Resolving www.nicobuyscars.com (www.nicobuyscars.com)... 205.147.88.151
Connecting to www.nicobuyscars.com (www.nicobuyscars.com)|205.147.88.151|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

In any case, use polite settings for Nutch: large fetcher.server.delay, keep respecting the robots.txt, etc. It's very likely that the server implements other heuristics to detect and block bots.