Why my scrapy does not used all urls in start_urls list?

909 Views Asked by At

I have almost 300 urls in my start_urls list, but the scrapy only scrawl about 200 urls. But not all of these listed urls. I do not know why? How I can deal with that. I have to scrawl more items from the website.

Another question I do not understand is: how I can see the log error when the scrapy finishes? From the terminal or I have to write code to see the log error. I think the log is enabled by default.

Thanks for your answers.


updates:

The output is in the following. I do not know why there are only 2829 items scraped. There are 600 urls in my start_urls actually.

But when I only give 400 urls in start_urls, it can scrape 6000 items. I expect to scrape almost the whole website of www.yhd.com. Could anyone give any more suggestions?

2014-12-08 12:11:03-0600 [yhd2] INFO: Closing spider (finished)
2014-12-08 12:11:03-0600 [yhd2] INFO: Stored csv feed (2829 items) in myinfoDec.csv        
2014-12-08 12:11:03-0600 [yhd2] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 142586,
'downloader/request_count': 476,
'downloader/request_method_count/GET': 476,
'downloader/response_bytes': 2043856,
'downloader/response_count': 475,
'downloader/response_status_count/200': 474,
'downloader/response_status_count/504': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 12, 8, 18, 11, 3, 607101),
'item_scraped_count': 2829,
'log_count/DEBUG': 3371,
'log_count/ERROR': 1,
'log_count/INFO': 14,
'response_received_count': 474,
'scheduler/dequeued': 476,
'scheduler/dequeued/memory': 476,
'scheduler/enqueued': 476,
'scheduler/enqueued/memory': 476,
'start_time': datetime.datetime(2014, 12, 8, 18, 4, 19, 698727)}
2014-12-08 12:11:03-0600 [yhd2] INFO: Spider closed (finished)
1

There are 1 best solutions below

0
On

Finally I solved the problem....

First,it does not crawl all url listed in start_urls is because I have a typo in url in start_urls. One of the "http://..." is mistakenly written as "ttp://...", the first 'h' is missing. Then it seems the spider stopped to looked at the rest urls listed after it. Horrifed.

Second, I solved the log file problem by click the configuiration panel of Pycharm, which provides showing log file panel. By the way, my scrapy framework is put into the Pycharm IDE. It works great for me. Not advertisement.

Thanks for all the comments and suggestions.