Scrapy: How to clean response ?

1.6k Views Asked by At

Here is my code snippet. I am trying scrape a website using Scrapy and then store data in Elasticsearch for indexing.

def parse(self, response):
    for news in response.xpath('head'):
        yield {
            'pagetype': news.xpath('//meta[@name="pagetype"]/@content').extract(),
            'description': news.xpath('//div[@class="module__content"]/*/node()/text()').extract(),
              }

Now my issue is the value that gets saved in the 'description' field.

    [u'\n              \n              ', u'"For\n              many of us what we eat on Christmas day isn\'t what we would usually consume and\n              that\u2019s perfectly ok," Dr said.', u'"However\n              it is not uncommon for festive season celebrations to begin in November and\n              continue well in to the New Year.', u'"So\n              if health is on the agenda, being mindful about what we put into our bodies\n              with a balanced approach, throughout the whole festive season, is important."', u"Dr\n              , a lecturer at School\n              Sciences, said balancing fresh, healthy food with being physically active was a\n              good start.", u'"Whatever\n              the celebration, try to limit processed foods, often high in fat, sugar and\n              salt," she said.', u'"Taking\n              time during holidays to prepare food and make the most of fresh ingredients is\n              often a much healthier option than relying on convenience foods and take away.', u'"Being\n              mindful about going back for seconds is important too.\xa0 We don\u2019t need to eat until we feel\n              uncomfortable and eating the foods we enjoy doesn\'t necessarily mean we need to\n              eat copious amounts."', u"Dr\n             own healthy tips and substitutes for the Christmas season\n              include:", u'But\n              just because Dr  is a dietitian, doesn\u2019t mean she doesn\u2019t enjoy a\n              Christmas treat or two.', u'"I\n              would have to say my sister in law\'s homemade rocky road is my favourite\n              festive treat. She makes it every Christmas day and it gets better each year," she\n              said.', u'"I\n              also enjoy a summer cocktail every so often during the festive season and a\n              mojito would be one of my favourites on Christmas day. We make it with extra\n              mint from the garden which is a nice, fresh addition.', u'"Rather\n              than focusing on food avoidance, moderation is the best approach.', u'"There\n              are definitely some more healthy choices and some less healthy options when it\n              comes to the typical Christmas day menu, but it\'s more important to be mindful\n              of a healthy, balanced diet throughout the festive period, rather than avoiding\n              specific foods on one day of the year."', u'\n                ', u'\n              \n                ', u'\n                ', u'\n              \n                ', u'\n              ', u'\n                ', u'\n                        ', u'\n                        ', u'\n                        ', u'\n                    ', u'\n            ', u'Related News', u'\n          ', u'\n        ', u'\n          ', u'\n        ', u'\n          ', u'\n        ', u'Search for related news']

There are lots of whitespaces, newline codes and 'u' letters....

How do I further process this code to just contain normal text, free of extra whitespaces, newlines (\n) codes and 'u' letters?

I read that BeautifulSoup works well with Scrapy, but I couldn't found any examples on how to integrate Scrapy with BeautifulSoup. I am open to use any other method as well. Any help is very appreciated.

Thanks

1

There are 1 best solutions below

5
On

You can strip spaces and newlines from the strings in the list using for example the method shown in this answer:

[' '.join(item.split()) for item in list_of_strings]

where list_of_strings is the list of strings you gave as example.

Regarding the "u letters", you shouldn't really worry about them. They simply mean that the string is in unicode encoding. See e.g. this question on the matter.