How to generate word cloud for a group of URLs?

462 Views Asked by At

I have an array called "URL", in which there are several URLs. Now I want to use the crawler to crawl the title and body of each web page, and then store them together in a TXT file, and then generate a word cloud belonging to this group of web pages. This is the first document(urls.py):

def urlsgetword(url):
    from urllib import request

    import os
    from bs4 import BeautifulSoup

    response = request.urlopen(url)  # 发出打开网页的请求
    content = response.read().decode('utf-8')  # 获取网页内容并用utf-8解码

    soup = BeautifulSoup(content, 'lxml')
    title = soup.title  # 得到网页标题
    article = soup.find('div', class_='wp_articlecontent')  # 得到网页内容
    title = title.text  # 得到标题文本内容
    title = ''.join(title.split())  # 去除空格
    article = article.get_text(strip=True)  # 得到文档文本内容,strip=True用以去除文本前后空白行
    article = ''.join(article.split())

    info = title + '\n' + article

    if not os.path.exists("F:/python-file/"):
        os.mkdir("F:/python-file/")

    with open("urls.txt", 'w', encoding='utf-8') as f:
        f.write(info)
        f.close()

This is the second document(wordcloud.py):

def wcloud():
    import matplotlib.pyplot as plt
    import wordcloud
    import jieba

    text = open('F:/python-file/urls.txt').read()

    wordlist_after_jieba = jieba.cut(text, cut_all=True)
    wl_space_split = " ".join(wordlist_after_jieba)

    my_wordcloud = wordcloud.WordCloud().generate(wl_space_split)

    plt.imshow(my_wordcloud)
    plt.axis("off")
    plt.show()

Finally, I want the word cloud output in the main file, so I write this:

for u in urls:
    urlsgetword(u)
wcloud()

As a result, the program failed. Which file is wrong?

0

There are 0 best solutions below