Python: Appending adding to a dictionary to with a for loop to output as json

1.8k Views Asked by At

I'm bit new to python, I've trying to scrap a page using Beautiful Soup and output the results in a JSON format. SimpleJson

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (
    "page1.html",
    "page2.html",
    "page3.html"
)

my_dict = {}

for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    title = soup.title.string
    body = soup.find(id="bodyText")
    my_dict['title'] = title
    my_dict['body']= str(body)

print simplejson.dumps(my_dict,indent=4)

I'm only getting the results of the last page? Can someone tell me where I'm going wrong?

4

There are 4 best solutions below

5
On BEST ANSWER

An indentation can cause wonders in python , only the last line needed to be indented inside the for loop

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (

"page1.html",
"page2.html",
"page3.html"

)

my_dict = {}

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = title
    my_dict['body']= str(body)

    print simplejson.dumps(my_dict,indent=4)

or if you really want all the data in one dictioanry, then you could try:

my_dict['title'] = my_dict.get("title","")+","+title
my_dict['body']= my_dict.get("body","")+","+body

So the code may look like:

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (

"page1.html",
"page2.html",
"page3.html"

)

my_dict = {}

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = my_dict.get("title",[]).append(title)
    my_dict['body']= my_dict.get("body",[]).append(body)

print simplejson.dumps(my_dict,indent=4)
3
On

You are overwriting your dictionary each time through the loop. Tab the print statement over so it is included in the for loop:

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = title
    my_dict['body']= str(body)

    print simplejson.dumps(my_dict,indent=4)
0
On
results = [] # you need a list to collect all dictionaries

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))
    this_dict = {}
    this_dict['title'] = soup.title.string
    this_dict['body'] = soup.find(id="bodyText")
    results.append(this_dict)

print simplejson.dumps(results, indent=4)

I have a feeling, however, that what you want it is a dictionary, where keys are titles of page and values are bodies:

results = {}

for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    results[soup.title.string] = soup.find(id='bodyText')

print simplejson.dumps(results, indent=4)

Or using comprehensions:

soups = (BeautifulSoup(open(webpage)) for webpage in webpages)
results = {soup.title.string: soup.find(id='bodyText') for soup in soups}
print simplejson.dumps(results, indent=4)

PS. Please forgive me mistakes, if any occur, I am writing from a phone...

0
On

Since you are destroying title and body in each iteration, there are two ways of handling it:

  1. Create a list of all dictionaries as:

    all_dict=[]
    for webpage in webpages:
        soup = BeautifulSoup(open(webpage))
        title = soup.title.string
        body = soup.find(id="bodyText")
        my_dict['title'] = title
        my_dict['body']= str(body)
        all_dict.append(my_dict)
    
    for my_dict in alldict:
        print simplejson.dumps(my_dict,indent=4)
    
  2. Use iteration number using enumerate() to create different title and body names like title1, body1, title2, body2, etc. This way you preserve each title and body name in same dictionary as:

    for i,webpage in enumerate(webpages):
        soup = BeautifulSoup(open(webpage))
        title = soup.title.string
        body = soup.find(id="bodyText")
        my_dict['title'+str(i)] = title
        my_dict['body'+str(i)]= str(body)
    
    print simplejson.dumps(my_dict,indent=4)