Ch 4:Chapter Title Ch 4:Chapter Title Ch 4:Chapter Title

I have the following code being read using beautifulsoup.

from bs4 import BeautifulSoup
import re

html = '''
<h1 class="ch">Ch 4:Chapter Title</h1>
<h2 class="subchapter">subch. 1:Subchapter Title 1</h2>

<div>
    <div>
        <div class="left">sec 1.</div>
        <div class="right">belongs to left sec 1</div>
    </div>

    <div>
        <div class="left">sec 2.</div>
        <div class="right">belongs to left sec 2</div>
    </div>

    <div>
        <div class="left">sec 3.</div>
        <div class="right">belongs to left sec 3</div>
    </div>
</div>

<h2 class="subchapter">subch. 2:Subchapter Title 2</h2>

<div>
    <div>
        <div class="left">sec 4.</div>
        <div class="right">belongs to left sec 4</div>
    </div>

    <div>
        <div class="left">sec 5.</div>
        <div class="right">belongs to left sec 5</div>
    </div>

    <div>
        <div class="left">sec 6.</div>
        <div class="right">belongs to left sec 6</div>
    </div>
</div>
'''

lst = []
s = BeautifulSoup(html, 'html.parser')

chs = s.find_all('h1', attrs={'class':'ch'})
subchs = s.find_all('h2', attrs={'class': 'subchapter'})
secns = s.find_all('div', attrs={'class':'left'})
sectxts = s.find_all('div', attrs={'class':'right'})

if chs:
   for ch in chs:
      chapter = ch.text
      ch_citation = re.search(r".*:", chapter).group()
      ch_title = re.sub(r".*:","",chapter)

if subchs:
   for subch in subchs:
      subchapter = subch.text
      subch_citation = re.search(r".*:", subchapter).group()
      subch_title = re.sub(r".*:","",subchapter)

      for secn in secns:
         section_citation = secn.text
      for sectxt in sectxts:
         section_txt = sectxt.text
         lst.append([
                    ch_citation,
                    subch_citation,
                    section_citation,
                    ch_title, 
                    subch_title,
                    section_txt
                    ])

print(lst)

Outputs:

Subchapter 1:
[['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 1'], 
['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 2'], 
['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 3'], 
['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 4'], 
['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 5'], 
['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 6'], 

Subchapter 2:
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 1'], 
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 2'], 
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 3'], 
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 4'], 
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 5'], 
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 6']]

The Problem

  1. As you guys can tell the subchapter_text is being repeated on both subchapter 1 and subchapter 2.
  2. Also it is only grabbing the last citation from section_citation
  3. In some cases there wont be any subchapters only the chapter and section. Therefore ideally I would not want the section inside the subchapter for loop ... is there any list manipulation that can complete the following output and make the following code work

Ideal Code for the future projects

lst = []
s = BeautifulSoup(html, 'html.parser')

chs = s.find_all('h1', attrs={'class':'ch'})
subchs = s.find_all('h2', attrs={'class': 'subchapter'})
secns = s.find_all('div', attrs={'class':'left'})
sectxts = s.find_all('div', attrs={'class':'right'})

if chs:
   for ch in chs:
      chapter = ch.text
      ch_citation = re.search(r".*:", chapter).group()
      ch_title = re.sub(r".*:","",chapter)

if subchs:
   for subch in subchs:
      subchapter = subch.text
      subch_citation = re.search(r".*:", subchapter).group()
      subch_title = re.sub(r".*:","",subchapter)
      
if secns and sectxts:
   for secn in secns:
      section_citation = secn.text
   for sectxt in sectxts:
      section_txt = sectxt.text
      
lst.append([ch_citation,subch_citation,section_citation,ch_title,subch_title,section_txt])

!Ideal Output!

Subchapter 1:
[['Ch 4:', 'subch. 1:', 'sec 1.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 1'], 
['Ch 4:', 'subch. 1:', 'sec 2.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 2'], 
['Ch 4:', 'subch. 1:', 'sec 3.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 3']

Subchapter 2: 
['Ch 4:', 'subch. 2:', 'sec 4.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 4'], 
['Ch 4:', 'subch. 2:', 'sec 5.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 5'], 
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 6']]

Any advice, ideas, or help is greatly appreciated!

1

There are 1 best solutions below

0
Andrej Kesely On

I'm not exactly sure what is needed, but you can use .find_previous to check in which chapter/subchpter etc. you're currently in. For example:

import pandas as pd
from bs4 import BeautifulSoup


html = """
... your html code from question ...
"""

soup = BeautifulSoup(html, "html.parser")

out = []
for c in soup.select(".left"):
    citation = c.text.strip()
    text = c.find_next(class_="right").text.strip()
    subchapter = c.find_previous(class_="subchapter").text.strip()
    chapter = c.find_previous(class_="ch").text.strip()

    out.append(
        [
            chapter.split(":")[0],
            subchapter.split(":")[0],
            citation,
            chapter.split(":")[1],
            subchapter.split(":")[1],
            text,
        ]
    )

df = pd.DataFrame(
    out,
    columns=[
        "chapter_no",
        "subchapter_no",
        "citation",
        "chapter_text",
        "subchapter_text",
        "text",
    ],
)
print(df)

Prints:

  chapter_no subchapter_no citation   chapter_text     subchapter_text                   text
0       Ch 4      subch. 1   sec 1.  Chapter Title  Subchapter Title 1  belongs to left sec 1
1       Ch 4      subch. 1   sec 2.  Chapter Title  Subchapter Title 1  belongs to left sec 2
2       Ch 4      subch. 1   sec 3.  Chapter Title  Subchapter Title 1  belongs to left sec 3
3       Ch 4      subch. 2   sec 4.  Chapter Title  Subchapter Title 2  belongs to left sec 4
4       Ch 4      subch. 2   sec 5.  Chapter Title  Subchapter Title 2  belongs to left sec 5
5       Ch 4      subch. 2   sec 6.  Chapter Title  Subchapter Title 2  belongs to left sec 6