I have the following code being read using beautifulsoup.
from bs4 import BeautifulSoup
import re
html = '''
<h1 class="ch">Ch 4:Chapter Title</h1>
<h2 class="subchapter">subch. 1:Subchapter Title 1</h2>
<div>
<div>
<div class="left">sec 1.</div>
<div class="right">belongs to left sec 1</div>
</div>
<div>
<div class="left">sec 2.</div>
<div class="right">belongs to left sec 2</div>
</div>
<div>
<div class="left">sec 3.</div>
<div class="right">belongs to left sec 3</div>
</div>
</div>
<h2 class="subchapter">subch. 2:Subchapter Title 2</h2>
<div>
<div>
<div class="left">sec 4.</div>
<div class="right">belongs to left sec 4</div>
</div>
<div>
<div class="left">sec 5.</div>
<div class="right">belongs to left sec 5</div>
</div>
<div>
<div class="left">sec 6.</div>
<div class="right">belongs to left sec 6</div>
</div>
</div>
'''
lst = []
s = BeautifulSoup(html, 'html.parser')
chs = s.find_all('h1', attrs={'class':'ch'})
subchs = s.find_all('h2', attrs={'class': 'subchapter'})
secns = s.find_all('div', attrs={'class':'left'})
sectxts = s.find_all('div', attrs={'class':'right'})
if chs:
for ch in chs:
chapter = ch.text
ch_citation = re.search(r".*:", chapter).group()
ch_title = re.sub(r".*:","",chapter)
if subchs:
for subch in subchs:
subchapter = subch.text
subch_citation = re.search(r".*:", subchapter).group()
subch_title = re.sub(r".*:","",subchapter)
for secn in secns:
section_citation = secn.text
for sectxt in sectxts:
section_txt = sectxt.text
lst.append([
ch_citation,
subch_citation,
section_citation,
ch_title,
subch_title,
section_txt
])
print(lst)
Outputs:
Subchapter 1:
[['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 1'],
['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 2'],
['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 3'],
['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 4'],
['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 5'],
['Ch 4:', 'subch. 1:', 'sec 6.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 6'],
Subchapter 2:
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 1'],
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 2'],
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 3'],
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 4'],
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 5'],
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 6']]
The Problem
- As you guys can tell the subchapter_text is being repeated on both subchapter 1 and subchapter 2.
- Also it is only grabbing the last citation from section_citation
- In some cases there wont be any subchapters only the chapter and section. Therefore ideally I would not want the section inside the subchapter for loop ... is there any list manipulation that can complete the following output and make the following code work
Ideal Code for the future projects
lst = []
s = BeautifulSoup(html, 'html.parser')
chs = s.find_all('h1', attrs={'class':'ch'})
subchs = s.find_all('h2', attrs={'class': 'subchapter'})
secns = s.find_all('div', attrs={'class':'left'})
sectxts = s.find_all('div', attrs={'class':'right'})
if chs:
for ch in chs:
chapter = ch.text
ch_citation = re.search(r".*:", chapter).group()
ch_title = re.sub(r".*:","",chapter)
if subchs:
for subch in subchs:
subchapter = subch.text
subch_citation = re.search(r".*:", subchapter).group()
subch_title = re.sub(r".*:","",subchapter)
if secns and sectxts:
for secn in secns:
section_citation = secn.text
for sectxt in sectxts:
section_txt = sectxt.text
lst.append([ch_citation,subch_citation,section_citation,ch_title,subch_title,section_txt])
!Ideal Output!
Subchapter 1:
[['Ch 4:', 'subch. 1:', 'sec 1.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 1'],
['Ch 4:', 'subch. 1:', 'sec 2.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 2'],
['Ch 4:', 'subch. 1:', 'sec 3.', 'Chapter Title', 'Subchapter Title 1', 'belongs to left sec 3']
Subchapter 2:
['Ch 4:', 'subch. 2:', 'sec 4.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 4'],
['Ch 4:', 'subch. 2:', 'sec 5.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 5'],
['Ch 4:', 'subch. 2:', 'sec 6.', 'Chapter Title', 'Subchapter Title 2', 'belongs to left sec 6']]
Any advice, ideas, or help is greatly appreciated!
I'm not exactly sure what is needed, but you can use
.find_previousto check in which chapter/subchpter etc. you're currently in. For example:Prints: