How to get text which has no HTML tag | Add multiple delimiters in split

117 Views Asked by At

Following XPath select div element with class ajaxcourseindentfix and split it from Prerequisite and gives me all the content after prerequisite.

div = soup.select("div.ajaxcourseindentfix")[0]
" ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1]

My div can have not only prerequisite but also the following splitting points:

Prerequisites
Corerequisite
Corerequisites

Now, whenever I have Prerequisite, above XPath works fine but whenever anything from above three comes, the XPath fails and gives me the whole text.

Is there a way to put multiple delimiters in XPath? Or how do I solve it?

Sample pages:

Corequisite URL: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96106&show

Prerequisite URL: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show

Both: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=98590&show

[Old Thread] - How to get text which has no HTML tag

1

There are 1 best solutions below

0
On BEST ANSWER

This code is the solution to your problem unless you need XPath specifically, I would also suggest that you review BeautifulSoup documentation on the methods I've used, you can find that HERE

.next_element and .next_sibling can be very useful in these cases. or .next_elements we'll get a generator that we'll have either to convert or use it in a manner that we can manipulate a generator.

from bs4 import BeautifulSoup
import requests


url = 'http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show'
makereq = requests.get(url).text

soup = BeautifulSoup(makereq, 'lxml')

whole = soup.find('td', {'class': 'custompad_10'})
# we select the whole table (td), not needed in this case
thedivs = whole.find_all('div')
# list of all divs and elements within them

title_h3 = thedivs[2]
# we select only yhe second one (list) and save it in a var

mytitle = title_h3.h3
# using .h3 we can traverse (go to the child <h3> element)

mylist = list(mytitle.next_elements)
# title_h3.h3 is still part of a three and we save all the neighbor elements 

the_text = mylist[3]
# we can then select specific elements 
# from a generator that we've converted into a list (i.e. list(...))

prequisite = mylist[6]

which_cpsc = mylist[8]

other_text = mylist[11]

print(the_text, ' is the text')
print(which_cpsc, other_text, ' is the cpsc and othertext ')
# this is for testing purposes

Solves both issues, we don't have to use CSS selectors and those weird list manipulations. Everything is organic and works well.