I'm searching through an html file with BeautifulSoup's find_all function. I'm having a couple problems with this. First, since I want to find only the <script> tags, I have to use soup.find_all('script') since it won't let me have the <,> in the find_all(). Is there a way to get around this? Just by searching script I'm getting parts of the HTML file that are not a script tag but parts that use the word script in a URL or paragraph.
Second, when I use soup.find_all('script'), there are certain HTML files where not all script tags are returned. In some files, these are <script>'s in the <head> of the file and other's, the page parameters are dealt with in the scripts. Is there a way to get around this and force all script tags to be returned?
For example, one of the ignored <script>'s look like this:
<!--[if lte IE 7]>
<script src="//www.webiste.com" type="text/javascript" ></script>
<![endif]-->
My code is:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(file), 'html.parser')
tags = soup.find_all('script')
I'm trying to grab every <script>...</script> section out of the HTML file. This has been the easiest way I've found to do it, but if anyone knows of an easier way that will also fix my other problems I'm open to changing my code.