How do I extract only some specific body texts from a public HTML page in PHP?

60 Views Asked by At

I want to extract only some specific body innertexts from a particular HTML page. By specific, I mean only the innertexts which are related to the title of the page.

The thing is that I am scraping data from a public website, and it's full of advertisements, signup forms, etc, which are unnecessary and irrelevant to the title of the page.

I was going through the internet and found out a library call SimpleHTMLDOM, so I implemented it into my project. This is how I am fetching the body texts from the website:

include('simple_html_dom.php');

$html = file_get_html("URL of the website here");

if($html){
    if($html->find('p')){
        foreach($html->find('p') as $element){
            echo $element->plaintext.'<br>';
        }
    }
}

As I feared, it's fetching all the unnecessary texts too which are inside the <p> tag. So my question is how do I segregate among the unecessary texts (from ads, etc) and the main body texts? Or is there any other library which will help me in doing so? Please guide me.

EDIT: As of now I am trying to fetch the necessary body text (excluding ads,etc) from this url:

click to view

0

There are 0 best solutions below