Scraping with Invoke-WebRequest

749 Views Asked by At

We are migrating an asp.net intranet to SharePoint and automating the conversion via PowerShell.

We only want to scrap links from within the DIV tag with a classname 'topnav'. Not all the links on the page

$url = "http://intranet.company.com"
$page = Invoke-WebRequest -Uri $url
$div_topnav = $page.ParsedHtml.getElementsByTagName('div') | ? {$_.className -match 'topnav'}

This gets us the HTML of the topnav, but how best to extract just the application links from the Applications nodes? We do not want HOME or Documents nodes?

<div class="topnav" >
<ul class="lev1 clearfix" >
    <li class="lev1 pos1 first lev1_first">
        <a href="index.html">Home</a>
    </li>
    <li class="lev1 pos2 haschildren lev1_haschildren">
        <a href="index.html">Applications</a>
        <ul>
            <li class="lev2 pos1 first lev2_first">
                <a href="http://someurl.com">App 1</a>
            </li>
            <li class="lev2 pos2 haschildren lev2_haschildren">
                <a href="index.html">Training</a>
                <ul class="lev3">
                    <li class="lev3 pos1 lev3_pos1 first lev3_first">
                        <a href="http://someurl.com">App 3</a>
                    </li>
                    <li class="lev3 pos2 lev3_pos2 last lev3_last">
                        <a href="http://someurl.com">App 4</a>
                    </li>
                </ul>
            </li>
        </ul>
    <li class="lev1 pos3 haschildren lev1_haschildren">
        <a href="index.html">Documents</a>
        <ul>
            <li class="lev2 pos1 first lev2_first">
                <a href="http://someurl.com">Doc 1</a>
            </li>
            <li class="lev2 pos2 haschildren lev2_haschildren">
                <a href="index.html">Training</a>
                <ul class="lev3">
                    <li class="lev3 pos1 lev3_pos1 first lev3_first">
                        <a href="http://someurl.com">Doc 3</a>
                    </li>
                    <li class="lev3 pos2 lev3_pos2 last lev3_last">
                        <a href="http://someurl.com">Doc 4</a>
                    </li>
                </ul>
            </li>
        </ul>
    </li>
</ul>
</div>
1

There are 1 best solutions below

0
PatM0 On

I think that is what you want:

[xml]$div_topnav=
@"<div class="topnav" >
    <ul class="lev1 clearfix" >
    <li class="lev1 pos1 first lev1_first">
        <a href="index.html">Home</a>
    </li>
    <li class="lev1 pos2 haschildren lev1_haschildren">
        <a href="index.html">Applications</a>
        <ul>
            <li class="lev2 pos1 first lev2_first">
                <a href="http://someurl.com">App 1</a>
            </li>
            <li class="lev2 pos2 haschildren lev2_haschildren">
                <a href="index.html">Training</a>
                <ul class="lev3">
                    <li class="lev3 pos1 lev3_pos1 first lev3_first">
                        <a href="http://someurl.com">App 3</a>
                    </li>
                    <li class="lev3 pos2 lev3_pos2 last lev3_last">
                        <a href="http://someurl.com">App 4</a>
                    </li>
                </ul>
            </li>
        </ul>
    </li>
        <li class="lev1 pos3 haschildren lev1_haschildren">
            <a href="index.html">Documents</a>
            <ul>
                <li class="lev2 pos1 first lev2_first">
                    <a href="http://someurl.com">Doc 1</a>
                </li>
                <li class="lev2 pos2 haschildren lev2_haschildren">
                    <a href="index.html">Training</a>
                    <ul class="lev3">
                        <li class="lev3 pos1 lev3_pos1 first lev3_first">
                            <a href="http://someurl.com">Doc 3</a>
                        </li>
                        <li class="lev3 pos2 lev3_pos2 last lev3_last">
                            <a href="http://someurl.com">Doc 4</a>
                        </li>
                    </ul>
                </li>
            </ul>
        </li>
    </ul>
</div>
"@
($div_topnav.GetElementsByTagName("a") | ? "#Text" -Like "App *").href

The output will be your links of all of your apps.

PowerShell couldn't parse your posted $div_topnav Content, because there is a closing li-tag missing for your li-tag in line 6 (I fixed that in my Code snippet).