Getting IPs from a .html file

406 Views Asked by At

Their is a site with socks4 proxies online that I use in a proxychains program. Instead of manually entering new IPs in, I was trying to automate the process. I used wget to turn it into a .html file on my home directory, this is some of the output if i cat the file:

</font></a></td><td colspan=1><font class=spy1>111.230.138.177</font> <font class=spy14>(Shenzhen Tencent Computer Systems Company Limited)</font></td><td colspan=1><font class=spy1>6.531</font></td><td colspan=1><TABLE width='13' height='8' CELLPADDING=0 CELLSPACING=0><TR  BGCOLOR=blue><TD  width=1></TD></TR></TABLE></td><td colspan=1><font class=spy1><acronym title='311 of 436 - last check status=OK'>71% <font class=spy1>(311)</font> <font class=spy5>-</font></acronym></font></td><td colspan=1><font class=spy1><font class=spy14>05-jun-2020</font> 23:06 <font class=spy5>(4 mins ago)</font></font></td></tr><tr class=spy1x onmouseover="this.style.background='#002424'" onmouseout="this.style.background='#19373A'"><td colspan=1><font class=spy14>139.99.104.233<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(a1j0e5^q7p6)+(m3f6f6^r8c3)+(a1j0e5^q7p6)+(t0b2s9^y5m3)+(w3c3m3^z6j0))</script></font></td><td colspan=1>SOCKS5</td><td colspan=1><a href='/en/anonymous-proxy-list/'><font class=spy1>HIA</font></a></td><td colspan=1><a href='/free-proxy-list/CA/'><font class=spy14>Canada</

As you can see the IP is usually followed by a spy[0-19]> . I tried to parse out the actual IP's with awk using the following code:

awk '/^spy/{FS=">";  print $2 } file-name.html

This is problematic because their would be a bunch of other stuff trailing after the IP, also I guess the anchor on works for the beginning of a line? Anyway I was wondering if anyone could give me any ideas on how to parse out the IP addresses with awk. I just started learning awk, so sorry for the noob question. Thanks

3

There are 3 best solutions below

0
On

Using a proper XML/HTML parser and a expression:

xidel -se '(//td[@colspan=1]/font[@class="spy1"])[1]/text()' file.html

 Output:

111.230.138.177  

Or if it's not all the time the first xpath match:

xidel -se '//td[@colspan=1]/font[@class="spy1"]/text()' file.html |
   perl -MRegexp::Common -lne 'print $1 if /($RE{net}{IPv4})/'
1
On

AWK is great for hacking IP addresses:

gawk -v RS="spy[0-9]*" '{match($0,/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/); ip = substr($0,RSTART,RLENGTH); if (ip) {print ip}}' file.html

Result:

111.230.138.177
139.99.104.233

Explanation.

You must use GAWK if you want the record break to contain a regular expression.

  1. We divide the file into lines containing one IP address using regex in the RS variable.

  2. The match function finds the second regex in the entire line. Regex is 4 groups from 1 to 3 numbers, separated by a dot (the IP address).

  3. Then the substract function retrieves from the entire line ($0) a fragment of RLENGTH length starting from RSTART (the beginning of the searched regex).

  4. IF checks if the result has a value and if so prints it. This protects against empty lines in the result.

This method of hulling IP addresses is independent of the correctness of the file, it does not have to be html.

0
On

There's already solutions provided here, I'm rather putting a different one for future readers using utility.

egrep -o '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' file.html