How to extract specific text from HTML table?

4.1k Views Asked by At

Here is my HTML file I want to extract word (pending, Next Listing Date (Likely):, 10/01/2014). I am using jaunt and JSoup.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
   <head>
      <meta http-equiv="Content-Language" content="en-us"/>
      <meta http-equiv="Content-Type" content="text/html;url=http://allahabadhighcourt.in/casestatus/utf-8"/>
      <title>Case Status Result</title>
      <link REL="StyleSheet" href="http://allahabadhighcourt.in/alldhc.css" TYPE="text/css"/>
      <script src="http://allahabadhighcourt.in/alldhc.js" LANGUAGE="JavaScript" TYPE="text/javascript">
      <!--
      -->
      </script>
   </head>
   <body onLoad="bodyOnLoad()">
      <div CLASS="heading">
         <img BORDER="0" src="http://allahabadhighcourt.in/image/titleEN.gif" WIDTH="532" HEIGHT="30" ALT="HIGH COURT OF JUDICATURE AT ALLAHABAD"/>
      </div>
      <h4 CLASS="subheading" ALIGN="center" STYLE="margin-top: 6pt; margin-bottom: 0pt">Case Status - Allahabad</h4>
      <p ALIGN="center" STYLE="margin-top: 0; margin-bottom: 6pt">
         <img BORDER="0" src="http://allahabadhighcourt.in/image/blueline.gif" WIDTH="210" HEIGHT="1"/></p>
<table ALIGN="center" CLASS="withb" WIDTH="60%" COLS="2">
<tr><td VALIGN='top' COLSPAN='2' ALIGN='right' STYLE='font-size: 18pt'>Pending</td></tr><tr><td VALIGN='top' ALIGN='center' COLSPAN='2' STYLE='font-size: 16pt'>Criminal Misc. Bail Application : 12898 of 2013 [Etah]</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Petitioner:</td><td STYLE='font-size: 14pt'>AVANISH</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Respondent:</td><td STYLE='font-size: 14pt'>STATE OF U.P.</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Counsel (Pet.):</td><td STYLE='font-size: 14pt'>SANJEEV MISHRA</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Counsel (Res.):</td><td STYLE='font-size: 14pt'>GOVT. ADVOCATE</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Category:</td><td VALIGN='top'>Criminal Jurisdiction Application-U/s 439, Cr.p.c., For Bail (major)</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Date of Filing:</td><td VALIGN='top' STYLE='font-size: 14pt'>08/05/2013</td></tr><tr><td WIDTH='35%' STYLE='font-size: 14pt'>Last Listed on:</td><td STYLE='font-size: 14pt'>03/01/2014 in Court No. 48</td></tr><tr><td WIDTH='35%' STYLE='font-size: 14pt'>Next Listing Date (Likely):</td><td STYLE='font-size: 14pt'>10/01/2014</td></tr><tr><td COLSPAN='2'></td></tr></table><p STYLE="text-align: justify; margin-top: 16pt; margin-left: 90pt; margin-right: 90pt; font-size: 10pt">This is not an authentic/certified copy of the information regarding status of a case. Authentic/certified information may be obtained under Chapter VIII Rule 30 of Allahabad High Court Rules. Mistake, if any, may be brought to the notice of OSD (Computer).</p>
      <table ALIGN="center" WIDTH="80%" COLS="1" RULES="NONE" BORDER="0" STYLE="margin-top: 16pt">
         <tbody>
            <tr ALIGN="center" VALIGN="TOP">
               <td VALIGN="TOP" ALIGN="center">
                  <img ALT="Back" src="http://allahabadhighcourt.in/image/back.gif" WIDTH="30" HEIGHT="25" BORDER="0" onClick="location.href='indexA.html'" STYLE="cursor:pointer"/>
               </td>
            </tr>
         </tbody>
      </table>
   </body>
</html>
3

There are 3 best solutions below

2
On BEST ANSWER

As already pointed out in some comments, it is hard to parse specific elements due to no obvious tag attributes. Though, if your table always maintain the same structure, perhaps with blank values some times, you can tell the CSS-selector in Jsoup to parse specific elements of certain indexes.

Document doc = do you parsing here...

Element pending = doc.select("table td:eq(0)").first();
Element nextDate = doc.select("table td:eq(0)").get(9);
Element date = doc.select("table td:eq(1)").last();

System.out.println(pending.text() + "\n" + nextDate.text() + "\n" + date.text());

which will output

Pending
Next Listing Date (Likely):
10/01/2014

Note the use of pseudo-selectors to specify the index of the elements; td:eq(0).

If each of the elements had it's different attributes, you could select them by using the specific attribute selector, such as [attr=value], which in this case would be something like [VALIGN=top]. It's easy to see that this wouldn't have worked in your case.

I strongly suggest that you read more about how to use the selector-syntax to parse an HTML document. Specific reading can be found here.

0
On

There are no placeholders in your html from where you can start parsing. I suggest you add an "id" element to the table tag like this

<table id="data-table" ALIGN="center" WIDTH="80%" COLS="1" RULES="NONE" BORDER="0" STYLE="margin-top: 16pt">

and the use Jsoup to parse the content like this.

String html = "The entire html page read as a Java String";
Document doc = Jsoup.parse(html);
Element tableElement = doc.select("#data-table");

and then traverse the tableElement using the Elements API.

1
On

You can use regular expression in java for doing the same .

UserAgent userAgent = new UserAgent();                       //create new userAgent (headless browser).
  userAgent.visit(your_site_link);                        //visit a url  
  String siteText=userAgent.doc.innerHTML().toString();

    String REGEX="(?<=>).*(?=<\\w*/td\\w*>)";
    Pattern pattern=Pattern.compile(REGEX);
    Matcher matcher =pattern.matcher(siteText);
    while(matcher.find()){
        System.out.println("TD  Datas : "+matcher.group());
    }