How to retrieve all the user comments from a site?

693 Views Asked by At

I want all the user comments from this site : http://www.consumercomplaints.in/?search=chevrolet

The problem is the comments are just displayed partially, and to see the complete comment I have to click on the title above it, and this process has to be repeated for all the comments.

The other problem is that there are many pages of comments.

So I want to store all the complete comments in an excel sheet from the above site specified. Is this possible ? I am thinking of using crawler4j and jericho along with Eclipse.

My code for visitPage method: @Override public void visit(Page page) {
String url = page.getWebURL().getURL(); System.out.println("URL: " + url);

           if (page.getParseData() instanceof HtmlParseData) {
                   HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();

                   String html = htmlParseData.getHtml();

  //               Set<WebURL> links = htmlParseData.getOutgoingUrls();
  //               String text = htmlParseData.getText();

                   try
                   {
                       String CrawlerOutputPath = "/DA Project/HTML Source/";
                       File outputfile = new File(CrawlerOutputPath);

                       //If file doesnt exists, then create it
                        if(!outputfile.exists()){
                            outputfile.createNewFile();
                        }

                       FileWriter fw = new FileWriter(outputfile,true);  //true = append file
                       BufferedWriter bufferWritter = new BufferedWriter(fw);
                       bufferWritter.write(html);
                       bufferWritter.close();
                       fw.write(html);
                       fw.close();

                   }catch(IOException e)
                   {
                       System.out.println("IOException : " + e.getMessage() );
                       e.printStackTrace();
                   }

                   System.out.println("Html length: " + html.length());
           }
   }

Thanks in advance. Any help would be appreciated.

1

There are 1 best solutions below

3
On BEST ANSWER

Yes it is possible.

  • Start crawling on your search site (http://www.consumercomplaints.in/?search=chevrolet)
  • Use the visitPage method of crawler4j to only follow comments and the ongoing pages.
  • Take the html Content from crawler4j and shove it to jericho
  • filter out the content you want to store and write it to some kind of .csv or .xls file (i would prefer .csv)

Hope this helps you