i'm trying to scrape a site using Jaunt ( https://ravit.is.fi/hevoset/1 ) and I'm having problems finding the correct table element to parse this table (in red: https://i.stack.imgur.com/Fnaep.png )
From the html, I assumed the correct element would be < table border=\"0\" cellpadding=\"3\" cellspacing=\"1\"> but the table marked in green also uses the same element so what would be the way to "choose" the correct table? Been trying tons of things to no avail but as I am pretty new to java, html and coding in general, I'm most likely missing something obvious
Also, I tried putting the data from the other table to the xls table but everything went to same cell so what do you need to do so it would look like this: https://i.stack.imgur.com/sMkxs.png ?
Thank you in advance
public class JauntTesti{
public static void main(String[] args){
int sivu = 1;
while (true) {
try{
UserAgent userAgent = new UserAgent();
if (sivu <= 1) {
userAgent.visit("https://ravit.is.fi/hevoset/" + sivu);
String title = userAgent.doc.findFirst("<title>").getChildText(); //hakee ekan löytyvän otsikon stringiin title
System.out.println("\n" + sivu);
Element body = userAgent.doc.findFirst("<body>");
Element strong = body.findEach("<strong>");
Element strong2 = userAgent.doc.findEach("<td>");
Element strong3 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(1);
Element strong4 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(1).getElement(1);
Element strong5 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(2).getElement(1);
Element strong6 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(3).getElement(1);
Element strong7 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(4).getElement(1);
Element strong8 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(5).getElement(1);
Element test1 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0);
Element test2 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(1).getElement(0);
Element test3 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(2).getElement(0);
Element test4 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(3).getElement(0);
Element test5 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(4).getElement(0);
Element test6 = strong2.getElement(0).getElement(0).getElement(1).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(0).getElement(5).getElement(0);
String nimi = strong3.innerText();
String laji = strong4.innerText();
String sukupuoli = strong5.innerText();
String ika = strong6.innerText();
String valmentaja = strong7.innerText();
String omistaja = strong8.innerText();
while (true) {
if (test4.innerHTML().equals("<strong>IKÄ:</strong> ")){
ika = strong6.innerText();
break;
}
ika = " ";
break;
}
while (true) {
if (test4.innerHTML().equals("<strong>VALMENTAJA:</strong> ")){
valmentaja = strong6.innerText();
break;
}
if (test5.innerHTML().equals("<strong>VALMENTAJA:</strong> ")){
valmentaja = strong7.innerText();
break;
}
valmentaja = "-1";
break;
}
while (true) {
if (test4.innerHTML().equals("<strong>OMISTAJA:</strong> ")){
omistaja = strong6.innerText();
break;
}
if (test5.innerHTML().equals("<strong>OMISTAJA:</strong> ")){
omistaja = strong7.innerText();
break;
}
if (test6.innerHTML().equals("<strong>OMISTAJA:</strong> ")){
omistaja = strong8.innerText();
break;
}
omistaja = "-1";
break;
}
Table taulukko2 = userAgent.doc.getTable("<table border=\"0\" cellpadding=\"3\" cellspacing=\"1\">");
Elements taul1 = taulukko2.getCol(0);
for(Element element : taul1) System.out.println(taul1.innerText());
ika = ika.replace(" v","");
//int ikav = Integer.parseInt(ika);
System.out.println("Nimi: " + nimi);
System.out.println("Laji: " + laji);
System.out.println("Sukupuoli: " + sukupuoli);
System.out.println("Ikä: " + ika);
System.out.println("Valmentaja: " + valmentaja);
System.out.println("Omistaja: " + omistaja);
try {
String filename = "C:/sheets/" + sivu + ".xls";
HSSFWorkbook workbook = new HSSFWorkbook();
HSSFSheet sheet = workbook.createSheet("FirstSheet");
sheet.setColumnWidth(0, 5000);
sheet.setColumnWidth(1, 5000);
sheet.setColumnWidth(2, 3000);
sheet.setColumnWidth(3, 2000);
sheet.setColumnWidth(4, 4000);
sheet.setColumnWidth(5, 8000);
HSSFRow rowhead = sheet.createRow((short)0);
rowhead.createCell(0).setCellValue("NIMI");
rowhead.createCell(1).setCellValue("LAJI");
rowhead.createCell(2).setCellValue("SUKUPUOLI");
rowhead.createCell(3).setCellValue("IKÄ");
rowhead.createCell(4).setCellValue("VALMENTAJA");
rowhead.createCell(5).setCellValue("OMISTAJA");
//rowhead.createCell(6).setCellValue(taul1.innerText());
HSSFRow row = sheet.createRow((short)1);
row.createCell(0).setCellValue(nimi);
row.createCell(1).setCellValue(laji);
row.createCell(2).setCellValue(sukupuoli);
row.createCell(3).setCellValue(ika);
row.createCell(4).setCellValue(valmentaja);
row.createCell(5).setCellValue(omistaja);
FileOutputStream fileOut = new FileOutputStream(filename);
workbook.write(fileOut);
fileOut.close();
workbook.close();
} catch ( Exception ex ) {
System.out.println(ex);
}
sivu++;
} else {
break;
}
}
catch(JauntException e){
System.err.println(e);
}
}
}
}
With the univocity-html-parser, you can get all details from all tables. Not sure how you need to organize your data, but this should give you some guidance:
Which uses the following methods:
The output of this code is:
Hope this can be useful to you.
Disclosure: I'm the author of this library. It's commercial closed source but it can save you a lot of development time.