Connecting to swedish Wikipedia to extract information

211 Views Asked by At

What I need to do is to connect to Wikipedia and extract the text in the area above the list with contents, also if possible the information (see picture below). There are some options: JWPL, Bliki, JSoup.

I tried Bliki, but couldn't get the information from (see picture) and couldn't change to swedish. JSoup seems fairly easy, however since it's not build specific for wikipedia there isn't any easy way to get the content in the page I'm after.

Using JSoup I can get the HTML document very easy, but can't find out how to get only the part I want as plain text.

 Document doc = Jsoup.connect("http://sv.wikipedia.org/wiki/Stockholm").get();
    Element contentDiv = doc.select("div[id=content]").first();
    System.out.println(contentDiv.toString());

Using this Bliki code return a document formatted in plain text, which is great, however it doesn't include information from the picture below. And MOST important not in swedish, because I don't know how to change that.

String[] listOfTitleStrings = { "Stockholm" }; User user = new User("", "", "http://en.wikipedia.org/w/api.php"); user.login(); List<Page> listOfPages = user.queryContent(listOfTitleStrings); PlainTextConverter p = new PlainTextConverter(); for (Page page : listOfPages) { WikiModel wikiModel = new WikiModel("${image}","${title}"); String text = wikiModel.render(p, page.toString()); System.out.println(text); } Will be running on Android. Edit: Maybe I wasn't clear enough that this must work on all wikipedia pages. Information I want

1

There are 1 best solutions below

2
On BEST ANSWER

I doubt you'll get what you're looking for served to just copy-paste. JSoup is HTML parser, you'll have to look up the elements and write according selectors to get their content.

If you're using Chrome, right click on element (text) and select inspect element and once the HTML source opens up, right click on according element and select Copy CSS Path.

For Country (Land), you'll get something like this:

#mw-content-text > table.infobox.geography > tbody > tr:nth-child(5) > td > span > a

Of course this can be shortened, but it doesn't improve performance much and it's going to be a pain if you don't know CSS well enough.

Luckily, JSoup supports CSS selectors, so what you can do after you get according element is:

String countrySelector = "#mw-content-text > table.infobox.geography > tbody > tr:nth-child(5) > td > span > a";

Document doc = Jsoup.connect("http://sv.wikipedia.org/wiki/Stockholm").get();
Element countryEl = doc.select(countrySelector).first();
System.out.println(countryEl.toString());

(I'm assuming that code you've provided works correctly)

If you wish to test if selector is correct faster, you can do it directly in Chrome, once you have selector copied, change tab to Console, then use $("selector"), then hit enter, for example:

$("#mw-content-text > table.infobox.geography > tbody > tr:nth-child(5) > td > span > a")

If you need text content of element, you can use $("selector").text().

(You might have noticed that this is some simple jQuery)

But beware, this might easily break if Wikipedia decides to update their DOM layout.


Edit: (adding this after additional explanation in comments)

For selectors to be working on multiple pages, you might want to make them more general.

First thing to select is infobox on the right, best to use table.infobox, but this might still select more than one element. The information you're after is usually in first infobox, so it's easy to select with .first(). If that doesn't work and you don't find the element you're after, you could create fallback to try and find info in all of the infobox elements.

I'm still not sure what exactly you're after, so here's the code you should get when putting above together:

// Set infobox selector (content on the right side of Wiki page)
String tableSelector = "table.infobox";
// Load document
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Gothenburg").get();
// Select infobox element
Element infoboxEl = doc.select(tableSelector).first();
// Select all table rows inside infobox
Elements tableRows = infoboxEl.select("tr");
for (Element row: tableRows) {
    // Output the title of each row
    System.out.print(row.select("th").text() + ": ");
    // Output conent for that title
    System.out.println(row.select("td").text());
}

Here is example output:

Gothenburg, Sweden Göteborg: 
_: From left to right: View over Gothenburg and the Göta älv, Götaplatsen, Svenska Mässan, Gothenburg heritage tram, Elfsborg Fortress, Ullevi.
_: Nickname(s): Little London Little Amsterdam,
_: Gothenburg, Sweden
_: Coordinates: 57°42′N 11°58′E / 57.700°N 11.967°E / 57.700; 11.967Coordinates: 57°42′N 11°58′E / 57.700°N 11.967°E / 57.700; 11.967
Country: Sweden
Province: Västergötland and Bohuslän
County: Västra Götaland County
Municipality: Gothenburg Municipality, Härryda Municipality, Partille Municipality and Mölndal Municipality
Charter: 1621
Area[1]: 
 • City: 447.76 km2 (172.88 sq mi)
 • Water: 14.5 km2 (5.6 sq mi)  3.2%
 • Urban: 203.67 km2 (78.64 sq mi)
 • Metro: 3,694.86 km2 (1,426.59 sq mi)
Elevation: 12 m (39 ft)
Population (2013 (urban: 2010))[1][2]: 
 • City: 533,260
 • Density: 1,200/km2 (3,100/sq mi)
 • Urban: 549,839
 • Urban density: 2,700/km2 (7,000/sq mi)
 • Metro: 956,118
 • Metro density: 260/km2 (670/sq mi)
Demonym: Gothenburger (Göteborgare)
Time zone: CET (UTC+1)
 • Summer (DST): CEST (UTC+2)
Postal code: 40xxx - 41xxx - 421xx - 427xx
Area code(s): (+46) 31
Website: www.goteborg.se

This outputs everything that you can see on the wiki and might not be exactly what you want, as there are missing titles in some cases (marked out as _:). But I think you get the idea how this works and you can use this to filter out what you're looking for.

I would recommend you to use class to save this data and display it in your application later. This way you can easily apply logic which will check if you got all the correct data and if it fails, you could create a fallback to fix it.