I am trying to use the DEiXTo tool to extract a sequence of moves from a chess website. The extraction tool works fine and pulls out the data I need from a single page. However, I cannot get the tool to extract a sequence of moves using the multipage facility, which has two ways to extract links to follow.
One is to enter the text of the node which contains the link to be followed. If I enter the text for one move, say '1. e4', then DEiXTo will follow the link. But that isn't satisfactory for following the sequence of moves, of course, as each move has different text.
The second way is to give a name, or 'title' in DEiXTo parlance, to the link, which I take to mean the node in the DOM tree and will be a an tag in the page HTML, and enter that title in the multipage search box. That is the bit that I can't get to work as the extraction process doesn't get beyond the first page.
I must be missing something. Part of the problem may be the English version of the manual, which is a translation from the original Greek and occasionally is a little quirky. Can anyone help please?
Here is a screenshot of the DEiXTo GUI to illustrate the situation:
The web page is in the top left pane, the DOM tree is in top right - only a small section of each is shown:
The section of the DOM page that forms the extraction pattern is in the small window at right, with a larger section shown in the pane at bottom left. This shows some of the titles, in particular the title 'nextmove' for the tag. This title is contained in the link to follow field below the checked Multi Page Crawling box in the bottom right pane.
Running the search by clicking either the [Go!] or[!] buttons produces:
The result is shown in the bottom right pane, which lists correctly all the moves in the web page but goes no further. As can be seen, the title 'nextmove' that was entered in the link field extracts the links that should be followed in the next stage of the extraction process. They are only partially visible here, but I have had a closer look at them and they are correct - but they are not followed. However, if I insert the text '1. e4' in the link field (first item in the 'moveText' field, I get an additional 13 records as a result of following the corresponding link under 'nextmove'.
First of all, the UI text "Text or title of the HTML link to follow" does not refer to the name YOU have given to the node in the pattern. It reffers to the inner text that exists in the hyperlink that leads to the next page (between the open and close 'a' html tags). Since this text is not fixed but depends on the game, you can not use this feature to go to the next page.
However, IT IS possible to accomplise what you want but it requires programming.
make the A node in the pattern optional because the next page does not have link there - as a result the pattern will fail for the content of the next page.
run GUI deixto to get the first page moves in a TXT file.
open the file programmaticaly and get the URL of the last line (this is the URL of the next page).
Open the wpf file (it's an XML file), locate the following XML element which is located very early in the file: <TargetUrls> <URL Address="http://www.chessgames.com/perl/explorer"/> </TargetUrls> and replace the value of the Address attribute with the URL you got at step #3.
Save the new wpf file.
Run deixto with the new wpf file (you can run GUI deixto as command line app by passing the wpf file as parameter).
Another easier approach is the following (I didn't try it though)
DEiXTo CLE is a command line executor of wpf files. It suppors many command line parameters and is used in complex situations where a single wpf does not do the job. Step 3 is the one that requires programming. You must open the results file and read (if exists) the URL in the begining of the last line. If no such URL exists then there is no 'next page'.