I'm migrating from LaTeX to PrinceXML. One of the things I need to do is to convert the bibliography. I've converted my .bib file to HTML. However, since LaTeX took care of sorting the entries for me, I haven't taken care to put them into the correct order - but in the HTML the order of declaration does matter.
So my problem is: using Linux command line tools (e.g. Perl is acceptable, but Javascript is not), how can I sort a source file like this:
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite><a href="http://example.org/">Online Book 1</a></cite>
<span class="ref-year">2011</span>
</li>
<li id="reference-to-book-2">
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
<span class="ref-author">Example, Sandy</span>
</li>
</ul>
</div><!-- references -->
to look like this:
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-2">
<span class="ref-author">Example, Sandy</span>
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
</li>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite><a href="http://example.org/">Online Book 1</a></cite>
<span class="ref-year">2011</span>
</li>
</ul>
</div><!-- references -->
The criteria being:
- The
<li>elements containing the entries are sorted alphabetically according to author (i.e. everything from one<li id="to its corresponding</li>is to be moved as a single block). - Within each entry, the elements are in the following order:
- line matches
class="ref-author" - line matches
<cite> - line matches
class="ref-year" - There are more elements (e.g.
class="publisher") I omitted from the example for purposes of clarity; also, I run across this sorting problem very often. So it would be helpful if the expressions to match could be specified freely (e.g. as an array declaration in the script).
- line matches
- The remainder of the file (outside
/id="references"/,/-- references --/) is unchanged. - The result file should have each line unchanged except for its position in the file (this point added because I the XML parsers I tried broke my indentation).
I got 1, 3 and 4 solved using sed and sort, but can't get 2 to work that way.
I suggest you use the
XML::LibXMLmodule and parse your data as HTML. Then you can manipulate the DOM as you wish and print the modified structure back outHere's an example of how it might work
output