Sorting a structured text file

225 Views Asked by At

I'm migrating from LaTeX to PrinceXML. One of the things I need to do is to convert the bibliography. I've converted my .bib file to HTML. However, since LaTeX took care of sorting the entries for me, I haven't taken care to put them into the correct order - but in the HTML the order of declaration does matter.

So my problem is: using Linux command line tools (e.g. Perl is acceptable, but Javascript is not), how can I sort a source file like this:

<div id="references">
    <h2>References</h2>

    <ul>
        <li id="reference-to-book-1">
            <span class="ref-author">Sample, Peter</span>
            <cite><a href="http://example.org/">Online Book 1</a></cite>
            <span class="ref-year">2011</span>
        </li>
        <li id="reference-to-book-2">
            <cite>Physical Book 2</cite>
            <span class="ref-year">2012</span>
            <span class="ref-author">Example, Sandy</span>
        </li>
    </ul>
</div><!-- references -->

to look like this:

<div id="references">
    <h2>References</h2>

    <ul>
        <li id="reference-to-book-2">
            <span class="ref-author">Example, Sandy</span>
            <cite>Physical Book 2</cite>
            <span class="ref-year">2012</span>
        </li>
        <li id="reference-to-book-1">
            <span class="ref-author">Sample, Peter</span>
            <cite><a href="http://example.org/">Online Book 1</a></cite>
            <span class="ref-year">2011</span>
        </li>
    </ul>
</div><!-- references -->

The criteria being:

  1. The <li> elements containing the entries are sorted alphabetically according to author (i.e. everything from one <li id=" to its corresponding </li> is to be moved as a single block).
  2. Within each entry, the elements are in the following order:
    1. line matches class="ref-author"
    2. line matches <cite>
    3. line matches class="ref-year"
    4. There are more elements (e.g. class="publisher") I omitted from the example for purposes of clarity; also, I run across this sorting problem very often. So it would be helpful if the expressions to match could be specified freely (e.g. as an array declaration in the script).
  3. The remainder of the file (outside /id="references"/,/-- references --/) is unchanged.
  4. The result file should have each line unchanged except for its position in the file (this point added because I the XML parsers I tried broke my indentation).

I got 1, 3 and 4 solved using sed and sort, but can't get 2 to work that way.

2

There are 2 best solutions below

1
On

I'd use Mojo for this. You might need to tidy up the XML afterwards.

use Mojo::Base -strict;
use Mojo::DOM;
use Mojo::Util 'slurp';

my $xml = slurp $ARGV[0] or die "I need a file";

my $dom = Mojo::DOM->new($xml);

my $list = $dom->at('#references ul');

my $refs = $dom->find('li');

$refs->each('remove');

$refs = $refs->sort( sub { $a->at('.ref-author')->text cmp $b->at('.ref-author')->text } );

for my $ref ( @{ $refs } ){


    my $new = Mojo::DOM->new('<li></li>')->at('li');
    $new->append_content($ref->at('.ref-author'));
    $new->append_content($ref->at('cite'));

    #KEEP APPENDING IN THE ORDER YOU WANT THEM


    $list->append_content($new);

}

say $dom;
3
On

I suggest you use the XML::LibXML module and parse your data as HTML. Then you can manipulate the DOM as you wish and print the modified structure back out

Here's an example of how it might work

use strict;
use warnings;

use XML::LibXML;

my $dom = XML::LibXML->load_html(IO  => \*DATA);

my ($refs) = $dom->findnodes('/html/body//div[@id="references"]/ul');

my @refs = $refs->findnodes('li');

$refs->removeChild($_) for @refs;

$refs->appendChild($_) for sort {
  my ($aa, $bb) = map { $_->findvalue('span[@class="ref-author"]') } $a, $b;
  $aa cmp $bb;
} @refs;

print $dom, "\n";


__DATA__
<html>
  <head>
  <title>Title</title>
  </head>
  <body>
    <div id="references">
        <h2>References</h2>

        <ul>
            <li id="reference-to-book-1">
                <span class="ref-author">Sample, Peter</span>
                <cite><a href="http://example.org/">Online Book 1</a></cite>
                <span class="ref-year">2011</span>
            </li>
            <li id="reference-to-book-2">
                <cite>Physical Book 2</cite>
                <span class="ref-year">2012</span>
                <span class="ref-author">Example, Sandy</span>
            </li>
        </ul>
    </div><!-- references -->
  </body>
</html>

output

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Title</title></head><body>
    <div id="references">
        <h2>References</h2>

        <ul>

        <li id="reference-to-book-2">
                <cite>Physical Book 2</cite>
                <span class="ref-year">2012</span>
                <span class="ref-author">Example, Sandy</span>
            </li><li id="reference-to-book-1">
                <span class="ref-author">Sample, Peter</span>
                <cite><a href="http://example.org/">Online Book 1</a></cite>
                <span class="ref-year">2011</span>
            </li></ul></div><!-- references -->
  </body></html>