How to merge content from multiple HMTL files in a single one?

218 Views Asked by At

I have more than 100 html files with the following structure.

<html>
<head>
<body>
    <TABLE>
      ...
    </TABLE>
    <TABLE>
        <TR>
            <td rowspan=2><img src="http://www.example.com" width=10></td>
            <TD width=609 valign=top>
                <!-- Content of file1 -->
                <p>abc</p>
                ...
                ...
                ...
                <p>xyz</p>
            </TD>
        </TR>
        <TR>
            <TD align="center" ...alt="top"></a></TD>
        </TR>
    </TABLE>        
</body>
</html>

and I´d like to merged in a single HTML the content inside the column #2 of 1rst row from 2nd table (TABLE[2]ROW[1]COLUMN[2]) of each file to get an output like this

<html>
<head>
<body>
    <!-- Content of file1 -->
    <p>abc</p>
    ...
    ...
    ...
    <p>xyz</p>

            <!-- Content of file2 -->
    <p>some text</p>
    ...
    ...
    ...
    <p>some text</p>

    ..
    ..
    ..
            <!-- Content of fileN -->
    <p>some text</p>
    ...
    ...
    ...
    <p>some text</p>
</body>
</html>

I´m new to perl, and I ask for some help in order to point me out in how to do it. Thanks in advance.

Below begginig a essay for file1, but I´m not sure if I go in correct way.

use HTML::TableExtract;

open (my $html,"<","file1.html");

my $table = HTML::TableExtract->new(keep_html=>0, depth => 1, count => 2, br_translate => 0 );
$table->parse($html);

foreach my $row ($table->rows) {
    print join("\t", @$row), "\n";
}
2

There are 2 best solutions below

14
On BEST ANSWER

Documentation HTML::TableExtract states that depth, count, row, col starts from 0.

Following code is a skeleton of the code with an assumption that all html files will be stored in one directory.

With an assistance of glob we obtain names of html files.

Then we write a subroutine extract_table_cell which we pass parameters depth,count,row,col to extract data located at this position.

Now for each filename we call extract_table_cell subroutine and store return data in an array @data.

Also we write subroutine gen_html which take reference to @data array and returns html code representing these data.

At this point we call say with subroutine gen_html as an argument to output result.

NOTE: you will require to change subroutine extract_table_cell to achieve desired format of cell data

use strict;
use warnings;
use feature 'say';

use HTML::TableExtract;

my($depth,$table,$row,$col) = (0,1,0,1);
my @data;

for (glob("*.html")) {
    push @data, extract_table_cell($_,$depth,$table,$row,$col);
}

say gen_html(\@data);

sub gen_html {
    my $data = shift;

    my($html,$block);

    for ( @{$data} ) {
        $block .= "\t\t$_\n";
    }

    $html =
"
<html>
    <head>
    </head>
    <body>
    $block
    </body>
</html>
";

    return $html;
}

sub extract_table_cell {
    my($file,$depth,$count,$row,$col) = @_;

    my $te = HTML::TableExtract->new( depth => $depth, count => $count );

    $te->parse_file($file);

    my $table = $te->first_table_found;

    return ${ $table->{grid}[$row][$col] };
}

Output

<html>
    <head>
    </head>
    <body>
        B 1.2
        D 1.2

    </body>
</html>

Test data files:

table_1.html

<html>
    <head>
    </head>
    <body>
        <table>
            <tr><td>A 1.1</td><td>A 1.2</td><td>A 1.3</td></tr>
            <tr><td>A 2.1</td><td>A 2.2</td><td>A 2.3</td></tr>
            <tr><td>A 3.1</td><td>A 3.2</td><td>A 3.3</td></tr>
            <tr><td>A 4.1</td><td>A 4.2</td><td>A 4.3</td></tr>
        </table>

        <table>
            <tr><td>B 1.1</td><td>B 1.2</td><td>B 1.3</td></tr>
            <tr><td>B 2.1</td><td>B 2.2</td><td>B 2.3</td></tr>
            <tr><td>B 3.1</td><td>B 3.2</td><td>B 3.3</td></tr>
            <tr><td>B 4.1</td><td>B 4.2</td><td>B 4.3</td></tr>
        </table>
    </body>
</html>

table_2.html

<html>
    <head>
    </head>
    <body>
        <table>
            <tr><td>C 1.1</td><td>C 1.2</td><td>C 1.3</td></tr>
            <tr><td>C 2.1</td><td>C 2.2</td><td>C 2.3</td></tr>
            <tr><td>C 3.1</td><td>C 3.2</td><td>C 3.3</td></tr>
            <tr><td>C 4.1</td><td>C 4.2</td><td>C 4.3</td></tr>
        </table>

        <table>
            <tr><td>D 1.1</td><td>D 1.2</td><td>D 1.3</td></tr>
            <tr><td>D 2.1</td><td>D 2.2</td><td>D 2.3</td></tr>
            <tr><td>D 3.1</td><td>D 3.2</td><td>D 3.3</td></tr>
            <tr><td>D 4.1</td><td>D 4.2</td><td>D 4.3</td></tr>
        </table>
    </body>
</html>
3
On

Polar Bear's answer could be the best one. I just want to add a different idea about getting TABLE[2]ROW[1]COLUMN[2] without using HTML::TableExtract. You said you are new in perl so I think this idea will be interesting to you. The idea is to use regex. Ex:

$/ = "</html>";
my $table2, $row1, $col2;
while(<STDIN>){
    /<\/table>\s*<table>([^\000]*?)<\/table>/i;
    $table2 = $1;
    $table2 =~ /<tr>([^\000]*?)<\/tr>/i;
    $row1 = $1;
    $row1 =~ /<\/td>\s*<td>([^\000]*?)<\/td>/i;
    $col2 = $1;
}
print $col2;

This code will always get TABLE[2]ROW[1]COLUMN[2].

Sample input:

<html>
<table>

</table>
<table>
    <tr>
        <td>
          hello world
        </td>
        <td>
          corona 
        </td>
    </tr>
    <tr>
    </tr>
</table>
</html>

Output:

  corona