LWP::Simple runs very fine: how to store 6000 ++ records in a file and do some cleanup?

215 Views Asked by At

good evening dear community!

i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

Update:

thanks to two great comments i have gained alot. Now the code runs very nice. Last quesstion: How to store the data into a file... How to force the parser to write the results into a file. This is much more convenient than getting more than 6000 records in the command line... And if the outputs is done in a file i need to do some final cleanup: see the output: If we compare all the output with the target url - then sure this needs some cleanup, what do you think?! Again see the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

6114,7754,"Volksschule Zeil a.Mai",/Sa,"d a.Mai",(Gru,"09524/94992 09524/94997",,Volksschulen,
6115,7757,"Mittelschule Zeil - Sa","d a.Mai",Schulri,"g 
97475       Zeil","09524/94995
09524/94997",,Volksschulen,"      www.hauptschule-zeil-sand.de"
6116,3890,"Volksschule Zeilar",(Gru,"dschule)
Bgm.-Stallbauer-Str. 8
84367       Zeilar",,"08572/439
08572/920001",,Volksschulen,"      www.gs-zeilarn.de"
6117,4664,"Volksschule Zeitlar",(Gru,"dschule)
Schulstra�e 5
93197       Zeitlar",,"0941/63528
0941/68945",,Volksschulen,"      www.vs-zeitlarn.de"
6118,4818,"Mittelschule Zeitlar","Schulstra�e 5
93197       Zeitlar",,,"0941/63528
0941/68945",,Volksschulen,"      www.vs-zeitlarn.de"
6119,7684,"Volksschule Zeitlofs (Gru","dschule)
Raiffeise","Str. 36
97799       Zeitlofs",,"09746/347
09746/347",,Volksschulen,"      grundschule-zeitlofs.de"

thx for any and all infos! zero!

Here the old question: Seems to work fine as a part of a 1-shot function. But as soon as I include the function as part of a loop, it doesn't return anything...Whats the deal?

To begin with the beginning: see the target http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50 This page has got more than 6000 results! Well how do i get all the results? I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records... i have a code that steems from the very supportive member tadmic (see this forum) - and that basically runs very nice. But after adding some lines - (at the moment) it spits out some errors.

Attempt: Here are the first 5 page URLs:

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50 for each page there after. We can use this information to create a loop:

#!/usr/bin/perl  
use warnings;  
use strict;  
use LWP::Simple;  
use HTML::TableExtract;  
use Text::CSV;  

my @cols = qw(  
    rownum  
    number  
    name  
    phone  
    type  
    website  
);  

my @fields = qw(  
    rownum  
    number  
    name  
    street  
    postal  
    town  
    phone  
    fax  
    type  
    website  
);  

my $i_first = "0";   
my $i_last = "6100";   
my $i_interval = "50";   

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {   
    my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");   
    $html =~ tr/r//d;     # strip the carriage returns  
    $html =~ s/&nbsp;/ /g; # expand the spaces  

    my $te = new HTML::TableExtract();  
    $te->parse($html);  

    my $csv = Text::CSV->new({ binary => 1 });  

    foreach my $ts ($te->table_states) {  
        foreach my $row ($ts->rows) {  
            #trim leading/trailing whitespace from base fields  
            s/^s+//, s/\s+$// for @$row;  

            #load the fields into the hash using a "hash slice"  
            my %h;  
            @h{@cols} = @$row;  

            #derive some fields from base fields, again using a hash slice  
            @h{qw/name street postal town/} = split /n+/, $h{name};  
            @h{qw/phone fax/} = split /n+/, $h{phone};  

            #trim leading/trailing whitespace from derived fields  
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

            $csv->combine(@h{@fields});  
            print $csv->string, "\n";  
        }  
    } 
}

i tested the code and get the following results:

btw: here the lines 57 and 58: ...the command line tells that ihave errors here..:

    #trim leading/trailing whitespace from derived fields  
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

what do you think? Are there some backslashes missing!? How to fix and testrun the code so that the results are correct!?

Look forward to hear from you zero

see the errors that i get:

    Ot",,,Telefo,Fax,Schulat,Webseite                                                          Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        "lfd. N.",Schul-numme,Schul,"ame                                                                           
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
3

There are 3 best solutions below

1
On BEST ANSWER

If you're trying to extract links from the pages, use WWW::Mechanize, which is a wrapper around LWP and properly parses the HTML to get the links for you, as well as a zillion other convenience things for people scraping web pages.

5
On

Whenever $_ is undef and a substitution involving it occurs these warnings occur. s/// construct implicitly works on $_. The solution is to check if defined before trying a substitution.

Apart from that, though not related to the warnings, you have a logical error in your regular expression:

s/^s+//, s/\s+$// for @h{qw/name street postal town/};

Note the missing \ in the first construct.

Removing the error and simplifying:

defined and s{^ \s+ | \s+ $}{}gx for @h{qw/name street postal town/};

In order to output to a file add the following before the for loop:

open my $fh, '>', '/path/to/output/file' or die $!;

Replace:

print $csv->string, "\n";

with:

print $fh $csv->string, "\n";

That is a syntactic change from print LIST to print FILEHANDLE LIST.

Refer:

3
On

This line won't remove carriage returns as you say:

$html =~ tr/r//d;     # strip the carriage returns  

You will need:

$html =~ tr/\r//d;     # strip the carriage returns

And maybe even:

$html =~ tr/\r\n//d;     # strip the carriage returns