Get the specific content from html and print to txt file in Perl

241 Views Asked by At

I have a html file which contains paper ID s and papers. So i want to print these ID s and papers sequencely. Here are the html file and example output.

<P>[ Paper ID - Title (# Reviewers) ]</P>
<DD><SELECT multiple size=10 name=papers[]> <OPTION value=2>&nbsp;&nbsp;2 - 
Switchable Glass: A possible medium for Evolvable Hardware (4)</OPTION> 
<OPTION value=3>&nbsp;&nbsp;3 - An Efficient Multi-Objective Evolutionary 
Algorithm for Combinational Circuit Design (3)</OPTION> <OPTION 
value=4>&nbsp;&nbsp;4 - A Background Mismatch Calibration for Capacitive 
Digital-to-Analog Converters (3)</OPTION> <OPTION value=5>&nbsp;&nbsp;5 - 
Designing Electronic Circuits by Means of Gene Expression Programming 
(3)</OPTION> <OPTION value=6>&nbsp;&nbsp;6 - Coherence Based Fault Detection 
And Error Correction (3)</OPTION> <OPTION value=7>&nbsp;&nbsp;7 - Wormhole 
Routing with Virtual Channels using Dynamic Rate Control for Network-on... 
(2)</OPTION> <OPTION value=8>&nbsp;&nbsp;8 - Noise Analysis of Phase Locked 
Loops (3)</OPTION> <OPTION value=9>&nbsp;&nbsp;9 - Design and Analysis of a 
Second Order Phase Locked Loops (PLLs) (2)</OPTION> <OPTION value=10>10 - 
SW-HW Co-design and fault tolerant implementation for the LRID Wireless 
communication... (3)</OPTION> <OPTION value=11>11 - Adaptive PID Controller 
Using Parameter Optimization Algorithm (2)</OPTION> <OPTION value=12>12 - A 
Novel Self-organizing Hybrid Network Protocol (2)</OPTION> <OPTION 
value=13>13 - An Adaptive FPGA-Based Mechatronic Control System Supporting 
Partial Reconfiguration... (3)</OPTION> <OPTION value=14>14 - Generalized 
Disjunction Decomposition for the Evolution of Programmable Logic Array... 
(3)</OPTION> <OPTION value=15>15 - Woofer-Tweeter Adaptive Optics System 
(1)</OPTION> <OPTION value=16>16 - A Re-Programmable Platform for Dynamic 
Burn-in Test of Xilinx VirtexII 3000 FPGA... (3)</OPTION> <OPTION 
value=17>17 - Using hardware-based particle swarm method for dynamic 
optimization of adaptive ... (2)</OPTION> <OPTION value=18>18 - 
Hardware/software coevolution of genome programs and cellular processors 
(2)</OPTION> <OPTION value=19>19 - Systolic Array Based Adaptive Beamformer 
Modelling in SystemC Environment (2)</OPTION> <OPTION value=20>20 - A 
Reconfigurable Hardware Design Using FPGA (2)</OPTION> <OPTION value=21>21 - 
An FPGA Implemented Processor Architecture with Adaptive Resolution 
(2)</OPTION> <OPTION value=22>22 - Evolving Hardware with 
Self-reconfigurable connectivity in Xilinx FPGAs (2)</OPTION> <OPTION 
value=23>23 - Particle Swarm Optimization with Discrete Recombination: An 
Online Optimizer for... (2)</OPTION> <OPTION value=24>24 - Towards the 
Integration of Drive Control Loop Electronics of the JPL/Boeing Gyroscope... 
(2)</OPTION> <OPTION value=25>25 - An Incremental Evolutionary Strategy for 
the Design of FIR Filters Targeting Real... (2)</OPTION> <OPTION value=26>26 
- Adaptive Micro-Antenna on Silicon Substrate (3)</OPTION> <OPTION 
value=27>27 - Towards Fluent Sensor Networks: A Scalable and Robust 
Self-Deployment Approach (3)</OPTION> <OPTION value=28>28 - Comparison of 
Fuzzy-C Means, Hard C-Means and Differential Evolution Algorithm in... 
(2)</OPTION> <OPTION value=29>29 - Evolutionary Design of Digital Circuits: 
Where Are Current Limits? (2)</OPTION> <OPTION value=30>30 - GEZGİN &amp; 
GEZGİN-2: Adaptive Real-Time Image Processing Subsystems for Earth 
Observing... (3)</OPTION> <OPTION value=31>31 - A Multi-objective Genetic 
Algorithm for On-chip Real-time Adaptation of a Multi-... (2)</OPTION> 
<OPTION value=32>32 - An Efficient Technique for Preventing Single Event 
Disruptions in Synchronous and... (1)</OPTION> <OPTION value=33>33 - 
Architecture of a Dynamically Reconfigurable NoC for Adaptive Reconfigurable 
MPSoC (0)</OPTION> <OPTION value=34>34 - Embedded Reconfigurable Array 
Fabrics for Efficient Implementation of Image Compression... (1)</OPTION> 
<OPTION value=35>35 - Routing in Wireless Sensor Networks Using Ant Colony 
Optimization (2)</OPTION> <OPTION value=36>36 - Simulation of 
Multifunctional Combinational Modules Controlled by Vdd (3)</OPTION> <OPTION 
value=37>37 - Reconfigurable Parallel Computing Architecture for On-Board 
Data Processing (2)</OPTION> <OPTION value=38>38 - On comparison of Variable 
Length Representations by Means of Unconstrained Evolution... (3)</OPTION> 
<OPTION value=39>39 - VLSI Implementation of LMS Equaliser with Adaptive 
Length Selection for Wireless... (0)</OPTION> <OPTION value=41>41 - A 
Scalable Reconfigurable Analog to Digital Converter Architecture Targeting 
Low... (0)</OPTION> <OPTION value=42>42 - Linear Prediction with 
Differential Evolution Algorithm (2)</OPTION> <OPTION value=43>43 - Genetic 
Algorithm based Engine for Domain-Specific Reconfigurable Arrays 
(0)</OPTION> <OPTION value=44>44 - Non-Uniform Search Domain based Genetic 
Algorithm for the Synthesis and Continuous... (2)</OPTION> <OPTION 
value=45>45 - Design Concepts for a Dynamically Reconfigurable Wireless 
Sensor Node (2)</OPTION> <OPTION value=46>46 - On-Board Partial Run-Time 
Reconfiguration for Pico-Satellite Constellations (2)</OPTION> <OPTION 
value=47>47 - A Framework of Evolvable and Reconfigurable Sensor Networks 
for Aerospace –based... (0)</OPTION> <OPTION value=48>48 - Analytical 
Modelling of Power Attenuation under Parameter Fluctuations with 
Applications... (2)</OPTION> <OPTION value=49>49 - A New State Space 
Representation Method for Adaptive Log Domain Systems (2)</OPTION> <OPTION 
value=50>50 - Swarm Based Incremental Learning for Combinational Circuit 
Evolution (2)</OPTION> <OPTION value=51>51 - Gene Regulation Mechanisms 
introduced in the E valuation Criteria for a Hardware... (2)</OPTION> 
<OPTION value=52>52 - Automatic Hybrid Genetic Algorithm Based Printed 
Circuit Board Inspection (2)</OPTION> <OPTION value=53>53 - Population based 
FPGA solution to Mastermind game (2)</OPTION> <OPTION value=54>54 - A Large 
Scale Adaptable Multiplier for Cryptographic Applications (2)</OPTION> 
<OPTION value=55>55 - A Self-Tuning Analog Proportional-Integral-Derivative 
(PID) Controller (2)</OPTION> <OPTION value=56>56 - Self-Configurable Neural 
Network Processor for Adaptable FIR Filters (3)</OPTION> <OPTION value=57>57 
- On-Chip Evolution Using a Soft Processor Core Applied to Image Recognition 
(2)</OPTION> <OPTION value=58>58 - A Novel Adaptive Viterbi Algorithm and 
Its Implementation (2)</OPTION> <OPTION value=59>59 - An Efficient Hardware 
Architecture for H.264 Adaptive Deblocking Filter (2)</OPTION> <OPTION 
value=60>60 - A Low-Complexity Self-Calibrating Adaptive Quadrature Receiver 
(2)</OPTION> <OPTION value=61>61 - A Honeycomb Development Architecture for 
Robust Fault-Tolerant Design (2)</OPTION> <OPTION value=62>62 - Sate-Space 
based Analytical Modelling for Real-Time Fault Recovery and Self-Repair... 
(2)</OPTION> <OPTION value=63>63 - Strategies to On- Line Failure Recovery 
in Self- Adaptive Systems based on Dynamic... (2)</OPTION> <OPTION 
value=64>64 - A Platform for Digital Intrinsic Hardware Evolution 
(2)</OPTION> <OPTION value=65>65 - Face Recognition Using a Gabor Filter 
Bank Approach (2)</OPTION> <OPTION value=66>66 - Protecting Fingerprint Data 
using Watermarking (2)</OPTION> <OPTION value=67>67 - Debug Support for 
System-on-Chips, Considerations for Reconfigurable and Hybrid ... 
(2)</OPTION> <OPTION value=68>68 - Novel Techniques for Ensuring Secure 
Communications for Distributed Low Power Devices (2)</OPTION> <OPTION 
value=69>69 - A Modular Framework for the Evolution of Circuits on 
Configurable Transistor Array... (2)</OPTION> <OPTION value=70>70 - Power 
Driven Reconfigurable Complex Continuous Wavelet Transform Processor 
(2)</OPTION> <OPTION value=71>71 - A Tuning Technique for Switched-Capacitor 
Circuits (0)</OPTION> <OPTION value=72>72 - An Automatic Technique to 
Synthesize System-on-a-Chip to Adapt to Changing Environments (2)</OPTION> 
<OPTION value=73>73 - Picosatellite Constellations for Remote Sensing in LEO 
(2)</OPTION> <OPTION value=74>74 - Evolvable Hardware Applied to 
Nanotechnology (1)</OPTION> <OPTION value=75>75 - Gate-level Morphogenetic 
Evolvable Hardware for Scalability and Adaptation on FPGAs (2)</OPTION> 
<OPTION value=76>76 - Synthesis of MOS Analog Circuits by Evolutionary 
Methods (2)</OPTION> <OPTION value=77>77 - An Adaptive HDL Design 
Methodology for Hard IP and Soft IP Co-Protection (2)</OPTION> <OPTION 
value=78>78 - FSM and HSM watermarking: A Tutorial (3)</OPTION> <OPTION 
value=79>79 - Physics-based Model applied to Evolvable Hardware (2)</OPTION> 
<OPTION value=80>80 - A Generic On-Chip Debugger for Wireless Sensor 
Networks (goCDWSN) (2)</OPTION> <OPTION value=81>81 - The Gannet 
Service-based SoC: A Service-level Reconfigurable Architecture (2)</OPTION> 
<OPTION value=82>82 - A FPGA simulation using asexual genetic algorithms for 
integrated self-repair (2)</OPTION> <OPTION value=83>83 - USING THE 
(2)</OPTION> <OPTION value=84>84 - A Comparing Design of Satellite Attitude 
Control System Based on Reaction Wheel (0)</OPTION></SELECT> 
A txt file that i want to create by using perl is:

2 - Switchable Glass: A possible medium for Evolvable Hardware (4)
3 - An Efficient Multi-Objective Evolutionary Algorithm for Combinational Circuit Design (3)
4 - A Background Mismatch Calibration for Capacitive Digital-to-Analog Converters (3)
5 - Designing Electronic Circuits by Means of Gene Expression Programming (3)
6 - Coherence Based Fault Detection And Error Correction (3)
7 - Wormhole Routing with Virtual Channels using Dynamic Rate Control for Network-on... (2)
8 - Noise Analysis of Phase Locked Loops (3)
9 - Design and Analysis of a Second Order Phase Locked Loops (PLLs) (2)
10 - SW-HW Co-design and fault tolerant implementation for the LRID Wireless communication... (3)
11 - Adaptive PID Controller Using Parameter Optimization Algorithm (2)
12 - A Novel Self-organizing Hybrid Network Protocol (2)
13 - An Adaptive FPGA-Based Mechatronic Control System Supporting Partial Reconfiguration... (3)
14 - Generalized Disjunction Decomposition for the Evolution of Programmable Logic Array... (3)
15 - Woofer-Tweeter Adaptive Optics System (1)
16 - A Re-Programmable Platform for Dynamic Burn-in Test of Xilinx VirtexII 3000 FPGA... (3)
17 - Using hardware-based particle swarm method for dynamic optimization of adaptive ... (2)
18 - Hardware/software coevolution of genome programs and cellular processors (2)
19 - Systolic Array Based Adaptive Beamformer Modelling in SystemC Environment (2)
20 - A Reconfigurable Hardware Design Using FPGA (2)
21 - An FPGA Implemented Processor Architecture with Adaptive Resolution (2)
22 - Evolving Hardware with Self-reconfigurable connectivity in Xilinx FPGAs (2)
23 - Particle Swarm Optimization with Discrete Recombination: An Online Optimizer for... (2)
24 - Towards the Integration of Drive Control Loop Electronics of the JPL/Boeing Gyroscope... (2)
25 - An Incremental Evolutionary Strategy for the Design of FIR Filters Targeting Real... (2)
26 - Adaptive Micro-Antenna on Silicon Substrate (3)
27 - Towards Fluent Sensor Networks: A Scalable and Robust Self-Deployment Approach (3)
28 - Comparison of Fuzzy-C Means, Hard C-Means and Differential Evolution Algorithm in... (2)
29 - Evolutionary Design of Digital Circuits: Where Are Current Limits? (2)

So far i've written this code but i can't realize why it doesn't work. It prints nothing to both screen and text file. Any help will be appreciated.Thank you!

use strict;
use warnings;  

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content(
    do { local $/; <DATA> }
open(my $fh, '>', 'outputs.txt');
my $i = 2;
for ( $tree->look_down( 'name' => 'papers' ) ) {
    my $papers = $_->look_down( 'OPTION value' => 'i' )->as_trimmed_text;
    # my $comment  = $_->look_down( 'class' => 'content' )->as_trimmed_text;
    # my $name     = $_->look_down( '_tag'  => 'h3' )->as_trimmed_text;
    # $name =~ s/^Re:\s*//;
    # $name =~ s/\s*$location\s*$//;

    print "Paper: $papers\n";
    print $fh "Paper: $papers\n";

Try this:

#! /usr/bin/env perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content(
    do { local $/; <DATA> }

# $tree->dump;

for ( $tree->look_down( 'name' => 'papers[]' ) ) {
    for my $p ( $_->look_down( '_tag' => 'option' ) ) {
        print "Paper: " . $p->as_trimmed_text( extra_chars => '\xA0' ) . "\n";


There are two problems with your code: the interesting section is named papers[], not papers (I used $tree->dump to find that out), and both the arguments and the return value of your second look_down() are completely messed up. I'm not sure why you expected that to work.


You're overcomplicating with the look_down, which is used for attributes. Simply find() the <option> elements.

foreach my $papers ( $tree->look_down( 'name' => 'papers[]' ) ) {
    foreach my $option ( $papers->find( 'option' ) ) {
        say $option->as_trimmed_text;

Also note that the name attribute of the <select> is papers[], not papers. The [] are part of the name.


Or a bit closer to your original code:

use strict;
use warnings;  

use HTML::TreeBuilder;
use Data::Dumper;

my $tree = HTML::TreeBuilder->new_from_content(
    do { local $/; <DATA> }
open(my $fh, '>', 'outputs.txt');
my $i = 2;
for my $select ( $tree->look_down( 'name' => 'papers[]' ) ) {
    while (my $option = $select->look_down( 'value' => $i)) {
        my $papers = $option->as_trimmed_text;
        print "Paper: $papers\n";
        print $fh "Paper: $papers\n";

As stated already the name is papers[] not paper plus you had the value pointing to i and not $i.