How can I use hxselect to generate array-ish result?

651 Views Asked by At

I'm using hxselect to process a HTML file in bash.

In this file there are multiple divs defined with the '.row' class.

In bash I want to extract these 'rows' into an array. (The divs are multilined so simply reading it line-by-line is not suitable.)

Is it possible to achieve this? (With basic tools, awk, grep, etc.)

After assigning rows to an array, I want to further process it:

for row in ROWS_EXTRACTED; do
PROCESS1($row)
PROCESS2($row)
done

Thank you!

2

There are 2 best solutions below

0
On

The following instructs hxselect to separate matches with a tab, deletes all newlines, and then translates the tab separators to newlines. This enables you to iterate over the divs as lines with read:

#!/bin/bash

divs=$(hxselect -s '\t' .row < "$1" | tr -d '\n' | tr '\t' '\n')

while read -r div; do
    echo "$div"
done <<< "$divs"

Given the following test input:

<div class="container">
  <div class="row">
    herp
    derp
  </div>
  <div class="row">
    derp
    herp
  </div>
</div>

Result:

$ ./test.sh test.html
<div class="row">    herp    derp  </div>
<div class="row">    derp    herp  </div>
2
On

One possibility would be to put the content of the tags in an array with each item enclosed in quotes. For example:

# Create array with " " as separator
array=`cat file.html | hxselect -i -c -s '" "' 'div.row'`
# Add " to the beginning of the string and remove the last
array='"'${array%'"'}

Then, processing in a for loop

for index in ${!array[*]}; do printf "  %s\n\n" "${array[$index]}"; done

If the tags contain the quote character, another solution would be to place a separator character not found in the tags content (§ in my example) :

array=`cat file.html | hxselect -i -c -s '§' 'div.row'`

Then do a treatment with awk :

# Keep only the separators to count them with ${#res}
res="${array//[^§]}"
for (( i=1; i<=${#res}; i++ ))
do
    echo $array2 | awk -v i="$i" -F § '{print $i}'
    echo "----------------------------------------"
done