Performing a second replace in backreference for a regexp

58 Views Asked by At

I have lines from a web page of the form

<a href="url with spaces">description with spaces</a> 

which I want to convert to a csv format

"url%20%with%20spaces","description with spaces"

to feed into a mediawiki page that expects external links to be [url%20%with%20spaces description with spaces] (and I don't want that page to be cluttered with #rreplace)

sed -Ee 's`.*href="(.*)">(.*)</a>.*`"\1","\2"`'

can split the url, but I can't see an easy way to do a further substitution of space with %20 in just \1 without affecting \2

1

There are 1 best solutions below

0
On BEST ANSWER

You might consider using GNU awk like

awk -F'href="|">|</a>' '{gsub(/ /, "%20",$2);print "\""$2"\",\""$3"\""}'

See the awk demo online.

The field separator pattern here is href="|">|</a>, it matches either href=", or ">, or ` to split the line into fields.

The second field needs additional processing, so gsub(/ /, "%20",$2) is used to replace each space with %20 substring. The updated Field 2 and Field 3 are used to form the resulting output.