gsub: remove till first occurence instead of last occurence of a given character in a line

693 Views Asked by At

I have an html file which I basically try to remove first occurences of <...> with sub/gsub functionalities.

I used awk regex . * + according to match anything between < >. However first occurence of > is being escaped (?). I don't know if there is a workaround.

sample input file.txt (x is added not to print empty):

<div>fruit</div></td>x
<span>banana</span>x
<br/>apple</td>x

code:

awk '{gsub(/^<.*>/,""); print}' file.txt

current output:

x
x
x

desired output:

fruit</div></td>x
banana</span>x
apple</td>x
2

There are 2 best solutions below

2
RavinderSingh13 On BEST ANSWER

With your shown samples, please try following awk code. Simple explanation would be, using sub substitute function of awk programing. Then substituting starting < till(using [^>] means till first occurrence of > comes) > including > with NULL in current line, finally print edited/non-edited line by 1.

awk '{sub(/^<[^>]*>/,"")} 1' Input_file


2nd solution: Using match function of awk here match values from 1st occurrence of < to till 1st occurrence of > and print the rest of line.

awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH)}' Input_file

OR In case you have lines which are not starting from < and you want to print them also then use following:

awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH);next} 1' Input_file
0
Daweo On

However first occurence of > is being escaped (?).

No, you got result as is due to that in GNU AWK as manual say

awk(...)regular expressions always match the leftmost, longest sequence of input characters that can match

this is called greedy in other languages' regular expressions usage, so say for

<div>fruit</div></td>x

/^<.*>/ does match

<div>fruit</div></td>

thus you end with x. In languages supporting so-called non-greedy matching you can harness it in such case, for example in ECMAScript

let str = "<div>fruit</div></td>x";
let out_str = str.replace(/^<.*?>/, "");
console.log(out_str);

output

fruit</div></td>x

As GNU AWK manual say in GNU AWK it is always longest (greedy), thus you have to use [^>] i.e. all but > to prevent match spanning from first < to last > which would contain > inside.