Replace the values under some columns using awk or sed

61 Views Asked by At

I have a data, which contains some strange values under the columns start with "AS_".

It is a csv file, but for understanding I am putting the files with space,

sl no   AS_2f   AD_2f   SR_2f   SV_2f   AS_5f   AD_5f   SR_5f   SV_5f                                                   1.0     3.0     5.0     35.0    9.0     11.0    13.0    15.0    17.0
2.0     3.0     4.0     5.0     6.0     107.0   8.0     204.0   10.0
3.0     3.4     34.0    4.2     4.6     5.0     5.4     5.8     6.2
4.0     3.0     2.0     1.0     0.0     -1.0    -2.0    204.0   -4.0
5.0     24.5    44.0    63.5    83.0    102.5   122.0   141.5   161.0
6.0     32.0    58.0    84.0    110.0   136.0   162.0   188.0   214.0 

cat << EOF > data.csv
sl no,AS_2f,AD_2f,SR_2f,SV_2f,AS_5f,AD_5f,SR_5f,SV_5f
1.0,3.0,5.0,35.0,9.0,11.0,13.0,15.0,17.0
2.0,3.0,4.0,5.0,6.0,107.0,8.0,204.0,10.0
3.0,3.4,34.0,4.2,4.6,5.0,5.4,5.8,6.2
4.0,3.0,2.0,1.0,0.0,-1.0,-2.0,204.0,-4.0
5.0,24.5,44.0,63.5,83.0,102.5,122.0,141.5,161.0
6.0,32.0,58.0,84.0,110.0,136.0,162.0,188.0,214.0
EOF

I would like to replace the values with greater than 20 with "nan" under all "AS_" columns (AS_2f, AS_5f, AS_9f, etc.). There are many more AS_ columns.

I can do it for AS_2f only, I can't able to do it for other AS_ columns. My script is.

awk -F',' 'NR==1 {print} NR>1 {for(i=2; i<=NF; i++) if ($i > 20 && i==2) $i="nan"; print}' data.csv

Desire output:

sl no,AS_2f,AD_2f,SR_2f,SV_2f,AS_5f,AD_5f,SR_5f,SV_5f
1.0,3.0,5.0,35.0,9.0,11.0,13.0,15.0,17.0
2.0,3.0,4.0,5.0,6.0,nan,8.0,204.0,10.0
3.0,3.4,34.0,4.2,4.6,5.0,5.4,5.8,6.2
4.0,3.0,2.0,1.0,0.0,-1.0,-2.0,204.0,-4.0
5.0,nan,44.0,63.5,83.0,nan,122.0,141.5,161.0
6.0,nan,58.0,84.0,110.0,nan,162.0,188.0,214.0
4

There are 4 best solutions below

0
jhnc On BEST ANSWER
awk '
    BEGIN { FS = OFS = "," }

    {
        # perform any substitutions
        # (does nothing when NR==1 because c is empty)
        for (i in c)
            if ($i>20)
                $i="nan"
    
        # apply OFS (optional if FS==OFS)
        $1=$1

        # output
        print
    }

    # save column numbers to check
    NR==1 {
        for (i=1; i<=NF; i++)
            if ($i~/^AS_/)
                c[i]
    }
' data.csv
0
potong On

This might work for you (GNU sed):

sed -E '1h;1b;G;s/$/\n/
        :a;s/^([^,\n]+,?)([^\n]*\n)([^,\n]+,?)([^\n]*\n.*)/\2\4\n\1,\3/;ta
        s/^([0-9]{3,}\..|[3-9].\..|2[^0]\..|20\.[^0]),+AS_.*/nan,/mg
        s/,.*/,/mg;s/\n//g;s/,$//' file

The main idea is to pair up each column with its heading and then substitute nan where the value of the column is above 20 and contains AS_.

Make a copy of the headings.

Append the copy to each line with an appended newline.

Pair up each column with its heading, separating value from heading by a comma.

Replace any value above 20 and whose heading contains AS_ with nan,.

Remove any headings, newlines or introduced commas and print the result.

N.B. This assumes the input file is a csv file not as OP posted i.e. space separated.

0
Daweo On

I would ameliorate

awk -F',' 'NR==1 {print} NR>1 {for(i=2; i<=NF; i++) if ($i > 20 && i==2) $i="nan"; print}' data.csv

following way, let data.csv content be

sl no,AS_2f,AD_2f,SR_2f,SV_2f,AS_5f,AD_5f,SR_5f,SV_5f
1.0,3.0,5.0,35.0,9.0,11.0,13.0,15.0,17.0
2.0,3.0,4.0,5.0,6.0,107.0,8.0,204.0,10.0
3.0,3.4,34.0,4.2,4.6,5.0,5.4,5.8,6.2
4.0,3.0,2.0,1.0,0.0,-1.0,-2.0,204.0,-4.0
5.0,24.5,44.0,63.5,83.0,102.5,122.0,141.5,161.0
6.0,32.0,58.0,84.0,110.0,136.0,162.0,188.0,214.0

then

awk 'BEGIN {FS=OFS=","} NR==1 {print; split($0, names)} NR>1 {for(i=2; i<=NF; i++) if ($i > 20 && index(names[i],"AS_")==1) $i="nan"; print}' data.csv

gives output

sl no,AS_2f,AD_2f,SR_2f,SV_2f,AS_5f,AD_5f,SR_5f,SV_5f
1.0,3.0,5.0,35.0,9.0,11.0,13.0,15.0,17.0
2.0,3.0,4.0,5.0,6.0,nan,8.0,204.0,10.0
3.0,3.4,34.0,4.2,4.6,5.0,5.4,5.8,6.2
4.0,3.0,2.0,1.0,0.0,-1.0,-2.0,204.0,-4.0
5.0,nan,44.0,63.5,83.0,nan,122.0,141.5,161.0
6.0,nan,58.0,84.0,110.0,nan,162.0,188.0,214.0

Explanation of changes made: I inform GNU AWK that , is both field separator (FS) and output field separator (OFS). I use String Functions split and index. In 1st line I populate array names so names[1] is sl no, names[2] is AS_2f, names[3] is AD_2f and so on. For each column in each column I check if its' name start with AS_ using index function and checking if it return 1, which mean AS_ is at start.

(tested in GNU Awk 5.1.0)

0
dawg On

Here is a Ruby with the built-in CSV parser to do that:

ruby -r csv -e 'BEGIN{opts={:headers=>true, :converters => :numeric} }

tbl=CSV.parse($<.read, **opts)

tbl.headers.select{|col| col[/^AS_/] }.
    each{|hdr| tbl.by_col![hdr]=tbl.by_col[hdr].map{|v| v>20 ? "nan" : v} }

puts tbl' file 

For an awk I would do:

awk 'BEGIN{FS=OFS=","}
FNR==1{ # get headers
    for(i=1;i<=NF;i++) if ($i~/^AS_/) c[i]
}

FNR>1 {for (e in c) if ($e>20) $e="nan"}
1' file 

Either prints:

sl no,AS_2f,AD_2f,SR_2f,SV_2f,AS_5f,AD_5f,SR_5f,SV_5f
1.0,3.0,5.0,35.0,9.0,11.0,13.0,15.0,17.0
2.0,3.0,4.0,5.0,6.0,nan,8.0,204.0,10.0
3.0,3.4,34.0,4.2,4.6,5.0,5.4,5.8,6.2
4.0,3.0,2.0,1.0,0.0,-1.0,-2.0,204.0,-4.0
5.0,nan,44.0,63.5,83.0,nan,122.0,141.5,161.0
6.0,nan,58.0,84.0,110.0,nan,162.0,188.0,214.0