I have a tab seperated text file A.tsv of icd code matrix which the columns are patient id and icd codes, rows are observations for each patients. NA indicate the patient are not diagnosed as the icd codes
study_id 691.8 692.9 701.2 706.1
a1 1 NA NA 2
a2 NA NA NA NA
a3 NA NA 1 NA
and a icd_code file consisting of the icd codes of interest
691.8 ICD_9
706.1 ICD_10
For a patient, if any icd codes of interest have a value (not NA), the diagnosis is coded as 1. If all icd codes of interest are NA, the diagnosis is coded as 0.
For the above example, the output should be
study_id diagnosis
a1 1
a2 0
a3 0
For the above example, the output should be
study_id diagnosis
a1 1
a2 0
a3 0
I am new to bash script and I have no clue on where should I start. How could I write a bash script with awk to realize the above question?
Now I have a solution but it seems that it is summarizing all columns but not specific icd code of interest listed in the icd file
awk -F"\t" 'BEGIN { OFS="\t"; } NR==FNR { icd_codes[$1] = $2; next; } FNR > 1 { study_id = $1; diagnosis = 0; for (i = 2; i <= NF; i++) { if ($i != "NA" && icd_codes[$i] != "") { diagnosis = 1; break; } } print study_id, diagnosis; }' "$icd_file" "$input_file" > "$output_file"
The main problem in your script is that
icd_codes[$i] != ""
should be using the column header string, e.g.691.8
, as the array index but instead is using the current value in the cell for that column, e.g.1
orNA
. You need an additional array to map from column numbers to column header strings or, more efficiently as it uses fewer loop iterations per input line as I've done below, from column header strings to column numbers.Using any awk: