Using awk, how to replace one string with another?

596 Views Asked by At

Need to create an awk script to convert a glyph (https://en.wikipedia.org/wiki/Glyph) to Unicode (JavaScript syntax), and the reverse - Unicode to a glyph.

Source data is stored in NotePad++ with UTF-8 encoding.

Here's my progress.

Use_case_1

Dictionary file (dict_1_.txt):

A \u0041
À \u00C0

Input file (input_1_.txt):

A
À

awk script for generating Unicode for equivalent glyph:

awk 'NR == FNR { a[$1] = $2; next } $1 in a { $1 = a[$1] } $2 in a { $2 = a[$2] } 1' dict_1_.txt input_1_.txt

correctly producing:

\u0041
\u00C0

Use_case_2

Dictionary file (dict_2_.txt)

\u0041 A
\u00C0 À

Input file (input_2_.txt)

\u0041
\u00C0

awk script for generating glyphs for equivalent Unicode:

awk 'NR == FNR { a[$1] = $2; next } $1 in a { $1 = a[$1] } $2 in a { $2 = a[$2] } 1' dict_2.txt input_2.txt

correctly producing:

A
À

So, can successfully "round-trip" on a single symbol.

But how to deal with a more comprehensive dictionary and more than one word per row?

Here is sample data.

Input file (input_3_.txt)

PUDÍN, ALMIDÓN

Dictionary file (dict_3_.txt)

,   \u002C
A   \u0041
D   \u0044
I   \u0049
Í   \u00CD
L   \u004C
M   \u004D
N   \u006E
Ó   \u00D3
P   \u0050
U   \u0055
<space> \u0020

The awk script should generate:

\u0050\u0055\u0044\u00CD\u006E\u002C\u002C\u0041\u004C\u004D\u0049\u0044\u00D3\u006E

Input file (input_4_.txt)

\u0050\u0055\u0044\u00CD\u006E\u002C\u002C\u0041\u004C\u004D\u0049\u0044\u00D3\u006E

Dictionary file (dict_4_.txt)

\u002C  ,
\u0041  A
\u0044  D
\u0049  I
\u00CD  Í
\u004C  L
\u004D  M
\u006E  N
\u00D3  Ó
\u0050  P
\u0055  U
\u0020  <space>

The awk script should generate:

PUDÍN, ALMIDÓN

Here is a more complicated set of input strings (one per row):

MONO Y DIACETIL ÉSTERES DEL ÁCIDO TARTÁRICO DE MONO Y DIGLICÉRIDOS DE ÁCIDOS GRASOS AÑADIDOS
043 HUEVAS DE PESCADO (INCLUYENDO ESPERMA=HUEVAS BLANDAS) Y VÍSCERAS COMESTIBLES DE PESCADO
ACEITE DE SOJA OXIDADO TÉRMICAMENTE Y EN INTERACCIÓN CON MONO Y DIGLICÉRIDOS DE ÁCIDOS GRASOS
BANDEJA PLÁSTICA O CAZUELA, CUBIERTA DE PAPEL DE ALUMINIO O ENVOLTURA

In the Dictionary examples above, have used <space> to indicate the 'symbol' between words and after a comma. This probably means that a solution should use \t for FS in both the Dictionary file and the Input file. Currently the FS is a keyboard 'space'. Also the RS is \n.

Further, I need to do the same for hexadecimal, so a solution needs to process a Dictionary file like this:

Í   &#xcd;
Ó   &#xd3;

as compared to the Dictionary example above:

Í   \u00CD
Ó   \u00D3

How to improve or replace my simple awk scripts with scripts that process the longer strings on multiple lines?

1

There are 1 best solutions below

4
karakfa On BEST ANSWER

here is one approach, note that you don't need two different versions of the dictionary.

With little effort these two can be combined into one script and from/to conversion can be controlled with a parameter. I intentionally kept the dictionary part the same

$ awk 'NR==FNR {$2=$2?$2:" "; u2a[$1]=$2; a2u[$2]=$1; next}
               {for(i=1;i<=NF;i++) $i=a2u[$i]}1' dict FS='' OFS='' input

\u0050\u0055\u0044\u00CD\u006E\u002C\u0020\u0041\u004C\u004D\u0049\u0044\u00D3\u006E

working with the encoded input now

$ awk 'NR==FNR {$2=$2?$2:" "; u2a[$1]=$2; a2u[$2]=$1; next}
               {enc=$0; gsub(/....../,"& ",enc); n=split(enc,a);
                for(i=1;i<=n;i++) line=line u2a[a[i]]; print line}' dict encoded_input

PUDÍN, ALMIDÓN

using your dict_4 as the dictionary for both scripts