Need to create an awk
script to convert a glyph
(https://en.wikipedia.org/wiki/Glyph) to Unicode
(JavaScript
syntax), and the reverse - Unicode to a glyph.
Source data is stored in NotePad++
with UTF-8
encoding.
Here's my progress.
Use_case_1
Dictionary file (dict_1_.txt):
A \u0041
À \u00C0
Input file (input_1_.txt):
A
À
awk
script for generating Unicode for equivalent glyph:
awk 'NR == FNR { a[$1] = $2; next } $1 in a { $1 = a[$1] } $2 in a { $2 = a[$2] } 1' dict_1_.txt input_1_.txt
correctly producing:
\u0041
\u00C0
Use_case_2
Dictionary file (dict_2_.txt)
\u0041 A
\u00C0 À
Input file (input_2_.txt)
\u0041
\u00C0
awk
script for generating glyphs for equivalent Unicode:
awk 'NR == FNR { a[$1] = $2; next } $1 in a { $1 = a[$1] } $2 in a { $2 = a[$2] } 1' dict_2.txt input_2.txt
correctly producing:
A
À
So, can successfully "round-trip" on a single symbol.
But how to deal with a more comprehensive dictionary and more than one word per row?
Here is sample data.
Input file (input_3_.txt)
PUDÍN, ALMIDÓN
Dictionary file (dict_3_.txt)
, \u002C
A \u0041
D \u0044
I \u0049
Í \u00CD
L \u004C
M \u004D
N \u006E
Ó \u00D3
P \u0050
U \u0055
<space> \u0020
The awk
script should generate:
\u0050\u0055\u0044\u00CD\u006E\u002C\u002C\u0041\u004C\u004D\u0049\u0044\u00D3\u006E
Input file (input_4_.txt)
\u0050\u0055\u0044\u00CD\u006E\u002C\u002C\u0041\u004C\u004D\u0049\u0044\u00D3\u006E
Dictionary file (dict_4_.txt)
\u002C ,
\u0041 A
\u0044 D
\u0049 I
\u00CD Í
\u004C L
\u004D M
\u006E N
\u00D3 Ó
\u0050 P
\u0055 U
\u0020 <space>
The awk
script should generate:
PUDÍN, ALMIDÓN
Here is a more complicated set of input strings (one per row):
MONO Y DIACETIL ÉSTERES DEL ÁCIDO TARTÁRICO DE MONO Y DIGLICÉRIDOS DE ÁCIDOS GRASOS AÑADIDOS
043 HUEVAS DE PESCADO (INCLUYENDO ESPERMA=HUEVAS BLANDAS) Y VÍSCERAS COMESTIBLES DE PESCADO
ACEITE DE SOJA OXIDADO TÉRMICAMENTE Y EN INTERACCIÓN CON MONO Y DIGLICÉRIDOS DE ÁCIDOS GRASOS
BANDEJA PLÁSTICA O CAZUELA, CUBIERTA DE PAPEL DE ALUMINIO O ENVOLTURA
In the Dictionary examples above, have used <space>
to indicate the 'symbol' between words and after a comma. This probably means that a solution should use \t
for FS
in both the Dictionary file and the Input file. Currently the FS
is a keyboard 'space'. Also the RS
is \n
.
Further, I need to do the same for hexadecimal, so a solution needs to process a Dictionary file like this:
Í Í
Ó Ó
as compared to the Dictionary example above:
Í \u00CD
Ó \u00D3
How to improve or replace my simple awk
scripts with scripts that process the longer strings on multiple lines?
here is one approach, note that you don't need two different versions of the dictionary.
With little effort these two can be combined into one script and from/to conversion can be controlled with a parameter. I intentionally kept the dictionary part the same
working with the encoded input now
using your dict_4 as the dictionary for both scripts