gsub for substituting translations not working

752 Views Asked by At

I have a dictionary dict with records separated by ":" and data fields by new lines, for example:

:one
1
:two
2
:three
3
:four
4

Now I want awk to substitute all occurrences of each record in the input file, eg

onetwotwotwoone
two
threetwoone
four

My first awk script looked like this and works just fine:

BEGIN { RS = ":" ; FS = "\n"}
NR == FNR {
rep[$1] = $2
next
}
{
for (key in rep)
grub(key,rep[key])
print
}

giving me:

12221
2
321
4

Unfortunately another dict file contains some character used by regular expressions, so I have to substitute escape characters in my script. By moving key and rep[key] into a string (which can then be parsed for escape characters), the script will only substitute the second record in the dict. Why? And how to solve?

Here's the current second part of the script:

{
for (key in rep)
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
print
}

All scripts are run by awk -f translate.awk dict input

Thanks in advance!

2

There are 2 best solutions below

0
Ed Morton On BEST ANSWER

Your fundamental problem is using strings in regexp and backreference contexts when you don't want them and then trying to escape the metacharacters in your strings to disable the characters that you're enabling by using them in those contexts. If you want strings, use them in string contexts, that's all.

You won't want this:

gsub(regexp,backreference-enabled-string)

You want something more like this:

index(...,string) substr(string)

I think this is what you're trying to do:

$ cat tst.awk
BEGIN { FS = ":" }
NR == FNR {
    if ( NR%2 ) {
        key = $2
    }
    else {
        rep[key] = $0
    }
    next
}
{
    for ( key in rep ) {
        head = ""
        tail = $0
        while ( start = index(tail,key) ) {
            head = head substr(tail,1,start-1) rep[key]
            tail = substr(tail,start+length(key))
        }
        $0 = head tail
    }
    print
}

$ awk -f tst.awk dict file
12221
2
321
4
1
Moeder On

Never mind for asking.... Just some missing parentheses...?!

{
for (key in rep)
{
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
}
print
}

works like a charm.