About awk and integer to ASCII character conversion

7.5k Views Asked by At

Just to make sure, is it really that using awk (Gnu awk at least) I can convert:

from octal to ASCII by:

print "\101"         # or a="\101"
A

from hex to ASCII:

print "\x41"         # or b="\x41"
B

but from decimal to ASCII I have to:

$ printf "%c\n", 67  # or c=sprintf("%c", 67)
C

There is no secret print "\?67" in that RTFM (Memo) I missed?

I'm trying to get character frequencies from $0="aabccc" like:

for(i=141; i<143; i++) a=a gsub("\\"i, ""); print a
213

but using decimals (instead of octals in above example). The decimalistic approach seem awfully long:

$ cat foo
aabccc
$ awk '{for(i=97;i<=99;i++){c=sprintf("%c",i);a=a gsub(c,"")} print a}' foo
213

It got used here.

5

There are 5 best solutions below

4
On BEST ANSWER

No, \nnn is octal and \xnn is hex - that's all there is for including characters you cannot include as-is in strings and you should always use the octal, not the hex, representation for robustness (see, for example, http://awk.freeshell.org/PrintASingleQuote).

I don't understand the last part of your question where you state what you're trying to do with this - provide concise, testable sample input and expected output and I'm sure someone can help you do it the right way, whatever it is.

Is this what you're trying to do?

$ awk 'BEGIN{for (i=0141; i<0143; i++) print i}'
97
98
1
On

A lookup table is the only way to address this (directly convert CHAR to ASCII DECIMAL) within "AWK only".

You can simply use sprintf() to convert ASCII DECIMAL to CHAR.

  • You can create a lookup table by iterating through each of the known ascii chars and storing them in an array where the key is the character and the value is the ascii value of that char.

  • You can use sprintf() within AWK to get the char for each decimal.

  • Then you can pass the char to the array to get the corresponding decimal again.

In this example, using awk.

  • We cycle through all 256 characters, printing out each one.
  • We split the resulting string into a series of lines where each line has a single character.
  • We build a table in awk of the 256 characters (in BEGIN), and then feed each of the input characters in to lookup each one.
  • Finally we then print out the code for each character on the input.
awk 'BEGIN{
    for(n=0;n<256;n++)
        print sprintf("%c",n)
}' | awk '{
for (i=0; ++i <= length($0);)
    printf "%s\n", substr($0, i, 1)
}' | awk 'BEGIN{
    for(n=0;n<256;n++)
        ord[sprintf("%c",n)]=n
}{
    print ord[$1]
}'

The reverse can also be done, where we lookup a list of character codes.

awk 'BEGIN{
    for(n=0;n<256;n++)
        print sprintf("%s",n)
}' | awk 'BEGIN{
    for(n=0;n<256;n++)
        char[n]=sprintf("%c",n)
}{
    print char[$1]
}'

Note: The second example may print out a lot of garbage in the high ascii range (> 128) depending on the character set you are using.

0
On

If as you say at the end of your question you're simply looking to count the frequency of characters, I'd just assemble an array.

$ awk '{for(i=1;i<=length($0);i++) a[substr($0,i,1)]++} END{for(i in a) printf "%d %s\n",a[i],i}' <<<$'aabccc\ndaae'
1 d
1 e
4 a
1 b
3 c

Note that this also supports multi-line input.

We're stepping through each line of input, incrementing a counter that is an array subscript keyed with the character in question.

I would expect this approach to be more performant than applying a regex to count the replacements for every interesting character, but I haven't done any speed comparison tests (and of course it would depend on how large a set you're interested in).

While this answer doesn't address your initial question, I hope it'll provide a better way to approach the problem.

(Thanks for including the final details in your question. XY problems are all too frequent here.)

0
On

Note: The second example may print out a lot of garbage in the high ascii range (> 128) depending on the character set you are using.

This can be circumvented by using octal codes \200 - \377 for 128-255.

IIRC the bytes C0 C1 F5 F6 F7 F8 F9 FA FB FC FD FE FF shouldn't exist within properly encoded UTF-8 documents (or not yet spec'ed to). FE and FF may overlap with UTF16 byte order mark, but that should hardly be a concern as of today since the world has standardized upon UTF-8.

0
On

if you need to encode bytes -> octals in awk, here's a fully self-encapsulated, recursive, and cross-awk compatible octal encoder that I came up with before :

  • verified on gawk, mawk-1, mawk-2, and nawk,
  • benchmarked throughput rate of 39.2 MByte/sec

|

 out9: 1.82GiB 0:00:47 [39.2MiB/s] [39.2MiB/s] [   <=>            ]
  in0:  466MiB 0:00:00 [1.78GiB/s] [1.78GiB/s] [>] 100%            

( pvE 0.1 in0 < "${m3l}" | mawk2x ; )  

 39.91s user 6.94s system 98% cpu 47.656 total
 1  
 2  78b4c27659ae66e4c98796a60043f1fe  stdin
 3  
 echo "${data}" | awk '{

       print octencode_v7($0)
 }
 function octencode_v7(______,_,__,___,____,_____,_______) {
    if ( ( (_+=_+=_^=_<_\
         )^_*(_+_)*(_^_)^(!(("\1"~"\0")+\
        index(-+log(_<_),"+") ) ) )<=(___=\
        (_^=_<_)<length("\333\222")\
               ? length(______) : match(______,"$")-_))  {
        return \
        octencode_v7(substr(______,_^=_<_,_=int(___/(_+_)))) \
        octencode_v7(substr(______,++_))
    }
    _______=___
        ___="\36\6\2\24"
    gsub(/\\/,___,______)
    _______-=gsub("["(!_)"-"(_+(_+=++_+_))"]", "\\"(!_)(_)"&",______)
         _--;
    if (!+(_______-=gsub(___, "\\"(_--^--_+_*_),______) \
                  - gsub("[[]","\\" ((_^!_)(_)_),______) \
                  - gsub(/\^/,  "\\" ((_^!_)(_)(_+_)),______))) {
        return ______
    }
    ____=___=_+=_^=_<_
    _____=(___^=++____)-_^(____=!_)
    do { ___=_____
    do {  __=_____
    if (+____ || (_____-___)!=_^(_<_)) {
        do { _=(____)(___)__
        if (+____!=_^(_<_) || ! index(___^___,_____)    ||
              +__!~"^["(_____^___+___)"]$") {
            _="\\"(_)
            _______-=gsub(((!+____ && +_____<(___+___)) ||
                         (+____==_^(_<_)                &&
                         ( +___==+_____                 || 
                         (___+____+___)==+_____)))       \
                               ? "["(_)"]" : (_), _,______)
    } } while(__--)
    } } while(___--)
          if (!_______) {
            return ______
    } } while((++____+____)<_____)
    return ______
}'

It's basically a triple-nested do-while loop combo to cycle through all the octal codes, without needing any previously made lookup reference strings/arrays