US-ASCII string (de-)compression into/from a byte array (7 bits/character)

1.1k Views Asked by At

As we all know, ASCII uses 7-bit to encode chars, so number of bytes used to represent the text is always less than the length of text letters

For example:

    StringBuilder text = new StringBuilder();
    IntStream.range(0, 160).forEach(x -> text.append("a")); // generate 160 text
    int letters = text.length();
    int bytes = text.toString().getBytes(StandardCharsets.US_ASCII).length;
    System.out.println(letters); // expected  160,  actual 160
    System.out.println(bytes); //   expected  140,  actual 160

Always letters = bytes, but the expected is letters > bytes.

the main proplem: in smpp protocol sms body must be <= 140 byte, if we used ascii encoding, then you can write 160 letters =(140*8/7),so i'd like to text encoded in 7-bit based ascii, we are using JSMPP library

Can anyone explain it to me please and guide me to the right way, Thanks in advance (:

4

There are 4 best solutions below

2
kriegaex On BEST ANSWER

Here is a quick & dirty solution without any libraries, i.e. only JRE on-board means. It is not optimised for efficiency and does not check if the message is indeed US-ASCII, it just assumes it. It is just a proof of concept:

package de.scrum_master.stackoverflow;

import java.util.BitSet;

public class ASCIIConverter {
  public byte[] compress(String message) {
    BitSet bits = new BitSet(message.length() * 7);
    int currentBit = 0;
    for (char character : message.toCharArray()) {
      for (int bitInCharacter = 0; bitInCharacter < 7; bitInCharacter++) {
        if ((character & 1 << bitInCharacter) > 0)
          bits.set(currentBit);
        currentBit++;
      }
    }
    return bits.toByteArray();
  }

  public String decompress(byte[] compressedMessage) {
    BitSet bits = BitSet.valueOf(compressedMessage);
    int numBits = 8 * compressedMessage.length - compressedMessage.length % 7;
    StringBuilder decompressedMessage = new StringBuilder(numBits / 7);
    for (int currentBit = 0; currentBit < numBits; currentBit += 7) {
      char character = (char) bits.get(currentBit, currentBit + 7).toByteArray()[0];
      decompressedMessage.append(character);
    }
    return decompressedMessage.toString();
  }

  public static void main(String[] args) {
    String[] messages = {
      "Hello world!",
      "This is my message.\n\tAnd this is indented!",
      " !\"#$%&'()*+,-./0123456789:;<=>?\n"
        + "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\n"
        + "`abcdefghijklmnopqrstuvwxyz{|}~",
      "1234567890123456789012345678901234567890"
        + "1234567890123456789012345678901234567890"
        + "1234567890123456789012345678901234567890"
        + "1234567890123456789012345678901234567890"
    };

    ASCIIConverter asciiConverter = new ASCIIConverter();
    for (String message : messages) {
      System.out.println(message);
      System.out.println("--------------------------------");
      byte[] compressedMessage = asciiConverter.compress(message);
      System.out.println("Number of ASCII characters = " + message.length());
      System.out.println("Number of compressed bytes = " + compressedMessage.length);
      System.out.println("--------------------------------");
      System.out.println(asciiConverter.decompress(compressedMessage));
      System.out.println("\n");
    }
  }
}

The console log looks like this:

Hello world!
--------------------------------
Number of ASCII characters = 12
Number of compressed bytes = 11
--------------------------------
Hello world!


This is my message.
    And this is indented!
--------------------------------
Number of ASCII characters = 42
Number of compressed bytes = 37
--------------------------------
This is my message.
    And this is indented!


 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
--------------------------------
Number of ASCII characters = 97
Number of compressed bytes = 85
--------------------------------
 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~


1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
--------------------------------
Number of ASCII characters = 160
Number of compressed bytes = 140
--------------------------------
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
1
Rowi On

Based on the encoding type, Byte length would be different. Check the below example.

String text = "0123456789";
byte[] b1 = text.getBytes(StandardCharsets.US_ASCII);
System.out.println(b1.length);
// prints "10"

byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
System.out.println(utf8.length); 
// prints "10"

byte[] utf16= text.getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length); 
// prints "22"

byte[] utf32 = text.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(utf32.length); 
// prints "10" 
15
kry On

(160*7-160*8)/8 = 20, so you expect 20 bytes less used by the end of your script. However, there is a minimum size for registers, so even if you don't use all of your bits, you still can't concat it to an another value, so you are still using 8 bit bytes for your ASCII codes, that's why you get the same number. For example, the lowercase "a" is 97 in ASCII

‭01100001‬

Note the leading zero is still there, even it is not used. You can't just use it to store part of an another value.

Which concludes, in pure ASCII letters must always equal bytes.

(Or imagine putting size 7 object into size 8 boxes. You can't hack the objects to pieces, so the number of boxes must equal the number of objects - at least in this case.)

5
Tom Blodget On

Nope. In "modern" environments (since 3 or 4 decades ago), the ASCII character encoding for the ASCII character set uses 8 bit code units which are then serialized to one byte each. This is because we want to move and store data in "octets" (8-bit bytes). This character encoding happens to always have the high bit set to 0.

You could say there was, used long ago, a 7-bit character encoding for the ASCII character set. Even then data might have been moved or stored as octets. The high bit would be used for some application-specific purpose such as parity. Some systems, would zero it out in an attempt to increase interoperability but in the end hindered interoperability by not being "8-bit safe". With strong Internet standards, such systems are almost all in the past.