I'm using the Zemanta API, which accepts up to 8 KB of text per call. I'm extracting the text to send to Zemanta from Web pages using JavaScript, so I'm looking for a function that will truncate my text at exactly 8 KB.
Zemanta should do this truncation on its own (i.e., if you send it a larger string), but I need to shuttle this text around a bit before making the API call, so I want to keep the payload as small as possible.
Is it safe to assume that 8 KB of text is 8,192 characters, and to truncate accordingly? (1 byte per character; 1,024 characters per KB; 8 KB = 8,192 bytes/characters) Or, is that inaccurate or only true given certain circumstances?
Is there a more elegant way to truncate a string based on its actual file size?
If you are using a single-byte encoding, yes, 8192 characters=8192 bytes. If you are using UTF-16, 8192 characters(*)=4096 bytes.
(Actually 8192 code-points, which is a slightly different thing in the face of surrogates, but let's not worry about that because JavaScript doesn't.)
If you are using UTF-8, there's a quick trick you can use to implement a UTF-8 encoder/decoder in JS with minimal code:
Now you can truncate with:
(The reason for the try-catch there is that if you truncate the bytes in the middle of a multibyte character sequence you'll get an invalid UTF-8 stream and decodeURIComponent will complain.)
If it's another multibyte encoding such as Shift-JIS or Big5, you're on your own.