Converting Text to HTML In D

137 Views Asked by At

I'm trying to figure the best way of encoding text (either 8-bit ubyte[] or string) to its HTML counterpart.

My proposal so far is to use a lookup-table to map the 8-bit characters

string[256] lutLatin1ToHTML;
lutLatin1ToXML[0x22] = "&quot";
lutLatin1ToXML[0x26] = "&amp";
...

in HTML that have special meaning using the function

pure string toHTML(in string src,
                   ref in string[256] lut) {
    return src.map!(a => (lut[a] ? lut[a] : new string(a))).reduce!((a, b) => a ~ b) ;
}

I almost work except for the fact that I don't know how to create a string from a `ubyte? (the no-translation case).

I tried

writeln(new string('a'));

but it prints garbage and I don't know why.

For more details on HTML encoding see https://en.wikipedia.org/wiki/Character_entity_reference

2

There are 2 best solutions below

1
On BEST ANSWER

You can make a string from a ubyte most easily by doing "" ~ b, for example:

ubyte b = 65;
string a = "" ~ b;
writeln(a); // prints A

BTW, if you want to do a lot of html stuff, my dom.d and characterencodings.d might be useful: https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff

It has a html parser, dom manipulation functions similar to javascript (e.g. ele.querySelector(), getElementById, ele.innerHTML, ele.innerText, etc.), conversion from a few different character encodings, including latin1, and outputs ascii safe html with all special and unicode characters properly encoded.

assert(htmlEntitiesEncode("foo < bar") == "foo &lt; bar";

stuff like that.

1
On

In this case Adam's solution works just fine, of course. (It takes advantage of the fact that ubyte is implicitly convertible to char, which is then appended to the immutable(char)[] array for which string is an alias.)

In general the safe way of converting types is to use std.conv.

import std.stdio, std.conv;

void main() {
    // utf-8
    char cc = 'a';
    string s1 = text(cc);
    string s2 = to!string(cc);
    writefln("%c %s %s", cc, s1, s2);

    // utf-16
    wchar wc = 'a';
    wstring s3 = wtext(wc);
    wstring s4 = to!wstring(wc);
    writefln("%c %s %s", wc, s3, s4);    

    // utf-32
    dchar dc = 'a';
    dstring s5 = dtext(dc);
    dstring s6 = to!dstring(dc); 
    writefln("%c %s %s", dc, s5, s6);

    ubyte b = 65;
    string a = to!string(b);
} 

NB. text() is actually intended for processing multiple arguments, but is conveniently short.