We're looking to add languages and pseudo-languages to CLD2. Mostly to support Romanized forms like Hinglish (Hindi in Latin script) or translit (Cyrillic strings using Latin script), but not only. (Yes, we know CLD3 supports these; it's not applicable.)
It seems that we need to add a set of strings mapped to probabilities. cld_generated_cjk_delta_bi_32.cc
and cld_generated_cjk_uni_prop_80.cc
seem to contain some sort of mappings but it's unclear what exactly.
Any ideas?