I have a huge set of arbitrary natural language strings. For my tool to analyze them I need to convert each string to unique color value (RGB or other). I need color contrast to depend on string similarity (the more string is different from other, the more their respective colors should be different). Would be perfect if I would always get same color value for the same string.
Any advice on how to approach this problem?
Update on distance between strings
I probably need "similarity" defined as a Levenstein-like distance. No natural language parsing is required.
That is:
"I am going to the store" and
"We are going to the store"
Similar.
"I am going to the store" and
"I am going to the store today"
Similar as well (but slightly less).
"I am going to the store" and
"J bn hpjoh up uif tupsf"
Quite not similar.
(Thanks, Welbog!)
I probably would know exactly what distance function I need only when I'll see program output. So lets start from simpler things.
Update on task simplification
I've removed my own suggestion to split task into two — absolute distance calculation and color distribution. This would not work well as at first we're reducing dimensional information to a single dimension, and then trying to synthesize it up to three dimensions.
I would maybe define some delta between two strings. I don't know what you define as the difference (or "unequality") of two strings, but the most obvious thing I could think about would be string length and the number of occurences of particular letters (and their index in the string). It should not be tricky to implement it such that it returns the same color code in equal strings (if you do an equal first, and return before further comparison).
When it comes to the actual RGB value, I would try to convert the string data into 4 bytes (RGBA), or 3 bytes if you only use the RGB. I don't know if every string would fit into them (as that may be language specific?).