I recall seeing a couple of string-intensive programs that do a lot of string comparison but relatively few string manipulation, and that have used a separate table to map strings to identifiers for efficient equality and lower memory footprint, e.g.:
public class Name {
public static Map<String, Name> names = new SomeMap<String, Name>();
public static Name from(String s) {
Name n = names.get(s);
if (n == null) {
n = new Name(s);
names.put(s, n);
}
return n;
}
private final String str;
private Name(String str) { this.str = str; }
@Override public String toString() { return str; }
// equals() and hashCode() are not overridden!
}
I'm pretty sure one of these programs was javac from OpenJDK, so not some toy application. Of course the actual class was more complex (and also I think it implemented CharSequence), but you get the idea - the entire program was littered with Name
in any location you would expect String
, and on the rare cases where string manipulation was needed, it converted to strings and then cached them again, conceptually like:
Name newName = Name.from(name.toString().substring(5));
I think I understand the point of this - especially when there are a lot of identical strings all around and a lot of comparisons - but couldn't the same be achieved by just using regular strings and intern
ing them? The documentation for String.intern()
explicitly says:
...
When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.It follows that for any two strings s and t, s.intern() == t.intern() is true if and only if s.equals(t) is true.
...
So, what are the advantages and disadvantages of manually managing a Name
-like class vs using intern()
?
What I've thought about so far was:
- Manually managing the map means using regular heap,
intern()
uses the permgen. - When manually managing the map you enjoy type-checking that can verify something is a
Name
, while an interned string and a non-interned string share the same type so it's possible to forget interning in some places. - Relying on
intern()
means reusing an existing, optimized, tried-and-tested mechanism without coding any extra classes. - Manually managing the map results in a code more confusing to new users, and strign operations become more cumbersome.
... but I feel like I'm missing something else here.
Type checking is a major concern, but invariant preservation is also a significant concern.
Adding a simple check to the
Name
constructorcan ensure* that there exist no
Name
instances corresponding to invalid names like"12#blue,,"
which means that methods that takeName
s as arguments and that consumeName
s returned by other methods don't need to worry about where invalidName
s might creep in.To generalize this argument, imagine your code is a castle with walls designed to protect it from invalid inputs. You want some inputs to get through so you install gates with guards that check inputs as they come through. The
Name
constructor is an example of a guard.The difference between
String
andName
is thatString
s can't be guarded against. Any piece of code, malicious or naive, inside or outside the perimeter, can create any string value. BuggyString
manipulation code is analogous to a zombie outbreak inside the castle. The guards can't protect the invariants because the zombies don't need to get past them. The zombies just spread and corrupt data as they go.That a value "is a"
String
satisfies fewer useful invariants than that a value "is a"Name
.See stringly typed for another way to look at the same topic.
* - usual caveat re deserializing of
Serializable
allowing bypass of constructor.