Retained heap size of a string in java

3.4k Views Asked by At

This is a question that we have had trouble understanding. It's tricky to describe it using text but I hope that the gist will be understood.

I understand that a string's actual content is enclosed in an internal char array. In normal instances the retained heap size of the string will include 40 bytes plus the size of the character array. This is explained here. When calling a substring the character array retains a reference to the original string and therefore the retained size of the character array could be a lot bigger than the string itself.

However when profiling memory usage using Yourkit or MAT something strange seems to happen. The string that references the char array's retained size does not include the retained size of the character array.

An example could be as follows (semi pseudo-code):

String date = "2011-11-33"; (24 bytes)
date.value = char{1172}; (2360 bytes)

The string's retained size is defined as 24 bytes without including the character array's retained size. This could make sense if there are a lot of references to the character array due to many substring operations.

Now when this string is included in some type of collection such as an array or list then the retained size of this array will include the retained size of all the strings including the character array's retained size.

We then have a situation like this:

Array's retained size = 300 bytes
array[0] = String 40 bytes;
array[1] = String 40 bytes;
array[1].value = char[] (220 bytes)

You therefore have to look into each array entry to try to work out where the retained size comes from.

Again this can be explained in that the array holds all the strings that hold references to the same character array and therefore altogether the array's retained size is correct.

Now we get to the problem.

I keep in a separate object a reference to the array that I discussed above as well as a different array with the same strings. In both arrays the strings refer to the same character array. This is expected - after all we are talking about the same string. However the retained size of this character array is counted for both arrays in this new object. In other words the retained size seems to be double. If I delete the first array then the second array will still hold a reference to the character array and vice versa. This causes a confusion in that it seems that java is holding two separate references to the same character array. How can this be? Is this a problem with java's memory or is it just the way that the profilers display information?

This problem caused a lot of headaches for us in trying to track down huge memory usage in our application.

Again - I hope that someone out there will be able to understand the question and explain it.

Thanks for your help

4

There are 4 best solutions below

4
On BEST ANSWER

I keep in a separate object a reference to the array that I discussed above as well as a different array with the same strings. In both arrays the strings refer to the same character array. This is expected - after all we are talking about the same string. However the retained size of this character array is counted for both arrays in this new object. In other words the retained size seems to be double.

What you have here is a transitive reference in a dominator tree:

enter image description here

The character array should not show up in the retained size of either array. If the profiler displays it that way, then that's misleading.

This is how JProfiler shows this situation in the biggest objects view:

enter image description here

The string instance that is contained in both arrays, is shown outside the array instances, with a [transitive reference] label. If you want to explore the actual paths, you can add the array holder and the string to the graph and find all paths between them:

enter image description here

Disclaimer: My company develops JProfiler.

5
On

Unless the strings are interned, they can be equal() but not ==. When constructing a String object from a char array, the constructor will make a copy of the char array. (This is the only way to shield the immutable String from later changes in the char array values.)

4
On

I'd say it is just the way the profiler displays the information. It has no idea that the two arrays should be considered for "deduplication". How about you wrap the two arrays into some kind of dummy holder object, and run your profiler against that? Then, it should be able to take care of the "double-counting".

4
On

If you run with -XX:-UseTLAB

public static void main(String... args) throws Exception {
    StringBuilder text = new StringBuilder();
    text.append(new char[1024]);
    long free1 = free();
    String str = text.toString();
    long free2 = free();
    String [] array = { str.substring(0, 100), str.substring(101, 200) };
    long free3 = free();
    if (free3 == free2)
        System.err.println("You must use -XX:-UseTLAB");
    System.out.println("To create String with 1024 chars "+(free1-free2)+" bytes\nand to create an array with two sub-string was "+(free2-free3));
}

private static long free() {
    return Runtime.getRuntime().freeMemory();
}

prints

To create String with 1024 chars 2096 bytes
and to create an array with two sub-string was 88

You can see its consuming more memory that you might expect if they shared the same back end store.

If you look at the code in the String class.

public String substring(int start, int end) {
    // checks.
    return ((beginIndex == 0) && (endIndex == count)) ? this :
        new String(offset + beginIndex, endIndex - beginIndex, value);
}

String(int offset, int count, char value[]) {
    this.value = value;
    this.offset = offset;
    this.count = count;
}

You can see that substring for String doesn't take a copy of the underlying value array.


Another thing to consider is the -XX:+UseCompressedStrings which is on by default on newer versions of the JVM. This encourages the JVM to use byte[] instead of char[] where possible.

The size of the headers for the String and array object varies for 32-bit JVMs, 64-bit JVM with 32-bit references and 64-bit JVMs with 64-bit references.