Don’t get tangled

The state of String in Java

AttilaBalazs
cat-string

Over the past year, how Strings are represented in Java has changed. Attila Balazs outlines what’s new, and shows you how you can perfect your code for it.

Over
the past year there have been a couple of changes in how Strings
are represented in Java. In this post, I will summarize these
changes, help you understand how it can influence the performance
of your code, and give you a couple of ideas for tuning your
code
.

 Before
Java 7 update 6 (1.7.0_06)

 Representation

Up until
Java 7u6 the String class had four instance fields:

  • value – this was a reference to a character
    array (char[]) containing the actual characters which make up the
    String (remember that char in Java is 16 bit/two bytes wide
    unsigned integer type).
  • offset – an integer specifying the index of the
    first character belonging to the string in the array. This was 0
    most of the time.
  • count – an integer specifying the number of
    characters belonging to the string (the length of the
    string)
  • hash – an integer representing the hash code of
    the String. This was being lazily populated and had some tricky
    logic applied to it (it wasn’t volatile but still was thread safe
    because the Java Memory Model guarantees the atomicity of 32 bit
    reads / writes. It also had an issue with strings that had hashCode
    zero – in this case it was being recalculated needlessly because 0
    was being used as a marker for “not yet calculated”)

This representation meant that creating a substring of an
existing string was a rather cheap operation (since the actual
characters weren’t being copied – just a new reference to the
existing backing array was created). In fact this was the reason
why this implementation was chosen – for Java to perform well in
situations where a lot of substrings of larger strings were created
(for example when parsing source code, CSV files or other text
files).
The danger of this representation is that you
can retain more memory than you actually need:

String str = get100MBFile();
str = str.substring(0, 10);

In
the above code we’re retaining 100MB of memory, even though we only
need ten characters. The solution is to force the copying of the
underlying char array by using the String(String)
constructor:

str = new String(str);

You
can pretty quickly diagnose this situation if you know what you’re
looking for. Also, some profilers have dedicated diagnostic tools
which detects such problems.

Deduplication
Another
issue which arises frequently in practice when working with strings
is Java are duplicate strings – distinct instances of String which
contain the same sequence of characters. While strings are
immutable and instances can be freely interchanged (or deduplicated
to reduce memory consumption), Java only does deduplication for
String constants.

This is wholly understandable since doing deduplication
for all strings would be very time consuming and it would slow down
the program without much benefit (in the general case).

Still, if we know that our dataset contains a lot of
repeating strings, we can reduce the memory footprint by
deduplicating our strings manually.

One way to do this is to call intern on the String
instance. 
This returns an identical string from
the internal string pool (if one was already present there) or adds
the current string to the internal string pool and returns it.
There were a couple of issues with intern however in the pre-Java 7
days which discouraged its use:

  • it was slow. Very slow.
  • the string pool was located in a special memory
    area (the PermGen – permanent generation – space) which wasn’t
    being affected by the “maximum memory” (-Xmx) command line switch
    and had to be tuned separately using the -XX:MaxPermSize=… switch
    (or risk running out of it).

In conclusion if you needed deduplication under Java
1.7.0_05 and earlier, it was best if you rolled your own solution
using WeakHashMap<String, WeakReference<String>>. Even
better, if you knew all the possible values you can have before
hand, you could use an Enum to represent your data.

After
Java 7 update 6

Representation

 

Starting in Java 7u6 Oracle went back to the
simpler representation of String. It now contains only 3 instance
fields:

  • value - as before, the actual
    characters

  • hash - the cached hash code of the
    String (calculated lazily, still with the same issues for Strings
    with a zero hash code)

  • hash32 - a new hash code based on
    Murmur hash, which gives a better dispersion of hash
    values

A couple more words about the alternative hash
code:

  • it isn’t exposed publicly through the String
    class. You can access it using the (unofficial)

    sun.misc.Hashing.stringHash32 method

  • unlike the original hash code,
    hash32 for two strings containing the same
    characters but running in different JVMs (on the same machine or on
    different machines) isn’t guaranteed to be the same (in fact most
    likely it won’t be, since a
    HASHING_SEED” value is included in the
    calculation which is initialized on JVM startup using the current
    time)

  • the purpose of alternative hash code is to give
    better performance for HashMap and related classes with String keys
    and to thwart hash-collision denial of service attacks

  • Its usage isn’t enabled by default. You need to
    set the “
    jdk.map.althashing.threshold
    property to enable it. If you set this to a value X, then HashMap
    and related classes with a capacity at least X will use the
    alternative hashing algorithm.

  • A word of caution if you want to enable
    alternative hashing: prior to Java 7u40 (ie. all versions between
    Java 7u6 and Java 7u39) had a performance issue which meant that
    HashMap creation while alternative hashing was enabled was slower
    than needed to be. Thus if you want to enable alternative hashing,
    ensure that you have the latest Java 7 runtime.

As you can see substring works now “as
expected”, creating a completely independent copy of the requested
characters. This eliminates the possible memory retention issues
from the previous versions.

This new behavior can cause performance
regressions in some special cases when we have a large string and
extract many substrings from it. A canonical example for this is
parsing of source code (or any other text for that matter) where we
create a substring for each token. You can find a possible solution
for this a little bit later on in this post.

Yet another change which was made in Java 7 was
the removal of compressed strings. The compressed string option
allowed for a more compact representation of strings (using

byte[] instead of char[]
- 50% less memory needed) which contained only ASCII
characters (as is the case in English environments).

In practice the memory savings were 10-30% and
the additional code complexity was not deemed justifiable. Again,
read on for a possible solution.

Deduplication

In Java 7 the string pool used by the
intern call was relocated to the heap. This
means that setting the maximum memory size will also make it
possible to have a larger string pool. Its performance was also
improved which means that we can get rid of our manual string
pools.

If you have an application which uses
intern for string deduplication, take a look at
the
-XX:StringTableSize=… tuning
parameter. This determines the number of “buckets” the hash table
containing the strings has. By default this is

1009 in Java 7, which is insufficient when we
have many interned strings. You might want to tune it to something
like
60013 (come Java 8 this will be the
new default value).

You can also use the
-XX:+PrintStringTableStatistics option which
prints out a statistic about the string cache to the console during
shutdown and the
jmap tool which shows
the total number of strings in the string cache.

Tuning for many substrings

If you are in one of those special situations
where you need a lot of relatively long-lived substrings from a
single source (like parsing source code or other large text files),
you can create a custom
CharSequence
implementation which mimics the old substring behavior (and
also uses
byte[] if you know that you’re
only dealing with ASCII data for example).

CharSequence is an interface
introduced in Java 1.4 to work-around the fact that the String
class is final and we can’t have alternative implementations.
String itself implements
CharSequence (as
does
StringBuilder) and if you wish to
have a flexible API, you should expect

CharSequence for parameters rather than
Strings.

In our benchmark code we provide an alternative
implementation for CharSequence (BBCS – which stands for
BlobBackedCharSequence) that uses a byte array to store characters
(thus being space efficient if we only use ASCII characters) and
uses the “cheap substring” trick from Java 6. You can find the
source code under my GitHub account.

Lets see what this would look like:

 

The benchmark does a pseudo-tokenizing of the JDK 8
source code (you can get early access builds of JDK 8 here to
test your code now) and reports the runtime and the consumed
memory.

The benchmark supports different implementations
for extracting the String tokens:

  • null – this is used to create a baseline (to see
    how much the Token instances occupy)

  • String – uses String.substring

  • for Java 6 we also test String with compressed
    strings enabled

  • cached String – uses a simple opportunistic
    cache (ie. it doesn’t guarantee that all the duplicates are
    detected, but it’s fast)

  • interned String – uses String.intern as a
    cache

  • Blob – uses the custom CharSequence
    implementation

  • cached Blob – the custom CharSequence
    implementation combined with opportunistic caching

Tests were run with both Java 6 and Java 7. The results
are:

Java Version Tokenizer
implementation
Consumed
memory (MB)
Runtime
(seconds)
Java
6u38
null (baseline) 285  

Java
6u38

String 639
(+354)
5
Java
6u38
Chached String
461
(+176)
4
Java
6u38
Interned String 291
(+6)
11
Java
6u38
Compressed String 461
(+176)
5
Java
6u38
Blob 508
(+223)
4
Java
6u38
Cached Blob 375
(+90)
4
Java
7u45
null (baseline) 279  
Java
7u45
String 618
(+339)
8
Java
7u45
Cached String 328
(+49)
5
Java
7u45
Interned String 304
(+25)
6
Java
7u45
Blob 507
(+228)
5
Java
7u45
Cached Blob 375
(+96)
5

Conclusion

As
usual, the most important advice is: use an up-to-date version of
the JVM. Version 6 is end of lifed and you should be using Java 7,
and testing with the pre-release versions of Java 8. There can be
performance regressions but they are very rare and as a general
rule of thumb you should see a 10% improvement in CPU and memory
usage by upgrading from version 6 to version 7.

As for the String representations:

  • to see performance regressions you would need to
    have some very particular data and processing requirements (lots of
    data with very varied parts – which makes caching ineffective – and
    which needs to be processed in one chunk – ie. you can’t process it
    line-by-line for some reason)

  • if the performance / memory usage is not within
    expected bounds, the first step should be to see if the right data
    types are use (ie. don’t keep numbers / dates as strings, if a
    field can have only a limited number of values, use an
    Enum)

  • String interning has improved dramatically in
    Java 7. Use it if you have a lot of duplicate strings. Also, look
    at the tuning options / statistics exposed related to string table
    and try adjusting them.

  • more complex solutions (like
    BlobBackedCharSequence) can be implemented, however the gains might
    not be as great as expected. Use this approach only if the other
    options are exhausted and they didn’t provide satisfactory gains,
    and only after careful benchmarking, to avoid making the situation
    worse on a whim.

You can read more from Attila
at: 

GitHub: https;//github.com/cdman
Author
Comments
comments powered by Disqus