Guest post

High-speed, multi-threaded virtual memory in Java

Alexander Turner

Alexander Turner explores how the 64-bit address space of modern JVMs allows for petabytes of virtual memory.

Would you like a JVM with
terabytes or even petabytes of memory, but still running on
commodity hardware? Would you like to write to files just as you
interact with memory; without having all that fuss about opening,
closing, reading and writing?

With the 64-bit address space of modern JVMs, both of these are
actually possible. The key is to stop thinking about memory and
disk as being separate things and instead combine them with
memory-mapped files.

In the old 32-bit days, memory-mapped files were handy for high
speed disk access, but not for massive virtual memory or for large
files because of the limited address space of the virtual machine.
Now that the JVM is 64-bit and runs on 64-bit operating systems,
memory-mapped files can map tera- or even petabytes of memory into
a process address space. The process does not need to bother itself
about whether the memory is in RAM or on disk; the operating system
takes care of that, and it does so very fast indeed.

We can access memory-mapped files from Java using the
MappedByteBuffer class. Instances of this appear to be normal
ByteBuffers, but the memory they contain is purely virtual; at any
moment it might be on disk, or it might be in Random Access Memory,
but either way it is the job of the operating system to care about
these things. Because of the Intger.MAX_VALUE maximum size of
ByteBuffer, we need a few of them to map large amounts of memory.
In this example I have mapped 40 gigabytes. As my Mac only has 16
gigs of RAM, this proves the point!  

      public MapperCore(String prefix,long size) throws IOException {
                coreFile=new File(prefix+getUniqueId() + ".mem");
                // This is a for testing - to avoid the disk filling up
                // Now create the actual file
                coreFileAccessor = new RandomAccessFile(coreFile, "rw");
                FileChannel channelMapper=coreFileAccessor.getChannel();
                long nChunks=size/TWOGIG;
                if(nChunks>Integer.MAX_VALUE)throw new ArithmeticException("Requested File Size Too Large");
                long countDown=size;
                long from=0;
                while(countDown>0) {
                        long len=Math.min(TWOGIG, countDown);
                        ByteBuffer, from, len);

The above code creates a list of MappedByteBuffer objects which
cover the 40 gigabytes of virtual memory. Reading and writing the
to memory only requires a little care handling access straddling
two chunks. The full listing of the code can be found in this


One amazingly powerful, simple and easy aspect of this approach
is threading. Normal IO with Java is a threading nightmare. Two
threads cannot access the same stream or RandomAccessFile at the
same time without causing chaos, and while we can use non-blocking
IO this is complex and invasive into ones code.

However, as memory-mapped files are handled by the OS just like
any other memory threading is a non-issue. Yes, really: we can use
as many threads as we want and read and write from it at the same
time with no consequences at all; I ran the test code with 128
threads and it worked just fine (though the machine got kinda of
hot ). The only trick required is to use duplicates of the
MappedByteBuffer objects to avoid issues with their position

Now we can test it:

        public void readWriteCycleThreaded() throws IOException {
                final MapperCore mapper=new MapperCore("/tmp/MemoryMap",BIG_SIZE);
                final AtomicInteger fails=new AtomicInteger();
                final AtomicInteger done=new AtomicInteger();
                Runnable r=new Runnable() {
                        public  void run() {
                                try {
                                        // Set to 0 for sequential test
                                        long off=(long) ((BIG_SIZE-1024) * Math.random());
                                        System.out.println("Running new thread");
                                        byte[] bOut=new byte[1024];
                                        double counts=10000000;
                                        for(long i=0;i<counts;++i) {
                                                ByteBuffer buf=ByteBuffer.wrap(bOut);
                                                long pos=(long) (((BIG_SIZE-1024) * (i/counts))+off)%(BIG_SIZE-1024);
                                                // Align with 8 byte boundary
                                                for(int j=0;j<128;++j) {
                                                        buf.putLong(pos + j*8);
                                                byte[] bIn= mapper.get(pos, 1024);
                                                for(int j=0;j<128;++j) {
                                                        long val=buf.getLong();
                                                        if(val!=pos+j*8) {
                                                                throw new RuntimeException("Error at " + (pos+j*8) + " was " + val);
                                        System.out.println("Thread Complete");
                                }catch(Throwable e)
                                }finally {
                int nThreads=128;
                for(int i=0;i<nThreads;++i) {
                        new Thread(r).start();
                while(done.intValue()!=nThreads) {
                        try {
                        } catch (InterruptedException e) {
                                // ignore
                if(fails.intValue()!=0) {
                        throw new RuntimeException("It failed " + fails.intValue());

With 128 threads, any other form of IO I have tried to perform
stuff like this with would simply collapse to less than signal
threaded performance. I have tried this on a quad-core,
hyperthreaded I7” Retina MacBook Pro; this code uses all 128
threads and maxes out the CPU (800%) until the OS detects that the
process is running low on RAM. At that point, it starts paging in
and out of the mapped memory file; to achieve this, the kernel
starts to take up some CPU and the Java process drops to between
650 and 750%. The Java program never needs to care one jot about
reading, writing, synchronizing or any of that stuff – the OS does
it all.  

Results May Vary

If the read and write points are randomised rather than
sequential, performance drops away somewhat (to 750% and 250% with
and without swapping respectively). This leads me to believe this
approach might be more appropriate for working with small numbers
of large objects rather than large numbers of small objects. An
alternative would be to use a pre-load cache if large numbers of
small are mapped into the virtual memory.


So far I have illustrated the use of this technology as flexible
virtual memory system. In my example, the underlying file is
deleted once we are done with the virtual memory. However, this
approach could just as easily be used for persistent data.

For example, video editing is a very challenging engineering
problem. In general, there are two competing approaches; storing
the entire video in a loss-free format and edit that stored
information, or regenerate the video as required. The latter
approach is more common because of RAM constraints. However, video,
being linear, makes an ideal data type to store in very large
mapped virtual memory. As the algorithms progress over the video,
they can access it as raw byte arrays. The operating system will
page the virtual memory behind these buffers to and from disk as

A completely different, but equally valid use, is as a
replacement for the over-engineered RAM cache solutions often used
in document serving. Consider that we have a medium-sized document
repository of a few terabytes. It may contain such objects as
pictures, short videos and PDFs. A common approach to accelerating
serving these from disk is to use weak or soft references within a
RAM cache of the documents. This has major implications for the JVM
garbage collector and is hard to engineer correctly. It would be
much simpler, and achieve the same thing, to map the entire
document store into virtual memory. The operating system will take
care of paging data into RAM as it is required; furthermore, and
more importantly, the operating system will try to keep memory
pages which having just been accessed in RAM. This means that the
memory-mapped file approach acts as a RAM cache without any
engineering on the part of Java or any implications for the JVM
garbage collector.

Finally, memory-mapped files may be useful for scientific
computing and other modelling applications. Where computational
models are used to represent real-world systems, they frequently
require enormous amounts of data to function correctly. In my audio
processing system Sonic
, individual sound wave forms are mixed and processed to
model real-world audio effects. For example, to model reverberation
the program creates copies of the original audio as if they had
reflected off hard surfaces. It then mixes these reflections back
with the original. Such a methodology requires enormous amounts of
storage; placing audio signal objects in virtual memory makes a lot
of sense (and was the original motivation for this work).

Alex has worked in Java for 15 years and had recently helped
develop the COBOL and Magik languages for the JVM. Right now, he is
working on Java performance testing tooling for Micro

Photo by Jacqui

Alexander Turner
Alex has worked in Java for 15 years and had recently helped develop the COBOL and Magik languages for the JVM. Right now, he is working on Java performance testing tooling for Micro Focus.
comments powered by Disqus