Guest post

High-speed, multi-threaded virtual memory in Java

Would you like a JVM with terabytes or even petabytes of memory, but still running on commodity hardware? Would you like to write to files just as you interact with memory; without having all that fuss about opening, closing, reading and writing?

With the 64-bit address space of modern JVMs, both of these are actually possible. The key is to stop thinking about memory and disk as being separate things and instead combine them with memory-mapped files.

In the old 32-bit days, memory-mapped files were handy for high speed disk access, but not for massive virtual memory or for large files because of the limited address space of the virtual machine. Now that the JVM is 64-bit and runs on 64-bit operating systems, memory-mapped files can map tera- or even petabytes of memory into a process address space. The process does not need to bother itself about whether the memory is in RAM or on disk; the operating system takes care of that, and it does so very fast indeed.

We can access memory-mapped files from Java using the MappedByteBuffer class. Instances of this appear to be normal ByteBuffers, but the memory they contain is purely virtual; at any moment it might be on disk, or it might be in Random Access Memory, but either way it is the job of the operating system to care about these things. Because of the Intger.MAX_VALUE maximum size of ByteBuffer, we need a few of them to map large amounts of memory. In this example I have mapped 40 gigabytes. As my Mac only has 16 gigs of RAM, this proves the point!  

      public MapperCore(String prefix,long size) throws IOException {
                coreFile=new File(prefix+getUniqueId() + ".mem");
                // This is a for testing - to avoid the disk filling up
                coreFile.deleteOnExit();
                // Now create the actual file
                coreFileAccessor = new RandomAccessFile(coreFile, "rw");
                FileChannel channelMapper=coreFileAccessor.getChannel();
                long nChunks=size/TWOGIG;
                if(nChunks>Integer.MAX_VALUE)throw new ArithmeticException("Requested File Size Too Large");
                length=size;
                long countDown=size;
                long from=0;
                while(countDown>0) {
                        long len=Math.min(TWOGIG, countDown);
                        ByteBuffer chunk=channelMapper.map(MapMode.READ_WRITE, from, len);
                        chunks.add(chunk);
                        from+=len;
                        countDown-=len;
                }
        }

The above code creates a list of MappedByteBuffer objects which cover the 40 gigabytes of virtual memory. Reading and writing the to memory only requires a little care handling access straddling two chunks. The full listing of the code can be found in this gist.  

Threading

One amazingly powerful, simple and easy aspect of this approach is threading. Normal IO with Java is a threading nightmare. Two threads cannot access the same stream or RandomAccessFile at the same time without causing chaos, and while we can use non-blocking IO this is complex and invasive into ones code.

However, as memory-mapped files are handled by the OS just like any other memory threading is a non-issue. Yes, really: we can use as many threads as we want and read and write from it at the same time with no consequences at all; I ran the test code with 128 threads and it worked just fine (though the machine got kinda of hot ). The only trick required is to use duplicates of the MappedByteBuffer objects to avoid issues with their position state.

Now we can test it:

@Test 
        public void readWriteCycleThreaded() throws IOException {
                final MapperCore mapper=new MapperCore("/tmp/MemoryMap",BIG_SIZE);
                final AtomicInteger fails=new AtomicInteger();
                final AtomicInteger done=new AtomicInteger();
                Runnable r=new Runnable() {
                        public  void run() {
                                try {
                                        // Set to 0 for sequential test
                                        long off=(long) ((BIG_SIZE-1024) * Math.random());
                                        System.out.println("Running new thread");
                                        byte[] bOut=new byte[1024];
                                        double counts=10000000;
                                        for(long i=0;i<counts;++i) {
                                                ByteBuffer buf=ByteBuffer.wrap(bOut);
                                                long pos=(long) (((BIG_SIZE-1024) * (i/counts))+off)%(BIG_SIZE-1024);
                                                // Align with 8 byte boundary
                                                pos=pos/8;
                                                pos=pos*8;
                                                for(int j=0;j<128;++j) {
                                                        buf.putLong(pos + j*8);
                                                }
                                                mapper.put(pos,bOut);
                                                byte[] bIn= mapper.get(pos, 1024);
                                                buf=ByteBuffer.wrap(bIn);
                                                for(int j=0;j<128;++j) {
                                                        long val=buf.getLong();
                                                        if(val!=pos+j*8) {
                                                                throw new RuntimeException("Error at " + (pos+j*8) + " was " + val);
                                                        }
                                                }                       
                                        }
                                        System.out.println("Thread Complete");
                                }catch(Throwable e)
                                {
                                        e.printStackTrace();
                                        fails.incrementAndGet();
                                }finally {
                                        done.incrementAndGet();
                                }
                        }
                };
                int nThreads=128;
                for(int i=0;i<nThreads;++i) {
                        new Thread(r).start();
                }
                while(done.intValue()!=nThreads) {
                        try {
                                Thread.sleep(1000);
                        } catch (InterruptedException e) {
                                // ignore
                        }
                }
                if(fails.intValue()!=0) {
                        throw new RuntimeException("It failed " + fails.intValue());
                }
        }

With 128 threads, any other form of IO I have tried to perform stuff like this with would simply collapse to less than signal threaded performance. I have tried this on a quad-core, hyperthreaded I7” Retina MacBook Pro; this code uses all 128 threads and maxes out the CPU (800%) until the OS detects that the process is running low on RAM. At that point, it starts paging in and out of the mapped memory file; to achieve this, the kernel starts to take up some CPU and the Java process drops to between 650 and 750%. The Java program never needs to care one jot about reading, writing, synchronizing or any of that stuff - the OS does it all.  

Results May Vary

If the read and write points are randomised rather than sequential, performance drops away somewhat (to 750% and 250% with and without swapping respectively). This leads me to believe this approach might be more appropriate for working with small numbers of large objects rather than large numbers of small objects. An alternative would be to use a pre-load cache if large numbers of small are mapped into the virtual memory.

Applications

So far I have illustrated the use of this technology as flexible virtual memory system. In my example, the underlying file is deleted once we are done with the virtual memory. However, this approach could just as easily be used for persistent data.

For example, video editing is a very challenging engineering problem. In general, there are two competing approaches; storing the entire video in a loss-free format and edit that stored information, or regenerate the video as required. The latter approach is more common because of RAM constraints. However, video, being linear, makes an ideal data type to store in very large mapped virtual memory. As the algorithms progress over the video, they can access it as raw byte arrays. The operating system will page the virtual memory behind these buffers to and from disk as required.

A completely different, but equally valid use, is as a replacement for the over-engineered RAM cache solutions often used in document serving. Consider that we have a medium-sized document repository of a few terabytes. It may contain such objects as pictures, short videos and PDFs. A common approach to accelerating serving these from disk is to use weak or soft references within a RAM cache of the documents. This has major implications for the JVM garbage collector and is hard to engineer correctly. It would be much simpler, and achieve the same thing, to map the entire document store into virtual memory. The operating system will take care of paging data into RAM as it is required; furthermore, and more importantly, the operating system will try to keep memory pages which having just been accessed in RAM. This means that the memory-mapped file approach acts as a RAM cache without any engineering on the part of Java or any implications for the JVM garbage collector.

Finally, memory-mapped files may be useful for scientific computing and other modelling applications. Where computational models are used to represent real-world systems, they frequently require enormous amounts of data to function correctly. In my audio processing system Sonic Field, individual sound wave forms are mixed and processed to model real-world audio effects. For example, to model reverberation the program creates copies of the original audio as if they had reflected off hard surfaces. It then mixes these reflections back with the original. Such a methodology requires enormous amounts of storage; placing audio signal objects in virtual memory makes a lot of sense (and was the original motivation for this work).

Alex has worked in Java for 15 years and had recently helped develop the COBOL and Magik languages for the JVM. Right now, he is working on Java performance testing tooling for Micro Focus.

Photo by Jacqui 1686.

Alexander Turner
Alexander Turner

What do you think?

JAX Magazine - 2014 - 06 Exclucively for iPad users JAX Magazine on Android

Comments

Latest opinions