‘coz Concurrency is Groovy!

Tutorial - GPars, making parallel systems Groovy and Java-friendly

GPars is an open-source library for high-level concurrency in Groovy (and Java). If you’ve ever heard terms like actors, dataflow, or parallel collections and wanted to try these concepts in a Java-friendly language, now you have a chance. In this article, I plan to give you a quick overview of the abstractions available in GPars. We’ll then look in more detail at flow-based concurrent data processing, parallelizing collection operations as well as running composed functions asynchronously.

As multi-core chips are becoming the norm in mainstream computers, tablets and phones, concurrent programming is gaining importance. Unfortunately, the widespread thread-based concurrency model that we know from Java doesn’t match well with how the human brain works. Threads and locks introduce too much non-determinism into the code, which frequently results in subtle bugs that are hard to track down and fix. Such code can’t be reliably tested or analysed. Inevitably, for concurrent programming to be effective, we need to use more natural mental models.

Concurrency can be Groovy

Bringing intuitive concurrency models to the mainstream is the challenge that GPars has ambitions to help with. We took the well-known concurrency concepts, such as actors, CSP, dataflow and others and implemented them in Java with a tasty Groovy DSL topping giving the library a fluent and easy-to-use flavour. Although targeting Groovy primarily, some of the GPars abstractions can be used from Java directly. Thanks to the interest of the Groovy community in concurrency and their support to the project, GPars is currently a standard part of Groovy distributions. No additional setup is required to get up and running with concurrency in Groovy.

A loop considered deadly

I’d like you to stop for a moment and think about the trivial exercises that computer science students are typically assigned when learning to program. For example, one such task is to find the maximum of a collection. Potentially some complex metrics may be applied to make the solution more computationally heavy. What algorithm comes to your mind first?

Chances are high you’d propose an iteration that would go through the collection and remember the biggest element found so far. Once we hit the end of the collection, the element we remember as the so-far biggest must be the global maximum of the whole collection. Clear, simple and – well, wrong! Keep on reading if you want to find out why.

Make your choice

There doesn’t seem to be a single one-size-fits-all solution in the concurrency space. Multiple paradigms have gradually emerged and despite some overlap, they are each suitable to different types of problems. GPars itself started in 2008 as a small Groovy library for parallel collection processing. It added support for the actor model shortly after. Over time other concurrency abstractions have been integrated. Here’s a quick list of what is currently available in GPars version 0.12:

  • Parallel collections provide intuitive ways to parallelize processing of Java/Groovy collections, maps and all geometrically decomposable problems in general

  • Asynchronous functions enable Groovy closures to be run asynchronously and at the same time to orchestrate their mutual communication with little effort

  • Fork/Join empowers you to process recursive divide-and-conquer algorithms concurrently

  • Dataflow variables (aka Promises) offer a lightweight mechanism for inter-thread communication.

  • Dataflow channels and operators let you organize active data-transforming elements into highly concurrent data-processing pipelines and networks

  • CSP is a well-known concurrency model based in theoretical mathematics, which uses an abstraction of independent concurrently-run processes communicating through synchronous channels

  • Actors/Active objects give you an abstraction of low-ceremony event-driven active components that asynchronously exchange data

  • Agents, as the name suggests, protect data that multiple threads need to access concurrently

You may check out the details of each of the models in the GPars user guide and see them compared side-by-side with their typical area of applicability. Also, the upcoming book “Groovy in Action”, 2nd edition, by Dierk König covers GPars to a great level of detail.

For this article, I chose three of the abstractions that are most likely to show you the benefits of intuitive concurrency – parallel collections, asynchronous functions, and dataflow operators. Let’s dive right in!

Geometrical decomposition

Now, this is a good place to explain why the sequential algorithm for finding the maximum that we described earlier is a bad choice. It is not that the solution is incorrect. Obviously, it gives you valid answers, doesn’t it? Where it fails is its effectiveness. It prohibits scaling up with the increasing number of workers. It totally ignores the possibility that the system may be able to put several processors on the problem.

Supermarkets solved this challenge decades ago. When a queue at the cash desk gets too long, they call in an additional cashier to serve the customers and so the work-load gets distributed and throughput increases.

Back to our problem of finding the maximum: Leveraging the Groovy functional collection API, GPars adds parallel versions of each of the popular iterative methods such as eachParallel(), collectParallel(), findAllParallel(), maxParallel() and others. These methods hide the actual implementation from the user. Behind the scenes, the collection gets divided into smaller chunks, possibly organized hierarchically, and each of these chunks will be processed by a different thread (Figure 1).

The actual work is performed by threads from a thread pool that the user must provide. GPars comes with two types of thread pools:

  • GParsExecutorsPool uses straight Java 5 Executors

  • GParsPool uses a Fork/Join thread pool

 

Figure 1: Geometric Decomposition

 

Parallel collections in use:
GParsPool.withPool 16, {
    def myFavorite = programmingLanguages.collectParallel {it.grabSpec()}
                        .maxParallel {it.spec.count(/parallel/)}
}

 

Within the withPool code block, the parallel collection methods automatically distribute work among the threads of the surrounding thread pool. The more threads you add to the pool, the more parallelism you get. Without an explicit pool size requirement, the pool would create a thread for each available processor detected at runtime, giving you the maximum computing power. That way there will be no artificial upper boundaries limiting parallelism inside your algorithm. The code will run at full speed no matter whether it is being run on an old single-processor machine or on a future hundred-core chip.

GPars parallel collection API offers solutions to what is frequently called The Loop Parallelism problems or more generally Geometrical Decomposition problems. There are also other types of challenges that require more creative approaches to concurrency. We’re going to discuss two of them in the next part of the article.

      

Pages

Václav Pech
Václav Pech

What do you think?

Comments

Latest opinions