This is not just science - this is big data science
Using QuantCell, the polyglot big data spreadsheet, to deploy MapReduce algorithms
There‘s a lot of talk at the moment on how SMEs can get meaningful benefits from Big Data. By enabling non-developers to build complex analysis, models and applications, QuantCell endevours to bring the capabilities of major programming languages to the spreadsheet user. In this article, originally published in JAX Magazine, Agust Egilsson and Kris Tholeiffsson take us under the hood.
BodQuantCell is a polyglot big data spreadsheet and an end-user programming environment. It is used by domain experts and developers to quickly prototype, develop and deploy production ready code. In order to support the data scientist or analyst, QuantCell models and applications can be created using a variety of languages such as SQL, R and Python – the principal languages used by analysts and data scientists. Regular users can simply write formulas using the well understood spreadsheet notation, and last but not least, QuantCell supports and is built upon Java, arguably the most commonly used programming language today.
The QuantCell environment takes on the challenge of facilitating end-user programming by employing a number of approaches relating to access to data, access to algorithms, access to compute power, access to domain expertise and by providing deployment into production paths. This tutorial will go into some detail explaining the inner workings of a how QuantCell addresses deployment of models and applications created in the system as production ready Java APIs using the Tools language. This matters greatly, for example, when performing analysis against Hadoop server environments. We end the discussion with a short overview of the other pillars that support end-user programming in QuantCell.
Deployment and Packaging
QuantCell allows the developer to create models and applications using code snippets from a variety of programming languages. Currently the spreadsheet supports expressions written as traditional spreadsheet formulas, Java code snippets and SQL, with R and Python support a few weeks away. In addition to these languages the system supports expressions from a so-called “Tools” language that is QuantCell specific and is used to support deployment of solutions created in QuantCell directly into production environments or to cloud infrastructures. An example of such a sheet, combining SQL, code snippets and spreadsheet syntax into an application is shown as Figure 1.
Figure 1: SQL and other expressions
The user application created is interactive just like a regular spreadsheet, at nonetheless every piece of it is represented as Java code and then translated into byte code by techniques from the Compiler API (JSR 199), before being further optimized by the JVM.
Introducing the Tools language
In order for applications created in QuantCell to be immediately useful outside the spreadsheet we have built deployment paths that take a model or application created in QuantCell and represent it as a service, Java API’s packaged in a jar file or otherwise as a particular implementation of class. These are just a few of the deployment options that need to be available to the developer and the end-user. Deployment is implemented using the Tools language in QuantCell.
It is the only language in the QuantCell environment that is specifically designed to address deployment from QuantCell and to make models and applications independent of QuantCell. The features and commands in the Tools language will grow as we add new deployment paths. Currently, for example, the Tools language supports deployment of QuantCell models as Java archive files (.jar files). When using the Tools language to deploy anonymous classes from models, as compiled byte code or source code stored in Java archives, one can use the package command from the language as follows:
Listing 1 ;(Anonymously defined cell variables, …) = [tools] package –jar <preferred location of jar containing byte code> -java <preferred location of jar containing source code> -source <source code version> -target <preferred byte code version> -override End
A typical usage of the Tools language in QuantCell that we will come back to later is shown as Figure 2
Figure 2: Tools language command
This is useful when writing Mappers, Reducers and Combiners for Hadoop jobs running from QuantCell. But let’s start with a less complicated example.
Assume you create an icon/path using JavaFX from QuantCell as shown in cell d7 on Figure 3
Figure 3: Simple deployment example
By issuing the package command seen in cell d9, i.e., “(d7) = [tools] package -jar c:\temp\q-icon-code.jar -java c:\temp\q-icon-javafx.jar –overwrite” QuantCell returns the application created in this sheet as a Java program and API written to the “temp” directory. The actual code created by the package command is shown as Figure 4, when opened and formatted in an IDE.
Figure 4: Deployed code
The class name, shown on Figure 4, looks a little mechanical, but then again, it is mechanically created.
Hadoop and MapReduce
The above example explains a feature of the Tools language used in QuantCell for deployment. On the other hand, the example is not necessarily practical. Let’s look at a more comprehensive example where we build mappers and reducers in QuantCell and send the analysis to a Rackspace Hadoop cluster for evaluation on a particular data set. In this example, the packaging command becomes necessary since the analysis is run on an outside cluster requiring us to deploy our code onto the cluster.
In this case the MapReduce analysis is created in many pieces in the form of custom functions, reference data and formulas created in different areas of the sheet. The QuantCell client is used to interactively write, test and execute the analysis. It also supports logging of activities on the cluster and callback in addition to providing a form of documentation of the code. Here is a screenshot, as Figure 5, showing the complete sheet and, in particular, the code used to create the Hadoop Mapper in cell c3 of the sheet.
Figure 5: MapReduce analysis
The Mapper object in cell c3 is seen to depend on the function “traffic” defined in cell c16 which itself depends on cells c14, c15 and g4. The packager in the Tools language has to account for all these relationships when creating the byte code which is sent to the Apache Hadoop cluster. Since deployment of code is core functionality in QuantCell, it is efficiently handled by the system internals. The package command creates as many classes in the resulting jar as specified by the input variables.
In the case at hand, QuantCell creates only the Mapper and the Reducer class in cells c3 and c4 and makes sure that all the logic is incorporated into these two classes. The Tools command used in cell c5 to interactively package the analysis is “(c3,c4) = [tools] package -jar .\temp\mr-class.jar -java .\temp\mr-java.jar -source 1.6 -target 1.6 –overwrite”, this expression should be self-explanatory, but it should be noted that the jar location is just some location where the Hadoop client can find the API and send it to the Hadoop server.
It remains to be explained how the Hadoop client is able to take advantage of this code. This is something that is part of Hadoop. Basically, one only has to tell the Hadoop configuration object where to locate the code and in our example above, this is done in cell c6 that contains the Configuration object.
Figure 6: Location of the Java archive specified
Line 5 in the code (cell c6), see Figure 6, points Apache Hadoop to the location of the QuantCell sheet containing the Mapper and the Reducer class and, as required, other class definitions. The Java archive pointed to is always kept up to date. In other words, changes in the analysis in the QuantCell sheet are immediately written to the jar by the package statement in cell c5. This is, essentially, what enables the Hadoop server to use the QuantCell sheet as the backbone of the analysis being performed.
QuantCell independent deployment
Just like in the previous, shorter, example the Java archive created by the packaging command only contains user defined code. In other words, no part of the QuantCell environment is mixed in within the API communicated to the Hadoop server. This is important since it mimics how other more traditional IDEs work, i.e., the code created does not depend on the IDE used by the user to write the analysis.
The above example can be experimented with by installing QuantCell, available from quantcell.com, and by opening one of the examples included “CDH4 on Rackspace - Traffic”. The Hadoop setup used in the example is based on Cloudera’s Open Source Distribution including Hadoop (CDH4) distributed from Cloudera’s homepage (cloudera.com). A CDH4 client is retrieved and installed when the example is opened, but the example is not specific to CDH4 only, other Hadoop environments can be configured as well.
About the QuantCell environment
QuantCell is an end-user programming environment for building applications, models and performing big data analysis. There are several fundamental components that make QuantCell a highly capable analytics tool for domain experts and data scientist as well as a powerful tool for developers. We like to think of our approach as being divided into the following components pictured here as Figure 7.
Figure 7: Elements of end-user programming
To give an overview of QuantCell we will add to our above description by giving a short overview of each of these pillars.
The spreadsheet is the only well-established end-user development tool. It is proven to shorten turnaround time and most users are familiar with working with a spreadsheet. QuantCell takes the spreadsheet to the next level by bringing the power of major programming languages to spreadsheet users, allowing them to benefit from big data technologies, open source APIs, machine learning algorithms and cloud-based resources.
QuantCell supports intuitive, on-demand access to thousands of libraries and tools and big data frameworks from online open source repositories and proprietary sources. Literally, a terabyte of various algorithms and methods can be accessed and used within QuantCell by the end-user from online repositories. Practically any Java API can be imported into QuantCell from day one and turned into the end-user’s formula library.
Access to Data Sources
QuantCell simplifies connecting to and using data sources, reference data may be accessed using applicable APIs and data may be accessed using JDBC connectors and SQL or NoSQL languages, Apache Hadoop and other frameworks. Additionally, our DataMarket wizard may be used to connect to data from over 50 different providers and to access well over 100 million time-series.
QuantCell models derive their performance from the underlying system, whether it be the local JVM, remote Hadoop servers, databases, or other scalable compute clouds such as AWS. Model execution benefits enormously from state of the art just-in-time compilation, garbage collection, concurrency support and other optimization methods available in the newest Java platform.
Deployment of Models and Applications
Often an afterthought in other analytical software products, deployment into production is ingrained into QuantCell's DNA, as explained in the tutorial. Deployment is a critical part of the workflow when aiming to shorten the time from prototyping and developing to deploying a solution into production.
The QuantCell team seeks to provide solutions that tackle end-user programming from all sides. In addition to simplifying how users access algorithms, libraries, data and compute power, and in addition to addressing deployment of models into production, the QuantCell environment provides features that reduce and often eliminate the need to write code at all.
Delivering on end-user programming is an ongoing project for us at QuantCell Research, we believe it to be tremendously valuable to domain experts, end-users and developers alike by shortening turnaround times and allowing each group to focus on what it does best.
Agust Egilsson, Ph.D.
Agust is a skilled developer and has been writing code for over thirty years. He is the co-founder of QuantCell Research and the creator and lead architect of the QuantCell Big Data spreadsheet. He received his doctorate in mathematics from UC Berkeley. Prior to founding QuantCell, Agust was an assistant professor at UC Berkeley, lecturer and researcher at University of Wisconsin and University of Iceland. He was an investment banker were he directed quantitative research and risk management. He also worked for DeCode Genetics and has background in Business Intelligence. He has published several research papers in mathematics and computer science.
Kris is the co-founder of QuantCell Research and a former product manager for Java at Sun Microsystem. An experienced product and marketing manager, he oversaw the launch of the first Java platform for Linux, and launched Sun’s Java Enterprise System and SunGrid cloud computing utility. Prior to co-founding QuantCell, he was the CMO and VP of the Dohop travel search engine, worked at Compaq Computer Corporation, and headed marketing and product management for an online banking service in 11 international markets. Kris holds an MBA degree from Rice University and is a graduate of the University of Georgia.