This is not just science - this is big data science

Using QuantCell, the polyglot big data spreadsheet, to deploy MapReduce algorithms

AgustEgilssonPhDKrisThorleifsson
mini

Now spreadsheet users can tap into the capabilities of major programming languages! Agust Egilsson and Kris Tholeiffsson show us how.

 There‘s
a lot of talk at the moment on how SMEs can get meaningful benefits
from Big Data. By enabling non-developers to build complex
analysis, models and applications, QuantCell endevours to bring the
capabilities of major programming languages to the spreadsheet
user. In this article, originally published in JAX
Magazine
, Agust Egilsson and Kris Tholeiffsson take us under
the hood.

BodQuantCell is a polyglot big data spreadsheet and an
end-user programming environment. It is used by domain experts and
developers to quickly prototype, develop and deploy production
ready code. In order to support the data scientist or analyst,
QuantCell models and applications can be created using a variety of
languages such as SQL, R and Python – the principal languages used
by analysts and data scientists. Regular users can simply write
formulas using the well understood spreadsheet notation, and last
but not least, QuantCell supports and is built upon Java, arguably
the most commonly used programming language today.

The QuantCell environment takes on the challenge of
facilitating end-user programming by employing a number of
approaches relating to access to data, access to algorithms, access
to compute power, access to domain expertise and by providing
deployment into production paths. This tutorial will go into some
detail explaining the inner workings of a how QuantCell addresses
deployment of models and applications created in the system as
production ready Java APIs using the Tools language. This matters
greatly, for example, when performing analysis against Hadoop
server environments. We end the discussion with a short overview of
the other pillars that support end-user programming in
QuantCell.

Deployment and Packaging

QuantCell allows the developer to create models and
applications using code snippets from a variety of programming
languages. Currently the spreadsheet supports expressions written
as traditional spreadsheet formulas, Java code snippets and SQL,
with R and Python support a few weeks away. In addition to these
languages the system supports expressions from a so-called “Tools”
language that is QuantCell specific and is used to support
deployment of solutions created in QuantCell directly into
production environments or to cloud infrastructures. An example of
such a sheet, combining SQL, code snippets and spreadsheet syntax
into an application is shown as Figure 1.

Figure 1: SQL and other expressions

The user application created is interactive just like
a regular spreadsheet, at nonetheless every piece of it is
represented as Java code and then translated into byte code by
techniques from the Compiler API (JSR 199), before being further
optimized by the JVM.

Introducing the Tools language

In order for applications created in QuantCell to be
immediately useful outside the spreadsheet we have built deployment
paths that take a model or application created in QuantCell and
represent it as a service, Java API’s packaged in a jar file or
otherwise as a particular implementation of class. These are just a
few of the deployment options that need to be available to the
developer and the end-user. Deployment is implemented using the
Tools language in QuantCell.

It is the only language in the QuantCell environment
that is specifically designed to address deployment from QuantCell
and to make models and applications independent of QuantCell. The
features and commands in the Tools language will grow as we add new
deployment paths. Currently, for example, the Tools language
supports deployment of QuantCell models as Java archive files (.jar
files). When using the Tools language to deploy anonymous classes
from models, as compiled byte code or source code stored in Java
archives, one can use the package command from the language as
follows:

Listing 1

;(Anonymously defined cell variables, …) = [tools] package

–jar <preferred location of jar containing byte code>

-java <preferred location of jar containing source code>

-source <source code version>

-target <preferred byte code version>

-override

End

A typical usage of the Tools language in QuantCell
that we will come back to later is shown as Figure 2

Figure 2: Tools language command

This is useful when writing Mappers, Reducers and
Combiners for Hadoop jobs running from QuantCell. But let’s start
with a less complicated example.

Assume you create an icon/path using JavaFX from
QuantCell as shown in cell d7 on Figure 3

Figure 3: Simple deployment example

By issuing the package command seen in cell d9, i.e.,
“(d7) = [tools] package -jar c:tempq-icon-code.jar -java
c:tempq-icon-javafx.jar –overwrite” QuantCell returns the
application created in this sheet as a Java program and API written
to the “temp” directory. The actual code created by the package
command is shown as Figure 4, when opened and formatted in an
IDE.

Figure 4: Deployed code

The class name, shown on Figure 4, looks a little
mechanical, but then again, it is mechanically created.

Hadoop and MapReduce

The above example explains a feature of the Tools
language used in QuantCell for deployment. On the other hand, the
example is not necessarily practical. Let’s look at a more
comprehensive example where we build mappers and reducers in
QuantCell and send the analysis to a Rackspace Hadoop cluster for
evaluation on a particular data set. In this example, the packaging
command becomes necessary since the analysis is run on an outside
cluster requiring us to deploy our code onto the cluster.

In this case the MapReduce analysis is created in many
pieces in the form of custom functions, reference data and formulas
created in different areas of the sheet. The QuantCell
client is used to interactively write, test and execute the
analysis. It also supports logging of activities on the cluster and
callback in addition to providing a form of documentation of the
code. Here is a screenshot, as Figure 5, showing the complete sheet
and, in particular, the code used to create the Hadoop Mapper in
cell c3 of the sheet.

Figure 5: MapReduce analysis

The Mapper object in cell c3 is seen to depend on the
function “traffic” defined in cell c16 which itself depends on
cells c14, c15 and g4. The packager in the Tools language has to
account for all these relationships when creating the byte code
which is sent to the Apache Hadoop cluster. Since deployment of
code is core functionality in QuantCell, it is efficiently handled
by the system internals. The package command creates as many
classes in the resulting jar as specified by the input
variables.

In the case at hand, QuantCell creates only the Mapper
and the Reducer class in cells c3 and c4 and makes sure that all
the logic is incorporated into these two classes. The Tools command
used in cell c5 to interactively package the analysis is “(c3,c4) =
[tools] package -jar .tempmr-class.jar -java .tempmr-java.jar
-source 1.6 -target 1.6 –overwrite”, this expression should be
self-explanatory, but it should be noted that the jar location is
just some location where the Hadoop client can find the API and
send it to the Hadoop server.

It remains to be explained how the Hadoop client is
able to take advantage of this code. This is something that is part
of Hadoop. Basically, one only has to tell the Hadoop configuration
object where to locate the code and in our example above, this is
done in cell c6 that contains the Configuration object.

Figure 6: Location of the Java archive
specified

Line 5 in the code (cell c6), see Figure 6, points
Apache Hadoop to the location of the QuantCell sheet containing the
Mapper and the Reducer class and, as required, other class
definitions. The Java archive pointed to is always kept up to date.
In other words, changes in the analysis in the QuantCell sheet are
immediately written to the jar by the package statement in cell c5.
This is, essentially, what enables the Hadoop server to use the
QuantCell sheet as the backbone of the analysis being
performed.

QuantCell independent deployment

Just like in the previous, shorter, example the Java
archive created by the packaging command only contains user defined
code. In other words, no part of the QuantCell environment is mixed
in within the API communicated to the Hadoop server. This is
important since it mimics how other more traditional IDEs work,
i.e., the code created does not depend on the IDE used by the user
to write the analysis.

The above example can be experimented with by
installing QuantCell, available from quantcell.com, and by opening
one of the examples included “CDH4 on Rackspace – Traffic”. The
Hadoop setup used in the example is based on Cloudera’s Open Source
Distribution including Hadoop (CDH4) distributed from Cloudera’s
homepage (cloudera.com). A CDH4 client is retrieved and installed
when the example is opened, but the example is not specific to CDH4
only, other Hadoop environments can be configured as well.

About the QuantCell environment

QuantCell is an end-user programming environment for
building applications, models and performing big data analysis.
There are several fundamental components that make QuantCell a
highly capable analytics tool for domain experts and data scientist
as well as a powerful tool for developers. We like to think of our
approach as being divided into the following components pictured
here as Figure 7.

Figure 7: Elements of end-user
programming

To give an overview of QuantCell we will add to our
above description by giving a short overview of each of these
pillars.

Spreadsheet

The spreadsheet is the only well-established end-user
development tool. It is proven to shorten turnaround time and most
users are familiar with working with a spreadsheet. QuantCell takes
the spreadsheet to the next level by bringing the power of major
programming languages to spreadsheet users, allowing them to
benefit from big data technologies, open source APIs, machine
learning algorithms and cloud-based resources.

Algorithms

QuantCell supports intuitive, on-demand access to
thousands of libraries and tools and big data frameworks from
online open source repositories and proprietary sources. Literally,
a terabyte of various algorithms and methods can be accessed and
used within QuantCell by the end-user from online repositories.
Practically any Java API can be imported into QuantCell from day
one and turned into the end-user’s formula library.

Access to Data Sources

QuantCell simplifies connecting to and using data
sources, reference data may be accessed using applicable APIs and
data may be accessed using JDBC connectors and SQL or NoSQL
languages, Apache Hadoop and other frameworks. Additionally, our
DataMarket wizard may be used to connect to data from over 50
different providers and to access well over 100 million
time-series.

Compute Power

QuantCell models derive their performance from the
underlying system, whether it be the local JVM, remote Hadoop
servers, databases, or other scalable compute clouds such as AWS.
Model execution benefits enormously from state of the art
just-in-time compilation, garbage collection, concurrency support
and other optimization methods available in the newest
Java platform.

Deployment of Models and Applications

Often an afterthought in other analytical software
products, deployment into production is ingrained into QuantCell’s
DNA, as explained in the tutorial. Deployment is a critical part of
the workflow when aiming to shorten the time from prototyping and
developing to deploying a solution into production.

End-User Programming

The QuantCell team seeks to provide solutions that
tackle end-user programming from all sides. In addition to
simplifying how users access algorithms, libraries, data and
compute power, and in addition to addressing deployment of models
into production, the QuantCell environment provides features that
reduce and often eliminate the need to write code at all.

Delivering on end-user programming is an ongoing
project for us at QuantCell Research, we believe it to be
tremendously valuable to domain experts, end-users and developers
alike by shortening turnaround times and allowing each group to
focus on what it does best.

Agust Egilsson, Ph.D.

Agust is a skilled developer and has been writing
code for over thirty years. He is the co-founder of QuantCell
Research and the creator and lead architect of the QuantCell Big
Data spreadsheet. He received his doctorate in mathematics from UC
Berkeley. Prior to founding QuantCell, Agust was an assistant
professor at UC Berkeley, lecturer and researcher at University of
Wisconsin and University of Iceland. He was an investment banker
were he directed quantitative research and risk management. He also
worked for DeCode Genetics and has background in Business
Intelligence. He has published several research papers in
mathematics and computer science.

Kris Thorleifsson

Kris is the co-founder of QuantCell Research and a
former product manager for Java at Sun Microsystem. An experienced
product and marketing manager, he oversaw the launch of the first
Java platform for Linux, and launched Sun’s Java Enterprise System
and SunGrid cloud computing utility. Prior to co-founding
QuantCell, he was the CMO and VP of the Dohop travel search engine,
worked at Compaq Computer Corporation, and headed marketing and
product management for an online banking service in 11
international markets. Kris holds an MBA degree from Rice
University and is a graduate of the University of Georgia.

 


 


Author

AgustEgilssonPhDKrisThorleifsson

All Posts by AgustEgilssonPhDKrisThorleifsson

Comments
comments powered by Disqus