Hadoop – Think Big
The data revolution has started. Or has it?
The price for storage has fallen in recent years – even one GB
of SSD storage is expected to fall below a dollar by the
end of 2012. Clusters with thousands of machines are no longer
the standard of big corporations only. With technologies like
Apache Hadoop and HBase, a “throw nothing away culture” has become
common. Data scientists are happily sifting through all that raw
data finding ever more valuable information and extracting
business-relevant pieces that provide competitive advantages.
Machine learning helps solve problems that even a few years ago
seemed intractable. The data revolution has started. Or has it?
Reading publications on big data today gives the impression that
development processes have to be reshaped: growing amounts of data
have to be stored to remain competitive. Projects that deal with
storing, analysing and visualising data are hyped all over the
place. When taking a closer look at individual solutions in
production today, the view gets much less coherent – often
solutions are built from several different pieces at varying
maturity. It is not unusual for developers to take longer than
originally estimated to put an architecture to work.
This article tries to put data-driven development in context –
comparing and contrasting it with what is already being done in
practice. It shows how the big data technologies fit together and
where yellow elephants and machines that adapt them fit in.
Let’s start with a hypothetical example of building a web shop.
When brainstorming, requirements that comes to mind are things like
a place to store products, user data, user transactions (those that
are active e. g. “Mary bought new headphones, product ID is
9887876, they are in shipment” and also ideally past ones to be
able to make better product recommendations for Mary later). Once
the shop is in place, we may guess what changes and improvements
users might find helpful. We might explicitly ask Mary for her
suggestions. However, average users are very lazy: they only
provide feedback very rarely.
What is easier is to watch users online: watching what products
they look at, which ones they end up buying, investigating what
pages have the highest rate of users leaving the site, what pages
are typical entry pages. When going down that road, development
turns into a cycle of four steps:
The four steps of development:
1. Observe how users interact with the application – that will
lead to finding multiple deficiencies with varying benefits when
fixed: Product search might not be ideal, in that users often
search for headphones in a particular colour, but colour might not
yet be indexed. Users might be searching for the “Buy” button as it
is not at the very top of the page.
2. Orient by defining in which direction the application should
be improved – decide which deficiency to fix in your next iteration
and define what criteria to measure how successful your fix was:
Make the colour of headphones navigable and expect to double
headphone sales within a month.
3. Decide what to implement – Implementation details may differ,
in our toy example options may be to make the colour of
headphones part of the index so users can search for them,
or include them in a faceting user interface.
4. Act – implement the fix.
In the end, the cycle begins again, by observing how users react
to your fix. That last step is what feedback can be made on the
current and future implementation. The tighter that cycle can be
made, the faster feedback can flow back into development, the more
likely a project is to outpace its competitors (Figure 1).
Fig. 1: The four steps of development
This development cycle is not surprising, nor should it be
particularly new. Instead, it is what has been done implicitly for
any successful implementation. However, this explicit form shows
that there are steps that can greatly benefit from gathering
well-defined additional data sets or using already existing
Observation is greatly supported by keeping track of any user
behaviour – both common and unusual, both successful and
unsuccessful. Each time a user successfully interacts with the web
shop, it shows what functions work particularly well. Unsuccessful
interaction reveals deficiencies and room for improvement. A metric
is defined by the success of a new feature which then needs to be
evaluated in orientation. Making this step explicit and measurable
gives clear numbers against which changes should be compared. It
also makes the goal of a change explicit – and helps define if and
when that goal is reached.
A second goal of collecting interaction data is to use that data
to provide better services to individual users: Most likely the
user, Mary, will not want to be searching for another pair of
headphones she gave a bad review for. So instead she should get
different search results. Also she might be interested in a very
specific type of music and may be very happy when presented with
compilations that she likes.
In the end both types of using interaction data result in the
flowchart displayed in Figure 2.
Fig. 2: Two types of interaction data
Is either way of thinking about data gathering new? Not really.
Interaction data has been gathered since the early days of the
internet. Users tend to do all sorts of interesting stuff within
the infrastructure they are given. As a result, service providers
started very early with logging user interactions – be it only to
diagnose what caused a problem after the effect.
More common types of requirements engineering on that level
include machine sizing based on past traffic patterns, hardening a
setup against seen intrusion attempts. Typical features based on
user data include showing only new content to readers of an online
magazine or showing different content based on the origin of the
The data sources used usually include web server access logs,
health check results and response time logs. All of these come in
different but usually text-based formats that are treated with
tools like sed, awk or python scripts. Results are then presented
in custom dashboards, gnuplot graphs or even semi-standard tooling
for log analysis – AWStats being one prominent and well-known
example of web server logs.
Let’s go one step further and look at application-specific data:
These include customer databases, transaction logs and the like.
What can be learnt from this?
In terms of new requirements, we might find out that all
customers come from Europe so marketing should expand to different
countries. Based on the demographics of users, the site itself may
need different features – your average technology-savvy customer
has different information needs than someone who is barely able to
features, one might want to recommend products to users based on
their individual past behaviour.
Tools used here are standard relational databases. For analysis
there is a standard query language, with some custom extensions
depending on the database system provider. For visualisation,
developers can come up with custom reports or use tools like
Now what if log sizes outgrow the storage or compute capacity of
one single machine? What if analysis of a customer database takes
too long on a single machine or is too expensive to speed up by
scaling this one machine up? There is a very simple answer: Take
The not-so-simple, but probably more correct answer is: You
won’t get around a distributed system for data analysis. When
opting for commodity hardware, your best bet is to use Apache
Hadoop as the basis of your system. What you need on top depends on
your specific case.
Hadoop itself comes with two components: HDFS and MapReduce.
HDFS provides a distributed filesystem on top of your GNU/Linux
filesystem (Windows is not officially supported yet). It is built
on the assumption that the system runs on commodity hardware, so
failure does happen. To cope, it comes with auto failover and
built-in replication. It also assumes that you are operating
hardware; disk scan is cheaper than disk seek – this assumption
holds true even for SSDs. The third assumption is that your system
will process huge amounts of data – as moving data is more
expensive than moving computation, Hadoop will try to run
processing as close to the data as possible, ideally on the same
machine the data is stored on. MapReduce then provides the
programming library to make efficient use of HDFS’ features.
As HDFS only provides raw file storage, there are multiple
libraries that provide structured storage of data – in a compressed
binary format that provides upgradable data structures. Examples
are Avro, Thrift and Protocol Buffers.
HDFS itself is well-suited for background processing, but less
suitable for online use. If you have so many users that standard
databases can no longer cope with the traffic and you’d like your
system to integrate well with Hadoop (in particular with MapReduce
jobs) Apache HBase (as well as Hypertable) are the systems to look
at for storage. Both provide good online access performance.
When it comes to analysis, developers will need to write
MapReduce jobs. There is a Java (including a separate library that
makes unit testing easier) as well as a C++ API available. However,
Hadoop also comes with a streaming interface that makes it possible
to use STDIN and STDOUT for processing thus enabling e. g. any
However, sometimes developers will not need to touch either low
level programming language. Instead, query languages such as Pig,
Latin or Hive’s language, which is very close to SQL, are
sufficient for most tasks. Where their syntax is not enough,
developers have the option to extend the language with their own
Cascading tries to provide a middle ground between low-level
Java MapReduce and high-level Pig. Apache Giraph focuses on
processing graph data including operators in particular important
when analysing graphs.
When writing jobs, the need to chain mutiple MapReduce phases
quickly arises. One might want to mix and match phases writing in
Java and those written in Pig. Trigger processing based on a timer
or based on data availability. Those tasks can be solved by Oozie –
a workflow manager for Hadoop.
What if instead of providing simple statistics the goal is to
automatically group users according to the types of products they
buy? What if the goal is to classify users by how likely they are
to prefer a certain type of product? What if the goal is to predict
what queries a user is most likely to type next? Most of the more
sophisticated problems can be phrased in a way that makes them
simple to solve with automated methods.
Apache Mahout helps with problems that are of large scale and comes
with Hadoop bindings where necessary.
In terms of visualization, again, what is left to do is to
create custom reports and then come up with custom dashboards.
Also, it is possible to export data to your preferred file format
to then visualise results in common tools like gephi for graph
visualisation or import data into regular relational databases and
use their tools for visualisation.
It’s never that simple in the real world
What we have looked at so far is a great, coherent
infrastructure entirely based on Apache Hadoop. However,
infrastructure in the real world never is all that simple. There
will be relational databases left running that are just too
expensive to migrate. If not integrated well with a Hadoop setup,
those will remain isolated data silos that no one can benefit from
except the original developers. MapReduce jobs running directly
against the database would bring down that system very quickly.
Hadoop is very efficient for generating DOS attacks against
existing systems. It is common practice to import database content
into HDFS on a regular schedule. With Sqoop there is a tool that
has been proven to work reliably.
Now all that is left is a working solution to get log files from
distributed systems back into HDFS. With Flume, Scribe, and Chukwa,
there are three systems available to choose from that help with
distributed logging and provide Hadoop bindings.
Taking one step back, the result already looks quite clean: data
storage, analysis and visualisation are available, existing systems
are well integrated. However, now we have the software to scale the
system up to tens, hundreds, even thousands of machines. Those
numbers are no longer manually manageable by any operations team.
Instead, automated configuration management and the option for
controlled roll-back becomes ever more important. Systems like Chef
and Puppet for systems, as well as Zookeeper for configuration
management suddenly become crucial for scaling (Figure 3).
Fig. 3: A complex but coherent jigsaw
The final picture no longer looks all that simple. In addition
to the many different pieces and best practices, there are various
steps that involve making a conscious decision for one system and
against another, if one is to avoid mixing and matching all sorts
of systems that all solve very common problems but differ in
What is to be gained by such a setup? The most important
advantage is integration of data storage. If data is no longer
separated in data silos, all sorts of applications become possible
and less expensive to build as data no longer has to be copied from
team to team but is available and processable to all. Also,
business reporting becomes simple by making it possible to
integrate all necessary data sources in a consistent way.
To be processable however means that the data format is known,
well-documented or at least self-describing. Suddenly logging no
longer is optional and is needed for debugging and post-mortem
analysis in case of issues. All of a sudden, log files are an asset
in themselves – if they contain the right information, enough data
and valid entries.
In general, it has been observed that use cases for Hadoop are
identified easily after its introduction. However, one should not
take introduction of such a system all too lightly – the learning
curve for developers is still quite steep.
Systems are still to be considered at a DevOps level and that
means both development and operations have to work tightly together
to make things work. Integration is not yet fully optimised and as
well done as with other more common systems. The most promising
route is to identify an existing use case that would greatly
benefit from using Hadoop. Introducing Apache Hadoop, and all that
is needed around to support it, will provide enough experience to
simplify and adapting more and more use cases over time. Having one
driving project helps focus on real business needs and to figure
out what the important features are. It also helps take decisions
as to what projects to use to solve real technical problems.
Overall, Hadoop and its ecosystem provide unique features that
allow scaling to large, distributed systems on commodity hardware
at reasonable cost. The system has been used in production by both
larger (Yahoo!, Facebook, Twitter) and smaller (Neofonie GmbH,
nurago, nuxeo) companies to solve their analytics needs. Being an
Apache project, it is possible to get involved and have issues
fixed within reasonable time. Though release cycles have been
growing in the past, the project itself has issued faster releases
to get features and bug fixes out to uses at shorter intervals. In
addition, the project itself unites developers from different
backgrounds and brings together knowledge from teams with lots of
experience with distributed processing around the world, which does
not only help harden the implementation, but also provides a broad
basis to draw from when developing new features.
Having released version 1.0 after several years, it shows its
developers’ trust into the maturity of the code base. In upcoming
releases, the architecture will become much more flexible
separating resource management from computation libraries. This
will enable developers to treat MapReduce as a simple programming
library and evolve MapReduce APIs independently of cluster
versions. 2012 will bring many improvements to the project. Make
sure to be part of the process by joining the user and developer
mailing lists. You can shape the process by putting in your
knowledge and work by providing valuable patches that solve still