Interview with Hadoop PMC, Arun Murthy
We talk to the Hadoop PMC, Arun Murthy about the importance of Hadoop 1.0, how it got this far and what ‘The Swiss Army Knife of the 21st century’ needs to overcome in the future
It has been a busy old year at Hortonworks: They have been
at the heart of Hadoop’s development as they prepared to ship their
first stable release – laying down the marker for the future. Back
in January (just after the pivotal Apache Hadoop 1.0 release), we
talked to Hadoop PMC Chair and Hortonworks Founder and Architect
Arun Murthy, who has been part of the team driving Hadoop since the
very early days. We discussed the ideas behind Hadoop, what this
latest release means and more importantly, we asked him: where next
for the yellow elephant?
JAXenter: So after six years of development, Hadoop 1.0
has finally landed. How significant is this banner release and what
does it signal?
Hadoop has already been in
production in very large clusters in places like Yahoo and Facebook
for a very long time at this point. I started on the Hadoop team
six years ago: two years into the project, we were pretty much
using it extensively at Yahoo. It’s just got better and
I think the key with this release it signals to the rest of the
enterprise, especially the mid-to-late adopters that we as a
community are confident that this is stable and performant,
scalable and reliable and so on. More importantly, we can support
this in a compatible manner for a very long time. That’s a pretty
big deal for us.
JAX: What features are new this time round?
Murthy: In terms of features, it’s got end-to-end very strong data
security which is important for lots of enterprises. I mean if you
deal with sensitive data, financial data, user data – you
absolutely need strong security – it has to be auditable and this
provides that for the enterprise.
There’s very strong support for HBase, another project in the
Apache Ecosystem which is very popular. In terms of easing adoption
and easing integration, we’ve got something called mapHDFS, which
is a way of accessing the filesystem through an HTTP-based REST
API. The advantage of having that is that you don’t need any Java
clients or C clients and if you have existing tools and
infrastructure, you can easily plugin and start using the file
system. This will make it easier to consume data from webhdfs from
So overall, it’s got a bunch of features that improve
performance, scalability and so on. It’s been deployed on 50,000
nodes at Yahoo and around 10,000 nodes at Facebook so it’s
something we as a community are confident of deploying in a
compatible manner for a long time.
JAX: Why was it in development for so long? I suppose
you wanted to make sure you got it right first before introducing
Murthy: We’ve been in development for a long time, but
we’ve done lots of releases. Each one of them has been used
upgrading. It’s been massive for a few years. As I’ve said, this is
of more significance for mid-to-late adopters that they can be
confident that it will be supported for the next couple of years at
least. It’s more of a signal that it is mature.
JAX: For those late adopters, can you describe the key
components within Hadoop – HDFS and MapReduce?
Murthy: Well when we talk about Hadoop, there’s the core
of MapReduce and HDFS, but also the whole of the ecosystem. This
includes not only MapReduce and HDFS but HBase, Pig, Hive,
Zookeeper. There’s a whole bunch going on within the
So, HDFS is a massively scalable high throughput oriented
filesystem. The advantages of HDFS, and also with MapReduce and the
whole ecosystem, is that you assume that your hardware will bake
because its strewn across thousands of commodity hardware
These are not high-end boxes at $15-20,000 apiece: these are
commodity boxes that are around a couple thousand of dollars
apiece. The hard disks and the reliability of the individual box
isn’t great. But what HDFS does it tackle the complexity within the
software. It assumes that the hardware in the unit will fail. The
software will deal with the failures and we’ve spent a long time
working on this.
Also what it means is that because you are spread across a lot
of machines, you get lots of parallelism in the system. You get so
much parallelism within the system that your throughput goes up by
a lot. A high-end hard disk will give you about 200 to 300MB per
second. But a commodity box will only give you 60-100MB per second.
Now if you spread the I/O across thousands of these boxes, the
aggregate bandwidth you get is massive as you are multiplying over
thousands of servers. People have known this for a long time but
the hard part is dealing with all these failures, so you need a
software layer to paper over all the cracks.
So HDFS is the file system that stores the data, MapReduce is
the processing system. One of the core things about Hadoop is that
both the file system and processing system are co-designed and
co-developed. This means that MapReduce deals with the distributed
data like its GFS under the hood and it makes a lot of
optimisations because it knows the distributed nature of the data.
Instead of moving the data onto the processing system, you compute
onto the node where the data is present. It understands that you
have a rack, and that a rack has so many nodes so even if it can
find available processing time on that particular node, it will try
and find a node on the same rack, so your inter-rack bandwidth is
much higher than your off-rack bandwidth.
As well as being co-designed and co-developed, they are both
deployed in the same manner. You don’t have a compute network and
storage network, you have a single cluster on which you do both
compute and storage. That’s a big deal, right?
And similarly for HDFS, it’s not only aware that your hard disk
could go, it is aware that one of your nodes can go or that one of
your racks could go. You get multi-replicas of the blocks. A file
is made of lots of blocks, each of which is replicated and the file
system makes sure it is replicated in such a manner that even a
single disk, rack or node failing will not affect it. For example,
we make sure we have replicas on at least two racks. The typical
replica count is three. So if one rack goes down, you have access
to that data on another rack.
JAX: How essential is Apache to being the epicentre for
Hadoop, and what benefits does that give to the wider
Murthy: Apache, in my opinion, is probably the most
established open source foundation at this point. It’s got
tremendous amounts of history and pedigree delivering great open
source software, starting from HTTPd which is the most popular web
and valued web server. Things like Tomcat, hundreds of projects and
of course now, Hadoop and its ecosystem.
With that pedigree and history, the Apache Foundation
understands open source projects and how to manage very successful
open source projects. The ASF is great at that and it always places
the community at its core, which means that even if a few key
developers get hit by a bus or whatever, you makes sure there’s a
wider community behind it so it’s not built on a set of ‘superstar’
or ‘rockstar’ developers.
The ASF also employs the Apache license, which again is a very
enterprise-friendly license. So lots of enteprises are developing
Apache and contributing to Apache. That’s massive when you’re
talking about really successful projects like this. At the end of
the day, developers have to get food and shelter. You have
employers you need to fund. By making sure that the ASF is friendly
to enterprises, it gets a lot of people to contribute to it. For
example, when we were at Yahoo, it was at least an option to
develop with the ASF; with something else, it would be harder.
Yahoo was confident about investing millions of dollars into it,
from funding the developer to running the software, ie 50,000
nodes. As a result the two big things with the ASF are that you can
build a community and it’s also friendly to enterprises thus gets a
lot more adoption.
JAX: You work with Hortonworks, I suppose with some many
companies springing up, there’s a lot of variations in
distributions. How important is that to Hadoop’s progression to
being accepted as the gold standard?
Murthy: By having new projects, people see the value
proposition. Hadoop is really cost-effective and renewing that
initial investment. It has also has many advantages such as dealing
with structured and unstructured data. At the end of the day, if
you are a bank or insurance company or whatever, you still need
enterprise support. Your CIO or CDO is not going to be happy if you
say you’re going to download software and play with it. It’s good
in the POC stage (proof of concept) but when you’re in production,
you definitely want someone supporting you. And that doesn’t just
apply to Hadoop, think of Linux for example, or Red Hat. It’s
exactly the same for most of open source projects The big deal with
our ecosystem, I mean me and most of the founders is that we’ve
been working with Hadoop since Day 1, for six years now. More
importantly, not only have we developed, but we deployed the
largest Hadoop install at Yahoo and supported them 24/7. For
example, I was leading of MapReduce and if anything went wrong over
those 50,000 nodes, it was my responsibility. Even among Yahoo we
were one group, lots of other groups were using it.
They’d fund the hardware, we’d run it, monitor it, manage it and
deploy it and all of that. As a result we’ve got a tremendous
amount of experience not only developing it, but supporting it at
scale. As a result, if you go to a bank or a company, they are
confident that they can trust us because we’ve done it for so
JAX: You mentioned a lot of sub-projects within Hadoop
earlier. Which one stands out most for you or do they all play a
Murthy: They all definitely play a part. Among the more
popular ones are HBase, Pig and Hive. So Pig offers a higher level
programming interface so you don’t have to write Java code. It
makes it easier to adopt for the non-Java programmer, and there’s
lots of them in the world.
Hive is an SQL-interface on top of MapReduce so you write a SQL
query and Hive will take that and translate it into a set of
MapReduce jobs. Both of these systems will take these queries and
translate it down to MapReduce, so MapReduce is the foundation for
both of these. The advantages for the user is that you can be an
analyst writing SQL queries or a Java programmer writing Java code
for MapReduce. It increases the scope and the ease with which you
can interact with Hadoop and get your job done.
Of course there’s HBase, the NoSQL or distributed database in
Hadoop, which is again very popular. The other interesting one for
me is HCatalog, which is a mandated system. So you have data on
your file system, those are raw bytes. We want to think of this in
terms of this is my user data, customer data, banking data etc.
HCatalog gives you that metadata system on top of the raw file
system. That’s a very important part of the ecosystem.
JAX: What do you make of Oracle’s decision to jump
first, so to speak, with their Hadoop offering – the Big Data
Appliance? And partnering with Cloudera.
Murthy: At the end of the day, it’s great to have a
partnership with Microsoft, you know. There’s Oracle,
there’s IBM. The most important
thing is that a lot of the big software vendors are validating
Hadoop at this point. So what it means is the end customer, an
interim bank for example, is now more and more confident that
Hadoop is mainstream and here to stay. The other important
part is that everyone is converging on the open source part of
Hadoop, i.e. the Apache Foundation. Again, we contribute a lot to
it. Thankfully, it means we’re not in a phase like the Unix wars of
the 80’s, where it would splinter into a million pieces and
everybody does their own thing. The good thing is not only are the
major vendors converging on Hadoop and the open source version of
it. It’s exciting times, I mean, Hadoop is quickly getting
established as the centre-piece of the big data ecosystem.
JAX: Recently, Hadoop was awarded the top prize at The
Guardian’s Innovation Awards here in the UK. It was dubbed ‘the
Swiss Army knife of the 21st century’. Do you feel that the wider
media are picking up on how important Hadoop could be?
Oh absolutely. The Media
Guardian Innovation Award last year was great. We also got an award
this year from InfoWorld at their Technology of the Year awards.
Hadoop is increasingly getting a lot of mainstream press and again,
this can only help its adoption. Every single SME (small to middle
enterprise) and vendors are looking at as well as the larger ones,
so Oracle, IBM, Microsoft. All of them are interested. This is good
stuff. Good news leading to more investment, more investment to
better software and it will repeat in some sense.
JAX: What is the biggest problem for Hadoop to overcome
in the future?
Murthy: At the end of the day, we must remember that
Hadoop has only been around for 5-6 years in the enterprise space:
there’s a long way to go yet. Some of the big ones to overcome for
example is the whole availability thing. Forget security which is
still massively important for enterprise adoption.
We at Hortonworks are working really hard on availability at
this point. We should have the high availability stuff done for
both the file system and processing system by the middle of the
year, so we’re excited to tick that box off. There’s disaster
recovery and snapshot and all that within the file system.
On the MapReduce side, so far Hadoop has only supported
MapReduce as a programming paradigm. As I’ve said, I’ve worked with
MapReduce for 6 years, I really like it.On the other hand, there
are loads of other data-processing gaps in which MapReduce is not
connected. You name it – high performance computing for example.
MapReduce is not the right programming paradigm, currently. It’s
not just the implementation, it’s the ideas. It’s not the right
design, in some sense.
So, what we’re doing with the next generation of Hadoop.
Actually we had a press release recently about it that talks about
it. The big deal in the next generation of MapReduce is that we can
support not just MapReduce but any other programming paradigm in
the world. We’ve generalised MapReduce to the point where you can
have data processing infrastructure and we can bring in different
applications, most will continue to be MapReduce but it can be MPI
or whatever it is.
That’s something I’ve been working on for something like 18-24
months now and we’re excited to be at a point where we can push it
out. It’s taken a lot of investment and we can start rolling it out
to customers. It’ll be on the Yahoo grid soon, with more
scalability and better performance. Lots of good stuff. We’re
confident we can continue to push Hadoop.