Originally from Java Tech Journal

Interview with Hadoop PMC, Arun Murthy

Arun Murthy

Arun Murthy

It has been a busy old year at Hortonworks: They have been at the heart of Hadoop’s development as they prepared to ship their first stable release – laying down the marker for the future. Back in January (just after the pivotal Apache Hadoop 1.0 release), we talked to Hadoop PMC Chair and Hortonworks Founder and Architect Arun Murthy, who has been part of the team driving Hadoop since the very early days. We discussed the ideas behind Hadoop, what this latest release means and more importantly, we asked him: where next for the yellow elephant?

JAXenter: So after six years of development, Hadoop 1.0 has finally landed. How significant is this banner release and what does it signal?

Arun Murthy: Hadoop has already been in production in very large clusters in places like Yahoo and Facebook for a very long time at this point. I started on the Hadoop team six years ago: two years into the project, we were pretty much using it extensively at Yahoo. It’s just got better and better.

I think the key with this release it signals to the rest of the enterprise, especially the mid-to-late adopters that we as a community are confident that this is stable and performant, scalable and reliable and so on. More importantly, we can support this in a compatible manner for a very long time. That’s a pretty big deal for us.

JAX: What features are new this time round?

Murthy: In terms of features, it’s got end-to-end very strong data security which is important for lots of enterprises. I mean if you deal with sensitive data, financial data, user data - you absolutely need strong security - it has to be auditable and this provides that for the enterprise.

There’s very strong support for HBase, another project in the Apache Ecosystem which is very popular. In terms of easing adoption and easing integration, we’ve got something called mapHDFS, which is a way of accessing the filesystem through an HTTP-based REST API. The advantage of having that is that you don’t need any Java clients or C clients and if you have existing tools and infrastructure, you can easily plugin and start using the file system. This will make it easier to consume data from webhdfs from existing C-apps.

So overall, it’s got a bunch of features that improve performance, scalability and so on. It’s been deployed on 50,000 nodes at Yahoo and around 10,000 nodes at Facebook so it’s something we as a community are confident of deploying in a compatible manner for a long time.

JAX: Why was it in development for so long? I suppose you wanted to make sure you got it right first before introducing Hadoop 1.0?

Murthy: We’ve been in development for a long time, but we’ve done lots of releases. Each one of them has been used upgrading. It’s been massive for a few years. As I’ve said, this is of more significance for mid-to-late adopters that they can be confident that it will be supported for the next couple of years at least. It’s more of a signal that it is mature.

JAX: For those late adopters, can you describe the key components within Hadoop - HDFS and MapReduce?

Murthy: Well when we talk about Hadoop, there’s the core of MapReduce and HDFS, but also the whole of the ecosystem. This includes not only MapReduce and HDFS but HBase, Pig, Hive, Zookeeper. There’s a whole bunch going on within the ecosystem.

So, HDFS is a massively scalable high throughput oriented filesystem. The advantages of HDFS, and also with MapReduce and the whole ecosystem, is that you assume that your hardware will bake because its strewn across thousands of commodity hardware servers.

These are not high-end boxes at $15-20,000 apiece: these are commodity boxes that are around a couple thousand of dollars apiece. The hard disks and the reliability of the individual box isn’t great. But what HDFS does it tackle the complexity within the software. It assumes that the hardware in the unit will fail. The software will deal with the failures and we’ve spent a long time working on this.

Also what it means is that because you are spread across a lot of machines, you get lots of parallelism in the system. You get so much parallelism within the system that your throughput goes up by a lot. A high-end hard disk will give you about 200 to 300MB per second. But a commodity box will only give you 60-100MB per second. Now if you spread the I/O across thousands of these boxes, the aggregate bandwidth you get is massive as you are multiplying over thousands of servers. People have known this for a long time but the hard part is dealing with all these failures, so you need a software layer to paper over all the cracks.

So HDFS is the file system that stores the data, MapReduce is the processing system. One of the core things about Hadoop is that both the file system and processing system are co-designed and co-developed. This means that MapReduce deals with the distributed data like its GFS under the hood and it makes a lot of optimisations because it knows the distributed nature of the data. Instead of moving the data onto the processing system, you compute onto the node where the data is present. It understands that you have a rack, and that a rack has so many nodes so even if it can find available processing time on that particular node, it will try and find a node on the same rack, so your inter-rack bandwidth is much higher than your off-rack bandwidth.

As well as being co-designed and co-developed, they are both deployed in the same manner. You don’t have a compute network and storage network, you have a single cluster on which you do both compute and storage. That’s a big deal, right?

And similarly for HDFS, it’s not only aware that your hard disk could go, it is aware that one of your nodes can go or that one of your racks could go. You get multi-replicas of the blocks. A file is made of lots of blocks, each of which is replicated and the file system makes sure it is replicated in such a manner that even a single disk, rack or node failing will not affect it. For example, we make sure we have replicas on at least two racks. The typical replica count is three. So if one rack goes down, you have access to that data on another rack.

JAX: How essential is Apache to being the epicentre for Hadoop, and what benefits does that give to the wider ecosystem?

Murthy: Apache, in my opinion, is probably the most established open source foundation at this point. It’s got tremendous amounts of history and pedigree delivering great open source software, starting from HTTPd which is the most popular web and valued web server. Things like Tomcat, hundreds of projects and of course now, Hadoop and its ecosystem.

With that pedigree and history, the Apache Foundation understands open source projects and how to manage very successful open source projects. The ASF is great at that and it always places the community at its core, which means that even if a few key developers get hit by a bus or whatever, you makes sure there’s a wider community behind it so it’s not built on a set of ‘superstar’ or ‘rockstar’ developers.

The ASF also employs the Apache license, which again is a very enterprise-friendly license. So lots of enteprises are developing Apache and contributing to Apache. That’s massive when you’re talking about really successful projects like this. At the end of the day, developers have to get food and shelter. You have employers you need to fund. By making sure that the ASF is friendly to enterprises, it gets a lot of people to contribute to it. For example, when we were at Yahoo, it was at least an option to develop with the ASF; with something else, it would be harder. Yahoo was confident about investing millions of dollars into it, from funding the developer to running the software, ie 50,000 nodes. As a result the two big things with the ASF are that you can build a community and it’s also friendly to enterprises thus gets a lot more adoption.

JAX: You work with Hortonworks, I suppose with some many companies springing up, there’s a lot of variations in distributions. How important is that to Hadoop’s progression to being accepted as the gold standard?

Murthy: By having new projects, people see the value proposition. Hadoop is really cost-effective and renewing that initial investment. It has also has many advantages such as dealing with structured and unstructured data. At the end of the day, if you are a bank or insurance company or whatever, you still need enterprise support. Your CIO or CDO is not going to be happy if you say you’re going to download software and play with it. It’s good in the POC stage (proof of concept) but when you’re in production, you definitely want someone supporting you. And that doesn’t just apply to Hadoop, think of Linux for example, or Red Hat. It’s exactly the same for most of open source projects The big deal with our ecosystem, I mean me and most of the founders is that we’ve been working with Hadoop since Day 1, for six years now. More importantly, not only have we developed, but we deployed the largest Hadoop install at Yahoo and supported them 24/7. For example, I was leading of MapReduce and if anything went wrong over those 50,000 nodes, it was my responsibility. Even among Yahoo we were one group, lots of other groups were using it.

They’d fund the hardware, we’d run it, monitor it, manage it and deploy it and all of that. As a result we’ve got a tremendous amount of experience not only developing it, but supporting it at scale. As a result, if you go to a bank or a company, they are confident that they can trust us because we’ve done it for so long.

JAX: You mentioned a lot of sub-projects within Hadoop earlier. Which one stands out most for you or do they all play a part?

Murthy: They all definitely play a part. Among the more popular ones are HBase, Pig and Hive. So Pig offers a higher level programming interface so you don’t have to write Java code. It makes it easier to adopt for the non-Java programmer, and there’s lots of them in the world.

Hive is an SQL-interface on top of MapReduce so you write a SQL query and Hive will take that and translate it into a set of MapReduce jobs. Both of these systems will take these queries and translate it down to MapReduce, so MapReduce is the foundation for both of these. The advantages for the user is that you can be an analyst writing SQL queries or a Java programmer writing Java code for MapReduce. It increases the scope and the ease with which you can interact with Hadoop and get your job done.

Of course there’s HBase, the NoSQL or distributed database in Hadoop, which is again very popular. The other interesting one for me is HCatalog, which is a mandated system. So you have data on your file system, those are raw bytes. We want to think of this in terms of this is my user data, customer data, banking data etc. HCatalog gives you that metadata system on top of the raw file system. That’s a very important part of the ecosystem.

JAX: What do you make of Oracle’s decision to jump first, so to speak, with their Hadoop offering - the Big Data Appliance? And partnering with Cloudera.

Murthy: At the end of the day, it’s great to have a partnership with Microsoft, you know. There’s Oracle, there’s IBM. The most important thing is that a lot of the big software vendors are validating Hadoop at this point. So what it means is the end customer, an interim bank for example, is now more and more confident that Hadoop is mainstream and here to stay. The other important part is that everyone is converging on the open source part of Hadoop, i.e. the Apache Foundation. Again, we contribute a lot to it. Thankfully, it means we’re not in a phase like the Unix wars of the 80’s, where it would splinter into a million pieces and everybody does their own thing. The good thing is not only are the major vendors converging on Hadoop and the open source version of it. It’s exciting times, I mean, Hadoop is quickly getting established as the centre-piece of the big data ecosystem.

JAX: Recently, Hadoop was awarded the top prize at The Guardian’s Innovation Awards here in the UK. It was dubbed ‘the Swiss Army knife of the 21st century’. Do you feel that the wider media are picking up on how important Hadoop could be?

Murthy: Oh absolutely. The Media Guardian Innovation Award last year was great. We also got an award this year from InfoWorld at their Technology of the Year awards. Hadoop is increasingly getting a lot of mainstream press and again, this can only help its adoption. Every single SME (small to middle enterprise) and vendors are looking at as well as the larger ones, so Oracle, IBM, Microsoft. All of them are interested. This is good stuff. Good news leading to more investment, more investment to better software and it will repeat in some sense.

JAX: What is the biggest problem for Hadoop to overcome in the future?

Murthy: At the end of the day, we must remember that Hadoop has only been around for 5-6 years in the enterprise space: there’s a long way to go yet. Some of the big ones to overcome for example is the whole availability thing. Forget security which is still massively important for enterprise adoption.

We at Hortonworks are working really hard on availability at this point. We should have the high availability stuff done for both the file system and processing system by the middle of the year, so we’re excited to tick that box off. There’s disaster recovery and snapshot and all that within the file system.

On the MapReduce side, so far Hadoop has only supported MapReduce as a programming paradigm. As I’ve said, I’ve worked with MapReduce for 6 years, I really like it.On the other hand, there are loads of other data-processing gaps in which MapReduce is not connected. You name it – high performance computing for example. MapReduce is not the right programming paradigm, currently. It’s not just the implementation, it’s the ideas. It’s not the right design, in some sense.

So, what we’re doing with the next generation of Hadoop. Actually we had a press release recently about it that talks about it. The big deal in the next generation of MapReduce is that we can support not just MapReduce but any other programming paradigm in the world. We’ve generalised MapReduce to the point where you can have data processing infrastructure and we can bring in different applications, most will continue to be MapReduce but it can be MPI or whatever it is.

That’s something I’ve been working on for something like 18-24 months now and we’re excited to be at a point where we can push it out. It’s taken a lot of investment and we can start rolling it out to customers. It’ll be on the Yahoo grid soon, with more scalability and better performance. Lots of good stuff. We’re confident we can continue to push Hadoop.

Chris Mayer

What do you think?

Comments

Latest opinions