Open source, Lucene, Nutch, Hadoop and more

Road to JAX London: A chat with Doug Cutting - Part 2

Doug Cutting

Doug Cutting


JAX: When you read the Google paper, did a lightbulb go off in your head?

DC: Very much so. You know, we saw that and it was clear that was a better way to do things. It took us, I don't know, maybe a six-month effort or so to really get things to a point where we could demonstrate MapReduce running, and it would be running much better than the crawler and the index before we had the MapReduce implementation. That was around 2004, when a lot of that research was done.

And then Yahoo! came along, and they had their own framework for distributed computing that they were building their web search on, and it had sort of outlived its useful life and was becoming stretched pretty thin. They too had read the papers from Google and thought that would be a good way to go, and thought the implementation that we'd built within Nutch was the one that was furthest along, and thought that doing this as open source would be a good way to go. And [they] wanted to join forces with Nutch.

They didn't want to do it in the context of an open source search project for legal reasons, and instead really wanted it to be split out into a separate distributed computing project. So that the split of Hadoop from Nutch, which was really motivated by Yahoo!’s use - I think it was a good one, and I think the general-purpose distributed computing platform is obviously a more general thing than something that's doing crawling, and it needed to be done. And Yahoo! provided the impetus to do that, and that was done in January of 2006.

JAX: You probably get asked this a lot - did you ever envisage Hadoop becoming what it is today? Did you ever see it getting so huge?

DC: No, I didn't really - I started something that was useful beyond web search, certainly, and [I knew] there would be utility in having this general thing. I'd never been a big fan of relational databases, through my whole career. I've dabbled with them, and always found them insufficient for the sort of text search and web links I was doing, but not an appropriate solution. But it also never had anything to do with enterprise software; I'd always worked at web companies and search projects and at Apple, on desktop computing OSes. It really wasn't something I spent any time thinking about. So, no, I didn't see this as having a huge impact across a lot of industries at the time. So, I've been very pleased to see it! [laughs]

JAX: Absolutely. Do you think we've reached the limits for storage/processing? Or have we only really scratched the surface at this point?

DC: I think we're definitely still in the early stages of where this stuff can be used, and what it can do. There's a long-term exponential trend in the affordability of hardware to store and process data; there's an long-term exponential trend in the consumption of that hardware by industry - by lots of industries, all industries, to store and process more of their data, and use it to improve their business. So, both of those obviously have something to do with one another, I don't see it ending anytime soon. It's possible they'll start to slow, but there'll still continue to be massive growth.

And the approach to data processing that we've seen in Hadoop is really one that is more appropriate to keep track of these trends. If you're trying to store massive quantities of data, you need things which are designed from the ground-up to scale as linearly as possible, and also using the most economical hardware. The classic relational virtues really aren't designed from the group that way - they're dated a bit, they came from a different era.

The other thing that's really exciting is looking at what Google's been doing more recently - they recently published the Spanner paper, talking about their F1 system. They're a few years ahead of the curve, ahead of the rest of us in this world. So they give a big glimpse of where we can go, and when they write these papers, they actually give us a roadmap of where we could go [laughs].

It's now looking like we can sort of have it all, we really can have transactional database systems that scale very far, that scale to a global level. You can have them across data centres, holding petabytes in the tables, and still responding to interactive queries. So we're not quite to that point in the open source ecosystem, but I think it's clear we're going to get there.

So in terms of features, we're on our way there. So I think that's going to cause even more adoption, more use cases. So you've got existing industries that are growing, and existing technological capabilities, and existing hardware economics, and those economics are going to get better, industries are going to mature and realise how they can use this more, and the technology's going to advance, and enable them to do even more. Yeah, I don't think it's over, by any means.

JAX: Some enterprises seems slightly hesitant to adopt Hadoop at this stage. Does there need to be a change of mindset there for adoption? What things need to be considered before picking up a Hadoop distribution?

DC: I think all enterprises are inherently conservative, and for good reason - when they adopt a new technology, they've got to support it for several years, so they need something they can support. They don't want to just run the 'next big thing' and be stuck with it, and that's largely why Cloudera was formed - to be able to support those kinds of enterprise customers, and give them that confidence, and provide them with an offering that is easy for them to use and give them a partner in supporting it.

It's happening - we're seeing broad adoption, and we've got, I think, half of the Fortune 50 are now customers, and are adopting this stuff. Most of them are still in an early stage - they're not yet deployed across the whole business. But it's spreading, all these companies are spreading their use of these technologies.

It won't happen overnight. If the core of business is in some data technology, it's hard to pick it up and move it. There's a lot of cases where the Hadoop stack is simply not ready for that, but there's also a lot of cases where it can be used today to give folks a real advantage against their competitors, and so we're seeing pretty steady growth and adoption.

JAX: What do you think is the biggest challenge for Hadoop to overcome, looking forward? Or is it a case of continuing the way it's going, and that'll probably see it through?

DC: I think the challenge is to meet the hype. I think we're doing it pretty well so far, that you really can store data and process it effectively. It is a very young technology - people's imaginations get ahead of them, ahead of the technology easily, so we need to control expectations. At the same time, we need to listen to those expectations, and if we can't meet them this year, see if we can meet them next year.

So far, I think we're doing pretty well with that - we've got a lot of folks adopting it. But, there's a lot of areas for improvement. There's the security story - we need to be able to support things encrypted everywhere; we need to support online systems better, being able to do interactive queries, more complex online queries; just a lot of integration with various tools. so plenty of work out there.

JAX: With the Hadoop 2.0 codebase, do you think it's heading in the right direction to address those problems?

DC: Yeah, there's been a lot of work on performance in the file system level, a lot of work on security. I mean definitely - the direction is set by the demands of the users, so I think almost by definition we're heading in the right direction [laughs]. Cloudera and others listen to the customers, and we build what people need most next. It's demand-driven, and hopefully we're listening to the right people. I think we are!



Chris Mayer

What do you think?

JAX Magazine - 2014 - 06 Exclucively for iPad users JAX Magazine on Android


Latest opinions