Open source, Lucene, Nutch, Hadoop and more
Road to JAX London: A chat with Doug Cutting - Part 2
JAX: When you read the Google paper, did a lightbulb go off in your head?
DC: Very much so. You know, we saw that and it was clear that was a better way to do things. It took us, I don't know, maybe a six-month effort or so to really get things to a point where we could demonstrate MapReduce running, and it would be running much better than the crawler and the index before we had the MapReduce implementation. That was around 2004, when a lot of that research was done.
And then Yahoo! came along, and they had their own framework for
distributed computing that they were building their web search on,
and it had sort of outlived its useful life and was becoming
stretched pretty thin. They too had read the papers from Google and
thought that would be a good way to go, and thought the
implementation that we'd built within Nutch was the one that was
furthest along, and thought that doing this as open source would be
a good way to go. And [they] wanted to join forces with
Nutch.
They didn't want to do it in the context of an open source search
project for legal reasons, and instead really wanted it to be split
out into a separate distributed computing project. So that the
split of Hadoop from Nutch, which was really motivated by Yahoo!’s
use - I think it was a good one, and I think the general-purpose
distributed computing platform is obviously a more general thing
than something that's doing crawling, and it needed to be done. And
Yahoo! provided the impetus to do that, and that was done in
January of 2006.
JAX: You probably get asked this a lot - did you ever
envisage Hadoop becoming what it is today? Did you ever see it
getting so huge?
DC: No, I didn't really - I started something that was useful
beyond web search, certainly, and [I knew] there would be utility
in having this general thing. I'd never been a big fan of
relational databases, through my whole career. I've dabbled with
them, and always found them insufficient for the sort of text
search and web links I was doing, but not an appropriate solution.
But it also never had anything to do with enterprise software; I'd
always worked at web companies and search projects and at Apple, on
desktop computing OSes. It really wasn't something I spent any time
thinking about. So, no, I didn't see this as having a huge impact
across a lot of industries at the time. So, I've been very pleased
to see it! [laughs]
JAX: Absolutely. Do you think we've reached the limits for
storage/processing? Or have we only really scratched the surface at
this point?
DC: I think we're definitely still in the early stages of where
this stuff can be used, and what it can do. There's a long-term
exponential trend in the affordability of hardware to store and
process data; there's an long-term exponential trend in the
consumption of that hardware by industry - by lots of industries,
all industries, to store and process more of their data, and use it
to improve their business. So, both of those obviously have
something to do with one another, I don't see it ending anytime
soon. It's possible they'll start to slow, but there'll still
continue to be massive growth.
And the approach to data processing that we've seen in Hadoop is
really one that is more appropriate to keep track of these trends.
If you're trying to store massive quantities of data, you need
things which are designed from the ground-up to scale as linearly
as possible, and also using the most economical hardware. The
classic relational virtues really aren't designed from the group
that way - they're dated a bit, they came from a different
era.
The other thing that's really exciting is looking at what Google's
been doing more recently - they recently
published the Spanner paper, talking about their F1 system.
They're a few years ahead of the curve, ahead of the rest of us in
this world. So they give a big glimpse of where we can go, and when
they write these papers, they actually give us a roadmap of where
we could go [laughs].
It's now looking like we can sort of have it all, we really can
have transactional database systems that scale very far, that scale
to a global level. You can have them across data centres, holding
petabytes in the tables, and still responding to interactive
queries. So we're not quite to that point in the open source
ecosystem, but I think it's clear we're going to get there.
So in terms of features, we're on our way there. So I think that's
going to cause even more adoption, more use cases. So you've got
existing industries that are growing, and existing technological
capabilities, and existing hardware economics, and those economics
are going to get better, industries are going to mature and realise
how they can use this more, and the technology's going to advance,
and enable them to do even more. Yeah, I don't think it's over, by
any means.
JAX: Some enterprises seems slightly hesitant to adopt
Hadoop at this stage. Does there need to be a change of mindset
there for adoption? What things need to be considered before
picking up a Hadoop distribution?
DC: I think all enterprises are inherently conservative, and for
good reason - when they adopt a new technology, they've got to
support it for several years, so they need something they can
support. They don't want to just run the 'next big thing' and be
stuck with it, and that's largely why Cloudera was formed - to be able to
support those kinds of enterprise customers, and give them that
confidence, and provide them with an offering that is easy for them
to use and give them a partner in supporting it.
It's happening - we're seeing broad adoption, and we've got, I
think, half of the Fortune 50 are now customers, and are adopting
this stuff. Most of them are still in an early stage - they're not
yet deployed across the whole business. But it's spreading, all
these companies are spreading their use of these
technologies.
It won't happen overnight. If the core of business is in some data
technology, it's hard to pick it up and move it. There's a lot of
cases where the Hadoop stack is simply not ready for that, but
there's also a lot of cases where it can be used today to give
folks a real advantage against their competitors, and so we're
seeing pretty steady growth and adoption.
JAX: What do you think is the biggest challenge for Hadoop
to overcome, looking forward? Or is it a case of continuing the way
it's going, and that'll probably see it through?
DC: I think the challenge is to meet the hype. I think we're doing
it pretty well so far, that you really can store data and process
it effectively. It is a very young technology - people's
imaginations get ahead of them, ahead of the technology easily, so
we need to control expectations. At the same time, we need to
listen to those expectations, and if we can't meet them this year,
see if we can meet them next year.
So far, I think we're doing pretty well with that - we've got a lot
of folks adopting it. But, there's a lot of areas for improvement.
There's the security story - we need to be able to support things
encrypted everywhere; we need to support online systems better,
being able to do interactive queries, more complex online queries;
just a lot of integration with various tools. so plenty of work out
there.
JAX: With the Hadoop 2.0 codebase, do you think it's
heading in the right direction to address those
problems?
DC: Yeah, there's been a lot of work on performance in the file
system level, a lot of work on security. I mean definitely - the
direction is set by the demands of the users, so I think almost by
definition we're heading in the right direction [laughs]. Cloudera
and others listen to the customers, and we build what people need
most next. It's demand-driven, and hopefully we're listening to the
right people. I think we are!
Pages
- Development Beginnings, Lucene and Nutch
- The Google Paper that led to Hadoop
- Hadoop Moving Forward and the ASF
Follow us