Open source, Lucene, Nutch, Hadoop and more

Road to JAX London: A chat with Doug Cutting

Chris Mayer
Road-TO-JAX

With JAX London, a few weeks away, we caught up with one of the keynoters – Doug Cutting the creator of Big Data sensation Hadoop

JAX London and Big Data Con are just a few weeks
away and naturally, we’re very excited about what’s in store for
the events. It seemed like the perfect time to catch up with those
keynoting at the event itself in what we’re calling the
“Road to JAX” series, which will cover some
important topics for the worlds of Java and Big Data.

Up first is a man who needs no introduction, but we’ll give
him one anyway. He’s the creator of some hugely influential open
source projects – the text search engine library Apache Lucene, its
web-focused follow-up Apache Nutch and the Big Data crunching
platform taking the industry by storm Apache Hadoop. Currently
acting as Architect at Cloudera, we had the pleasure of talking to
Doug Cutting about these projects and much
more

—– 

JAX: How did you get into software development, and in
particular open source stuff?

 
Doug Cutting: When I was in college in Stanford in the early 80s,
it became pretty clear to me that the industry to get into was the
software industry. I started taking some classes and really liked
it. I pretty much decided wanted to do something in that area from
that point on. I got some summer jobs working with folks from Xerox
PARC and really enjoyed working with those guys. Right after I
graduated, I actually went to Edinburgh for 18 months and worked
there on a research project on speech recognition. Then I came back
worked at PARC for 5 years doing research, and then I was steered
in the direction of search engines.

From the start, I found I enjoyed programming – but it was also
always a career. I amassed some debts in college and didn’t want to
graduate without a job that I could pay them off with. I worked at
Xerox, Apple, Excite through the internet years, working on mostly
non-open source software. I had little things that ended up being
open source, but not a lot. Then around 2000, I had written this
program called Lucene
and I originally thought I might try and turn it into a company of
some sort and realised I didn’t have the stomach for that. So I
decided to try it as open source and that went really well, I
enjoyed that, I found it very rewarding as a way to work.

It’s interesting. Since roughly 2000, I’ve worked almost
exclusively on open source software, with a few breaks here and
there but mostly open source – but I’ve also earned a paycheck
every month in that time. Even for 5 years I was an independent
consultant helping people use Lucene and other software, not
working for any particular company.

Some people think of me as altruistic and I don’t know if that’s
really the case – I’ve totally gotten paid for the code I write.
More recently, I made the stipulation that I wanted the code to be
open source, but there’s usually someone who wants to have the
software and they want to have it as open source and they’re
willing to pay to have it written.

JAX: So you see yourself as a problem solver/troubleshooter
of sorts, as well as contributing heavily to open source
projects?

DC: I think I’ve been an instigator. In the case of Lucene, I
thought Java was a new platform and I’ve been working on search
engines for a long time. I thought that having a text search engine
in Java would be a good thing to have.

JAX: What was the allure of Java as a starting
point?

DC: The combination of relative good performance as well as
high-level programming features and reliability. The relative
difficulty of crashing, the system is harder and then combined with
the ability to move things between different platforms, between
various flavours of UNIX and Windows and Macs relatively
painlessly.

I think there’s a real sweet spot there, certainly for
engineering to be able to write things that, you know, run pretty
well, but you don’t spend a lot of time on these niggling details
of porting them, not when you’re debugging them are you getting all
these strange memory leaks and memory errors that you find in C and
C++ programs.

It just seemed like a sweet spot. I mean, people have criticised
the Lucene and Hadoop projects for not being in C++, and there
would be some performance benefits, but we’d have moved a lot
slower if we were operating in that because there’s some inertia
when you have to debug in those languages.

JAX: So definitely the right decision when looking back on
it to stick with Java?

DC: A great decision for me certainly. It’s history, we can’t
change it. Somebody could have done it differently and done very
well but I think these projects are doing very well and speak for
themselves at this point as being effective tools.

JAX: That’s definitely true. You’ve talked about Lucene – I
was wondering how Nutch came about?

DC: It was in my days of doing consulting, freelance work based
around Lucene. Somebody approached me and said, “You know, it’d be
really neat if there was a full text search engine, a
crawler-based, web search engine, that was all open source. Would
you be interested in starting a project like that?” So I was like,
of course I would, I’d love to do that. I’d spent years at Excite
working on crawler-based search engines and saw a lot of work on
closed source ones.

I was of the somewhat naive belief that – I think I knew it was
naive at the time – pretty much all software eventually becomes a
commodity, and there’s an open source implementation of it. And I
thought web search engines should not be an exception. That hasn’t
really borne true. I think the real web search engines are at
Google and Microsoft these days, with a few others around the world
– and they’re not open source. The amount of work it takes to
maintain one is oftentimes so – it hasn’t yet, anyway, yielded an
open source one that’s really world class, but it seemed like a
good idea at the time.

So I threw myself at it, got a couple of collaborators and we tried
to do what we could do. I was familiar with the way that we had
done distributed processing at Excite, which was fairly crude. Just
having a bunch of machines and managing processes on them manually,
and copying files around various stages. We had something that
would in theory scale arbitrarily to many machines. But it was
onerous to operate – a lot of manual steps in there.

And that was the time when I saw the papers from Google, talking
about how they did this stuff, and where they had automated all of
these manual steps, and had a framework. And it was pretty much the
same algorithms and data structures that were directly supported by
MapReduce…

So I think that was an obvious improvement, to go towards the
automation and try adding that to Nutch. But we had a working
system at that point, that was an open source project before we
read those papers, and then it got much better after we saw that
work.

      

JAX: When you read the Google paper, did a lightbulb go off
in your head?

DC: Very much so. You know, we saw that and it was clear that was a
better way to do things. It took us, I don’t know, maybe a
six-month effort or so to really get things to a point where we
could demonstrate MapReduce running, and it would be running much
better than the crawler and the index before we had the MapReduce
implementation. That was around 2004, when a lot of that research
was done.

And then Yahoo! came along, and they had their own framework for
distributed computing that they were building their web search on,
and it had sort of outlived its useful life and was becoming
stretched pretty thin. They too had read the papers from Google and
thought that would be a good way to go, and thought the
implementation that we’d built within Nutch was the one that was
furthest along, and thought that doing this as open source would be
a good way to go. And [they] wanted to join forces with
Nutch.

They didn’t want to do it in the context of an open source search
project for legal reasons, and instead really wanted it to be split
out into a separate distributed computing project. So that the
split of Hadoop from Nutch, which was really motivated by Yahoo!’s
use – I think it was a good one, and I think the general-purpose
distributed computing platform is obviously a more general thing
than something that’s doing crawling, and it needed to be done. And
Yahoo! provided the impetus to do that, and that was done in
January of 2006.

JAX: You probably get asked this a lot – did you ever
envisage Hadoop becoming what it is today? Did you ever see it
getting so huge?

DC: No, I didn’t really – I started something that was useful
beyond web search, certainly, and [I knew] there would be utility
in having this general thing. I’d never been a big fan of
relational databases, through my whole career. I’ve dabbled with
them, and always found them insufficient for the sort of text
search and web links I was doing, but not an appropriate solution.
But it also never had anything to do with enterprise software; I’d
always worked at web companies and search projects and at Apple, on
desktop computing OSes. It really wasn’t something I spent any time
thinking about. So, no, I didn’t see this as having a huge impact
across a lot of industries at the time. So, I’ve been very pleased
to see it! [laughs]

JAX: Absolutely. Do you think we’ve reached the limits for
storage/processing? Or have we only really scratched the surface at
this point?

DC: I think we’re definitely still in the early stages of where
this stuff can be used, and what it can do. There’s a long-term
exponential trend in the affordability of hardware to store and
process data; there’s an long-term exponential trend in the
consumption of that hardware by industry – by lots of industries,
all industries, to store and process more of their data, and use it
to improve their business. So, both of those obviously have
something to do with one another, I don’t see it ending anytime
soon. It’s possible they’ll start to slow, but there’ll still
continue to be massive growth.

And the approach to data processing that we’ve seen in Hadoop is
really one that is more appropriate to keep track of these trends.
If you’re trying to store massive quantities of data, you need
things which are designed from the ground-up to scale as linearly
as possible, and also using the most economical hardware. The
classic relational virtues really aren’t designed from the group
that way – they’re dated a bit, they came from a different
era.

The other thing that’s really exciting is looking at what Google’s
been doing more recently – they recently
published the Spanner paper
, talking about their F1 system.
They’re a few years ahead of the curve, ahead of the rest of us in
this world. So they give a big glimpse of where we can go, and when
they write these papers, they actually give us a roadmap of where
we could go [laughs].

It’s now looking like we can sort of have it all, we really can
have transactional database systems that scale very far, that scale
to a global level. You can have them across data centres, holding
petabytes in the tables, and still responding to interactive
queries. So we’re not quite to that point in the open source
ecosystem, but I think it’s clear we’re going to get there.

So in terms of features, we’re on our way there. So I think that’s
going to cause even more adoption, more use cases. So you’ve got
existing industries that are growing, and existing technological
capabilities, and existing hardware economics, and those economics
are going to get better, industries are going to mature and realise
how they can use this more, and the technology’s going to advance,
and enable them to do even more. Yeah, I don’t think it’s over, by
any means.

JAX: Some enterprises seems slightly hesitant to adopt
Hadoop at this stage. Does there need to be a change of mindset
there for adoption? What things need to be considered before
picking up a Hadoop distribution?

DC: I think all enterprises are inherently conservative, and for
good reason – when they adopt a new technology, they’ve got to
support it for several years, so they need something they can
support. They don’t want to just run the ‘next big thing’ and be
stuck with it, and that’s largely why Cloudera was formed – to be able to
support those kinds of enterprise customers, and give them that
confidence, and provide them with an offering that is easy for them
to use and give them a partner in supporting it.

It’s happening – we’re seeing broad adoption, and we’ve got, I
think, half of the Fortune 50 are now customers, and are adopting
this stuff. Most of them are still in an early stage – they’re not
yet deployed across the whole business. But it’s spreading, all
these companies are spreading their use of these
technologies.

It won’t happen overnight. If the core of business is in some data
technology, it’s hard to pick it up and move it. There’s a lot of
cases where the Hadoop stack is simply not ready for that, but
there’s also a lot of cases where it can be used today to give
folks a real advantage against their competitors, and so we’re
seeing pretty steady growth and adoption.

JAX: What do you think is the biggest challenge for Hadoop
to overcome, looking forward? Or is it a case of continuing the way
it’s going, and that’ll probably see it through?

DC: I think the challenge is to meet the hype. I think we’re doing
it pretty well so far, that you really can store data and process
it effectively. It is a very young technology – people’s
imaginations get ahead of them, ahead of the technology easily, so
we need to control expectations. At the same time, we need to
listen to those expectations, and if we can’t meet them this year,
see if we can meet them next year.

So far, I think we’re doing pretty well with that – we’ve got a lot
of folks adopting it. But, there’s a lot of areas for improvement.
There’s the security story – we need to be able to support things
encrypted everywhere; we need to support online systems better,
being able to do interactive queries, more complex online queries;
just a lot of integration with various tools. so plenty of work out
there.

JAX: With the Hadoop 2.0 codebase, do you think it’s
heading in the right direction to address those
problems?

DC: Yeah, there’s been a lot of work on performance in the file
system level, a lot of work on security. I mean definitely – the
direction is set by the demands of the users, so I think almost by
definition we’re heading in the right direction [laughs]. Cloudera
and others listen to the customers, and we build what people need
most next. It’s demand-driven, and hopefully we’re listening to the
right people. I think we are!

      

JAX: You recently described Hadoop as the ‘kernel’ of the
platform itself. What other Big Data technologies catch your eye at
the moment? There’s quite a few at Apache Software Foundation at
the moment incubating…

DC: I think it is pretty exciting that Hadoop has become this
kernel, and I think Bigtop is becoming the open source integration
point for getting all these parts to coordinate. I think the YARN
project within Hadoop, of generalising the runtime of the kernel so
that we can support different kinds of processing; things like the
Giraph project for graph processing, I think that’s gonna be pretty
useful. Then there’s the whole real-time processing, which is kind
of a separate thread of development, it isn’t really as integrated
into the Hadoop stack. That’s interesting to watch, not really
something I’ve been involved in, so things like Storm and F4 should
come into play more. HBase has long been the primary online system
within the stack, and I think in the next year, we’re going to see
a lot more that is really integrated with Hadoop, that gives you
interactive queries, both beyond the simple key values of HBase. So
you’re likely to see interactive SQL queries, as well as
Lucene-style Solr Cloud in the Lucene camp, giving you the scalable
search – so you can search petabytes of data with very low latency,
as well as getting pretty good throughput, having lots queries
running simultaneously. Both of those are directions we’re going to
see a lot of progress on.

JAX: I just wanted to touch upon your role at Apache
Software Foundation. What do you plan to achieve with your role
there?

DC: I’m chairman at Apache at the moment right now and I have been
for the last few years. It’s not really a position of power –
Apache is pretty much an all-volunteer organisation, and we’ve got
a few contractors that do system administration, but it’s mostly
volunteers. So you can’t really have a top-down power structure in
that kind of organisation. It’s really much more about
coordination, and so the only times we flex our muscles are in sort
of ‘police actions’, where a community is not operating according
to the principles that we espouse.

And we really want communities to operate as level playing fields,
where anyone can come along, and get involved in a project, and
their contributions will be evaluated on their technical merit, and
we don’t want any one company to exercise too much power within the
project, and control the project for their commercial interests,
rather than the technical needs of the community.

So when we see something like that happen, we have to step in now
and then. Those are the times when we become activists. It’s
unclear that this all-volunteer structure is going to scale
indefinitely – it’s scaled well for over a decade, but eventually
we may need to start hiring more staff. The number of contractors
we have doing system administration has grown. We’ve got a
contractor doing marketing and communications. We’ve got an
executive assistant now, who helps manage some of the paperwork of
the foundation, so eventually we may actually grow. Figuring our
how to do that is a bit of a puzzle, and how we do the fundraising
to back that. So far, our fundraising has been relatively passive –
we’re able to get big companies who really value the existence of
the ASF, and just [get them to] give us money, which is wonderful,
with no strings attached, and so far that’s been sufficient.
Whether we’re going to do more active fundraising, and actually
employ more people as this continues to grow, we’ll see.

It’s pretty amazing, the size of the foundation. We’ve got 3000
committers, and over 100 active projects being developed – it’s a
lot of software that comes out of something that’s really a
grassroots, bottom-up foundation.

JAX: It must be amazing to see so many projects stepping up
with something innovative.

DC: And it largely runs itself, that’s the design. We always push
anything down and say we can’t drive it from the top down – we just
can’t afford to. Not only is it against the principles, we don’t
have any management paid to do that and we couldn’t expect people
to respond to them if we did. So we force everything down and it
does run itself.

JAX: That spirit certainly shines through

DC: It’s unusual – there aren’t a lot of examples for us to look to
and we’re making it up as we go along. So far it seems to be
working. Hopefully, we can keep scaling and if we can’t, we’ll
think of something else.

JAX: One final question: can you discuss your Cloudera role
a bit more and the newest release CDH4. What problems does that
tackle for the enterprise?

DC: In Cloudera, my title is Chief Architect. They’ve sort of given
me the James Bond role – give me a license to hack [laughs]. Mostly
I work on Apache things, working on software, helping manage ASF
and also do try to help Cloudera in achieving its mission, spending
some time with customers. A lot of the time explaining that how
Hadoop works, how Apache works.

CDH 4 is really the next generation commercial packaging of the
Hadoop ecosystem. It’s based off of the open source Bigtop project
– the version of Bigtop which will have long-term commercial
support. Cloudera will continue to work on critical bug fixes,
security fixes to CDH in a way the Bigtop open source project
doesn’t yet do. There’s also associated with the commercial
proprietary offering Cloudera Enterprise to help folks manage their
Hadoop clusters. The line we’ve drawn between our open source and
proprietary efforts is that the APIs you code against is all open
source, when developing an application. The stuff you run to
configure, run, monitor that software is mainly proprietary.
 If you want a console or you can in one place see everything
that’s going on and change the way it is working, then that’s
something we’re happy to sell you along with support.

JAX: Seems like you’ve struck a great balance there. Great
talking to you Doug

Doug will be keynoting at the upcoming JAX London
and Big Data Con London, telling us how to go about “Unlocking the
Power of Big Data with Hadoop”. Find more conference info here.

Photo Courtesy of Felix
O

Author
Comments
comments powered by Disqus