Open source, Lucene, Nutch, Hadoop and more
Road to JAX London: A chat with Doug Cutting
JAX London and Big Data Con are just a few weeks away and naturally, we're very excited about what's in store for the events. It seemed like the perfect time to catch up with those keynoting at the event itself in what we're calling the "Road to JAX" series, which will cover some important topics for the worlds of Java and Big Data.
Up first is a man who needs no introduction, but we'll give him one anyway. He's the creator of some hugely influential open source projects - the text search engine library Apache Lucene, its web-focused follow-up Apache Nutch and the Big Data crunching platform taking the industry by storm Apache Hadoop. Currently acting as Architect at Cloudera, we had the pleasure of talking to Doug Cutting about these projects and much more
-----
JAX: How did you get into software development, and in
particular open source stuff?
Doug Cutting: When I was in college in Stanford in the early 80s,
it became pretty clear to me that the industry to get into was the
software industry. I started taking some classes and really liked
it. I pretty much decided wanted to do something in that area from
that point on. I got some summer jobs working with folks from Xerox
PARC and really enjoyed working with those guys. Right after I
graduated, I actually went to Edinburgh for 18 months and worked
there on a research project on speech recognition. Then I came back
worked at PARC for 5 years doing research, and then I was steered
in the direction of search engines.
From the start, I found I enjoyed programming - but it was also
always a career. I amassed some debts in college and didn’t want to
graduate without a job that I could pay them off with. I worked at
Xerox, Apple, Excite through the internet years, working on mostly
non-open source software. I had little things that ended up being
open source, but not a lot. Then around 2000, I had written this
program called Lucene
and I originally thought I might try and turn it into a company of
some sort and realised I didn’t have the stomach for that. So I
decided to try it as open source and that went really well, I
enjoyed that, I found it very rewarding as a way to work.
It’s interesting. Since roughly 2000, I’ve worked almost
exclusively on open source software, with a few breaks here and
there but mostly open source - but I’ve also earned a paycheck
every month in that time. Even for 5 years I was an independent
consultant helping people use Lucene and other software, not
working for any particular company.
Some people think of me as altruistic and I don’t know if that’s
really the case - I've totally gotten paid for the code I write.
More recently, I made the stipulation that I wanted the code to be
open source, but there’s usually someone who wants to have the
software and they want to have it as open source and they’re
willing to pay to have it written.
JAX: So you see yourself as a problem solver/troubleshooter
of sorts, as well as contributing heavily to open source
projects?
DC: I think I’ve been an instigator. In the case of Lucene, I
thought Java was a new platform and I’ve been working on search
engines for a long time. I thought that having a text search engine
in Java would be a good thing to have.
JAX: What was the allure of Java as a starting
point?
DC: The combination of relative good performance as well as
high-level programming features and reliability. The relative
difficulty of crashing, the system is harder and then combined with
the ability to move things between different platforms, between
various flavours of UNIX and Windows and Macs relatively
painlessly.
I think there’s a real sweet spot there, certainly for
engineering to be able to write things that, you know, run pretty
well, but you don’t spend a lot of time on these niggling details
of porting them, not when you’re debugging them are you getting all
these strange memory leaks and memory errors that you find in C and
C++ programs.
It just seemed like a sweet spot. I mean, people have criticised
the Lucene and Hadoop projects for not being in C++, and there
would be some performance benefits, but we’d have moved a lot
slower if we were operating in that because there’s some inertia
when you have to debug in those languages.
JAX: So definitely the right decision when looking back on
it to stick with Java?
DC: A great decision for me certainly. It’s history, we can’t
change it. Somebody could have done it differently and done very
well but I think these projects are doing very well and speak for
themselves at this point as being effective tools.
JAX: That’s definitely true. You’ve talked about Lucene - I
was wondering how Nutch came about?
DC: It was in my days of doing consulting, freelance work based
around Lucene. Somebody approached me and said, “You know, it’d be
really neat if there was a full text search engine, a
crawler-based, web search engine, that was all open source. Would
you be interested in starting a project like that?” So I was like,
of course I would, I'd love to do that. I’d spent years at Excite
working on crawler-based search engines and saw a lot of work on
closed source ones.
I was of the somewhat naive belief that - I think I knew it was
naive at the time - pretty much all software eventually becomes a
commodity, and there’s an open source implementation of it. And I
thought web search engines should not be an exception. That hasn’t
really borne true. I think the real web search engines are at
Google and Microsoft these days, with a few others around the world
- and they’re not open source. The amount of work it takes to
maintain one is oftentimes so - it hasn’t yet, anyway, yielded an
open source one that’s really world class, but it seemed like a
good idea at the time.
So I threw myself at it, got a couple of collaborators and we tried
to do what we could do. I was familiar with the way that we had
done distributed processing at Excite, which was fairly crude. Just
having a bunch of machines and managing processes on them manually,
and copying files around various stages. We had something that
would in theory scale arbitrarily to many machines. But it was
onerous to operate - a lot of manual steps in there.
And that was the time when I saw the papers from Google, talking
about how they did this stuff, and where they had automated all of
these manual steps, and had a framework. And it was pretty much the
same algorithms and data structures that were directly supported by
MapReduce...
So I think that was an obvious improvement, to go towards the automation and try adding that to Nutch. But we had a working system at that point, that was an open source project before we read those papers, and then it got much better after we saw that work.
Pages
- Development Beginnings, Lucene and Nutch
- The Google Paper that led to Hadoop
- Hadoop Moving Forward and the ASF
Follow us