Open source, Lucene, Nutch, Hadoop and more
Road to JAX London: A chat with Doug Cutting
JAX London and Big Data Con are just a few weeks away and naturally, we're very excited about what's in store for the events. It seemed like the perfect time to catch up with those keynoting at the event itself in what we're calling the "Road to JAX" series, which will cover some important topics for the worlds of Java and Big Data.
Up first is a man who needs no introduction, but we'll give him one anyway. He's the creator of some hugely influential open source projects - the text search engine library Apache Lucene, its web-focused follow-up Apache Nutch and the Big Data crunching platform taking the industry by storm Apache Hadoop. Currently acting as Architect at Cloudera, we had the pleasure of talking to Doug Cutting about these projects and much more
JAX: How did you get into software development, and in
particular open source stuff?
Doug Cutting: When I was in college in Stanford in the early 80s, it became pretty clear to me that the industry to get into was the software industry. I started taking some classes and really liked it. I pretty much decided wanted to do something in that area from that point on. I got some summer jobs working with folks from Xerox PARC and really enjoyed working with those guys. Right after I graduated, I actually went to Edinburgh for 18 months and worked there on a research project on speech recognition. Then I came back worked at PARC for 5 years doing research, and then I was steered in the direction of search engines.
From the start, I found I enjoyed programming - but it was also always a career. I amassed some debts in college and didn’t want to graduate without a job that I could pay them off with. I worked at Xerox, Apple, Excite through the internet years, working on mostly non-open source software. I had little things that ended up being open source, but not a lot. Then around 2000, I had written this program called Lucene and I originally thought I might try and turn it into a company of some sort and realised I didn’t have the stomach for that. So I decided to try it as open source and that went really well, I enjoyed that, I found it very rewarding as a way to work.
It’s interesting. Since roughly 2000, I’ve worked almost exclusively on open source software, with a few breaks here and there but mostly open source - but I’ve also earned a paycheck every month in that time. Even for 5 years I was an independent consultant helping people use Lucene and other software, not working for any particular company.
Some people think of me as altruistic and I don’t know if that’s really the case - I've totally gotten paid for the code I write. More recently, I made the stipulation that I wanted the code to be open source, but there’s usually someone who wants to have the software and they want to have it as open source and they’re willing to pay to have it written.
JAX: So you see yourself as a problem solver/troubleshooter of sorts, as well as contributing heavily to open source projects?
DC: I think I’ve been an instigator. In the case of Lucene, I thought Java was a new platform and I’ve been working on search engines for a long time. I thought that having a text search engine in Java would be a good thing to have.
JAX: What was the allure of Java as a starting point?
DC: The combination of relative good performance as well as high-level programming features and reliability. The relative difficulty of crashing, the system is harder and then combined with the ability to move things between different platforms, between various flavours of UNIX and Windows and Macs relatively painlessly.
I think there’s a real sweet spot there, certainly for
engineering to be able to write things that, you know, run pretty
well, but you don’t spend a lot of time on these niggling details
of porting them, not when you’re debugging them are you getting all
these strange memory leaks and memory errors that you find in C and
It just seemed like a sweet spot. I mean, people have criticised the Lucene and Hadoop projects for not being in C++, and there would be some performance benefits, but we’d have moved a lot slower if we were operating in that because there’s some inertia when you have to debug in those languages.
JAX: So definitely the right decision when looking back on it to stick with Java?
DC: A great decision for me certainly. It’s history, we can’t change it. Somebody could have done it differently and done very well but I think these projects are doing very well and speak for themselves at this point as being effective tools.
JAX: That’s definitely true. You’ve talked about Lucene - I was wondering how Nutch came about?
DC: It was in my days of doing consulting, freelance work based around Lucene. Somebody approached me and said, “You know, it’d be really neat if there was a full text search engine, a crawler-based, web search engine, that was all open source. Would you be interested in starting a project like that?” So I was like, of course I would, I'd love to do that. I’d spent years at Excite working on crawler-based search engines and saw a lot of work on closed source ones.
I was of the somewhat naive belief that - I think I knew it was naive at the time - pretty much all software eventually becomes a commodity, and there’s an open source implementation of it. And I thought web search engines should not be an exception. That hasn’t really borne true. I think the real web search engines are at Google and Microsoft these days, with a few others around the world - and they’re not open source. The amount of work it takes to maintain one is oftentimes so - it hasn’t yet, anyway, yielded an open source one that’s really world class, but it seemed like a good idea at the time.
So I threw myself at it, got a couple of collaborators and we tried to do what we could do. I was familiar with the way that we had done distributed processing at Excite, which was fairly crude. Just having a bunch of machines and managing processes on them manually, and copying files around various stages. We had something that would in theory scale arbitrarily to many machines. But it was onerous to operate - a lot of manual steps in there.
And that was the time when I saw the papers from Google, talking about how they did this stuff, and where they had automated all of these manual steps, and had a framework. And it was pretty much the same algorithms and data structures that were directly supported by MapReduce...
So I think that was an obvious improvement, to go towards the automation and try adding that to Nutch. But we had a working system at that point, that was an open source project before we read those papers, and then it got much better after we saw that work.