Open source, Lucene, Nutch, Hadoop and more

Road to JAX London: A chat with Doug Cutting

Chris Mayer

With JAX London, a few weeks away, we caught up with one of the keynoters – Doug Cutting the creator of Big Data sensation Hadoop

JAX London and Big Data Con are just a few weeks away and naturally, we’re very excited about what’s in store for the events. It seemed like the perfect time to catch up with those keynoting at the event itself in what we’re calling the “Road to JAX” series, which will cover some important topics for the worlds of Java and Big Data.

Up first is a man who needs no introduction, but we’ll give him one anyway. He’s the creator of some hugely influential open source projects – the text search engine library Apache Lucene, its web-focused follow-up Apache Nutch and the Big Data crunching platform taking the industry by storm Apache Hadoop. Currently acting as Architect at Cloudera, we had the pleasure of talking to Doug Cutting about these projects and much more


JAX: How did you get into software development, and in particular open source stuff?
Doug Cutting: When I was in college in Stanford in the early 80s, it became pretty clear to me that the industry to get into was the software industry. I started taking some classes and really liked it. I pretty much decided wanted to do something in that area from that point on. I got some summer jobs working with folks from Xerox PARC and really enjoyed working with those guys. Right after I graduated, I actually went to Edinburgh for 18 months and worked there on a research project on speech recognition. Then I came back worked at PARC for 5 years doing research, and then I was steered in the direction of search engines.

From the start, I found I enjoyed programming – but it was also always a career. I amassed some debts in college and didn’t want to graduate without a job that I could pay them off with. I worked at Xerox, Apple, Excite through the internet years, working on mostly non-open source software. I had little things that ended up being open source, but not a lot. Then around 2000, I had written this program called Lucene and I originally thought I might try and turn it into a company of some sort and realised I didn’t have the stomach for that. So I decided to try it as open source and that went really well, I enjoyed that, I found it very rewarding as a way to work.

It’s interesting. Since roughly 2000, I’ve worked almost exclusively on open source software, with a few breaks here and there but mostly open source – but I’ve also earned a paycheck every month in that time. Even for 5 years I was an independent consultant helping people use Lucene and other software, not working for any particular company.

Some people think of me as altruistic and I don’t know if that’s really the case – I’ve totally gotten paid for the code I write. More recently, I made the stipulation that I wanted the code to be open source, but there’s usually someone who wants to have the software and they want to have it as open source and they’re willing to pay to have it written.

JAX: So you see yourself as a problem solver/troubleshooter of sorts, as well as contributing heavily to open source projects?

DC: I think I’ve been an instigator. In the case of Lucene, I thought Java was a new platform and I’ve been working on search engines for a long time. I thought that having a text search engine in Java would be a good thing to have.

JAX: What was the allure of Java as a starting point?

DC: The combination of relative good performance as well as high-level programming features and reliability. The relative difficulty of crashing, the system is harder and then combined with the ability to move things between different platforms, between various flavours of UNIX and Windows and Macs relatively painlessly.

I think there’s a real sweet spot there, certainly for engineering to be able to write things that, you know, run pretty well, but you don’t spend a lot of time on these niggling details of porting them, not when you’re debugging them are you getting all these strange memory leaks and memory errors that you find in C and C++ programs.

It just seemed like a sweet spot. I mean, people have criticised the Lucene and Hadoop projects for not being in C++, and there would be some performance benefits, but we’d have moved a lot slower if we were operating in that because there’s some inertia when you have to debug in those languages.

JAX: So definitely the right decision when looking back on it to stick with Java?

DC: A great decision for me certainly. It’s history, we can’t change it. Somebody could have done it differently and done very well but I think these projects are doing very well and speak for themselves at this point as being effective tools.

JAX: That’s definitely true. You’ve talked about Lucene – I was wondering how Nutch came about?

DC: It was in my days of doing consulting, freelance work based around Lucene. Somebody approached me and said, “You know, it’d be really neat if there was a full text search engine, a crawler-based, web search engine, that was all open source. Would you be interested in starting a project like that?” So I was like, of course I would, I’d love to do that. I’d spent years at Excite working on crawler-based search engines and saw a lot of work on closed source ones.

I was of the somewhat naive belief that – I think I knew it was naive at the time – pretty much all software eventually becomes a commodity, and there’s an open source implementation of it. And I thought web search engines should not be an exception. That hasn’t really borne true. I think the real web search engines are at Google and Microsoft these days, with a few others around the world – and they’re not open source. The amount of work it takes to maintain one is oftentimes so – it hasn’t yet, anyway, yielded an open source one that’s really world class, but it seemed like a good idea at the time.

So I threw myself at it, got a couple of collaborators and we tried to do what we could do. I was familiar with the way that we had done distributed processing at Excite, which was fairly crude. Just having a bunch of machines and managing processes on them manually, and copying files around various stages. We had something that would in theory scale arbitrarily to many machines. But it was onerous to operate – a lot of manual steps in there.

And that was the time when I saw the papers from Google, talking about how they did this stuff, and where they had automated all of these manual steps, and had a framework. And it was pretty much the same algorithms and data structures that were directly supported by MapReduce…

So I think that was an obvious improvement, to go towards the automation and try adding that to Nutch. But we had a working system at that point, that was an open source project before we read those papers, and then it got much better after we saw that work.


JAX: When you read the Google paper, did a lightbulb go off in your head?

DC: Very much so. You know, we saw that and it was clear that was a better way to do things. It took us, I don’t know, maybe a six-month effort or so to really get things to a point where we could demonstrate MapReduce running, and it would be running much better than the crawler and the index before we had the MapReduce implementation. That was around 2004, when a lot of that research was done.

And then Yahoo! came along, and they had their own framework for distributed computing that they were building their web search on, and it had sort of outlived its useful life and was becoming stretched pretty thin. They too had read the papers from Google and thought that would be a good way to go, and thought the implementation that we’d built within Nutch was the one that was furthest along, and thought that doing this as open source would be a good way to go. And [they] wanted to join forces with Nutch.

They didn’t want to do it in the context of an open source search project for legal reasons, and instead really wanted it to be split out into a separate distributed computing project. So that the split of Hadoop from Nutch, which was really motivated by Yahoo!’s use – I think it was a good one, and I think the general-purpose distributed computing platform is obviously a more general thing than something that’s doing crawling, and it needed to be done. And Yahoo! provided the impetus to do that, and that was done in January of 2006.

JAX: You probably get asked this a lot – did you ever envisage Hadoop becoming what it is today? Did you ever see it getting so huge?

DC: No, I didn’t really – I started something that was useful beyond web search, certainly, and [I knew] there would be utility in having this general thing. I’d never been a big fan of relational databases, through my whole career. I’ve dabbled with them, and always found them insufficient for the sort of text search and web links I was doing, but not an appropriate solution. But it also never had anything to do with enterprise software; I’d always worked at web companies and search projects and at Apple, on desktop computing OSes. It really wasn’t something I spent any time thinking about. So, no, I didn’t see this as having a huge impact across a lot of industries at the time. So, I’ve been very pleased to see it! [laughs]

JAX: Absolutely. Do you think we’ve reached the limits for storage/processing? Or have we only really scratched the surface at this point?

DC: I think we’re definitely still in the early stages of where this stuff can be used, and what it can do. There’s a long-term exponential trend in the affordability of hardware to store and process data; there’s an long-term exponential trend in the consumption of that hardware by industry – by lots of industries, all industries, to store and process more of their data, and use it to improve their business. So, both of those obviously have something to do with one another, I don’t see it ending anytime soon. It’s possible they’ll start to slow, but there’ll still continue to be massive growth.

And the approach to data processing that we’ve seen in Hadoop is really one that is more appropriate to keep track of these trends. If you’re trying to store massive quantities of data, you need things which are designed from the ground-up to scale as linearly as possible, and also using the most economical hardware. The classic relational virtues really aren’t designed from the group that way – they’re dated a bit, they came from a different era.

The other thing that’s really exciting is looking at what Google’s been doing more recently – they recently published the Spanner paper, talking about their F1 system. They’re a few years ahead of the curve, ahead of the rest of us in this world. So they give a big glimpse of where we can go, and when they write these papers, they actually give us a roadmap of where we could go [laughs].

It’s now looking like we can sort of have it all, we really can have transactional database systems that scale very far, that scale to a global level. You can have them across data centres, holding petabytes in the tables, and still responding to interactive queries. So we’re not quite to that point in the open source ecosystem, but I think it’s clear we’re going to get there.

So in terms of features, we’re on our way there. So I think that’s going to cause even more adoption, more use cases. So you’ve got existing industries that are growing, and existing technological capabilities, and existing hardware economics, and those economics are going to get better, industries are going to mature and realise how they can use this more, and the technology’s going to advance, and enable them to do even more. Yeah, I don’t think it’s over, by any means.

JAX: Some enterprises seems slightly hesitant to adopt Hadoop at this stage. Does there need to be a change of mindset there for adoption? What things need to be considered before picking up a Hadoop distribution?

DC: I think all enterprises are inherently conservative, and for good reason – when they adopt a new technology, they’ve got to support it for several years, so they need something they can support. They don’t want to just run the ‘next big thing’ and be stuck with it, and that’s largely why Cloudera was formed – to be able to support those kinds of enterprise customers, and give them that confidence, and provide them with an offering that is easy for them to use and give them a partner in supporting it.

It’s happening – we’re seeing broad adoption, and we’ve got, I think, half of the Fortune 50 are now customers, and are adopting this stuff. Most of them are still in an early stage – they’re not yet deployed across the whole business. But it’s spreading, all these companies are spreading their use of these technologies.

It won’t happen overnight. If the core of business is in some data technology, it’s hard to pick it up and move it. There’s a lot of cases where the Hadoop stack is simply not ready for that, but there’s also a lot of cases where it can be used today to give folks a real advantage against their competitors, and so we’re seeing pretty steady growth and adoption.

JAX: What do you think is the biggest challenge for Hadoop to overcome, looking forward? Or is it a case of continuing the way it’s going, and that’ll probably see it through?

DC: I think the challenge is to meet the hype. I think we’re doing it pretty well so far, that you really can store data and process it effectively. It is a very young technology – people’s imaginations get ahead of them, ahead of the technology easily, so we need to control expectations. At the same time, we need to listen to those expectations, and if we can’t meet them this year, see if we can meet them next year.

So far, I think we’re doing pretty well with that – we’ve got a lot of folks adopting it. But, there’s a lot of areas for improvement. There’s the security story – we need to be able to support things encrypted everywhere; we need to support online systems better, being able to do interactive queries, more complex online queries; just a lot of integration with various tools. so plenty of work out there.

JAX: With the Hadoop 2.0 codebase, do you think it’s heading in the right direction to address those problems?

DC: Yeah, there’s been a lot of work on performance in the file system level, a lot of work on security. I mean definitely – the direction is set by the demands of the users, so I think almost by definition we’re heading in the right direction [laughs]. Cloudera and others listen to the customers, and we build what people need most next. It’s demand-driven, and hopefully we’re listening to the right people. I think we are!


JAX: You recently described Hadoop as the ‘kernel’ of the platform itself. What other Big Data technologies catch your eye at the moment? There’s quite a few at Apache Software Foundation at the moment incubating…

DC: I think it is pretty exciting that Hadoop has become this kernel, and I think Bigtop is becoming the open source integration point for getting all these parts to coordinate. I think the YARN project within Hadoop, of generalising the runtime of the kernel so that we can support different kinds of processing; things like the Giraph project for graph processing, I think that’s gonna be pretty useful. Then there’s the whole real-time processing, which is kind of a separate thread of development, it isn’t really as integrated into the Hadoop stack. That’s interesting to watch, not really something I’ve been involved in, so things like Storm and F4 should come into play more. HBase has long been the primary online system within the stack, and I think in the next year, we’re going to see a lot more that is really integrated with Hadoop, that gives you interactive queries, both beyond the simple key values of HBase. So you’re likely to see interactive SQL queries, as well as Lucene-style Solr Cloud in the Lucene camp, giving you the scalable search – so you can search petabytes of data with very low latency, as well as getting pretty good throughput, having lots queries running simultaneously. Both of those are directions we’re going to see a lot of progress on.

JAX: I just wanted to touch upon your role at Apache Software Foundation. What do you plan to achieve with your role there?

DC: I’m chairman at Apache at the moment right now and I have been for the last few years. It’s not really a position of power – Apache is pretty much an all-volunteer organisation, and we’ve got a few contractors that do system administration, but it’s mostly volunteers. So you can’t really have a top-down power structure in that kind of organisation. It’s really much more about coordination, and so the only times we flex our muscles are in sort of ‘police actions’, where a community is not operating according to the principles that we espouse.

And we really want communities to operate as level playing fields, where anyone can come along, and get involved in a project, and their contributions will be evaluated on their technical merit, and we don’t want any one company to exercise too much power within the project, and control the project for their commercial interests, rather than the technical needs of the community.

So when we see something like that happen, we have to step in now and then. Those are the times when we become activists. It’s unclear that this all-volunteer structure is going to scale indefinitely – it’s scaled well for over a decade, but eventually we may need to start hiring more staff. The number of contractors we have doing system administration has grown. We’ve got a contractor doing marketing and communications. We’ve got an executive assistant now, who helps manage some of the paperwork of the foundation, so eventually we may actually grow. Figuring our how to do that is a bit of a puzzle, and how we do the fundraising to back that. So far, our fundraising has been relatively passive – we’re able to get big companies who really value the existence of the ASF, and just [get them to] give us money, which is wonderful, with no strings attached, and so far that’s been sufficient. Whether we’re going to do more active fundraising, and actually employ more people as this continues to grow, we’ll see.

It’s pretty amazing, the size of the foundation. We’ve got 3000 committers, and over 100 active projects being developed – it’s a lot of software that comes out of something that’s really a grassroots, bottom-up foundation.

JAX: It must be amazing to see so many projects stepping up with something innovative.

DC: And it largely runs itself, that’s the design. We always push anything down and say we can’t drive it from the top down – we just can’t afford to. Not only is it against the principles, we don’t have any management paid to do that and we couldn’t expect people to respond to them if we did. So we force everything down and it does run itself.

JAX: That spirit certainly shines through

DC: It’s unusual – there aren’t a lot of examples for us to look to and we’re making it up as we go along. So far it seems to be working. Hopefully, we can keep scaling and if we can’t, we’ll think of something else.

JAX: One final question: can you discuss your Cloudera role a bit more and the newest release CDH4. What problems does that tackle for the enterprise?

DC: In Cloudera, my title is Chief Architect. They’ve sort of given me the James Bond role – give me a license to hack [laughs]. Mostly I work on Apache things, working on software, helping manage ASF and also do try to help Cloudera in achieving its mission, spending some time with customers. A lot of the time explaining that how Hadoop works, how Apache works.

CDH 4 is really the next generation commercial packaging of the Hadoop ecosystem. It’s based off of the open source Bigtop project – the version of Bigtop which will have long-term commercial support. Cloudera will continue to work on critical bug fixes, security fixes to CDH in a way the Bigtop open source project doesn’t yet do. There’s also associated with the commercial proprietary offering Cloudera Enterprise to help folks manage their Hadoop clusters. The line we’ve drawn between our open source and proprietary efforts is that the APIs you code against is all open source, when developing an application. The stuff you run to configure, run, monitor that software is mainly proprietary.  If you want a console or you can in one place see everything that’s going on and change the way it is working, then that’s something we’re happy to sell you along with support.

JAX: Seems like you’ve struck a great balance there. Great talking to you Doug

Doug will be keynoting at the upcoming JAX London and Big Data Con London, telling us how to go about “Unlocking the Power of Big Data with Hadoop”. Find more conference info here.

Photo Courtesy of Felix O

Inline Feedbacks
View all comments