Open source, Lucene, Nutch, Hadoop and more
Road to JAX London: A chat with Doug Cutting - Part 3
JAX: You recently described Hadoop as the 'kernel' of the platform itself. What other Big Data technologies catch your eye at the moment? There’s quite a few at Apache Software Foundation at the moment incubating...
DC: I think it is pretty exciting that Hadoop has become this kernel, and I think Bigtop is becoming the open source integration point for getting all these parts to coordinate. I think the YARN project within Hadoop, of generalising the runtime of the kernel so that we can support different kinds of processing; things like the Giraph project for graph processing, I think that's gonna be pretty useful. Then there's the whole real-time processing, which is kind of a separate thread of development, it isn't really as integrated into the Hadoop stack. That's interesting to watch, not really something I've been involved in, so things like Storm and F4 should come into play more. HBase has long been the primary online system within the stack, and I think in the next year, we're going to see a lot more that is really integrated with Hadoop, that gives you interactive queries, both beyond the simple key values of HBase. So you're likely to see interactive SQL queries, as well as Lucene-style Solr Cloud in the Lucene camp, giving you the scalable search - so you can search petabytes of data with very low latency, as well as getting pretty good throughput, having lots queries running simultaneously. Both of those are directions we're going to see a lot of progress on.
JAX: I just wanted to touch upon your role at Apache Software Foundation. What do you plan to achieve with your role there?
DC: I'm chairman at Apache at the moment right now and I have been for the last few years. It's not really a position of power - Apache is pretty much an all-volunteer organisation, and we've got a few contractors that do system administration, but it's mostly volunteers. So you can't really have a top-down power structure in that kind of organisation. It's really much more about coordination, and so the only times we flex our muscles are in sort of 'police actions', where a community is not operating according to the principles that we espouse.
And we really want communities to operate as level playing fields, where anyone can come along, and get involved in a project, and their contributions will be evaluated on their technical merit, and we don't want any one company to exercise too much power within the project, and control the project for their commercial interests, rather than the technical needs of the community.
So when we see something like that happen, we have to step in now and then. Those are the times when we become activists. It's unclear that this all-volunteer structure is going to scale indefinitely - it's scaled well for over a decade, but eventually we may need to start hiring more staff. The number of contractors we have doing system administration has grown. We've got a contractor doing marketing and communications. We've got an executive assistant now, who helps manage some of the paperwork of the foundation, so eventually we may actually grow. Figuring our how to do that is a bit of a puzzle, and how we do the fundraising to back that. So far, our fundraising has been relatively passive - we're able to get big companies who really value the existence of the ASF, and just [get them to] give us money, which is wonderful, with no strings attached, and so far that's been sufficient. Whether we're going to do more active fundraising, and actually employ more people as this continues to grow, we'll see.
It's pretty amazing, the size of the foundation. We've got 3000 committers, and over 100 active projects being developed - it's a lot of software that comes out of something that's really a grassroots, bottom-up foundation.
JAX: It must be amazing to see so many projects stepping up with something innovative.
DC: And it largely runs itself, that's the design. We always push anything down and say we can’t drive it from the top down - we just can’t afford to. Not only is it against the principles, we don’t have any management paid to do that and we couldn’t expect people to respond to them if we did. So we force everything down and it does run itself.
JAX: That spirit certainly shines through
DC: It’s unusual - there aren’t a lot of examples for us to look to and we’re making it up as we go along. So far it seems to be working. Hopefully, we can keep scaling and if we can’t, we’ll think of something else.
JAX: One final question: can you discuss your Cloudera role a bit more and the newest release CDH4. What problems does that tackle for the enterprise?
DC: In Cloudera, my title is Chief Architect. They’ve sort of given me the James Bond role - give me a license to hack [laughs]. Mostly I work on Apache things, working on software, helping manage ASF and also do try to help Cloudera in achieving its mission, spending some time with customers. A lot of the time explaining that how Hadoop works, how Apache works.
CDH 4 is really the next generation commercial packaging of the Hadoop ecosystem. It’s based off of the open source Bigtop project - the version of Bigtop which will have long-term commercial support. Cloudera will continue to work on critical bug fixes, security fixes to CDH in a way the Bigtop open source project doesn’t yet do. There’s also associated with the commercial proprietary offering Cloudera Enterprise to help folks manage their Hadoop clusters. The line we’ve drawn between our open source and proprietary efforts is that the APIs you code against is all open source, when developing an application. The stuff you run to configure, run, monitor that software is mainly proprietary. If you want a console or you can in one place see everything that’s going on and change the way it is working, then that’s something we’re happy to sell you along with support.
JAX: Seems like you’ve struck a great balance there. Great talking to you Doug
Doug will be keynoting at the upcoming JAX London and Big Data Con London, telling us how to go about “Unlocking the Power of Big Data with Hadoop”. Find more conference info here.
Photo Courtesy of Felix O