Notch another version up for Nutch

Apache Nutch 2.0 announced – primed for Hadoop crawling

Chris Mayer

The Apache Software Foundation has unveiled the second series to their network crawling and indexing search framework, Apache Nutch is now generally available, promising to help facilitate the winds of change, in regards to Big Data and NoSQL databases.

The Apache Software Foundation has unveiled the second series to their network crawling and indexing search framework, Apache Nutch. It is now generally available, promising to help facilitate the winds of change, in regards to Big Data and NoSQL databases.

The highly scalable and Java-written search framework has steady foundations, built on several of its Apache stablemates, including Hadoop, Tika, Solr and Gora. Its main goal? To provide a link-graph database and parsing support for HTML, two areas which could do with a scalability spruce-up.

Nutch fits neatly into the ASF’s offering towards the field of Big Data, linking up to a variety of strong projects such the aforementioned ones. Building on the work of storage abstraction and Big Data persistence layer project, Apache Gora, Nutch goes deeper into data mining for big data stores such as Accumulo, Avro, Cassandra, HBase and HDFS (the Hadoop Distributed File System). This gives a large amount of flexibility to those pondering adopting any of these technologies in the near future. It also hooks up to an in-memory data store as well as several high profile SQL stores.
Whilst there’s clearly a clamouring for it to enter the world of the top enterprises, the team of developers at Nutch 2.0 say that they want it to be “go-to choice for companies of all sizes, from start-ups and medium sized businesses to large scale organizations.” From working across a large quantity of machines, to a single use case on one machine, Nutch claims to be fit for any purpose.
“Having been at the origin of Open Source superstars such as Apache Hadoop or Apache Tika, Nutch now catches up with the NoSQL trends and adopts a table-like representation,” said Apache Nutch Vice President Julien Nioche in a press release.
Nutch has been under development for close to two years. In September 2011, the team split with one half focusing on Nutch 1.0 as their mainstream product, and the rest devoting their time to planning this large-scale friendly release. Nutch has a highly pluggable modular architecture, as this presentation demonstrates.
“Our work on Nutch 2.0 gave birth to Apache Gora in the process, which it uses as an abstraction over the storage backends,” added Nioche. “This enhanced architecture makes Nutch not only more efficient but also easier to integrate with external tools while still solving a large range of use cases ranging from single servers setups to large-scale Internet crawlers hosted in the cloud.”
It’s been quite the community effort, with crossover inevitable. Apache’s open source effort such as search server Solr and graph database Tika play a huge part in Nutch and both seem to garner attention from the community. Whether Nutch can enjoy similar success remains to be seen.
“Nutch v2.0 is particularly exciting as it catches up with Apache projects like HBase, Cassandra, and Accumulo,” added Nioche. “The community’s response to the earlier versions of v2.0 has been very encouraging and we hope to see more and more people getting involved.”
After being founded by Doug Cutting and Mike Cafarella in 2003, Nutch has since been eclipsed by Hadoop (also founded by Cutting), with internet giants like Facebook and Amazon turning to the yellow elephant for their huge data needs. With Nutch working on top to make sense of it all, perhaps this release could see the project finally get the love it rightfully deserves.
Inline Feedbacks
View all comments