NASA part of testing programme

New metadata parser project Apache Tika 1.0 showcased

Chris Mayer

The one-stop shop for retrieving parsers

The Apache Foundation has released the first version of its highly anticipated Tika project. The embeddable, lightweight toolkit for content detection and analysis looks to be an essential add-on for those using search engines extensively.

The creators describe it as a one-stop shop for identifying, retrieving, and parsing text and metadata from over 1,200 file formats including HTML, XML, Microsoft Office, thus making it a very useful addition for anyone using a variety of files.

Chris Mattmann, Apache Tika Vice President, Senior Computer Scientist at NASA Jet Propulsion Laboratory, and University of Southern California Adjunct Assistant Professor of Computer Science tells us more:

“The Apache Tika v1.0 release is five years in the making, providing numerous improvements and new parsing formats. From a toolkit perspective, it’s easy to integrate, and provides maximum functionality with little configuration.”

Tika also comes with a GUI, allowing seamless interaction with files graphically. The updated version 1.0 removes all pre-1.0 API methods and gets ride the retrotranslated Java 1.4 support. OSGi integration is improved to automatically pick up and uses available Parser and Detector services.

Initally part of Lucene, Tika gained top-level billing in April 2010 and has been tested throughly within repositories exceeding 500 million documents across a variety of applications in industry, academia and government labs such as NASA.  Dan Crichton, Program Manager and Principal Computer Scientist, NASA Jet Propulsion Laboratory.

“At NASA, we leverage Apache Tika on several of our Earth science data system projects. Tika helps us processes hundreds of terabytes of scientific data in myriad formats and their associated metadata models. Using Tika with other Apache technologies such as OODT, Lucene, and Solr, we are able to automate, virtualize and increase the efficiency of NASA’s science data processing pipeline.”

Apache Tika v1.0 will be featured at ApacheCon’s Content Technologies track on 10 November 2011. The release notes lists the changes in full. Apache Tika source is available for download.

Inline Feedbacks
View all comments