NASA part of testing programme

New metadata parser project Apache Tika 1.0 showcased

Chris Mayer
Apache-Tika

The one-stop shop for retrieving parsers

The Apache Foundation has released the first version of
its highly anticipated Tika project. The embeddable,
lightweight toolkit for content detection and analysis looks to be
an essential add-on for those using search engines extensively.

The creators describe it as a one-stop shop for
identifying, retrieving, and parsing text and metadata from over
1,200 file formats including HTML, XML, Microsoft Office, thus
making it a very useful addition for anyone using a variety of
files.

Chris Mattmann, Apache Tika Vice President, Senior Computer
Scientist at NASA Jet Propulsion Laboratory, and University of
Southern California Adjunct Assistant Professor of Computer Science
tells us more:

“The Apache Tika v1.0 release is five years in the making,
providing numerous improvements and new parsing formats. From a
toolkit perspective, it’s easy to integrate, and provides maximum
functionality with little configuration.”

Tika also comes with a GUI, allowing seamless interaction with
files graphically. The updated version 1.0 removes all pre-1.0
API methods and gets ride the retrotranslated Java 1.4 support.
OSGi integration is improved to automatically pick up and uses
available Parser and Detector services.

Initally part of Lucene, Tika gained top-level billing in April
2010 and has been tested throughly within repositories
exceeding 500 million documents across a variety of applications in
industry, academia and government labs such as NASA.  Dan
Crichton, Program Manager and Principal Computer Scientist, NASA
Jet Propulsion Laboratory.

“At NASA, we leverage Apache Tika on several of our Earth
science data system projects. Tika helps us processes hundreds of
terabytes of scientific data in myriad formats and their associated
metadata models. Using Tika with other Apache technologies such as
OODT, Lucene, and Solr, we are able to automate, virtualize and
increase the efficiency of NASA’s science data processing
pipeline.”

Apache Tika v1.0 will be
featured at ApacheCon’s Content
Technologies track
on 10 November 2011. The release notes lists the changes in full. Apache
Tika source is available for download.

Author
Comments
comments powered by Disqus