Small update for Apache extracting project

Apache Tika to ride on into new communities with 1.1 release

Chris Mayer

Following on from November’s first full version, Apache’s data extracting project has received its first update to the 1.0 series.

Finally emerging out of the Apache incubator in November after three years stewing with some lofty ideas, Apache Tika burst onto the scene with an initial intention to break away from web-search project Apache Nutch and become an important metadata extractor on its own.

Now the team behind Apache Tika have released an update for the content analysis toolkit that detects languages and extracts metadata from things such as text documents, spreadsheets, PDFs or images. Apache Tika 1.1 pushes the boundaries further for the one stop shop, by adding in vital improvements for PDF, RTF and MP3 parsing.

The goal with Apache Tika was to provide a single API for anyone looking to extract various types of data, even providing rudimentary support for audio and video files. With this release, performance is significantly improved by the default media registry loading much quicker. New features appear on the command line too, such as the ability to list detectors.

Other enhancements include improved mime magic detection for MP4 based formats (QuickTime, MP4 Video and Audio and 3GPP) as well as a basic parser for MP4 files, speed-up when extracting PDFs and parsers added for Ogg Vorbis and FLAC  files.

To see all that has changed within Apache Tika, check out the Changelog and the Getting Started will certainly iron out any problem when using Apache Tika. It might seem like small steps but there’s big plans in the offing for this new Top Level Project at the Apache Foundation, to pervade into other communities.

Inline Feedbacks
View all comments