Small update for Apache extracting project

Apache Tika to ride on into new communities with 1.1 release

Chris Mayer

Following on from November’s first full version, Apache’s data extracting project has received its first update to the 1.0 series.

Finally emerging out of the Apache incubator in November after
three years stewing with some lofty ideas, Apache Tika burst onto
the scene with an initial intention to break away from web-search
project Apache Nutch and
become an important metadata extractor on its own.

Now the team behind Apache Tika have released an update for the
content analysis toolkit that detects languages and extracts
metadata from things such as text documents,
spreadsheets, PDFs or images. Apache Tika 1.1 pushes the boundaries
further for the one stop shop, by adding in vital improvements
for PDF, RTF and MP3 parsing.

The goal with Apache Tika was to provide a single
API for anyone looking to extract various types of data, even
providing rudimentary support for audio and video files. With this
release, performance is significantly improved by the default media
registry loading much quicker. New features appear on the command
line too, such as the ability to list detectors.

Other enhancements include improved mime magic detection for MP4
based formats (QuickTime, MP4 Video and Audio and 3GPP) as well as
a basic parser for MP4 files, speed-up when extracting PDFs and
parsers added for Ogg Vorbis and FLAC  files.

To see all that has changed within Apache Tika, check out the
Changelog and
the Getting
will certainly iron out any problem when using Apache
Tika. It might seem like small steps but there’s big plans in the
offing for this new Top Level Project at the Apache Foundation, to
pervade into other communities.

comments powered by Disqus