Apache Tika – “Data-driven analytics are at the heart of modern applications”
In 2016, Forbes published an article in which Apache Tika was identified as one of the key emerging technologies. Sergey Beryozkin (Red Hat Middleware R&D) revealed in an interview at ApacheCon 2019 what Apache Tika can do with GraalVM and where there is room for improvement.
JAXenter: A general question at first: What exactly does Apache Tika do or what is it best used for?
Sergey: Apache Tika is a library for detecting and parsing meta-data and text from a large number of the document formats. It provides a simple, uniform API which works for all the supported formats.
It is great for parsing well known formats such as PDF but can easily scale to support as many formats as needed, sometimes without having to change a single line of code.
JAXenter: In 2016 Forbes published an article identifying Tika as one of the key technologies. What are the reasons for it in your opinion?
Sergey: Data-driven analytics and transformations are at the heart of many modern applications. Apache Tika helps to feed the content from a huge number of document formats into these processes with the developers keeping focusing on the quality of the data analysis as opposed to spending all their time on writing the format specific parser code :-).
JAXenter: How did Quarkus help to prepare Apache Tika to run in a GraalVM native image?
Quarkus makes this process a joy for developers.
Sergey: GraalVM native image generator compiles the code ahead of time, under closed-world assumption, and produces a native image executable which includes the application classes as well as the classes from all the required dependencies including those from JDK itself. It also includes the runtime components collectively known as Substrate VM.
Native image has a number of limitations. Configuration is required to enable some of the features, for example, loading the provider classes with Java ServiceLoader API or delaying the initialization of the classes which do not meet the closed-world assumption until run time, etc. Therefore, working directly with GraalVM and its native image generator to make a library such as Apache Tika running in a native image executable, with all its required dependencies, can be very difficult.
Quarkus makes this process a joy for developers: it makes it straightforward to build the native image executable at build time by hiding all the complexity from the developers while letting them write the extension code which knows what the application’s closed world assumption is and uses so called Build Steps to create and optimize the native image generation. For example, Quarkus ‘quarkus-tika’ extension has a class called ‘TikaProcessor’ which produces only a few Build Steps methods to get Apache Tika running in the native mode.
JAXenter: What are the benefits of doing so?
Sergey: In general there are two main advantages: significantly faster start-up time and lower memory footprint.
Native image executable starts so fast that a human can barely notice it. It is achieved thanks to the ahead of time compilation and due to the fact that the native image executable can start with the application already being configured during the build time, the executable only restores the recorded state – Quarkus makes this particular optimization easily done for both the native Substrate and Hotspot VM modes.
The lower memory footprint is achieved due to a number of reasons. One of them is to do with the fact that reading the configuration files, particularly in XML format, results in loading a lot of class instances required to parse the configuration. Yet another reason is that Substrate VM does not need to aggressively optimize the way the traditional Java VM does it so there is no need to keep the extra meta-data in memory.
As such, the Apache Tika applications which need to start fast, or run in the resource constrained environments, will benefit most when running as a native image executable.
JAXenter: Is there a feature that you currently miss in Apache Tika or something that needs to be improved?
Sergey: Modularization is what will help making Apache Tika more accessible in some deployments. Today all the supported formats are handled by a single module which brings a large number of dependencies. Bob Paulin, Apache Tika committer, has already done most of the modularization work and we are now planning to complete it.
A significantly faster start-up time and lower memory footprint are the main advantages.
JAXenter: Apart from Apache Tika – which tools or technologies have the potential for you to shake up the development world next year?
Sergey: There are many ingenious projects which are entering the ASF space today. But let me guess: the projects which will be bringing blockchain-like technologies will be worth following, so please watch Apache Milagro as well as other projects such as Apache Camel which may offer the higher-level integration support for the distributed ledges.
JAXenter: 20 years of Apache Software Foundation and you’re a part of it! How do you see the future of the ASF?
Sergey: ASF will continue welcoming and supporting the current and next generations of developers who like to work on open source and would like to change the world.
Thanks very much!