Apache OODT platform: Use metadata as a first class citizen
Apache OODT is an award-winning software birthed at NASA’s Jet Propulsion Laboratory, being used for scientific data system projects in Earth Science, Planetary Science, and Astronomy. DARPA XDATA and National Cancer Institute’s Early Detection Research Network are also using OODT. We talked to Tom Barber, software developer at NASA Jet Propulsion Laboratory, about the idea behind Apache OODT and its uses.
JAXenter: What is Apache OODT?
Tom Barber: Good question! Apache OODT (Object Oriented Data Technology) is a data management and processing toolkit initially developed by NASA JPL and then donated to the Apache Software Foundation for distribution to a wider audience.
Apache OODT is not designed to be a turnkey solution where everything “just works”. NASA’s requirements were very diverse and so it’s supposed to be highly extendable to allow for deployment in virtually any case where data management or data pipelines are required.
OODT has a number of components designed to perform different tasks. For example, the file manager is designed to ingest files and streams from third party applications, extract its metadata and store the files for further processing or end user digestion. There is also a workflow module which will accept data flowing into the file manager and perform post processing on that data and store the output; resource management for managing workflow loads, Python bindings for scripting your OODT processes and the list goes on. The platform also includes a number of web applications and command line tools to allow for end user interaction and the building of web services for users, and of course all of this can be extended and customized to fit the requirements of the project on which it is being used.
One of the core principals of the OODT platform is the use of metadata as a first class citizen. When files are ingested, the file manager can extract metadata from the files and store it alongside the file being stored. So for example, users could post JPEG images to the OODT platform and OODT would store them and at the same time extract the EXIF image data and tag that alongside the image; a developer could then write a web application that searched the archive for JPEG images with certain tags (this is a real life scenario, but we’ll get to that later).
One of the core principals of the OODT platform is the use of metadata as a first class citizen.
Users can create their own extractors for custom file formats or specific data points coming into the platform, but they can also leverage the Apache Tika engine to extract metadata on the fly. Tika can read a large majority of mainstream file types and can be extended to add more. For example, documents, audio, video, pictures etc all have embedded textual metadata that can be extracted from them automatically via Tika. OODT can then take all of that information and make it searchable by users and by workflows and further processes which require certain data elements.
JAXenter: Where can it be used?
Tom Barber: Apache OODT has a number of uses both in the scientific and business domains. NASA has put it to good use on a number of satellite missions, especially Earth Science satellite missions. For example, Seawinds, OCO, OCO-2 SMAP and more. It helps process the data flowing back down to the ground and makes it available to researchers based throughout the world. A good public example is the NASA Planetary Data System (PDS) which distributes scientific data from NASA missions and is powered by Apache OODT. OODT is also used to help detect cancer via the National Cancer Institute (NCI) Early Detection Research Network (EDRN).
OODT does not sit purely in the scientific domain, it just happened to be written by a team of NASA developers.
One of the most current use cases is the DARPA Memex program which is run by DARPA in an effort to improve the search and detection capabilities of criminal activity on the “dark web”. This uses Apache OODT and a number of other open source technologies to help crawl the dark web looking for images and data that can help locate criminals and underground activity. OODT uses automatic metadata extraction along with a number of other techniques to ingest the data available and make it usable to researchers, agents and stakeholders within the program.
OODT is also used in a number of other scientific research domains including collection of snowfall data across North America and Genomics research.
Of course that’s not to say OODT sits purely in the scientific domain, it just happened to be written by a team of NASA developers. It also has a number of business use cases, Internet of Things (IoT) use cases and more. For example, companies which make use of a data pool or data staging area in their business intelligence and reporting systems can make good use of OODT as a way of tracking the files within their staging area and making them available to end users either in a raw format or after passing them through a workflow. The platform can also be put to good use in IoT and general sensor processing, collecting and archiving sensor readings and data flows between components to allow researchers to gain further insight into the sensors they are operating.
JAXenter: How can Apache OODT be used?
Tom Barber: OODT isn’t an out of the box solution, but we do try to make it as easy as possible. For that, OODT provides the RADIX build system which will compile a fully operational OODT platform ready for development and deployment.
Building the RADIX distribution is as simple as running the following commands:
export JAVA_HOME=/usr/lib/jvm… (adjust for your own JAVA_HOME) curl -s "https://git-wip-us.apache.org/repos/asf?p=oodt.git;a=blob_plain;f=mvn/archetypes/radix/src/main/resources/bin/radix;hb=HEAD" | bash mv oodt oodt-src; cd oodt-src; mvn install mkdir ../oodt; tar -xvf distribution/target/oodt-distribution-0.1-bin.tar.gz -C ../oodt cd ../oodt; ./bin/oodt start ./resmgr/bin/batch_stub 2001
If you navigate to http://localhost:8080/opsui, you should see the default OPSUI system which provides system oversight and interrogation. In the OODT deployment directory you’ll also find the various binaries for server components, client tools and all the other configuration files which construe a “standard” installation. To get a better understanding of your project, open it in your favorite IDE. As it is a Maven project, you should be able open the root pom.xml file and have it figure out the rest for you. The great thing about the RADIX build is that you can now make changes to your OODT source and recompile using Maven and the same commands as above and keep deploying your own configurations alongside the main OODT libraries.
In order to start extending the system, one needs additional information — for that we have extensive Wiki (https://cwiki.apache.org/confluence/display/OODT/Home) documentation, as well as an active community of both NASA and non-NASA developers. So if you feel like OODT might be a good fit for your project, swing by the mailing list and say “Hi!”.