LocationTech GeoTrellis: What’s new?
GeoTrellis is a geographic data processing engine for high performance applications. In this article, Ross Bernet explains how GeoTrellis continues to advance the field of distributed processing of geospatial imagery and raster data.
LocationTech GeoTrellis is a library based on the Apache Spark project that enables low latency, distributed processing of geospatial data, particularly for imagery and other raster data.
GeoTrellis has made significant progress in the past year. There are three accomplishments that stand in the forefront: (1) graduating from incubation with 1.0 release through LocationTech in December 2016, (2) the development of the feature-rich Python binding project called GeoPySpark, and (3) the support of Raster Foundry, a web application for processing imagery at scale with an emphasis on user friendly UI/UX. Azavea is pleased to share more about these endeavors and several additional achievements of the past year.
The core competency of GeoTrellis is raster data processing using the techniques of map algebra. In addition to raster support, the library includes some support for working with vector and point cloud data.
1.0, 1.1, 1.2 Releases
GeoTrellis is released under an Apache 2.0 license. It joined the LocationTech working group as an incubating project in 2013, and the 1.0 release was a significant milestone. In addition to being an indicator of the project’s maturity and reliability, the 1.0 release signified the promotion to a top level LocationTech project and graduation from incubation. In addition to the usual incubation tasks concerning intellectual property review and project governance, our key objective for the GeoTrellis contributors was to expand the community of contributors beyond the original core developers at Azavea, and we believe this is one of our most significant accomplishments over the past few years. In addition to higher level project goals, the 1.0 release brought many additional features, including:
- Streaming GeoTiff support
- Windowed GeoTiff reading on S3 and Hadoop
- HBase and Cassandra support
- A new Collections API that enables users to bypass Spark in some cases
- Experimental support for Vector Tiles, GeoWave integration, and GeoMesa integration
- Expanded documentation moved to ReadTheDocs. This greatly improves usability, readability, and searchability
Following the 1.0 release, GeoTrellis 1.1 was released in June 2017 and 1.2 is scheduled to be released in November 2017. We are following the SemVer versioning semantics guidelines and aim to have three released per year.
The 1.1 release introduced feature support for Spark-enabled cost distance calculations, expanded the non-Spark feature support through the implementation of a new Collections API, and support for conforming delaunay triangulation in order to ingest LiDAR point clouds. A point cloud is a set of data points with X,Y,Z coordinates and can be useful for terrain and volumetric analysis. This work was the result of a successful relationship through a private contract that was used to support open source work. This led to the GeoTrellis point cloud subproject and the development of an interactive demo.
Python bindings: GeoPySpark
GeoTrellis is written in Scala. Scala provides many performance benefits and enables GeoTrellis to fully leverage the power of the Spark and Hadoop ecosystem. However, Python is a more commonly used language for many developers working with geospatial data. We created Python bindings for GeoTrellis to increase the project’s accessibility and released the project as GeoPySpark in September 2017.
Much of the functionality of GeoPySpark is handled by another library, PySpark. Py4J is a tool that Spark (and in turn GeoPySpark) utilizes to enable Scala object and classes and their methods to be accessed in Python.
In addition to expanding the user base through Python support, GeoPySpark was designed to facilitate iterative workflows in a Jupyter Notebook. GeoPySpark can be installed using ‘pip’ or accessed through a Docker container that includes all of the necessary dependencies. Users can develop workflows and iterate on algorithmic design on a single machine in a Jupyter Notebook environment and easily scale the workflow to a national or global dataset by leveraging cluster computing. Azavea worked with the team at Kitware that created GeoNotebook – a tool that provides an embedded map inside of a Jupyter Notebook – to develop interactive workflows.
In addition to developing new libraries and tools that can use GeoTrellis, a new end-user application, Raster Foundry, is under development for earth observation analysis. The goal for Raster Foundry is to enable users to find, combine, and analyze earth imagery at any scale and share it on the web as well as leverage deep learning for object detection.
The development Raster Foundry is partially supported by two federal research grants from the US Department of Energy (DE-SC0013134) and NASA (NNX16CS04C).
Documentation is an essential component of any open source project. We have made a significant effort this past year to increase the quantity, quality, and searchability of the GeoTrellis documentation. The GeoTrellis documentation is hosted using ReadTheDocs – a tool designed to support usable documentation.
In addition to documentation, we are focused on the creation of example workbooks and tutorials to demonstrate core concepts. This work has begun with GeoPySpark and the tutorials can be found on GeoDocker.
We are also excited to share the work of MarineTraffic that is using GeoTrellis for their shipping density calculations.
RasterFrames has been a fruitful collaboration that has led to good feedback for GeoTrellis in addition to many useful contributions to the core library. RasterFrames is being developed by a company called Astraea, it brings the power of Spark DataFrames to geospatial raster data. It leverages GeoTrellis’ map algebra and tile layer operations. This is definitely an exciting project to look into, particularly for data scientists with knowledge of data frames who want to explore large geospatial datasets.
We are currently in the process of focusing efforts on 1.3 and the 2.0 release that we anticipate for the first half of 2018. We believe that GeoTrellis contains much of the core functionality we envisioned for the project. This does not mean it is done or feature complete, and much work remains to make GeoTrellis more usable and accessible. GeoPySpark is a good start for reaching Python developers, but we still need to continue improving the documentation, tutorials, and demonstrations for both core GeoTrellis and GeoPySpark. There are also several features and optimizations that are on the roadmap for the coming year:
- Cloud Optimized GeoTiffs (COGs) is a format for internal organization of GeoTiff files that enables efficient retrieval of data subsets in cloud workflows. We have plans to make COGs a core component of GeoTrellis.
- Bringing in additional features from GeoTrellis into GeoPySpark.
- Map Algebra Modeling Language (MAML) is a declarative structure that describes a sequences of map algebra operations. This structure can be evaluated against a given collection of datasets to compute a result. Critically the evaluation logic is not specified in MAML, only the semantic meaning of the operations. This separation allows for multiple interpreters to exist that operate in different computational contexts. This has the potential to expose GeoTrellis processing power to custom interpreters.
- Improve GeoTrellis job performance and reduce resource requirements by optimizing data access patterns based on query and job structure.
- Improve query performance for large spatiotemporal layers by implementing more advanced space filling curve indexing techniques, based on the approach used in GeoWave.
- Expose support for WCS OGC standard
- Gitter (The fastest way to get an answer about GeoTrellis)
- Twitter (Track updates and share what you are working on!)
- Introducing GeoTrellis (March 2014 Eclipse Newsletter)
- GeoTrellis Adapts to Climate Change and Spark (December 2014 Eclipse Newsletter)
- GeoTrellis on GitHub
- GeoTrellis documentation
- GeoPySpark on GitHub
- Raster Foundry website
- RasterFoundry on GitHub
- Azavea blog articles on GeoTrellis
This post was originally published in the November 2017 issue of the Eclipse Newsletter: Location Matters.
For more information and articles check out the Eclipse Newsletter.