Bob DuCharme, a technical writer with CCRi, writes about how open source GeoMesa can help users with managing large spatio-temporal datasets. GeoMesa can stored petabytes of GIS data and serve up tens of millions of points in seconds.
When using open source tools to build GIS applications around a server like OpenGeo Suite’s GeoServer, you can read data from disk files, and when using more data than disk files can efficiently store, you can read from a database manager such as PostgreSQL with PostGIS added in to provide geospatial support. If you have enough data, though, you’ll eventually hit a limit with PostGIS. The machine where you installed it can only have so much RAM and disk space, and scaling up from there can get expensive if it’s even possible.
As more kinds of data become available for GIS systems, it’s becoming easier to hit these limits. The GIS Lounge article Empowering GIS with Big Data described some of the classes of Big Spatial Data that are leading people to push these boundaries and some of the tools they’re using to work with this data. One relatively new tool is the open source GeoMesa, which adds support for geographic features to the Hadoop-based database systems Apache Accumulo, Apache HBase, and Google Cloud Bigtable. This can let you scale way, way up in your use of GIS data with open source systems.
Big Data database systems
First, what is Hadoop, what do these database systems add to it, and how do they differ from a relational database such as PostgreSQL? And most importantly, what makes this setup so good for Big Spatial Data?
Apache Hadoop is an open source framework that originally became popular for handling Big Data because of two particular components of that framework: the Hadoop Distributed File System (HDFS), which lets you treat the disks in a collection of commodity machines as a single file system, and MapReduce, which lets applications split their processing across multiple commodity machines.
With this combination, an application with growing resource requirements didn’t necessarily need a bigger computer with more RAM and disk space to scale up; the growth could be handled by the addition of inexpensive new machines to the Hadoop cluster. And, as cloud-based resources became easier to use and less expensive, you didn’t even need permanent physical machines for your cluster—if you wanted twelve machines to serve as a cluster for a six-hour job, you only needed to “own” your cloud versions of those machines for that length of time.
This didn’t work with all applications, though.
MapReduce requires you to create a Map procedure and a Reduce procedure to do your processing. The Map procedure runs on each node of the cluster, processing the subset of data passed to that node, and the Reduce procedure gathers the results of the Map executions and performs additional calculations to aggregate or summarize the results. This arrangement worked fine for many kinds of analytics, and companies poured terabytes of data into MapReduce routines so that they could look for otherwise hidden patterns in their transaction, clickstream, and log data. However, for typical database applications, the need to split processing into Map and Reduce operations made it difficult to allow the kind of interactive querying and updating of data that most people want to do with their databases.
As Hadoop gained traction, a parallel development in the information technology world was the development of NoSQL database systems. The term originally meant “not SQL” to describe database systems that used alternatives to the table-based data models used with the ISO SQL standard. The term has evolved to mean “Not Only SQL” because modern large applications may involve a collection of database systems that each use a different data model to perform a specialized task, and an SQL system—for example, PostgreSQL—may be part of that mix.
Some of these NoSQL database management system were designed to run in a Hadoop environment, where they could take advantage of HDFS and shield developers from the need to worry about Mapping and Reducing their data. One particular family of NoSQL databases inspired by the Google Big Table paper in how they store and organize their data is known as “column-oriented databases.” By grouping each table’s data by its columns instead of using the row orientation of typical SQL databases, these database systems added modeling flexibility, indexing efficiencies, and greater ease in distributing the data across multiple nodes.
Three popular column-oriented database systems that can run on Hadoop are the Apache open source HBase and Accumulo projects and Google Cloud Bigtable, Google’s commercially available version of their internal data management system. The GeoMesa suite of tools for storing, indexing, and querying big spatial data can work with all three of these database systems, letting you perform geospatial analytics on a very large scale. An application like GeoServer can then use a GeoMesa data store just like it would use any other data store through its web-based graphical user interface, via the OGC’s Common Query Language, or by using the WFS, WMS, WPS, and WCS standard web services. Seeing these abilities added to Cloud Bigtable impressed Google enough that they they now recommend GeoMesa contributor and supporter CCRi as a service partner.
Stored geospatial data, streaming geospatial data
In addition to its ability to work with petabytes of stored geospatial data, GeoMesa can work with streaming data. Frequently, consumers of high velocity data (10K records per second or more) implement an infrastructure based on the Lambda Architecture.The Lambda Architecture has a “Speed Layer” for supporting interactive displays and near-real-time analytics. Data stored by GeoMesa in Accumulo or HBase is part of the serving layer which responds to queries. GeoMesa also includes a streaming datastore based on Apache Kafka, which is ideal for supporting recent playback in a mapping visualization. For example, if your system is reading position data about a fleet of vehicles and you want to render and animate it in a map layer, the GeoMesa Suite’s Kafka data store can make this possible. GeoMesa can also use Kafka to cache enough of the data to let you “rewind” animations of your fleet movement around your map. A permanent record of this data can be provided in the serving layer for forensic analysis or batch processing in the future.
GeoMesa can also take advantage of Apache Spark, a computing framework that is increasingly replacing the direct use of MapReduce to do much faster processing on Hadoop clusters. Spark’s libraries for machine learning, streaming, and graph processing also let developers create and run analytics applications more quickly, which opens up some great possibilities when working with spatio-temporal data.
The recent 1.2 release of GeoMesa included a new step: a thorough review by the Eclipse Foundation’s LocationTech Working Group. This review ensured that GeoMesa’s source code and all of its dependencies conform to business-friendly software licenses and are compatible with GeoMesa’s Apache v2 License. Such thorough review of the intellectual property gives businesses using the software assurance that they can confidently use it and build solutions based on GeoMesa.
Getting started with GeoMesa
Several U.S. armed forces, intelligence agencies, and commercial companies are already benefiting from GeoMesa’s ability to scale up with Big Spatial Data. To get started yourself, the GeoMesa home page includes tutorials, and YouTube has several videos that demonstrate GeoMesa’s features and describe its architecture. To join the conversations and get support in your own use of GeoMesa, there are both user and developer mailing lists. And, as an open source project, you can jump in and contribute yourself along with CCRi, LocationTech, and dozens of other contributors.
Animated visualization of Seahawks vs. Patriots tweets for the 2015 Super Bowl using GeoMesa: