We know that the quality of our GIS data is important. The roads must be connected properly; the addresses must ensure a correct encoding, and so on. But do we know how good the data should be? Do we know how to give it a grade and make sure our entire projects stand up to that level? Maybe more importantly, do we know how to check our data?
When we create or update our data, we do it manually. Afterwards we can run all kind of automatic procedures, but in general, a lot of our work consists of manual labor. And where there’s a human touch, there are human mistakes. These mistakes can escalate to border disputes, cadaster issues that can results in cases of billions of dollars, or just a navigation app that leads us to nowhere. The fiasco of the iOS map “upgrade” in September 2012 is a pretty good example of what happens when the quality assurance doesn’t get its proper attention.
The leading international standard for mapping is the ISO/TC 211 Geographic information/Geomatics. It dictates the standards and gives us statistic tables that will show, how many errors are allowed with given grade.
Generally, our data should go through to two main Quality Assurance (QA) prisms:
- Automatic – Checking rules that don’t require the human eye, and, oblige to preordained rules. All those rules you want your data to fit. For example, for the road layer, no dangles. Or buildings shouldn’t intersect roads unless the roads are tunnels.
- Sampling – Checking a sample that would reflect the quality of the whole data, and requires human decipher. For example, let’s say you want to buy Point of Interest (POI) data that contain several types: culture, education and commerce. You’ll need to check a sample and see, with your own eyes and mind that the POIs are categorized correctly. How many should you check depends on the standard you use and the level of error you allow.
It all goes down to the famous equation of cost/benefit. If your data is 95% good, does that suffice? Or maybe it’s 85%? This grade of your data can lead to a lot of decisions based on something real and not a hunch. Maybe you’re wasting time fixing too much of your data. Maybe it’s not enough. The grade and your decision are a conclusion of the importance of the layer, the tolerance of your customers to mistakes, your abilities, projects time, and the money it would cost you, and of course, experience. After a couple of projects, it’ll be easier to define what your optimal grade is.
Last, the automatic and sampling checks should have weight. For example, if your editor forgets to digitize a building into the data that didn’t exist before that’s worse (probably) than classifying a bend in the road as a ramp and not as a normal lane. Or neglecting to erase nonexistent POI is worse than saying it’s a pharmacy and not a school. Weighing each category leads to better results that reflect your concept for good data.
Checking Your GIS Data
I’ve been working with the ESRI’s Data Reviewer extension, and found it more than sufficient.
It gives you the technological solution to QA the data with various checks, including composite ones that can be merged to one file. Create samplings of areas or entities, and finally generate reports that conclude your data grade.
It always starts by creating a session on a geodatabase. Then, you can run ad-hoc check, or employ a batch file that contain several checks into an *.rbj file. The checks can be grouped by subjects or layers, and then the report can give you results by the groups you defined. After the checks have been specified you run them. The run time depends on the amount of checks, their type, and the number of entities. You can also run the checks during weekends with a built in timed option. The results can be screened so you’ll see a red spot where the roads intersect without nodes, for example. You can mark the errors that have been checked with a built in right mouse click. You can flag errors with a message. The results can be grouped by one or more field (check name for example). Sampling can be done easily, creating random locations or entities to check. And in the end, you can create a report, to both kinds of checks to see what the bottom line is.
Some pros and cons to my humble opinion:
Thumbs up: Extension (shelf product), easy to use, organizes a lot of checks, good user interface (UI), good to standardize QA, comfortable to batch checks.
Thumbs down: Not customizable (to use inside modeler, workflow manger, etc.), difficult for outside development, problematic with alpha-numeric checks (which are more syntax tricky), some checks can crash the ArcGIS desktop (if too many entities are included), the layers to be checked must exist in the ArcGIS desktop table of contents.
In conclusion, the enormous importance of QA the GIS data might not be obvious. We may think after the editor has done, the data can move on to be sold out, but using simple procedures, we can verify that our data stands up to standards that are either global or in-housed. Conclusions can be made: are we giving enough time to our projects, or in our haste we are missing things? Are we deciphering well enough? Are we missing a layer from the list of layers that we owe to update?
The most important thing is that it creates confidence in our data. You can say it’s 98% true, stand behind that statement with a proper methodology, and that creates a very powerful statement.