Yes, artificial intelligence (AI) has made tremendous progress over the last ten years, where it is increasingly integrated in our daily lives. In science, this is also the case as more and more papers are published using machine learning approaches that enable solutions we could have only dreamed about not too long ago.
Just like human learning, things can go wrong. Machines can learn ‘wrong’ or, more accurately, biased information, creating sometimes large problems and poor conclusions.
Take for instance the large image libraries such as ImageNet or others that have been found to be heavily biased towards a Western perspective.
Many papers or works have used such data sources to better teach machines in areas of medicine or agriculture. However, when data comes from a selected source, then a lot of information is left out.
Bias is usually addressed as part of a set problem within machine learning, but what if the entire dataset was biased? This can create some larger issues if our understanding of what bias means is not evident.
When a machine learning algorithm encounters a new data source from something outside of image repositories, misidentification can follow. This is what the Google Brain Team recently pointed out in a paper that looks at geodiversity in the developing world, where imagery was often not present in large image repositories, potentially biasing machine learning approaches.
Biases in Automated Classification Algorithms
In one region, that is the western desert regions, this worked well and the results fit what was expected. In the east, where it is greener and more agricultural land is encountered, the algorithm misidentified power lines.
In large part, this is because the data used for learning or the teaching the algorithm were biased to Western countries, where classification that constitutes power lines do not easily fit into the geography of eastern Pakistan. In effect, the background data was not sufficiently divers enough to incorporate different possibilities.
Ways around this problem include more diverse data but also algorithms, such as Generative Adversarial Network (GAN), that can look at dataset regions and identify potential dissimilarity of an image scene relative to training data, identifying sources where there could be bias or misinformation before larger conclusions are drawn.
Correcting Geographic Bias
Part of the problem is some artificial intelligence techniques are over-fitted to the training data.
One way around this problem is to incorporate variation or assess variance error in the results where specific data may bias learning outcomes. Usually, with larger datasets, these errors could be overcome. However, if information continues to be biased, such as that with imagery from open data sources, then problems often persist and algorithms, rather than improving over time, end in creating greater selection bias (some data sources for training satellite imagery classification algorithms: SpaceNet, Functional Map of the World, and xView).
One good approach could be to train a given classifier, in the case of satellite imagery, and then randomly test the identification algorithm in very different settings, using all of the Earth as potential test data. Errors would at least become more clear, selection bias could be avoided, and consequently forcing researchers to consider diversity more in their training datasets.
Another possibility could be the integration of artificial images as part of the training set data. In other words, machines could also integrate or combine real-world imagery with artificial imagery to create new data where such data might be missing from given regions.
Composite creations that combine satellite information with ground data could help make-up for some of the bias by diversifying the dataset and providing a new test dataset so that greater variation is created. This could be done with convolution neural network models, as one example, which take composite and real imagery together so that data diversity is purposely built in.
What is evident is that we are still really at the beginning of the use of machine learning in the spatial sciences. The future does look better in terms of new methods and capabilities, but the biases evident such as that demonstrated in the Google Brian’s team paper demonstrates we need better and more diverse data sources. This is likely to be solved by the fact that providing data is only getting easier.
We are still not there yet and techniques, including merged artificial and real-world training data, could be one of several avenues that one can use until we can produce sufficiently enough information to capture the range of variation in the spatial sciences.
 For more on the Google Brain Team paper, see: Shankar, Shreya, Yoni Halpern, Eric Breack, James Atwood, Jimbo Wilson, D. Sculley. (2017). No classification without representation: Assessing geodiversity issues in open data sets for the developing world. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, pp. 1-5.
 For more on the Pakistan case, see: Bollinger, D. 2018. Development Seed. https://medium.com/devseed/geo-diversity-for-better-fairer-machine-learning-5c64021708dd.
 For more on ways around biased results in data, see: Voyant, C., Notton, G., Kalogirou, S., Nivet, M.-L., Paoli, C., Motte, F., & Fouilloy, A. (2017). Machine learning methods for solar radiation forecasting: A review. Renewable Energy, 105, 569–582. https://doi.org/10.1016/j.renene.2016.12.095.
 For more on integrated training sets using some artificial sources, see Movshovitz-Attias, Y.,Takeo Kanade, Yaser Sheikh. (2016). Computer vision – ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016: proceedings. Part 3: …(G. Hua & H. Jégou, Eds.). Cham: Springer.