Table Of Content

Machine Learning for Aerial Image Labeling by Volodymyr Mnih A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto c Copyright 2013 by Volodymyr Mnih (cid:13) Abstract Machine Learning for Aerial Image Labeling Volodymyr Mnih Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2013 Information extracted from aerial photographs has found applications in a wide range of areas including urban planning, crop and forest management, disaster relief, and climate modeling. At present, much of the extraction is still performed by human experts, making the process slow, costly, and error prone. The goal of this thesis is to develop methods for automatically extracting the locations of objects such as roads, buildings, and trees directly from aerial images. We investigate the use of machine learning methods trained on aligned aerial images and possibly outdated maps for labeling the pixels of an aerial image with semantic labels. We show how deep neural networks implemented on modern GPUs can be used to efficiently learn highly discriminative image features. We then introduce new loss functions for training neural networks that are partially robust to incomplete and poorly registered target maps. Finally, we propose two ways of improving the predictions of our system by introducing structure into the outputs of the neural networks. We evaluate our system on the largest and most-challenging road and building detection datasets considered in the literature and show that it works reliably under a wide variety of conditions. Furthermore, we are releasing the first large-scale road and building detection datasets to the public in order to facilitate future comparisons with other methods. ii Acknowledgements First, I want to thank Geoffrey Hinton for being an amazing advisor. I benefited not only from his deep insights and knowledge but also from his patience, encouragement, and sense of humour. I am also grateful to Allan Jepson and Rich Zemel for serving on my supervisory committee and for providing valuable feedback throughout. I also want to thank all the current and former members of the Toronto Machine Learning group for contributing to a truly great and fun research environment and for many interesting discussions. I especially learned a great deal from working with former post-docs Marc’Aurelio Ranzato and Hugo Larochelle, as well as my office mates George Dahl, Navdeep Jaitly, and Nitish Srivastava. My brother, Andriy, also probably deserves a co-supervision credit for the many hours he spent listening to my research ideas. Iwouldalsoliketothankmyparentsfortheirnever-endingsupport, andforgiving me the amazing opportunities I have had by moving to Canada. Finally, I would like to thank my wife and best friend, Anita, for her constant support and for putting up with me over the years. iii Contents 1 Introduction 1 2 An Overview of Aerial Image Labeling 6 2.1 Early Work - Simple Classifiers and Local Features . . . . . . . . . . 7 2.2 Move to High-Resolution Data . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Better classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Better features . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 Larger datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Structured Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 Post-classification . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.3 Probabilistic Approaches . . . . . . . . . . . . . . . . . . . . . 15 2.3.4 Discussion of Structured Prediction . . . . . . . . . . . . . . . 17 2.4 Source of Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Learning to Label Aerial Images 20 3.1 Patch-Based Labeling Framework . . . . . . . . . . . . . . . . . . . . 20 3.1.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 Generating Labels . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.3 Evaluating Predictions . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 One Layer Architectures . . . . . . . . . . . . . . . . . . . . . 29 3.3.2 Two Layer Architectures . . . . . . . . . . . . . . . . . . . . . 33 3.3.3 Deeper Architectures . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.4 Sensitivity to Hyper Parameters . . . . . . . . . . . . . . . . . 36 iv 3.3.5 A Word on Overfitting . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Peering into the Mind of the Network . . . . . . . . . . . . . . 40 3.5 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . 41 4 Learning to Label from Noisy Data 43 4.1 Dealing With Omission Noise . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Dealing With Registration Noise . . . . . . . . . . . . . . . . . . . . . 46 4.2.1 Translational Noise Model . . . . . . . . . . . . . . . . . . . . 47 4.2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.3 Understanding the Noise Model . . . . . . . . . . . . . . . . . 51 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.1 Omission Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.2 Registration Noise . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . 56 5 Structured Prediction 59 5.1 Post-processing Neural Networks . . . . . . . . . . . . . . . . . . . . 60 5.1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.2 Predictions and Inference . . . . . . . . . . . . . . . . . . . . . 66 5.2.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Combining Structure and Noise Models . . . . . . . . . . . . . . . . . 75 5.3.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6 Large-Scale Evaluation 84 6.1 Massachusetts Buildings Dataset . . . . . . . . . . . . . . . . . . . . 85 6.2 Massachusetts Roads Dataset . . . . . . . . . . . . . . . . . . . . . . 85 6.3 Buffalo Roads Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 86 v 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.4.1 Massachusetts Datasets . . . . . . . . . . . . . . . . . . . . . . 88 6.4.2 Buffalo Roads Dataset . . . . . . . . . . . . . . . . . . . . . . 90 7 Conclusions and Future Work 93 Bibliography 97 vi Chapter 1 Introduction Aerialimageinterpretationistheprocessofexaminingaerialimageryforthepurposes ofidentifyingobjectsanddeterminingvariouspropertiesoftheidentifiedobjects. The process originated during the First World War when photos taken from airplanes were examined for the purpose of reconnaissance. In its near one hundred year history, aerial image interpretation has found applications in many diverse areas including urban planning, crop and forest management, disaster relief, and climate modeling. Much of the work, however, is still performed by human experts. Examining large amounts of aerial imagery by hand is an expensive and time consuming process. First attempts at automation using computers date back to the late 1960s and early 1970s [Idelsohn, 1970, Bajcsy and Tavakoli, 1976]. While significantprogresshasbeenmadeinthepastthirtyyears, onlyafewsemi-automated systemsthatworkinlimiteddomainsareinusetodayandnofullyautomatedsystems currently exist [Baltsavias, 2004, Mayer, 2008]. The recent explosion in the availability of high resolution imagery underscores the need for automated aerial image interpretation methods. Such imagery, having resolution as high as 100 pixels per square meter, has greatly increased the number of possible applications but at the cost of an increase in the amount of required manual processing. Recentapplicationsoflarge-scalemachinelearningtosuchhigh-resolution imagery have produced object detectors with impressive levels of accuracy [Kluckner and Bischof, 2009, Kluckner et al., 2009, Mnih and Hinton, 2010, 2012], suggesting that automated aerial image interpretation systems may be within reach. In machine learning applications, aerial image interpretation is usually formulated as a pixel labeling task. Given an aerial image like the one shown in Figure 1.1, the 1 Chapter 1. Introduction 2 Figure 1.1: An aerial image of the city of Boston. goal is to produce either a complete semantic segmentation of the image into classes such as building, road, tree, grass, and water [Kluckner and Bischof, 2009, Kluckner et al., 2009] or a binary classification of the image for a single object class [Dollar et al., 2006, Mnih and Hinton, 2010, 2012]. While image labeling or parsing of general scenes has been extensively studied [He etal.,2004,Shottonetal.,2008,Farabetetal.,2012], aerialimageshaveafewdistinct characteristics that make aerial image labeling an easier task. First, by restricting ourselves to overhead imagery with known ground resolution both the viewpoint and the scale of objects can be assumed to be fixed. Having a fixed viewpoint and scale reduces the possible variations in object appearance and makes the priors on object shape less broad than in general image labeling. This suggests that it should be possible to incorporate strong shape dependencies into an aerial image labeling systems. Finally, the amount of both unlabeled and labeled aerial imagery is massive compared to the datasets available for general image labeling tasks. Methods that are able to effectively learn from massive amounts of labeled data should have a distinct advantage on aerial image labeling tasks over methods that can’t. The goal of this thesis is to develop new machine learning methods that are particularly well suited to the task of aerial image labeling. Namely, this thesis focuses Chapter 1. Introduction 3 on what we see as the three main issues in applying image labeling techniques to aerial imagery: Context and Features: The use of context is important for successfully label- • ing aerial images because local colour cues are not sufficient for discriminating between pairs of object classes like trees and grass, and roads and buildings. Additionally, occlusions and shadows caused by trees and tall buildings often make it impossible to classify a pixel without using any context information. Since the number of input features grows quadratically with the width of an input image patch, the number of parameters and the amount of computation required for a naive approach also increases quadratically. For these reasons, efficient ways of extracting discriminative features from a large image context are necessary for aerial image labeling. Noisy Labels: When training a system to label images, the amount of labeled • training data tends to be a limiting factor. The most successful applications of machine learning to aerial imagery have relied on existing maps. These provide abundant labels, but the labels are often incomplete and sometimes poorly registered, which hurts the performance of object detectors trained on them. In order to successfully apply image labeling to buildings and other object types for which the amount of label noise is high, new learning methods that are robust to noise in the labels are required. Structured Outputs: Labels of nearby pixels in an image exhibit strong • correlations, and exploiting this structure can significantly improve labeling accuracy. Due to the restricted viewpoint and fixed scale of aerial imagery, the structure present in the labels is generally more rigid than that in general image labeling, with shape playing an important role. In addition to being able to handle shape constraints, a structured prediction method suited to aerial imagery should also be able to deal with large datasets and noisy labels. The main contribution of this thesis is a coherent framework for learning to label aerial imagery. The proposed framework consists of a patch-based formulation of aerial image labeling, new deep neural network architectures implemented on GPUs, and new loss functions for training these architectures, resulting in a single model that can be trained end-to-end while dealing with the issues of context, noisy labels, and structured outputs. Chapter 1. Introduction 4 Fully embracing the view of aerial image labeling as a large scale machine learning task, we assemble a number of road and building detection datasets that far surpass all previous work in terms of both size and difficulty. In addition to releasing the first publicly available datasets for aerial image labeling we perform the first truly large-scale evaluation of an aerial image labeling system on real-world data. When trainedontheseroadandbuildingdetectiondatasetsourmodelssurpassallpublished models in terms of accuracy. The rest of the thesis is organized as follows: Chapter 2 presents a brief overview of existing work on applying machine learn- • ing to aerial image data. Some related work on general image labeling that has not been applied to aerial imagery is also covered. Chapter 3 presents our formulation of aerial image labeling as a patch-based • pixellabelingtaskaswellasanevaluationofseveraldifferentproposedarchitec- tures. The main contribution is a GPU-based, deep convolutional architecture that is capable of exploiting a large image context as well as learning discriminative features. This chapter includes work previously published in Mnih and Hinton [2010] and Mnih and Hinton [2012]. Chapter 4 addresses the problem of learning from incomplete or poorly regis- • tered maps. The main contributions are loss functions that provide robustness to both types of label noise and are suitable for training the architectures proposed in Chapter 3. This work has been previously published in [Mnih and Hinton, 2012]. Chapter 5 explores ways of taking advantage of the structure present in the la- • bels. We investigate two complementary ways of performing structured prediction – post-processing neural networks and Conditional Random Fields (CRFs). We argue that neural networks are good at learning high-level structure while CRFs are good at capturing low-level dependencies, with the combination of the two approaches being particularly effective. We also show how to combine a noise model from Chapter 4 with the proposed structured prediction models. This chapter includes work previously published in Mnih et al. [2011] and Mnih and Hinton [2012].

Description:

for the degree of Doctor of Philosophy. Graduate mantic labels. We show how deep neural networks implemented on modern GPUs can . be possible to incorporate strong shape dependencies into an aerial image labeling systems. Finally

Machine Learning for Aerial Image Labeling by Volodymyr Mnih A PDF

109 Pages·2013·6.26 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Machine Learning for Aerial Image Labeling by Volodymyr Mnih A

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.