An object-based convolutional neural network (OCNN) for urban land use classification

Ce Zhang, Isabel Sargent, Xin Pan, Huapeng Li, Andy Gardiner, Jonathon Hare, Peter M. Atkinson
2018 Remote Sensing of Environment  
Urban land use information is essential for a variety of urban-related applications 10 such as urban planning and regional administration. The extraction of urban land use from 11 very fine spatial resolution (VFSR) remotely sensed imagery has, therefore, drawn much 12 attention in the remote sensing community. Nevertheless, classifying urban land use from 13 VFSR images remains a challenging task, due to the extreme difficulties in differentiating 14 complex spatial patterns to derive
more » ... l semantic labels. Deep convolutional neural 15 networks (CNNs) offer great potential to extract high-level spatial features, thanks to its 16 hierarchical nature with multiple levels of abstraction. However, blurred object boundaries 17 and geometric distortion, as well as huge computational redundancy, severely restrict the 18 potential application of CNN for the classification of urban land use. In this paper, a novel 19 object-based convolutional neural network (OCNN) is proposed for urban land use 20 classification using VFSR images. Rather than pixel-wise convolutional processes, the 21 OCNN relies on segmented objects as its functional units, and CNN networks are used to 22 Keywords: convolutional neural network; OBIA; urban land use classification; VFSR remotely 35 sensed imagery; high-level feature representations 36 37 Introduction 38 Urban land use information, reflecting socio-economic functions or activities, is essential for 39 urban planning and management. It also provides a key input to urban and transportation 40 models, and is essential to understanding the complex interactions between human activities 41 and environmental change (Patino and Duque, 2013 ). With the rapid development of modern 42 remote sensing technologies, a huge amount of very fine spatial resolution (VFSR) remotely 43 sensed imagery is now commercially available, opening new opportunities to extract urban 44 land use information at a very detailed level (Pesaresi et al., 2013). However, urban land 45 features captured by these VFSR images are highly complex and heterogeneous, comprising 46 the juxtaposition of a mixture of anthropogenic urban and semi-natural surfaces. Often, the 47 same urban land use types (e.g. residential areas) are characterized by distinctive physical 48 properties or land cover materials (e.g. composed of different roof tiles), and different land use 49 categories may exhibit the same or similar reflectance spectra and textures (e.g. asphalt roads 50 and parking lots) (Pan et al., 2013). Meanwhile, information on urban land use within VFSR 51 imagery is presented implicitly as patterns or high-level semantic functions, in which some 52 identical low-level ground features or object classes are frequently shared amongst different 53 land use categories. This complexity and diversity of spatial and structural patterns in urban 54 areas makes its classification into land use classes a challenging task (Hu et al., 2015) . 55 Therefore, it is important to develop robust and accurate urban land use classification 56 techniques by effectively representing the spatial patterns or structures lying in VFSR remotely 57 sensed data. 58 Over the past few decades, tremendous effort has been made in developing automatic urban 59 land use classification methods. These methods can be categorized broadly into four classes 60 based on the spatial unit of representation (i.e. pixels, moving windows, objects and scenes) 61 (Liu et al., 2016) . The pixel-level approaches that rely purely upon spectral characteristics are 62 able to classify land cover, but are insufficient to distinguish land uses that are typically 63 composed of multiple land covers, and such problems are particularly significant in urban 64 settings (Zhao et al., 2016). Spatial information, that is, texture (Herold et al., 2003; Myint, 65 2001) or context (Wu et al., 2009), was incorporated to analyse urban land use patterns through 66 moving kernel windows (Niemeyer et al., 2014). However, it could be argued that both pixel-67 3 based and moving window-based methods require to predefine arbitrary image structures, 68 whereas actual objects and regions might be irregularly shaped in the real world (Herold et al., 69 2003). Therefore, object-based image analysis (OBIA) that is built upon automatically 70 segmented objects from remotely sensed imagery is preferable (Blaschke, 2010), and has been 71 considered as the dominant paradigm over the last decade (Blaschke et al., 2014). Those image 72 objects, as the base units of OBIA, offer two kinds of information with a spatial partition, 73 specifically; within-object information (e.g. spectral, texture, shape) and between-object 74 information (e.g. connectivity, contiguity, distances, and direction amongst adjacent objects). 75 Many studies applied OBIA for urban land use classification using within-object information 76 with a set of low-level features (such as spectra, texture, shape) of the ground features (e.g. 77 Blaschke, 2010; Blaschke et al., 2014; Hu and Wang, 2013). These OBIA approaches, however, 78 might overlook semantic functions or spatial configurations due to the inability to use low-79 level features in semantic feature representation. In this context, researchers have attempted to 80 incorporate between-object information by aggregating objects using spatial contextual 81 descriptive indicators on well-defined land use units, such as cadastral fields or street blocks. 82 Those descriptive indicators were commonly derived by means of spatial metrics to quantify 83 their morphological properties (Yoshida and Omae, 2005) or graph-based methods that model 84 the spatial relationships (Barr and Barnsley, 1997; Walde et al., 2014). However, the ancillary 85 geographic data for specifying the land use units might not be available for some regions, and 86 the spatial contexts are often hard to describe and characterize as a set of "rules", even though 87 the complex structures or patterns might be recognizable and distinguishable by human experts 88 (Oliva-Santos et al., 2014). Thus, advanced data-driven approaches are highly desirable to learn 89 land use semantics automatically through high-level feature representations. 90 Recently, deep learning has become the new hot topic in machine learning and pattern 91 recognition, where the most representative and discriminative features are learnt end-to-end, 92 hierarchically (Chen et al., 2016a). This breakthrough was triggered by a revival of interest in 93 the use of multi-layer neural networks to model higher-level feature representations without 94 human-designed features or rules. Convolutional neural networks (CNNs), as a well-95 established and popular deep learning method, has produced state-of-the-art results for multiple 96 domains, such as visual recognition (Krizhevsky et al., 2012), image retrieval (Yang et al., 97 2015) and scene annotation (Othman et al., 2016). Owing to its superiority in higher-level 98 feature representation and scene understanding, the CNN has demonstrated great potential in 99 many remote sensing tasks such as vehicle detection (Chen et al., 2014; Dong et al., 2015), 100 4 road network extraction (Cheng et al., 2017), remotely sensed scene classification (Othman et 101 al., 2016; Sargent et al., 2017), and semantic segmentation (Zhao et al., 2017b). Interested 102 readers are referred to a comprehensive review of deep learning in remote sensing (Zhu et al., 103 2017). 104 Land use information extraction from remotely sensed data using CNN models has been 105 undertaken in the form of land-use scene classification, which aims to assign a semantic label 106 (e.g. tennis court, parking lot, etc.) to an image according to its content (Chen et al., 2016b; 107 Nogueira et al., 2017). There are broadly two strategies to exploit the CNN models for scene-108 level land use classification, namely; i) pre-trained or fine-tuned CNN, and ii) fully-trained 109 CNN from scratch. The first strategy relies on pre-trained CNN networks transferred from an 110 auxiliary domain with natural images, which has been demonstrated empirically to be useful 111 for land-use scene classification (Hu et al., 2015; Nogueira et al., 2017). However, it requires 112 three input channels derived from natural images with RGB only, whereas the multispectral 113 remotely sensed imagery often involves the near infrared band, and such a distinction restricts 114 the utility of pre-trained CNN networks. Alternatively, the (ii) fully-trained CNN strategy gives 115 full control over the network architecture and parameters, which brings greater flexibility and 116 expandability (Chen et al., 2016). Previous researchers have explored the feasibility of the 117 fully-trained strategy in building CNN models for scene level land-use classification. For 118 example, Luus et al. (2015) proposed a multi-view CNN with multi-scale input strategies to 119 address the issue of land use scene classification and its scale-dependent characteristics. 120 Othman et al. (2016) used convolutional features and a sparse auto-encoder for scene-level 121 land-use image classification, which further demonstrated the superiority of CNNs in feature 122 learning and representation. Xia et al., (2017) even constructed a large-scale aerial scene 123 classification dataset (AID) for performance evaluation among various CNN models and 124 architectures developed by both strategies. However, the goal of these land use scene 125 classifications is essentially image categorization, where a small patch extracted from the 126 original remote sensing image is labelled into a semantic category, such as 'airport', 'residential' 127 or 'commercial' (Maggiori et al., 2017). Land-use scene classification, therefore, does not meet 128 the actual requirement of remotely sensed land use image classification, which requires all 129 pixels in an entire image to be identified and labelled into land use categories (i.e., producing 130 a thematic map). 131 With the intrinsic advantages of hierarchical feature representation, the patch-based CNN 132 models provide great potential to extract higher-level land use semantic information. However, 133 5 this patch-wise procedure introduces artefacts on the border of the classified patches and often 134 produces blurred boundaries between ground surface objects (Zhang et al., 2018a, 2018b), thus, 135 introducing uncertainty in the classification. In addition, to obtain a full resolution 136 classification map, pixel-wise densely overlapped patches were used at the model inference 137 phase, which inevitably led to extremely redundant computation. As an alternative, Fully 138 Convolutional Networks (FCN) and its extensions have been introduced into remotely sensed 139 sematic segmentation to address the pixel-level classification problem (e.g. Liu et al., 2017; 140 Paisitkriangkrai et al., 2016; Volpi and Tuia, 2017). These FCN-based methods are, however, 141 mostly developed to solve low-level semantic (i.e. land cover) classification tasks, due to the 142 insufficient spatial information in the inference phase and the lack of contextual information at 143 up-sampling layers (Liu et al., 2017). In short, we argue that the existing CNN models, 144 including both patch-based and pixel-level approaches, are not well designed in terms of 145 accuracy and/or computational efficiency to cope with the complicated problem of urban land 146 use classification using VFSR remotely sensed imagery. 147 In this paper, we propose an innovative object-based CNN (OCNN) method to address the 148 complex urban land-use classification task using VFSR imagery. Specifically, object-based 149 segmentation was initially employed to characterize the urban landscape into functional units, 150 which consist of two geometrically different objects, namely linearly shaped objects (e.g. 151 Highway, Railway, Canal) and other (non-linearly shaped) general objects. Two CNNs with 152 different model structures and window sizes were applied to analyse and label these two kinds 153 of objects, and a rule-based decision fusion was undertaken to integrate the models for urban 154 land use classification. The innovations of this research can be summarised as 1) to develop 155 and exploit the role of CNNs under the framework of OBIA, where both within-object 156 information and between-object information is used jointly to fully characterise objects and 157 their spatial context. 2) to design the CNN networks and position them appropriately with 158 respect to object size and geometry, and integrate the models in a class-specific manner to 159 obtain an effective and efficient urban land use classification output (i.e., a thematic map). The 160 effectiveness and the computational efficiency of the proposed method were tested on two 161 complex urban scenes in Great Britain. 162 The remainder of this paper is organized as follows: Section 2 introduces the general workflow 163 and the key components of the proposed methods. Section 3 describes the study area and data 164 sources. The results are presented in section 4, followed by a discussion in section 5. The 165 conclusions are drawn in the last section. 166 6 167 2. Method 168 2.1 Convolutional Neural Networks (CNN) 169 A Convolutional Neural Network (CNN) is a multi-layer feed-forward neural network that is 170 designed specifically to process large scale images or sensory data in the form of multiple 171 arrays by considering local and global stationary properties (LeCun et al., 2015). The main 172 building block of a CNN is typically composed of multiple layers interconnected to each other 173 through a set of learnable weights and biases (Romero et al., 2016). Each of the layers is fed 174 by small patches of the image that scan across the entire image to capture different 175 characteristics of features at local and global scales. Those image patches are generalized 176 through alternative convolutional and pooling/subsampling layers within the CNN framework, 177 until the high-level features are obtained on which a fully connected classification is performed 178 (Schmidhuber, 2015). Additionally, several feature maps may exist in each convolutional layer 179 and the weights of the convolutional nodes in the same map are shared. This setting enables 180 the network to learn different features while keeping the number of parameters tractable. 181 Moreover, a nonlinear activation (e.g. sigmoid, hyperbolic tangent, rectified linear units) 182 function is taken outside the convolutional layer to strengthen the non-linearity (Strigl et al., 183 2010 ). Specifically, the major operations performed in the CNN can be summarized as: 184 ) 2 ( tan 2 Class OC UC TCE OBIA CNN OCNN OBIA CNN OCNN OBIA CNN OCNN
doi:10.1016/j.rse.2018.06.034 fatcat:te2vvor5dzcc3mrm7l4stghk4a