Building Extraction in Very High Resolution Remote Sensing Imagery Using Deep Learning and Guided Filters

Yongyang Xu, Liang Wu, Zhong Xie, Zhanlong Chen
2018 Remote Sensing  
Very high resolution (VHR) remote sensing imagery has been used for land cover classification, and it tends to a transition from land-use classification to pixel-level semantic segmentation. Inspired by the recent success of deep learning and the filter method in computer vision, this work provides a segmentation model, which designs an image segmentation neural network based on the deep residual networks and uses a guided filter to extract buildings in remote sensing imagery. Our method
more » ... s the following steps: first, the VHR remote sensing imagery is preprocessed and some hand-crafted features are calculated. Second, a designed deep network architecture is trained with the urban district remote sensing image to extract buildings at the pixel level. Third, a guided filter is employed to optimize the classification map produced by deep learning; at the same time, some salt-and-pepper noise is removed. Experimental results based on the Vaihingen and Potsdam datasets demonstrate that our method, which benefits from neural networks and guided filtering, achieves a higher overall accuracy when compared with other machine learning and deep learning methods. The method proposed shows outstanding performance in terms of the building extraction from diversified objects in the urban district. [13] . A novel mean shift (MS)-based multiscale method was used in urban mapping [14] . Morphological profiles (MP) were utilized into the spatial-spectral classification [15] . Conditional random fields and machine learning, such as SVM and random forest, were also introduced to solve the classification of remote sensing images [6, 16] . In addition, encouraged by deep neural network features that have been shown to have an outstanding capacity in visual recognition [17], object detection [18] and semantic segmentation [19] [20] [21] , deep learning was introduced to resolve the old problems in remote sensing [22] . Deep neural networks have been successfully used to class and densely label high resolution remote imagery [23] . It can be used in various remote sensing tasks: Detection, classification, or data fusion [24] . A deep learning framework was proposed to detect buildings in high-resolution multispectral imageries (RGB and near-infrared) [25] . Multi-scale convolutional neural networks (CNNs) combined with the conditional random fields (CRFs) were used for dense classification in street scenes [26] . An end-to-end trainable deep convolutional neural network (DCNN) was built to improve semantic image segmentation with boundary detection [12] . Studies have shown that remote sensing image classification results cannot be conclusive [27] . The reason for this is that although the resolution of remote sensing images have improved, which has been helpful to detect and distinguish various objects on the ground, these improvements have made it more difficult to separate some objects, especially spectrally similar classes, due to the increase of the intra-class variance of objects, such as building, streets, shades and cars, and with decrease of the inter-class variance [5, 28, 29] . In other words, different objects may present the same spectral values within the remote sensing imagery, which make it more difficult to extract reasonable spatial features to resolve the classification of pixels in extracting the buildings. In the last years, fully convolutional networks (FCNs) have shown a good performance of semantic segmentation [30] [31] [32] . Indeed, FCNs can not only learn how to classify pixels and determine what it is, but they can also predict the structures of the spatial objects [33] . The model is able to detect different classes of objects on the ground and predict their shapes, such as buildings, the curves of the roads, trees, and so on. However, it is a little short of being capable of detecting small objects or objects with many boundaries, because the boundaries of the objects are blurred and the results are visually degraded during classing when using FCNs [12] . There has been some research that tries to improve the performance of semantic segmentation and develop a deep neural network structure either by adding skip connections so as to reintroduce the high-frequency detailed information of an imagery after upsampling [34, 35] or by using dilated convolution combined with CRFs [36] . The improved FCN model, which is designed as a multi-scale network architecture by adding a skip-layer structure, was trained to perform state-of-the-art natural image semantic segmentation [31] . A deep FCN with no downsampling was introduced to boost the effective training sample size and improve the classification accuracy [37] . The application of research into urban district classification using VHR remote sensing imagery ranges from urban management to flow monitoring. Recent research makes an effort to improve the accuracy in areas such as encoding of images, extraction of features from raw images [38, 39] , and the use of deep neural networks such as CNNs, FCNs, and so on, to label pixels, especially for the VHR remote sensing imagery [40, 41] . However, pixel labelling of the VHR imagery in urban districts offers challenges relating to the varied semantic classes and geometry shapes. Because buildings and the other imperviousness objects in urban areas are very complicated with respect to both their spectral and spatial characteristics, it is inefficient and difficult to extract them. The VHR imagery is usually limited to three or four broad bands, and these spectral features alone may lack the ability to distinguish the objects because different objects have similar spectral values, for example, roads and roofs. Additionally, the same objects may have different spectral values, for example a roof that is divided into two parts by exposure to the sun and the shade. Therefore, discriminative appearance-based features are needed to improve the performance. Fortunately, most of the VHR remote sensing imageries usually have the corresponding overlapping image (or combined camera +
doi:10.3390/rs10010144 fatcat:y7nwaxuv2vb5blsiuhjr2hn7ma