Image Based Mango Fruit Detection, Localisation and Yield Estimation Using Multiple View Geometry

Madeleine Stein, Suchet Bargoti, James Underwood
2016 Sensors  
This paper presents a novel multi-sensor framework to efficiently identify, track, localise and map every piece of fruit in a commercial mango orchard. A multiple viewpoint approach is used to solve the problem of occlusion, thus avoiding the need for labour-intensive field calibration to estimate actual yield. Fruit are detected in images using a state-of-the-art faster R-CNN detector, and pair-wise correspondences are established between images using trajectory data provided by a navigation
more » ... stem. A novel LiDAR component automatically generates image masks for each canopy, allowing each fruit to be associated with the corresponding tree. The tracked fruit are triangulated to locate them in 3D, enabling a number of spatial statistics per tree, row or orchard block. A total of 522 trees and 71,609 mangoes were scanned on a Calypso mango orchard near Bundaberg, Queensland, Australia, with 16 trees counted by hand for validation, both on the tree and after harvest. The results show that single, dual and multi-view methods can all provide precise yield estimates, but only the proposed multi-view approach can do so without calibration, with an error rate of only 1.36% for individual trees. Sensors 2016, 16, 1915 2 of 25 variety of platforms, such as manned and unmanned ground vehicles (UGVs) [4] [5] [6] [7] [8] , unmanned aerial vehicles (UAVs) and hand-held sensors [9] . Different types of imaging sensors have also been used, including "standard" (visible light) cameras and stereo, near infra-red, long wave thermal infrared cameras and LiDAR [2] [3] [4] [9] [10] [11] [12] [13] [14] . Standard cameras are a common choice, due to their low cost, the richness and comparatively high resolution of the data they provide, as well as the familiarity to the machine vision community in terms of data acquisition and processing techniques [12] . Much of the research has focused on improving the accuracy of fruit detection within imagery; however, the relationship of image-based fruit counts and the actual number of fruit on the tree is challenged by visual occlusion, which cannot be addressed by improved classification performance. This is a problem that is mentioned in the vast majority of the literature. Progress over the last five years from the machine vision community with convolutional neural networks (CNNs) has led to highly accurate fruit detection in colour imagery [15] [16] [17] , and so arguably, the focus should shift towards fruit counting systems, which are designed to acquire and process orchard imagery in such a way that delivers the highest accuracy compared to the actual field and harvest fruit counts. Systems that acquire a single image per tree (or one image from both inter-row perspectives) require either that all fruit are visible from only one or two views or that there is a consistent, repeatable and modellable relationship between the number of visible fruit compared to the total on the tree. In the latter case, a process of calibration to manual field-counts can be done, which has proven to be accurate for some canopy types, including trellised apple orchards [5,6,15], almond orchards [7] and vineyards [18, 19] . The calibration process requires manual field or harvest counts, which is labour intensive and would ideally be repeated every year. Calibration was validated from one year to the next by Nuske et al. [18] , but is not guaranteed to hold over multiple years of canopy growth, nor for different fruit varieties and tree canopy geometries. Furthermore, the process is subject to human error, which propagates to all subsequent yield estimates. Another approach is to acquire and process images from multiple viewpoints to combat the effect of occlusion from any one perspective. For relatively open canopies where the fruit is not heavily bunched (e.g. apples, mangoes, citrus), this has the potential to enable every piece of fruit to be directly observed and counted. For heavily bunched fruit, such as grapes, this may only improve performance if the camera can be flexibly positioned in three dimensions around the bunches. To count fruit seen from multiple views, each fruit must be uniquely identified, associated and tracked between images to avoid over-counting. There are several examples of multi-viewpoint fruit tracking in the literature for different types of fruit: peppers have been tracked and counted by using a statistical approach to cluster repeated observations [20] ; optical flow has been used to associate fruit in subsequent images for citrus [9]; pineapples were tracked and counted over image sequences using feature matching and structure from motion (SfM) [21] ; stereo vision has been used to detect and associate apples [4, 22] . Amongst these examples, quantified comparisons to the true number of fruit counted in the field are only given in two cases [4, 21] , and both show a near unity relationship, suggesting that multi-view fruit tracking has promise. In apples, the relationship held when fruit was thinned, but under-counting occurred when apples were left in bunches [4] . In this paper, we investigate the use of multiple view-point fruit detection, tracking and counting as a means to measure and map the quantity of fruit in a mango orchard. Mangoes are grown on individual trees with three-dimensional (3D) canopies, in contrast to the 2D apple trellis fruit-wall that was studied by Wang et al. [4] . The 3D canopies present distinct challenges for fruit detection and tracking: the large detection volume is difficult to illuminate uniformly with strobes or by choosing the time of day of the scan; the distance between the sensor and fruit is variable, causing appearance variation due to scale and optical flow variation as a function of range; the complex canopy geometry can cause fine-grained patterns of occlusion between subsequent monocular frames or stereo pairs, and that combined with the sensor-to-target range variation can cause unpredictable failures in dense stereo depth estimation. To meet these challenges, we use a state-of-the-art CNN approach for fruit detection in individual monocular images, which can handle the variability in scale and illumination of the fruit.
doi:10.3390/s16111915 pmid:27854271 pmcid:PMC5134574 fatcat:dhgu2wexanhadfflgmdnvtvsuu