GeoSVM: an efficient and effective tool to predict species' potential distributions
W. Zuo, N. Lao, Y. Geng, K. Ma
2008
Journal of Plant Ecology
Here, we also give the results of our evaluation of the performance of GeoSVM. We used data for 30 species of Rhododendron in China as a case study to compare GeoSVM and Genetic Algorithm for Rule-Set Prediction (GARP), one of the most popular models to predict species' potential distributions. We found that GeoSVM is more accurate and efficient than GARP. Furthermore, GeoSVM can handle more environmental information, which significantly improves the prediction accuracy. Patterns of species
more »
... ribution can potentially answer a bunch of fundamental questions in ecology, such as where are the original habitats of the species; how do the species distribute on earth; how do species achieve their distribution patterns; what is the relationship between distribution patterns of different species and how to set up a policy to conserve endangered species. The development of computer technology and machine learning methods enables the use of environmental factors to simulate species' potential distribution. Various statistical models have been explored in previous works for predicting species distributions, e.g. generalized linear models, generalized additive models, logistic regression, neural networks, decision trees, principle components analysis (PCA), Mahalanobis distance, maximum entropy method, genetic algorithm and regression tree analysis (see a survey in Zuo et al. 2007 ). These statistical models have been commonly used in wide range of other applications. However, when applied to the prediction of potential species distributions, a common problem arises-the high dimensionality and small sample size problem. This problem is caused by the nature of the task-the prediction of potential species distributions generally depends on the specimen data. These data are accumulated by fieldwork. Fieldwork, being an expensive and difficult process, limits the quantity of data available. We have >400 species of Rhododendron in China, but only 161 of them have >20 location samples (the lower limit of sample size for GARP). On the other hand, there are >100 environmental factors that can potentially affect species distribution, such as meteorological factors like annual, monthly, maximum and minimum values of temperature, precipitation and relative humidity as well as geographical factors like altitude and slope and soil and vegetation type. Most statistical methods rely on the big sample assumption that 'the number of samples is much larger than the number of parameters'. As we can see, however, this assumption does not hold anymore for species distribution data. Under this situation, these models usually perform well on training samples, but badly on new testing data. This phenomenon is called 'over training'. Some dimension-reducing methods, such as PCA, can mitigate this problem but only to some extent.
doi:10.1093/jpe/rtn005
fatcat:4aueph6rzzfqfpouz2pdipciwa