Machine learning for improved data analysis of biological aerosol using the WIBS

Simon Ruske, David O. Topping, Virginia E. Foot, Andrew P. Morse, Martin W. Gallagher
2018 Atmospheric Measurement Techniques Discussions  
<p><strong>Abstract.</strong> Primary biological aerosol including bacteria, fungal spores and pollen have important implications for public health and the environment. Such particles may have different concentrations of chemical fluorophores and will provide different responses in the presence of ultraviolet light which potentially could be used to discriminate between different types of biological aerosol. Development of ultraviolet light induced fluorescence (UV-LIF) instruments such as the
more » ... ideband Integrated Bioaerosol Sensor (WIBS) has made is possible to collect size, morphology and fluorescence measurements in real-time. However, it is unclear without studying responses from the instrument in the laboratory, the extent to which we can discriminate between different types of particles. Collection of laboratory data is vital to validate any approach used to analyse the data and to ensure that the data available is utilised as effectively as possible. <br><br> In this manuscript we test a variety of methodologies on traditional reference particles and a range of laboratory generated aerosols. Hierarchical Agglomerative Clustering (HAC) has been previously applied to UV-LIF data in a number of studies and is tested alongside other algorithms that could be used to solve the classification problem: Density Based Spectral Clustering and Noise (DBSCAN), k-means and gradient boosting. <br><br> Whilst HAC was able to effectively discriminate between the reference particles, yielding a classification error of only 1.8<span class="thinspace"></span>%, similar results were not obtained when testing on laboratory generated aerosol where the classification error was found to be between 11.5<span class="thinspace"></span>% and 24.2<span class="thinspace"></span>%. Furthermore, there is a worryingly large uncertainty in this approach in terms of the data preparation and the cluster index used, and we were unable attain consistent results across the different sets of laboratory generated aerosol tested. <br><br> The best results were obtained using gradient boosting, where the misclassification rate was between 4.38<span class="thinspace"></span>% and 5.42<span class="thinspace"></span>%. The largest contribution to this error was the pollen samples where 28.5<span class="thinspace"></span>% of the samples were misclassified as fungal spores. The technique was also robust to changes in data preparation provided a fluorescent threshold was applied to the data. <br><br> Where laboratory training data is unavailable, DBSCAN was found to be an potential alternative to HAC. In the case of one of the data sets where 22.9<span class="thinspace"></span>% of the data was left unclassified we were able to produce three distinct clusters obtaining a classification error of only 1.42<span class="thinspace"></span>% on the classified data. These results could not be replicated however for the other data set where 26.8<span class="thinspace"></span>% of the data was not classified and a classification error of 13.8<span class="thinspace"></span>% was obtained. This method, like HAC, also appeared to be heavily dependent on data preparation, requiring different selection of parameters dependent on the preparation used. Further analysis will also be required to confirm our selection of parameters when using this method on ambient data. <br><br> There is a clear need for the collection of additional laboratory generated aerosol to improve interpretation of current databases and to aid in the analysis of data collected from an ambient environment. New instruments with a greater resolution are likely improve on current discrimination between pollen, bacteria and fungal spores and even between their different types, however the need for extensive laboratory training data sets will grow as a result.</p>
doi:10.5194/amt-2018-126 fatcat:f2wjcaghtbggxpzawpmrly432u