Utilization of synergetic human-machine clouds: a big data cleaning case

Deniz Iren, Gokhan Kul, Semih Bilgen
2014 Proceedings of the 1st International Workshop on CrowdSourcing in Software Engineering - CSI-SE 2014  
Cloud computing and crowdsourcing are growing trends in IT. Combining the strengths of both machine and human clouds within a hybrid design enables us to overcome certain problems and achieve efficiencies. In this paper we present a case in which we developed a hybrid, throw-away prototype software system to solve a big data cleaning problem in which we corrected and normalized a data set of 53,822 academic publication records. The first step in our solution consists of utilization of external
more » ... OI query web services to label the records with matching DOIs. Then we used customized string similarity calculation algorithms based on Levensthein Distance and Jaccard Index to grade the similarity between records. Finally we used crowdsourcing to identify duplicates among the residual record set consisting of similar yet not identical records. We consider this proof of concept to be successful and report that we achieved certain results that we could not have achieved by using either human or machine clouds alone.
doi:10.1145/2593728.2593733 dblp:conf/icse/IrenKB14 fatcat:yd4zclelwneipj7np5sq5itghy