An unsupervised classification process for large datasets using web reasoning

Rafael Peixoto, Thomas Hassan, Christophe Cruz, Aurélie Bertaux, Nuno Silva
2016 Proceedings of the International Workshop on Semantic Big Data - SBD '16  
Determining valuable data among large volumes of data is one of the main challenges in Big Data. We aim to extract knowledge from these sources using a Hierarchical Multi-Label Classification process called Semantic HMC. This process automatically learns a label hierarchy and classifies items from very large data sources. Five steps compose the Semantic HMC process: Indexation, Vectorization, Hierarchization, Resolution and Realization. The first three steps construct automatically the label
more » ... rarchy from data sources. The last two steps classify new items according to the label hierarchy. This paper focuses in the last two steps and presents a new highly scalable process to classify items from huge sets of unstructured text by using ontologies and rule-based reasoning. The process is implemented in a scalable and distributed platform to process Big Data and some results are discussed.
doi:10.1145/2928294.2928301 dblp:conf/sigmod/PeixotoHCBS16 fatcat:f6ina2i74rfmrpzk76z45wmrnu