An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata

Arash Joorabchi, Abdulhussain E. Mahdi
2011 Journal of information science  
This article describes an unsupervised approach for automatic classification of scientific literature archived in digital libraries and repositories according to a standard library classification scheme. The method is based on identifying all the references cited in the document to be classified and, using the subject classification metadata of extracted references as catalogued in existing conventional libraries, inferring the most probable class for the document itself with the help of a
more » ... ting mechanism. We have demonstrated the application of the proposed method and assessed its performance by developing a prototype software system for automatic classification of scientific documents according to the Dewey Decimal Classification (DDC) scheme. A dataset of one thousand research articles, papers, and reports from a well-known scientific digital library, CiteSeer, were used to evaluate the classification performance of the system. Detailed results of this experiment are presented and discussed. represented in the collection, and therefore deemed impractical in many cases due to the sheer volume of new materials published on daily basis. For example, reportedly the number of new scientific publications in the field of biomedical science exceeds 1800 a day [2] . Motivated by the ever-increasing number of e-documents and the high cost of manual classification, Automatic Text Classification/Categorisation (ATC) -the automatic assignment of natural language text documents to one or more predefined classes/categories according to their contents -has become one of the key methods to enhance the information retrieval and knowledge management of digital textual collections. Until the late '80s, the use of rule-based methods was the dominant approach to ATC. Rule-based classifiers are built by knowledge engineers who inspect a corpus of labelled sample documents and define a set of rules which are used for identifying the class of unlabelled documents. Since the early '90s, with the advances in the field of Machine Learning (ML) and the emergence of relatively inexpensive high performance computing platforms, ML-based approaches have become widely associated with modern ATC systems. A comprehensive review of the application of ML algorithms in ATC, including the widely used Bayesian Model, k-Nearest Neighbour, and Support Vector Machine, is given in [3] . In general, an ML-based ATC algorithm uses a corpus of manually classified documents to train a classification function which is then used to predict the classes of unlabelled documents. Applications of such algorithms include spam filtering, cataloguing news and journal articles, and classification of web pages, to name a few. However, although a considerable success has been achieved in above listed applications, the prediction accuracy of ML-based ATC systems depends on a variety of factors, and no single ATC algorithm is adequate for all purposes. For example, it is commonly observed that as the number of classes in classification schemes increases, the prediction accuracy of ML algorithms decreases. This limitation of ML-based ATC systems becomes much more significant in case of scientific digital libraries where the classification schemes used could contain thousands of classes. Furthermore, the quality and quantity of the training dataset used to train the classification function has a decisive effect on the performance of ML-based ATC algorithms. However, in many cases, there is little or no training data available. consequently, over the past decade, most efforts of the ATC community has been directed towards developing new probability and statistical based ML algorithms that can enhance the performance of the ML-based ATC systems in terms of prediction accuracy and speed, as well as reduce the number of manually labelled documents required to accurately train the classifiers. On the other hand, as Golub [4], Yi [5], and Markey [6] discuss, there exits a less investigated approach to ATC that is attributed to the library science community. This approach focuses less on algorithms and more on leveraging comprehensive controlled vocabularies, such as library classification schemes and thesauri which have been developed and used for manual classification of holdings in conventional libraries. A library classification system is a coding system for organising library materials according to their subjects with the aim of simplifying subject browsing. Library classification systems are used by expert library cataloguers to classify books and other materials (e.g., serials, audiovisual materials, computer files, maps, manuscripts, realia) in conventional libraries. The two most widely used classification systems in libraries around the world today are the Dewey Decimal Classification (DDC) [7] and the Library of Congress Classification (LCC) [8] , which since their introduction in the late 18 th century have undergone numerous revisions and updates. A promising avenue for the application of this approach is the automatic classification of resources archived in digital libraries, where using standard library classification schemes is a natural and usually most suitable choice because of the similarities between conventional and digital libraries. Another application of this approach is in the classification of web pages, where due to their subject diversity, their proper and accurate labelling requires a comprehensive classification scheme that covers a wide range of disciplines. In such applications using library classification schemes can provide fine-grained classes that virtually cover all categories and branches of human knowledge. In general, ATC systems that have been developed based on the above library science approach can be divided into two main categories: 1. String matching-based systems: these systems do not rely on ML algorithms to perform the classification task. Instead, they use a method which involves string-to-string matching between words in a term list extracted from library thesauri and classification schemes, and words in the text to be classified. Here, the unlabelled incoming document can be thought of as a search query against the library classification schemes and thesauri, and the result of this search includes the class(es) of the unlabelled document. One of the well-known examples of such systems is the Scorpion project [9] by the Online Computer Library Centre (OCLC) [10]. Scorpion is an ATC system for classifying e-documents according to the DDC scheme. It uses a clustering method based on term frequency to find the most relevant classes to the document to be classified. A similar experiment was conducted by Larson [11] in early 90's, who built normalised clusters for 8,435 classes in the
doi:10.1177/0165551511417785 fatcat:63fpqqdzije4tbfgwaevwdhtze