A Rule-Based Information Extraction Approach for Extracting Metadata from PDF Books

Abrar Alamoudi, Amal Alomari, Sarah Alwarthan, Atta-ur-Rahman
2021 Innovative Computing Information and Control Express Letters, Part B: Applications  
Nowadays PDF books have become widely used because of their easily availability and accessibility through the Internet from anywhere and anytime. However, the process of extracting information structure from PDF sources automatically is considered as a challenging task. Nonetheless, it is necessary for effective knowledge representation, archiving and retrieval through creating the digital libraries in semantic web. In this work, an intelligent rule-based approach is proposed for extracting the
more » ... logical metadata from PDF books accurately. Furthermore, a set of rules and underlying patterns are defined to achieve high performance in terms of accuracy. The experimental results indicate that the proposed approach is capable of extracting the metadata from PDF books successfully with an overall accuracy of 94.62% and 90.27% for both training and testing sets, respectively. Moreover, this study could benefit the publishing houses to enhance the indexing, citations, and usability of their electronic books. Also, it would help the researchers who are interested in improving the performance of extracting information from PDF documents.
doi:10.24507/icicelb.12.02.121 fatcat:4kjs4omuxvhrnhujft5ad7maca