Meta-Data Extraction from Bibliographic Documents for the Digital Library [chapter]

A. Belaïd, D. Besagni
2007 Advances in Pattern Recognition  
This chapter addresses the problem of automatic metadata extraction within digitized documents by retro-conversion techniques. The focus is on bibliographic documents as they are by nature a source of such metadata. They are strongly structuring for a digital library (DL), their automatic recognition presents an obvious interest. However as their origin is very different (references, citations, tables of content, index cards), a generic methodology is proposed for their structure. Based on a
more » ... st morphological labeling of the text, it looks for syntactic elements (syntagmas) revealing the bibliographic field nature (title, authors, date, publication source, etc.). Depending on the case, the syntax is validated either by a given grammar or by occurrence analysis in the different document elements (i.e. several references in a bibliography, or articles in a table of content). In the later, the bottom-up procedure generates a structure model from the well-recognized elements and applies it on the rest. The modeling requires taking into consideration the interand intra-fields relationships. The experiments performed on different types of documents confirm the interest of this approach.
doi:10.1007/978-1-84628-726-8_15 fatcat:6cvk6b2ydneoplpslmuja5jwbe