Wrapping PDF Documents Exploiting Uncertain Knowledge [chapter]

S. Flesca, S. Garruzzo, E. Masciari, A. Tagarelli
2006 Lecture Notes in Computer Science  
The PDF format represents the de facto standard for printoriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according to the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token
more » ... oups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document.
doi:10.1007/11767138_13 fatcat:rwvjwrcbdrfand5zkuzxvtuq7y