A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
OCD: An Optimized and Canonical Document Format
2009
2009 10th International Conference on Document Analysis and Recognition
Revealing and being able to manipulate the structured content of PDF documents is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we present OCD, an optimized, easy-to-process and canonical format for representing structured electronic documents. The system and methods used for reverse engineering PDF documents into the OCD format are presented as well as the techniques to optimize it. We finally expose concrete evaluations of our OCD format compactness and restructuring performances.
doi:10.1109/icdar.2009.159
dblp:conf/icdar/BloechleLI09
fatcat:cfrnrvlk5bdchdf7jpiskkekza