Content-level Annotation of Large Collection of Printed Document Images

A. Kumar, C.V. Jawahar
<span title="">2007</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="" style="color: black;">Proceedings of the International Conference on Document Analysis and Recognition</a> </i> &nbsp;
A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of
more &raquo; ... ents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.1109/icdar.2007.4377025</a> <a target="_blank" rel="external noopener" href="">dblp:conf/icdar/KumarJ07</a> <a target="_blank" rel="external noopener" href="">fatcat:pl74yrflzjd2jm7jxltwjzkleq</a> </span>
<a target="_blank" rel="noopener" href="" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href=""> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> </button> </a>