Massive, Free and Reproducible Grountruthed Document Image Databases Generation with DocCreator

Nicholas Journet, Boris Mansencal, Muriel Visani
<span title="">2017</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="" style="color: black;">2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</a> </i> &nbsp;
Whether your research is focused on image restoration, layout analysis, text-graphic separation, binarization, OCR, etc. you need a groundtruthed database to train your method or to evaluate it. This article presents DocCreator, a multi-platform and open-source software able to create many synthetic image documents with controlled groundtruth. With DocCreator, you can create complete synthetic images choosing the text, font, background and layout to use, add various realistic degradations
more &raquo; ... through, light defect, paper deformation, ink degradation, etc.) on original images, or combine both to increase the size of your database. DocCreator comes as an online (easy to test version) and a desktop solution (fast calculation process, and no need to upload copyrighted data). DocCreator is useful for retraining tasks and to know precisely whether your algorithm is robust. It has already been used favorably and could help other DIAR researchers to produce and share groundtruthed databases.
