Slant Removal Technique for Historical Document Images

Ergina Kavallieratou, Laurence Likforman-Sulem, Nikos Vasilopoulos
2018 Journal of Imaging  
Slanted text has been demonstrated to be a salient feature of handwriting. Its estimation is a necessary preprocessing task in many document image processing systems in order to improve the required training. This paper describes and evaluates a new technique for removing the slant from historical document pages that avoids the segmentation procedure into text lines and words. The proposed technique first relies on slant angle detection from an accurate selection of fragments. Then, a slant
more » ... . Then, a slant removal technique is applied. However, the presented slant removal technique may be combined with any other slant detection algorithm. Experimental results are provided for four document image databases: two historical document databases, the TrigraphSlant database (the only database dedicated to slant removal), and a printed database in order to check the precision of the proposed technique. Recently, Brink et al. [3] categorized the proposed techniques by angle-frequency and repeated-shearing approaches that are described as follows: 1. Angle-frequency approach: Down-strokes are first located based on such criteria as the minimum vertical extent or velocity. Next, the angle of the local ink direction is measured at these locations and the resulting angles are agglomerated in a histogram. From this histogram, the slant angle is determined. This is a one-step procedure. 2. Repeated-shearing approach: This method is based on the assumption that the projection of dark pixels is maximized along an axis parallel to the slant angle. The basic principle is to repeatedly shear images of individual text lines, varying the shear angle, and optimizing the vertical projection of dark pixels. This approach is clearly more time consuming, but proves more accurate, as indicated by its popularity. The first category will be referred to here as 'slant estimation' (one-step procedure), and the second category is referred to as slant detection, since this method searches among many, for the most common angle. Slant estimation techniques are presented in [4] [5] [6] [7] , whereas a slant detection technique is presented in [9] . According to Brink et al. [3], the slant detection techniques are the most popular with the most precise results. The technique described in [9] is also used in that paper where extensive experiments over slant are performed. Last but not least, in the specific experiments, the pages were sheared entirely, since the alternative line or word segmentation is characterized as "less reliable and breaks ink traces at region boundaries" [3] . The proposed techniques up to now require line or word segmentation in order to be applied. In Figure 1 , an example of the slant removal algorithm described in [9] , is presented. The image is from the IAM Handwriting Database (IAM-DB), and the application of the algorithm requires image segmentation into text lines (Figure 1, horizontal stripes) . For this example, text line segmentation could succeed since text lines are spaced enough. It is not the case for the document image shown in Figure 2 (17th century) which includes touching ascenders and descenders and noise in the inter-line space. Since all existing algorithms perform slant removal on word or text line level, a segmentation-free approach is desirable for difficult to segment documents. Moreover, avoiding the text-line segmentation processing is computationally less expensive.
doi:10.3390/jimaging4060080 fatcat:hwsff3ret5aajpggpjubxcli24