Youssouf Chherawala, Robert Wisnovsky, Mohamed Cheriet
2011 Proceedings of the 2011 Workshop on Historical Document Imaging and Processing - HIP '11  
Automatic recognition of Arabic words is a challenging task and its complexity increases as the lexicon grows. In premodern documents, the vocabulary is unconstrained; therefore a lexicon-reduction strategy is needed to reduce the recognition computational complexity. This paper proposes a novel lexicon-reduction method for Arabic subwords based on their shapes' topology and geometry. First the subword shape's topological and geometrical information is extracted from its skeleton and encoded
more » ... o a graph. Then the graph is converted into a topological signature vector (TSV) which preserves the graph structure. The lexicon is reduced based on the TSV distance between the lexicon subwords' shapes and a query shape, by keeping the i nearest subwords. The value of i is selected according to a predetermined lexicon-reduction accuracy. The proposed framework has been tested on a database of pre-modern Arabic subword shapes with promising results.
doi:10.1145/2037342.2037345 dblp:conf/icdar/ChherawalaWC11 fatcat:xcmukoeb6jcunjcr7xo4xrxqre