Concept-oriented labelling of patent images based on Random Forests and proximity-driven generation of synthetic data

Dimitris Liparas, Anastasia Moumtzidou, Stefanos Vrochidis, Ioannis Kompatsiaris
2014 Proceedings of the Third Workshop on Vision and Language  
Patent images are very important for patent examiners to understand the contents of an invention. Therefore there is a need for automatic labelling of patent images in order to support patent search tasks. Towards this goal, recent research works propose classification-based approaches for patent image annotation. However, one of the main drawbacks of these methods is that they rely upon large annotated patent image datasets, which require substantial manual effort to be obtained. In this
more » ... t, the proposed work performs extraction of concepts from patent images building upon a supervised machine learning framework, which is trained with limited annotated data and automatically generated synthetic data. The classification is realised with Random Forests (RF) and a combination of visual and textual features. First, we make use of RF's implicit ability to detect outliers to rid our data of unnecessary noise. Then, we generate new synthetic data cases by means of Synthetic Minority Over-sampling Technique (SMOTE). We evaluate the different retrieval parts of the framework by using a dataset from the footwear domain. The results of the experiments indicate the benefits of using the proposed methodology.
doi:10.3115/v1/w14-5404 dblp:conf/acl-vl/LiparasMVK14 fatcat:kbuelvv3vbcfrinhtyx4pll5aa