Design and implementation of automatic indexing for information retrieval with Arabic documents

Ismail Hmeidi, Ghassan Kanaan, Martha Evens
1997 Journal of the American Society for Information Science  
We have put together a corpus of 242 abstracts of Arabic has been stimulated by the D.O.D. Tipster project (Hardocuments using the Proceedings of the Saudi Arabian man, 1993). Arabic provides a very different context National Conferences as a source. All these abstracts from English, since it is a non-Indo-European language involve computer science and information systems. We with a complex morphological structure. also designed and built an automatic information re- Investigation of methods of
more » ... automatic information retrieval system from scratch to handle Arabic data. The system was implemented in the C language using the GCC trieval for Arabic is essential to the growth of learning compiler and runs on IBM/PCs and compatible microcomin the Arab world. Expansion of information retrieval puters. We have implemented both automatic and manual systems is the simplest and most cost-effective way to indexing techniques for this corpus. A long series of experimake the resources of large reference libraries available ments using measures of recall and precision has demonto the increasing numbers of students and researchers in strated that automatic indexing is at least as effective as manual indexing and more effective in some cases. Since the Arab world. automatic indexing is both cheaper and faster, our results suggest that we can achieve a wider coverage of the literature with less money and produce as good results as with 1.1. Automatic Indexing manual indexing. We have also compared the retrieval results using words as index terms versus stems and roots, In the United States the large bibliographic database and confirmed the results obtained by Al-Kharashi and Abu-Salem with smaller corpora that root indexing is more maintained by the National Library of Medicine is ineffective than word indexing. index terms and articles involved. We are convinced by 1, 1996. Salton's arguments that machine indexing in English is more accurate and more cost-effective. He has also sug-
doi:10.1002/(sici)1097-4571(199710)48:10<867::aid-asi3>3.0.co;2-# fatcat:cx66h2ucvzcuhongrew3dwchse