Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval
Text Retrieval Conference
In TREC-10 the Berkeley group participated only in the English-Arabic cross-language retrieval (CLIR) track. One Arabic monolingual run and four English-Arabic cross-language runs were submitted. Our approach to the cross-language retrieval was to translate the English topics into Arabic using online English-Arabic bilingual dictionaries and machine translation software. The five official runs are named as BKYAAA1, BKYEAA1, BKYEAA2, BKYEAA3, and BKYEAA4. The BKYAAA1 is the Arabic monolingual
... , and the rest are English-to-Arabic cross-language runs. The same logistic regression based document ranking algorithm without pseudo relevance feedback was applied in all five runs. We refer the readers to the paper in  for details. Test Collection The document collection used in TREC-10 cross-language track consists of 383,872 Arabic articles from the Agence France Press (AFP) Arabic Newswire during the period from 13 May, 1994 to 20 December, 2000. There are 25 English topics with Arabic and French translations. A topic has three tagged fields, title, description, and narrative. The newswire articles are encoded in UTF-8 format, while the topics are encoded in ASMO 708. The cross-language retrieval task is to search the English topics against the Arabic documents and present the retrieved documents in ranked order. Preprocessing Because the texts in the documents and topics are encoded in different schemes, we converted the documents and topics to Windows 1256 code. We created a stoplist of 1,131 words using two sources. First, we translated our English stopword list to Arabic using the Ajeeb online English-Arabic dictionary. Second, we garnered some of the stopwords from the Arabic-English glossary published in Elementary Modern Standard Arabic. A consecutive sequence of Arabic letters, except for the punctuation marks, was recognized as a word. The words that are stopwords were removed when the documents and topics were indexed. The tokens were normalized by removing the initial letter ¢ , the final letter £ ¤ , and the initial letters ¥ §¦ . In addition, the letters ¨¦ and ¦ ¨were changed to the letter ¦ . The marks above or underneath the letter ¦ in © ¦ , ¦ , ¦ © , ¦ , ¦ , ¦ , , ¦ , if present, were also removed.