A theory of term importance in automatic text analysis

G. Salton, C. S. Yang, C. T. Yu
1975 Journal of the American Society for Information Science  
Most existing automatic content analysis and indexing techniques are based on work frequency characteristics applied largely in an ad hoc manner. Contradictory requirements arise in this connection, in that terms exhibiting high occurrence frequencies in individual documents are often us 'tful for high recall performance (to retrieve many relevant items), whereas terms with low frequency in the whole collection are useful for high precision (to reject nonrelevant items). A new technique known
more » ... discrimination value analysis ranks the text words in accordance with how well they are able to discriminate the documents of a collection from each other; that is, the value of a term depends on how much the average separation between individual documents changes when the given term is assigned for content identification. The best words are those which achieve the greatest separation. The discrimination value analysis accounts for a number of important phenomena in the content analysis of natural language texts: (a) the role and importance of single words; (b) the role of juxtaposed words (phrases); (c) the role of word groups or classes, as specified in a thesaurus. Effective criteria can be given for assigning each term to one of these three classes, and for constructing optimal indexing vocabularies. (Anthem) ". ..1.
doi:10.1002/asi.4630260106 fatcat:vium33mhnjaqtl5ppspljyy5ea