Nonlinear transformation of term frequencies for term weighting in text categorization

Zafer Erenel, Hakan Altınçay
2012 Engineering applications of artificial intelligence  
In automatic text categorization, the influence of features on the decision is set by the term weights which are conventionally computed as the product of term frequency and collection frequency factors. The raw form of term frequencies or their logarithmic forms are generally used as the term frequency factor whereas the leading collection frequency factors take into account the document frequency of each term. In this study, it is firstly shown that the best-fitting form of the term frequency
more » ... factor depends on the distribution of term frequency values in the dataset under concern. Taking this observation into account, a novel collection frequency factor is proposed which considers term frequencies. Five datasets are firstly tested to show that the distribution of term frequency values is task dependent. The proposed method is then proven to provide better F 1 scores compared to two recent approaches on majority of the datasets considered. It is confirmed that the use of term frequencies in the collection frequency factor is beneficial on tasks which does not involve highly repeated terms. It is also shown that the best F 1 scores are achieved on majority of the datasets when smaller number of features are considered.
doi:10.1016/j.engappai.2012.06.013 fatcat:gz4tm5zhhzfinlsrqpvui5v7fa