Textual data classification for a sectoral categorisation of public investments
A drawback of the abundance of data on public investments in Italy is the lack of a common sectoral classification: existing classifications cannot be merged into a unique hierarchical taxonomy because categories of different classifications often have many-to-many joins to each other. Moreover, many databases suffer from incomplete or inconsistent sectoral classification data. Therefore, we present a strategy to apply a homogeneous sectoral categorisation of projects monitored in different
... ed in different Italian Databases on Public Investments, based on the exploitation of textual information contained in project descriptions. This strategy can be applied incrementally to other data sources, so as to make the new classifications available for new data. The result is achieved through a supervised classification methodology based on K-Nearest Neighbour Algorithm which works on the Singular Value Decomposition Matrices of the supervisor set, using appropriate weighting functions for the word frequency and testing its performances in terms of classification accuracy on a test set. While the supervisor set is taken from the main Italian repository on public investments, the scoring set contains projects from other data sources. With the aim of reaching an optimal strategy, we show how the final results depend on the choice of numbers of SVD dimensions and neighbours, as well as that of the weighting functions for the word frequency. We also show how the classification accuracy is improved by inflating the training set with the addition of the title of the known categories to the project descriptions used in the textual analysis. Finally, in order to check the robustness of the proposed strategy, an unsupervised cluster analysis is performed on the scoring set and its results are compared with those of the supervised classification.