SAO Semantic Information Identification for Text Mining

Chao Yang, Donghua Zhu, Xuefeng Wang
2017 International Journal of Computational Intelligence Systems  
A Subject-Action-Object (SAO) is a triple structure which can be used to both describe topics in detail and explore the relationship between them. SAO analysis has become popular in bibliometrics, however there are two challenges in the identification of SAO structures: low relevance of SAOs to domain topics; and synonyms in SAO. These problems make the identification of SAO greatly dependent upon domain experts, limiting the further usage of SAO and influencing further the mining of SAO
more » ... ining of SAO characteristics. This paper proposes a parse tree-based SAO identification method that includes (1) a model to identify the core components (candidate terms for subject & object) of SAO structures, where term clumping processes and co-word analysis are involved; (2) a parse tree-based hierarchical SAO extraction model to divide entire SAO structures into a collection of simpler sub-tasks for separate subject, action, and object identification; and (3) an SAO weighting model to rank SAO structures for result selection. The proposed method is applied to publications in the Journal of Scientometrics (SCIM), to identify and rank significant SAO structures. Our experiment results demonstrate the validity and feasibility of the proposed method. This is an open access article under the CC BY-NC license ( provide an express way to quickly understand massive textual content, and help indicate significant topics. SAO is helpful for (1) solving the problem of ambiguous interpretations resulted by homonyms and synonyms of words 17, 18 ; and (2) identifying the specific relationship between topic terms. 20 SAO identification is the basis of SAO analysis. However, it is difficult to identify appropriate SAOs for bibliometric analysis. The problems of traditional SAO identification are: (1) it is difficult to directly extract the SAOs that have a close relationship with a topic of interest. Most of the SAOs identified with general Natural Language Processing are too common to express detailed meanings for "what to do" and "how to do it," which is the emphasis of "SAO structure" playing an important role in topic analysis. The reason is that there are millions of SAOs and most of them are common words, (e.g., "we look at an example," "paper consists of three parts."). These common words are irrelative to topics, stop us from getting truly valuable SAOs, and cannot be filtered out by post-cleaning and consolidation. (2) It is difficult to obtain the SAOs that have perfect quantitative properties and we usually face the problem of "synonyms in SAO". It is a serious problem for quantitative analysis, especially when we want to use some statistics-based methods (e.g., time series analyses, 21 co-occurrence and association analysis). The reason for "synonyms in SAO" is that the SAO structure is complex, which means that there will be literally many different SAO structures that have the same meaning, and it is difficult to combine them together. Aiming to overcome the problems described above, which is the value of this manuscript, this paper proposes a SAO identification method. Compared with traditional SAO identification methods, the main contributions of the proposed method are: (1) introduce term clumping and design a co-word algorithm (considering the co-occurrence with keywords) to identify SAO core components, which is helpful for improving the relevance of SAOs to topic. (2) Based on syntax-tree, constructed a hierarchical SAO extraction model, and perform the SAO cleaning and consolidation function. It is helpful for improving the "synonyms in SAO". (3) Constructed an SAO weighting model using the idea of TFIDF (term frequency-inverse document frequency) to evaluate the importance of each SAO. We apply the proposed method to the publications in the Journal of Scientometrics. The results demonstrate the feasibility of our method and hold interest for related bibliometric studies. The rest of this paper is organized as follows: related works are summarized in a Literature Review, the section 3 presents the SAO identification method followed by an empirical study on SCIM. Finally, we conclude our study and address future work.
doi:10.2991/ijcis.2017.10.1.40 fatcat:gctynq27avf7nbjbhtcl23wbcq