T. N. Moskvitina, South Ural State Humanitarian Pedagogical University
2018 Vestnik Tomskogo Gosudarstvennogo Pedagogičeskogo Universiteta  
Summarization is the way towards lessening the content of a text file to make it brief that holds all the critical purposes in the content of original text file. In the process of extractive summarization, one extracts only those sentences which are the most relevant sentences in the text document and that conveys the moral of the content. The extractive summarization techniques usually revolve around the idea of discovering most relevant and frequent keywords and then extract the sentences
more » ... d on those keywords. Manual extraction or explanation of relevant keywords are a dreary procedure overflowing with errors including loads of manual exertion and time. In this paper, we proposed a hybrid approach to extract keyword automatically for multi-document text summarization in enewspaper articles. The performance of the proposed approach is compared with three additional keyword extraction techniques namely, term frequency-inverse document frequency (TF-IDF), term frequency-adaptive inverse document frequency (TF-AIDF), and a number of false alarm (NFA) for automatic keyword extraction and summarization in e-newspapers articles for better analysis. Finally, we showed that our proposed techniques had been outperformed over other techniques for automatic keyword extraction and summarization. Bharti et al Euro. J. Adv. Engg. Tech., 2017, 4 (6): 410-427 ______________________________________________________________________________ 411 marize an article after reading it. Whereas, extractive summary extract details from the original article itself and present it to the reader. In this paper, we focus on extractive summarization method. It can be observed that extractive summarization relies heavily on keyword extraction. Therefore, we focused our attention on the integration between them. Here, an algorithm is proposed for automatic keyword extraction for text summarization. It is capable of handling the limitation of existing techniques like adding a stop list to extract important keyword which might include words that are essential to the document and might lead some relevant words to lose its significance. Also, the rarity of the word in other texts was also considered. The word which is highly frequent in the article under review and is hardly found in other documents gets very high scores so that only the words pertinent to this article gets chosen as keywords. The proposed algorithm follows a hybrid approach of machine learning and statistical method. A Hidden Markov Model (HMM) based POS tagger [5] is used to identify POS information of an article and then a statistical method is used to extract keywords. The algorithm for automatic keyword extraction uses a learned probability distribution to assign scores to each word. The keywords for the article under consideration are determined based on these scores and are used to summarize the article. The summarization algorithm accordingly selects sentences to form the required summary. This algorithm applies to multiple articles at a time for keyword extraction and summarization. It extracts the keyword from all the articles and appends it to a single file. It also eliminates the redundant keywords in the final output file. In this paper, we proposed an algorithm to summarized multi-documents text to single document and it compared the performance with three additional techniques for text summarization namely, TF-IDF, TF-AIDF, and NFA for better analysis. The related work for automatic keyword extraction followed by text summarization is discussed in next section. The preliminaries used in this paper is also discussed along with other three techniques (NFA, TF-IDF, and TF-AIDF) for keyword detection and extraction. Finally, the performance analysis of the proposed schemes is analysed along with the conclusion. RELATED WORK Keyword extraction is the essential phase to perform text summarization. Therefore, this section presents a literature survey on automatic keyword extraction followed by text summarization [7] [8] [9] [10] [11] [12] [13] [14] [15] . Automatic Keyword Extraction On the premise of past work done towards automatic keyword extraction from the text for its summarization, extraction systems can be classified into four classes, namely, simple statistical approach, linguistics approach, machine learning approach, and hybrid approaches [1] as shown in Fig. 1 . Simple Statistical Approach These strategies are rough, simplistic and have a tendency to have no training sets. They concentrate on statistics got from non-linguistic features of the document, for example, the position of a word inside the document, the term frequency, and inverse document frequency. These insights are later used to build up a list of keywords. Cohen [16] , utilized n-gram statistical data to discover the keyword inside the document automatically. Other techniques inside this class incorporate word frequency, term frequency (TF) [17] or term frequency-inverse document frequency (TF-IDF) [18], word co-occurrences [19] , and PAT-tree [20] . The most essential of them is term frequency. In these strategies, the frequency of occurrence is the main criteria that choose whether a word is a keyword or not. It is extremely unrefined and tends to give very unseemly results. An improvement of this strategy is the TF-IDF, which also takes the frequency of occurrence of a word as the model to choose a keyword or not. Similarly, word co-occurrence methods manage statistical information about the number of times a word has happened and the number of times it has happened with another word. This statistical information is then used to compute support and confidence of the words. Apriori technique is then used to infer the keywords. Linguistics Approach This approach utilizes the linguistic features of the words for keyword detection and extraction in text documents. It incorporates the lexical analysis [21], syntactic analysis [22] , discourse analysis [23], etc. The resources used for lexical analysis are an electronic dictionary, tree tagger, WordNet, n-grams, POS pattern, etc. Similarly, noun phrase (NP), chunks (Parsing) are used as resources for syntactic analysis. Machine Learning Approach Keyword extraction can also be seen as a learning problem. This approach requires manually annotated training data and training models. Hidden Markov model [24] , support vector machine (SVM) [25] , naive Bayes (NB) [26] , bagging [22] , etc. are commonly used training models in these approaches. In the second phase, the document whose keywords are to be extracted is given as inputs to the model, which then extracts the keywords that best fit Bharti et al Euro. J. Adv. Engg. Tech., 2017, 4 (6): 410-427 ______________________________________________________________________________ Bharti et al Euro. J. Adv. Engg. Tech., 2017, 4 (6): 410-427 ______________________________________________________________________________
doi:10.23951/1609-624x-2018-8-45-50 fatcat:zyhcnwwkprdl7akq4tu2hgz2we