335 Hits in 4.2 sec

A Framework for Generating Extractive Summary from Multiple Malayalam Documents

K. Manju, S. David Peter, Sumam Idicula
2021 Information  
In this paper, we propose a framework for extracting a summary from multiple documents in the Malayalam Language.  ...  This work mainly focuses on generating a summary from multiple news documents. In this case, the summary helps to reduce the redundant news from the different newspapers.  ...  This study presents a generic extractive multi-document summarization model to extract a summary from multiple Malayalam documents.  ... 
doi:10.3390/info12010041 fatcat:hnviqlalybc6re7377l2ghurua

Automatic summarization of Malayalam documents using clause identification method

Sunitha C, A Jaya, Amal Ganesh
2019 International Journal of Electrical and Computer Engineering (IJECE)  
Extractive summarization selects important sentences from the text and produces summary as it is present in the original document.  ...  Finally an algorithm is used to generate the sentences from the semantic triples of the selected clauses which is the abstractive summary of input documents.</span>  ...  In multi document summarization the important sentences related to a particular area/topic from multiple sources are extracted to produce a summary whereas in single document summarization the important  ... 
doi:10.11591/ijece.v9i6.pp4929-4938 fatcat:fe5smhjf2bcldgcigf75sx4ssu

Attention based Abstractive Summarization of Malayalam Document

Sindhya K Nambiar, David Peter S, Sumam Mary Idicula
2021 Procedia Computer Science  
The proposed work attempts to create an attention mechanism to generate the summary of the source document.  ...  The objective of the proposed work is to create a brief and understandable abstractive summary of a Malayalam document.  ...  Encoder-Decoder Framework In the proposed methodology, the summary of the document is built using the abstract sentences generated from a model built using a Neural network.  ... 
doi:10.1016/j.procs.2021.05.088 fatcat:eln7prmtcrcwvhjl2l6urzron4

Robust Recognition of Degraded Documents Using Character N-Grams

Shrey Dutta, Naveen Sankaran, K. Pramod Sankar, C.V. Jawahar
2012 2012 10th IAPR International Workshop on Document Analysis Systems  
The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them.  ...  Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.  ...  EXPERIMENTAL SETUP & RESULTS The data for our experiments is obtained from multiple sources such as scanned books and newspapers for the Indian language of Malayalam.  ... 
doi:10.1109/das.2012.76 dblp:conf/das/DuttaSSJ12 fatcat:ab5cqnntqnbxlev3ir5os4pcyy

A Heuristic Approach for Telugu Text Summarization with Improved Sentence Ranking

Kishore Kumar Mamidala
2021 Turkish Journal of Computer and Mathematics Education  
This paper presents a heuristic appraoch to extract a summary of e-news articles of the Telugu language.  ...  The creation of manual summaries from large text documents is difficult and time-consuming for humans. Text summarization has become an important and challenging area in natural language processing.  ...  Statistical methods like term frequency are used to score the sentences and extract the relevant information from multiple documents. In [7] , proposed a text summarization for Tamil.  ... 
doi:10.17762/turcomat.v12i3.1714 fatcat:r2qvpfhnzffqddziregdb6rrdu

A Semi-automatic Adaptive OCR for Digital Libraries [chapter]

Sachin Rawat, K. S. Sesh Kumar, Million Meshesha, Indraneel Deb Sikdar, A. Balasubramanian, C. V. Jawahar
2006 Lecture Notes in Computer Science  
This paper presents a novel approach for designing a semi-automatic adaptive OCR for large document image collections in digital libraries.  ...  We describe an interactive system for continuous improvement of the results of the OCR. In this paper a semi-automatic and adaptive system is implemented.  ...  This work was partially supported by the MCIT, Government of India for Digital Libraries Activities.  ... 
doi:10.1007/11669487_2 fatcat:t7kxm66ohjhwpl5kmyrdlgys6i

A Systematic Survey on Multi-document Text Summarization

2021 International Journal of Advanced Trends in Computer Science and Engineering  
Automatic text summarization is a technique of generating short and accurate summary of a longer text document.  ...  Multi-document summarization is an automatic process of creating relevant, informative and concise summary from a cluster of related documents.  ...  Multi-document text summarization [30] Figure 1: Text Summarization Techniques generates a summary from multiple documents, each of which covers a different  ... 
doi:10.30534/ijatcse/2021/111062021 fatcat:rs7d7bltbba6nj5ph3tx3hpgwm

Statistical and Analytical Study of Guided Abstractive Text Summarization

Jagadish S. Kallimani, K. G. Srinivasa, B. Eswara Reddy
2016 Current Science  
This paper presents the process that generates an abstractive summary by focusing on a unified model with attribute based Information Extraction (IE) rules and class based templates.  ...  It also draws comparison between abstracts generated and summaries obtained by extractive methods.  ...  The underlined text in the summary indicates the attributes extracted from the document.  ... 
doi:10.18520/cs/v110/i1/65-68 fatcat:3zomonyy6zdm3fhfd4syzw635u

Study of Different Features and Classification Techniques for Recognition of Handwritten Devanagari Text

Vijay Vijay, M U Kharat, S V Gumaste
2018 International Journal of Engineering & Technology  
Recognition of handwritten Devanagari word is one of the popular area of research from decades because of its wide scope of applications.  ...  Millions of people all over the globe are using Devanagri script for various purposes such as communication, understanding the history, record keeping, research, etc.  ...  A total of 300 handwritten document images were created from these writers. The collected documents were scanned at 300 DPI. Accuracy varies from language to language.  ... 
doi:10.14419/ijet.v7i4.19.28285 fatcat:5vziba6hgjhchetwyzsrvqclya

A boundary-based tokenization technique for extractive text summarization

Nnaemeka M Oparauwah, Juliet N Odii, Ikechukwu I Ayogu, Vitalis C Iwuchukwu
2021 World Journal of Advanced Research and Reviews  
Experimental results showed that the proposed approach enhanced word tokenization by enhancing the selection of appropriate keywords from text document to be used for summarization.  ...  This study presents a boundary-based tokenization method for extractive text summarization. The proposed method performs word tokenization by defining word boundaries in place of specific delimiters.  ...  Acknowledgments The authors are thankful to the anonymous reviewers of this work for their reviews. Disclosure of conflict of interest The authors declare that no known competing interest exists.  ... 
doi:10.30574/wjarr.2021.11.2.0351 fatcat:t67arfwm3neyzkai7sevgm63uy

Social Media Analysis based on Semanticity of Streaming and Batch Data [article]

Barathi Ganesh HB
2018 arXiv   pre-print
Knowledge extraction differs with respect to the application in which the research on cognitive science fed the necessities for the same.  ...  Every second swing of such a micro posts exist which induces the need of processing those micro posts, in-order to extract knowledge out of it.  ...  feature from a single batch document.  ... 
arXiv:1801.01102v2 fatcat:sr5d2epwa5ejhgtnaru6n5kgzy

HIT: A Hierarchically Fused Deep Attention Network for Robust Code-mixed Language Representation [article]

Ayan Sengupta, Sourabh Kumar Bhattacharjee, Tanmoy Chakraborty, Md Shad Akhtar
2021 arXiv   pre-print
HIT is a hierarchical transformer-based framework that captures the semantic relationship among words and hierarchically learns the sentence-level semantics using a fused attention mechanism.  ...  In this paper, we propose HIT, a robust representation learning method for code-mixed texts.  ...  Acknowledgement The work was partially supported by the Ramanujan Fellowship (SERB) and the Infosys Centre for AI, IIITD.  ... 
arXiv:2105.14600v1 fatcat:bqtnemsqmvb5zitwnz6rjdl3n4

An empirical study of CTC based models for OCR of Indian languages [article]

Minesh Mathew, CV Jawahar
2022 arXiv   pre-print
We compare our models with popular publicly available OCR tools for end-to-end document image recognition.  ...  We also introduce a new public dataset called Mozhi for word and line recognition in Indian language.  ...  The results demonstrate that a unified framework that uses CTC transcription works well for recognition of multiple Indian languages without the need for any language/script specific modules.  ... 
arXiv:2205.06740v1 fatcat:pqu2nzagkjbcfny6fqm3bqcscm

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages [article]

Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar
2020 arXiv   pre-print
We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.  ...  We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks.  ...  In some cases, we wrote custom extractors for each website using BeautifulSoup 4 , a Python library for parsing HTML/XML documents.  ... 
arXiv:2005.00085v1 fatcat:eiyrelngcbhxpmzyelcfj62qua

OCR with Adaptive Dictionary [chapter]

Chenyang Wang, Yanhong Xie, Kai Wang, Tao Li
2015 Lecture Notes in Computer Science  
In this paper, a framework is proposed for improving OCR performance with the adaptive dictionary, in which text categorization is utilized to construct dictionaries using web data and identify the category  ...  Compared with other existing methods for language identification, the proposed method shows a better performance.  ...  Texture-based methods are used for script identification, in which the texture features are extracted from the text patches and a classifier is applied to identify the script of the imaged documents.  ... 
doi:10.1007/978-3-319-21963-9_56 fatcat:cptaquvuy5bvveeonmnkohylve
« Previous Showing results 1 — 15 out of 335 results