39 Hits in 1.4 sec

SISA: Securing Images by Selective Alteration [article]

Prutha Gaherwar, Shraddha Joshi, Raviraj Joshi, Rahul Khengare
2021 arXiv   pre-print
With an increase in mobile and camera devices' popularity, digital content in the form of images has increased drastically. As personal life is being continuously documented in pictures, the risk of losing it to eavesdroppers is a matter of grave concern. Secondary storage is the most preferred medium for the storage of personal and other images. Our work is concerned with the security of such images. While encryption is the best way to ensure image security, full encryption and decryption is a
more » ... computationally-intensive process. Moreover, as cameras are getting better every day, image quality, and thus, the pixel density has increased considerably. The increased pixel density makes encryption and decryption more expensive. We, therefore, delve into selective encryption and selective blurring based on the region of interest. Instead of encrypting or blurring the entire photograph, we only encode selected regions of the image. We present a comparative analysis of the partial and full encryption of the photos. This kind of encoding will help us lower the encryption overhead without compromising security. The applications utilizing this technique will become more usable due to the reduction in the decryption time. Additionally, blurred images being more readable than encrypted ones, allowed us to define the level of security. We leverage the machine learning algorithms like Mask-RCNN (Region-based convolutional neural network) and YOLO (You Only Look Once) to select the region of interest. These algorithms have set new benchmarks for object recognition. We develop an end to end system to demonstrate our idea of selective encryption.
arXiv:2106.11770v1 fatcat:63v4fsxqlndixienbzaewktny4

Deep Learning for Hindi Text Classification: A Comparison [article]

Ramchandra Joshi, Purvi Goel, Raviraj Joshi
2020 arXiv   pre-print
Natural Language Processing (NLP) and especially natural language text analysis have seen great advances in recent times. Usage of deep learning in text processing has revolutionized the techniques for text processing and achieved remarkable results. Different deep learning architectures like CNN, LSTM, and very recent Transformer have been used to achieve state of the art results variety on NLP tasks. In this work, we survey a host of deep learning architectures for text classification tasks.
more » ... he work is specifically concerned with the classification of Hindi text. The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus. In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention. Multilingual pre-trained sentence embeddings based on BERT and LASER are also compared to evaluate their effectiveness for the Hindi language. The paper also serves as a tutorial for popular text classification techniques.
arXiv:2001.10340v1 fatcat:asbxrs6wybc5ppkmaeyhpirm4i

Comparative Study of Long Document Classification [article]

Vedangi Wagh, Snehal Khandve, Isha Joshi, Apurva Wani, Geetanjali Kale, Raviraj Joshi
2021 arXiv   pre-print
The amount of information stored in the form of documents on the internet has been increasing rapidly. Thus it has become a necessity to organize and maintain these documents in an optimum manner. Text classification algorithms study the complex relationships between words in a text and try to interpret the semantics of the document. These algorithms have evolved significantly in the past few years. There has been a lot of progress from simple machine learning algorithms to transformer-based
more » ... hitectures. However, existing literature has analyzed different approaches on different data sets thus making it difficult to compare the performance of machine learning algorithms. In this work, we revisit long document classification using standard machine learning approaches. We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets. We present an exhaustive comparison of different algorithms on a range of long document datasets. We re-iterate that long document classification is a simpler task and even basic algorithms perform competitively with BERT-based approaches on most of the datasets. The BERT-based models perform consistently well on all the datasets and can be blindly used for the document classification task when the computations cost is not a concern. In the shallow model's category, we suggest the usage of raw BiLSTM + Max architecture which performs decently across all the datasets. Even simpler Glove + Attention bag of words model can be utilized for simpler use cases. The importance of using sophisticated models is clearly visible in the IMDB sentiment dataset which is a comparatively harder task.
arXiv:2111.00702v1 fatcat:ehkbdytarjatxixpnjvrc6wmpi

Hierarchical Neural Network Approaches for Long Document Classification [article]

Snehal Khandve, Vedangi Wagh, Apurva Wani, Isha Joshi, Raviraj Joshi
2022 arXiv   pre-print
Text classification algorithms investigate the intricate relationships between words or phrases and attempt to deduce the document's interpretation. In the last few years, these algorithms have progressed tremendously. Transformer architecture and sentence encoders have proven to give superior results on natural language processing tasks. But a major limitation of these architectures is their applicability for text no longer than a few hundred words. In this paper, we explore hierarchical
more » ... er learning approaches for long document classification. We employ pre-trained Universal Sentence Encoder (USE) and Bidirectional Encoder Representations from Transformers (BERT) in a hierarchical setup to capture better representations efficiently. Our proposed models are conceptually simple where we divide the input data into chunks and then pass this through base models of BERT and USE. Then output representation for each chunk is then propagated through a shallow neural network comprising of LSTMs or CNNs for classifying the text data. These extensions are evaluated on 6 benchmark datasets. We show that USE + CNN/LSTM performs better than its stand-alone baseline. Whereas the BERT + CNN/LSTM performs on par with its stand-alone counterpart. However, the hierarchical BERT models are still desirable as it avoids the quadratic complexity of the attention mechanism in BERT. Along with the hierarchical approaches, this work also provides a comparison of different deep learning algorithms like USE, BERT, HAN, Longformer, and BigBird for long document classification. The Longformer approach consistently performs well on most of the datasets.
arXiv:2201.06774v1 fatcat:tlx2zldiefb2rbsrwof7z4zena

Evaluating Deep Learning Approaches for Covid19 Fake News Detection [article]

Apurva Wani, Isha Joshi, Snehal Khandve, Vedangi Wagh, Raviraj Joshi
2021 arXiv   pre-print
Social media platforms like Facebook, Twitter, and Instagram have enabled connection and communication on a large scale. It has revolutionized the rate at which information is shared and enhanced its reach. However, another side of the coin dictates an alarming story. These platforms have led to an increase in the creation and spread of fake news. The fake news has not only influenced people in the wrong direction but also claimed human lives. During these critical times of the Covid19
more » ... it is easy to mislead people and make them believe in fatal information. Therefore it is important to curb fake news at source and prevent it from spreading to a larger audience. We look at automated techniques for fake news detection from a data mining perspective. We evaluate different supervised text classification algorithms on Contraint@AAAI 2021 Covid-19 Fake news detection dataset. The classification algorithms are based on Convolutional Neural Networks (CNN), Long Short Term Memory (LSTM), and Bidirectional Encoder Representations from Transformers (BERT). We also evaluate the importance of unsupervised learning in the form of language model pre-training and distributed word representations using unlabelled covid tweets corpus. We report the best accuracy of 98.41\% on the Covid-19 Fake news detection dataset.
arXiv:2101.04012v2 fatcat:2ny3dzconfhxhgqzbm3mtm53y4

L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models [article]

Parth Patil, Aparna Ranade, Maithili Sabane, Onkar Litake, Raviraj Joshi
2022 arXiv   pre-print
Named Entity Recognition (NER) is a basic NLP task and finds major applications in conversational and search systems. It helps us identify key entities in a sentence used for the downstream application. NER or similar slot filling systems for popular languages have been heavily used in commercial applications. In this work, we focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state. Marathi is a low resource language and still lacks useful NER resources. We
more » ... esent L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi. We also describe the manual annotation guidelines followed during the process. In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc. The MahaBERT provides the best performance among all the models. The data and models are available at .
arXiv:2204.06029v1 fatcat:xur45zzzkvc6hkpdmqqvyqdyhu

ICodeNet – A Hierarchical Neural Network Approach for Source Code Author Identification [article]

Pranali Bora, Tulika Awalgaonkar, Himanshu Palve, Raviraj Joshi, Purvi Goel
2021 arXiv   pre-print
With the open-source revolution, source codes are now more easily accessible than ever. This has, however, made it easier for malicious users and institutions to copy the code without giving regards to the license, or credit to the original author. Therefore, source code author identification is a critical task with paramount importance. In this paper, we propose ICodeNet - a hierarchical neural network that can be used for source code file-level tasks. The ICodeNet processes source code in
more » ... e format and is employed for the task of per file author identification. The ICodeNet consists of an ImageNet trained VGG encoder followed by a shallow neural network. The shallow network is based either on CNN or LSTM. Different variations of models are evaluated on a source code author classification dataset. We have also compared our image-based hierarchical neural network model with simple image-based CNN architecture and text-based CNN and LSTM models to highlight its novelty and efficiency.
arXiv:2102.00230v1 fatcat:2zmvhql6cnbmpkdybisqls2oj4

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models [article]

Ravindra Nayak, Raviraj Joshi
2022 arXiv   pre-print
Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English
more » ... code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark. The HingGPT is a GPT2 based generative transformer model capable of generating full tweets. We also release L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language identification(LID) dataset and HingBERT-LID, a production-quality LID model to facilitate capturing of more code-mixed data using the process outlined in this work. The dataset and models are available at .
arXiv:2204.08398v1 fatcat:b3ltly4s6bbprofijwhhhbx3xi

L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library [article]

Raviraj Joshi
2022 arXiv   pre-print
Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. Moreover, popular NLP libraries do not have support for the Marathi language. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection. We have also published a monolingual Marathi corpus for unsupervised
more » ... uage modeling tasks. Overall we present MahaCorpus, MahaSent, MahaNER, and MahaHate datasets and their corresponding MahaBERT models fine-tuned on these datasets. We aim to move ahead of benchmark datasets and prepare useful resources for Marathi. The resources are available at
arXiv:2205.14728v2 fatcat:g3xh4re6znb2dhrmuxwqsywrbi

L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset [article]

Atharva Kulkarni, Meet Mandhane, Manali Likhitkar, Gayatri Kshirsagar, Raviraj Joshi
2021 arXiv   pre-print
Alternatively due to the low resource nature of many languages, translated versions of the English datasets were used for analysis (Joshi et al., 2019; Refaee and Rieser, 2015; Mohammad et al., 2016)  ... 
arXiv:2103.11408v2 fatcat:bycabla2mbfolonfcfmit5ezsu

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi [article]

Abhishek Velankar, Hrushikesh Patil, Raviraj Joshi
2022 arXiv   pre-print
Transformers are the most eminent architectures used for a vast range of Natural Language Processing tasks. These models are pre-trained over a large text corpus and are meant to serve state-of-the-art results over tasks like text classification. In this work, we conduct a comparative study between monolingual and multilingual BERT models. We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in
more » ... hi. We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi. We further show that Marathi monolingual models outperform the multilingual BERT variants on five different downstream fine-tuning experiments. We also evaluate sentence embeddings from these models by freezing the BERT encoder layers. We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts. However, we observe that these embeddings are not generic enough and do not work well on out of domain social media datasets. We consider two Marathi hate speech datasets L3Cube-MahaHate, HASOC-2021, a Marathi sentiment classification dataset L3Cube-MahaSent, and Marathi Headline, Articles classification datasets.
arXiv:2204.08669v1 fatcat:oui46mug4jeujh57zyrxggx5rm

Domain Adaptation of NMT models for English-Hindi Machine Translation Task at AdapMT ICON 2020 [article]

Ramchandra Joshi, Rushabh Karnavat, Kaustubh Jirapure, Raviraj Joshi
2020 arXiv   pre-print
Recent advancements in Neural Machine Translation (NMT) models have proved to produce a state of the art results on machine translation for low resource Indian languages. This paper describes the neural machine translation systems for the English-Hindi language presented in AdapMT Shared Task ICON 2020. The shared task aims to build a translation system for Indian languages in specific domains like Artificial Intelligence (AI) and Chemistry using a small in-domain parallel corpus. We evaluated
more » ... he effectiveness of two popular NMT models i.e, LSTM, and Transformer architectures for the English-Hindi machine translation task based on BLEU scores. We train these models primarily using the out of domain data and employ simple domain adaptation techniques based on the characteristics of the in-domain dataset. The fine-tuning and mixed-domain data approaches are used for domain adaptation. Our team was ranked first in the chemistry and general domain En-Hi translation task and second in the AI domain En-Hi translation task.
arXiv:2012.12112v2 fatcat:if6ffjqgkbggppuefxk2o6mkc4

On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode [article]

Raviraj Joshi, Subodh Kumar
2022 arXiv   pre-print
The streaming automatic speech recognition (ASR) models are more popular and suitable for voice-based applications. However, non-streaming models provide better performance as they look at the entire audio context. To leverage the benefits of the non-streaming model in streaming applications like voice search, it is commonly used in second pass re-scoring mode. The candidate hypothesis generated using steaming models is re-scored using a non-streaming model. In this work, we evaluate the
more » ... eaming attention-based end-to-end ASR models on the Flipkart voice search task in both standalone and re-scoring modes. These models are based on Listen-Attend-Spell (LAS) encoder-decoder architecture. We experiment with different encoder variations based on LSTM, Transformer, and Conformer. We compare the latency requirements of these models along with their performance. Overall we show that the Transformer model offers acceptable WER with the lowest latency requirements. We report a relative WER improvement of around 16% with the second pass LAS re-scoring with latency overhead under 5ms. We also highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER). Moreover, we observe that in the second pass re-scoring mode all the encoders provide similar benefits whereas the difference in performance is prominent in standalone text generation mode.
arXiv:2206.12829v1 fatcat:o535kv2k7bfjrevk4ybid2zbxy

Hate and Offensive Speech Detection in Hindi and Marathi [article]

Abhishek Velankar, Hrushikesh Patil, Amol Gore, Shubham Salunke, Raviraj Joshi
2021 arXiv   pre-print
Sentiment analysis is the most basic NLP task to determine the polarity of text data. There has been a significant amount of work in the area of multilingual text as well. Still hate and offensive speech detection faces a challenge due to inadequate availability of data, especially for Indian languages like Hindi and Marathi. In this work, we consider hate and offensive speech detection in Hindi and Marathi texts. The problem is formulated as a text classification task using the state of the
more » ... deep learning approaches. We explore different deep learning architectures like CNN, LSTM, and variations of BERT like multilingual BERT, IndicBERT, and monolingual RoBERTa. The basic models based on CNN and LSTM are augmented with fast text word embeddings. We use the HASOC 2021 Hindi and Marathi hate speech datasets to compare these algorithms. The Marathi dataset consists of binary labels and the Hindi dataset consists of binary as well as more-fine grained labels. We show that the transformer-based models perform the best and even the basic models along with FastText embeddings give a competitive performance. Moreover, with normal hyper-parameter tuning, the basic models perform better than BERT-based models on the fine-grained Hindi dataset.
arXiv:2110.12200v2 fatcat:v6pkub4ycbgivjebzg3ju6stoe

ShufText: A Simple Black Box Approach to Evaluate the Fragility of Text Classification Models [article]

Rutuja Taware, Shraddha Varat, Gaurav Salunke, Chaitanya Gawande, Geetanjali Kale, Rahul Khengare, Raviraj Joshi
2021 arXiv   pre-print
Text classification is the most basic natural language processing task. It has a wide range of applications ranging from sentiment analysis to topic classification. Recently, deep learning approaches based on CNN, LSTM, and Transformers have been the de facto approach for text classification. In this work, we highlight a common issue associated with these approaches. We show that these systems are over-reliant on the important words present in the text that are useful for classification. With
more » ... mited training data and discriminative training strategy, these approaches tend to ignore the semantic meaning of the sentence and rather just focus on keywords or important n-grams. We propose a simple black box technique ShutText to present the shortcomings of the model and identify the over-reliance of the model on keywords. This involves randomly shuffling the words in a sentence and evaluating the classification accuracy. We see that on common text classification datasets there is very little effect of shuffling and with high probability these models predict the original class. We also evaluate the effect of language model pretraining on these models and try to answer questions around model robustness to out of domain sentences. We show that simple models based on CNN or LSTM as well as complex models like BERT are questionable in terms of their syntactic and semantic understanding.
arXiv:2102.00238v1 fatcat:7oqkw6f4mrcxvdogx6qqvm6cgu
« Previous Showing results 1 — 15 out of 39 results