189,958 Hits in 5.5 sec

Limits of Detecting Text Generated by Large-Scale Language Models [article]

Lav R. Varshney, Nitish Shirish Keskar, Richard Socher
2020 arXiv   pre-print
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.  ...  Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.  ...  INTRODUCTION Building on a long history of language generation models that are based on statistical knowledge that people have [1] - [6] , large-scale, neural network-based language models (LMs) that  ... 
arXiv:2002.03438v1 fatcat:o636j5cl4ngo7mzgpmjajyafg4

Open Vocabulary Object Detection with Pseudo Bounding-Box Labels [article]

Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, Caiming Xiong
2022 arXiv   pre-print
To enlarge the set of base classes, we propose a method to automatically generate pseudo bounding-box annotations of diverse objects from large-scale image-caption pairs.  ...  Our method leverages the localization ability of pre-trained vision-language models to generate pseudo bounding-box labels and then directly uses them for training object detectors.  ...  from large-scale image-caption pairs by leveraging the localization ability of pre-trained vision-language (VL) models.  ... 
arXiv:2111.09452v2 fatcat:d3vl6ubbdjcptakyqgw5vc2kfa

Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [article]

Alex Tamkin, Miles Brundage, Jack Clark, Deep Ganguli
2021 arXiv   pre-print
2) What are the societal effects of widespread use of large language models? Here, we provide a detailed summary of the discussion organized by the two themes above.  ...  Broadly, the discussion centered around two main questions: 1) What are the technical capabilities and limitations of large language models?  ...  What are the technical capabilities and limitations of large language models?  ... 
arXiv:2102.02503v1 fatcat:woj37ay5obhj7cnurrq7czkakq

Large-Scale Hate Speech Detection with Cross-Domain Transfer [article]

Cagri Toraman, Furkan Şahinuç, Eyup Halit Yılmaz
2022 arXiv   pre-print
large-scale hate speech detection.  ...  In this study, we construct large-scale tweet datasets for hate speech detection in English and a low-resource language, Turkish, consisting of human-labeled 100k tweets per each.  ...  (ii) We analyze the performance of various models for large-scale hate speech detection with a special focus on model scalability.  ... 
arXiv:2203.01111v1 fatcat:myrwzl6u5naejbzegamuovr4oq

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models [article]

Feng Li, Hao Zhang, Yi-Fan Zhang, Shilong Liu, Jian Guo, Lionel M. Ni, PengChuan Zhang, Lei Zhang
2022 arXiv   pre-print
After that, we show how recent work utilizes large-scale raw image-text data to learn language-aligned visual representations that generalize better on zero or few shot learning tasks.  ...  We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data  ...  As VLP models are limited by inadequate well-aligned (image, caption) pairs, VIVO proposed to scale up pre-training using a large amount of (image, tag) pairs.  ... 
arXiv:2203.01922v1 fatcat:vnjfetgkpzedpfhklufooqet7y

A Convolutional Neural Network-Based Chinese Text Detection Algorithm via Text Structure Modeling

Xiaohang Ren, Yi Zhou, Jianhua He, Kai Chen, Xiaokang Yang, Jun Sun
2017 IEEE transactions on multimedia  
The spatial pyramid layer is then introduced to enhance the scale invariability of the CNN model for detecting texts in multiple scales.  ...  Furthermore a simplified version of the proposed algorithm with only general components is compared to existing general text detection algorithms on the ICDAR 2011 and 2013 datasets, showing comparable  ...  The scale invariability of the CNN model is enhanced by adding the SPL to generate scale properties for extracted features.  ... 
doi:10.1109/tmm.2016.2625259 fatcat:j2437qiaazbafd6gzlxt6pcwca

An analytical study of information extraction from unstructured and multidimensional big data

Kiran Adnan, Rehan Akbar
2019 Journal of Big Data  
In "IE from images" section, visual relationship detection, text recognition and face recognition techniques as IE subtask, recent work, and limitations have been described.  ...  Finally, the IE improvement model is designed to overcome the identified limitations of existing IE techniques for multidimensional unstructured big data.  ...  Language priors [41] or structural language [38] are used to overcome the limitations of single class modeling.  ... 
doi:10.1186/s40537-019-0254-8 fatcat:qy5l55um7feeblec4hxohr3pqa

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection [article]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, Ece Kamar
2022 arXiv   pre-print
To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups.  ...  Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate.  ...  Acknowledgements We thank Azure AI Platform and Misha Bilenko for sponsoring this work and providing compute resources, Microsoft Research for supporting our large scale human study, and Alexandra Olteanu  ... 
arXiv:2203.09509v3 fatcat:osfl4z2bmjdibefqxq4v2cdunu

Fight Fire with Fire: Fine-tuning Hate Detectors using Large Samples of Generated Hate Speech [article]

Tomer Wullach, Amir Adler, Einat Minkov
2021 arXiv   pre-print
Automatic hate speech detection is hampered by the scarcity of labeled datasetd, leading to poor generalization. We employ pretrained language models (LMs) to alleviate this data bottleneck.  ...  We utilize the GPT LM for generating large amounts of synthetic hate speech sequences from available labeled examples, and leverage the generated data in fine-tuning large pretrained LMs on hate detection  ...  They too focused on dataset balancing, using limited amounts of synthetic data. In this work, we apply sequence generation at large scale, increasing the original dataset size by magnitudes of order.  ... 
arXiv:2109.00591v1 fatcat:xvhkgin7hrg77pz42fkryeh66a

Connecting the Dots Between Fact Verification and Fake News Detection [article]

Qifei Li, Wangchunshu Zhou
2020 arXiv   pre-print
Fact verification models have enjoyed a fast advancement in the last two years with the development of pre-trained language models like BERT and the release of large scale datasets such as FEVER.  ...  Our approach makes use of the recent success of fact verification models and enables zero-shot fake news detection, alleviating the need of large-scale training data to train fake news detection models  ...  Moreover, few large scale fake news detection datasets are available and they are generally limited in their domains while human annotation of fake news is time-consuming and expensive, which hinders the  ... 
arXiv:2010.05202v1 fatcat:vvjqynhqgzckxofdintt6oqghm

Release Strategies and the Social Impacts of Language Models [article]

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse (+3 others)
2019 arXiv   pre-print
Large language models have a range of beneficial uses: they can assist in prose, poetry, and programming; analyze dataset biases; and more.  ...  However, their flexibility and generative capabilities also raise misuse concerns. This report discusses OpenAI's work related to the release of its GPT-2 language model.  ...  Acknowledgements We thank the following individuals for feedback on earlier versions of this document: Any remaining errors or omissions are the authors' responsibility alone.  ... 
arXiv:1908.09203v2 fatcat:6c2qqp32h5ax7m6xg5conmmqge

Tamizhi-Net OCR: Creating A Quality Large Scale Tamil-Sinhala-English Parallel Corpus Using Deep Learning Based Printed Character Recognition (PCR) [article]

Charangan Vasantharajan, Uthayasanker Thayasivam
2022 arXiv   pre-print
Especially, our model detects code-mix text, numbers, and special characters from the printed document.  ...  For this purpose, we enhanced the performance of Tesseract 4.1.1 by employing LSTM-based training on many legacy fonts to recognise printed characters in the above languages.  ...  Thus, the enhancement of NLP in those languages has been limited so far.  ... 
arXiv:2109.05952v2 fatcat:q7gskdi7pnh35fxnv2devoadyq

NLP Research and Resources at DaSciM, Ecole Polytechnique [article]

Hadi Abdine, Yanzhu Guo, Moussa Kamal Eddine, Giannis Nikolentzos, Stamatis Outsios, Guokan Shang, Christos Xypolopoulos, Michalis Vazirgiannis
2021 arXiv   pre-print
DaSciM (Data Science and Mining) part of LIX at Ecole Polytechnique, established in 2013 and since then producing research results in the area of large scale data analysis via methods of machine and deep  ...  The group has been specifically active in the area of NLP and text mining with interesting results at methodological and resources level.  ...  While efforts on domain adaptation of large-scale language models to tweets have been made in English, there is no similar work in any other language.  ... 
arXiv:2112.00566v1 fatcat:dcmwpwdwc5emti6jcflqq5ib4a

COLD: A Benchmark for Chinese Offensive Language Detection [article]

Jiawen Deng, Jingyan Zhou, Hao Sun, Fei Mi, Minlie Huang
2022 arXiv   pre-print
Offensive language detection and prevention becomes increasing critical for maintaining a healthy social platform and the safe deployment of language models.  ...  and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.  ...  Can neural models detect offensive language by training on our dataset?  ... 
arXiv:2201.06025v1 fatcat:e2ahpikv2jeo5av5sydbzjyr2y

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [article]

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu
2022 arXiv   pre-print
Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision  ...  By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pre-trained knowledge.  ...  Inspired by these works, we explore to transfer the knowledge in large-scale vision-language pre-trained models to the downstream dense prediction tasks. Vision-language models.  ... 
arXiv:2112.01518v2 fatcat:vpy4fs655jdcnmf332sur62azm
« Previous Showing results 1 — 15 out of 189,958 results