50 Hits in 5.4 sec

MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities [article]

Jason Armitage, Endri Kacupaj, Golsa Tahmasebzadeh, Swati, Maria Maleshkova, Ralph Ewerth, Jens Lehmann
2020 arXiv   pre-print
In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages.  ...  We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities  ...  Dataset The Multiple Languages and Modalities comprises data points on 236k human settlements for evaluating and optimising multitask learning systems.  ... 
arXiv:2008.06376v1 fatcat:yegbkyznffc6tm2pispzoxaxfy

Video-Grounded Dialogues with Pretrained Generation Language Models [article]

Hung Le, Steven C.H. Hoi
2020 arXiv   pre-print
Our framework allows fine-tuning language models to capture dependencies across multiple modalities over different levels of information: spatio-temporal level in video and token-sentence level in dialogue  ...  In this paper, we leverage the power of pre-trained language models for improving video-grounded dialogue, which is very challenging and involves complex features of different dynamics: (1) Video features  ...  MLM is learned similarly as response generation by passing through a linear layer with softmax.  ... 
arXiv:2006.15319v1 fatcat:vcpyx2tfqzbalp7ueldtn3nx7y

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue [article]

Shikib Mehri, Mihail Eric, Dilek Hakkani-Tur
2020 arXiv   pre-print
To progress research in this direction, we introduce DialoGLUE (Dialogue Language Understanding Evaluation), a public benchmark consisting of 7 task-oriented dialogue datasets covering 4 distinct natural  ...  language understanding tasks, designed to encourage dialogue research in representation-based transfer, domain adaptation, and sample-efficient task learning.  ...  ) multi-tasking with MLM on the target dataset during fine-tuning and (4) both pre-training and multitasking with MLM.  ... 
arXiv:2009.13570v2 fatcat:tzcnyusgajgn5osewdyuiabczm

UNITER: UNiversal Image-TExt Representation Learning [article]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
2020 arXiv   pre-print
We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA).  ...  In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions  ...  Trained with both in-domain and out-of-domain datasets, UNITER outperforms state-of-the-art models over multiple V+L tasks by a significant margin.  ... 
arXiv:1909.11740v3 fatcat:zdlyfiquxngzrnpvl4epubj3p4

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training [article]

Minheng Ni, Haoyang Huang, Lin Su, Edward Cui, Taroon Bharti, Lijuan Wang, Jianfeng Gao, Dongdong Zhang, Nan Duan
2021 arXiv   pre-print
Our goal is to learn universal representations that can map objects occurred in different modalities or texts expressed in different languages into a common semantic space.  ...  M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.  ...  To help the model learn different language representations under the shared vision modal, we propose three Multimodal Code-switched Training tasks: MC-MLM, MC-MRM and MC-VLM.  ... 
arXiv:2006.02635v4 fatcat:g4yd5j2mozdyfpclchqoymvmnq

Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding [article]

Wei Wang, Shuo Ren, Yao Qian, Shujie Liu, Yu Shi, Yanmin Qian, Michael Zeng
2021 arXiv   pre-print
The modality switch training randomly swaps speech and text embeddings based on the forced alignment result to learn a joint representation space.  ...  The embedding aligner is a shared linear projection between text encoder and speech encoder trained by masked language modeling (MLM) loss and connectionist temporal classification (CTC), respectively.  ...  Each layer is a Transformer block with 8 heads of 64 dimension self-attention layer [18] . For multitask learning (MTL), the weight for CTC and attention is set to 0.3 and 0.7.  ... 
arXiv:2110.12138v1 fatcat:uvg7xrv4pzcixjhisol3vwpojq

UniT: Multimodal Multitask Learning with a Unified Transformer [article]

Ronghang Hu, Amanpreet Singh
2021 arXiv   pre-print
Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations  ...  We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning  ...  We are grateful to Devi Parikh, Douwe Kiela, Marcus Rohrbach, Vedanuj Goswami, and other colleagues at FAIR for fruitful discussions and feedback.  ... 
arXiv:2102.10772v3 fatcat:rt4zw7it4jd2him7gcpf4nbl54

A Survey of Knowledge Enhanced Pre-trained Models [article]

Jian Yang, Gang Xiao, Yulong Shen, Wei Jiang, Xinyu Hu, Ying Zhang, Jinghui Peng
2021 arXiv   pre-print
In this survey, we provide a comprehensive overview of KEPTMs for natural language processing. We first introduce the progress of pre-trained models and knowledge representation learning.  ...  Pre-trained models learn contextualized word representations on large-scale text corpus through a self-supervised learning method, which has achieved promising performance after fine-tuning.  ...  The gains of lexical knowledge injection are observed for 9 out of 10 language understanding tasks from the GLUE benchmark, and for 3 lexical simplification benchmarks.  ... 
arXiv:2110.00269v1 fatcat:6y4gi4bmb5fnrogi7nx44jdxie

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction [article]

Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar
2020 arXiv   pre-print
ChemBERTa scales well with pretraining dataset size, offering competitive downstream performance on MoleculeNet and useful attention-based visualization modalities.  ...  Our results suggest that transformers offer a promising avenue of future work for molecular representation learning and property prediction.  ...  Thanks to the Reverie team for authorizing our usage of the PubChem 77M dataset, which was processed, filtered and split by them.  ... 
arXiv:2010.09885v2 fatcat:dpt4gwxccngg5nbyp6pumfolra

AMMU : A Survey of Transformer-based Biomedical Pretrained Language Models [article]

Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, Sivanesan Sangeetha
2021 arXiv   pre-print
In this survey, we start with a brief overview of foundational concepts like self-supervised learning, embedding layer and transformer encoder layers.  ...  We strongly believe there is a need for a survey paper that can provide a comprehensive survey of various transformer-based biomedical pretrained language models (BPLMs).  ...  A benchmark with one or more datasets for multiple NLP tasks helps to assess the general ability and robustness of models.  ... 
arXiv:2105.00827v2 fatcat:yzsr4tg7lrexzinrn5psw5r5q4

Perceiver IO: A General Architecture for Structured Inputs Outputs [article]

Andrew Jaegle and Sebastian Borgeaud and Jean-Baptiste Alayrac and Carl Doersch and Catalin Ionescu and David Ding and Skanda Koppula and Daniel Zoran and Andrew Brock and Evan Shelhamer and Olivier Hénaff and Matthew M. Botvinick and Andrew Zisserman and Oriol Vinyals and João Carreira
2021 arXiv   pre-print
As highlights, Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art performance on Sintel optical  ...  The full Perceiver IO model achieves strong results on tasks with highly structured output spaces, such as natural language and visual understanding, StarCraft II, and multi-task and multi-modal domains  ...  Acknowledgments We are grateful to Ankush Gupta and Adrià Recasens Continente for reviewing drafts of this paper and to Deqing Sun for sharing code and helpful advice on the optical flow experiments.  ... 
arXiv:2107.14795v2 fatcat:225bbvmax5c75e3fb7dna6caqi

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [article]

Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, Hongxia Yang
2021 arXiv   pre-print
Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence.  ...  We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language  ...  A series of cross-modal pretraining methods were proposed, and the selfsupervised learning provides the models with a strong ability to adapt to multiple multi-modal downstream tasks through finetuning  ... 
arXiv:2003.13198v4 fatcat:6rp3lxy7fnbmxft5kfm5imuisq

Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond [article]

Zhuosheng Zhang, Hai Zhao, Rui Wang
2020 arXiv   pre-print
In this survey, we provide a comprehensive and comparative review on MRC covering overall research topics about 1) the origin and development of MRC and CLM, with a particular focus on the role of CLMs  ...  Machine reading comprehension (MRC) aims to teach machines to read and comprehend human languages, which is a long-standing goal of natural language processing (NLP).  ...  Mostly, for module design, the answer span prediction and answer verification are trained jointly with multitask learning (Figure 9 (c)).  ... 
arXiv:2005.06249v1 fatcat:htdq7hk6mrghvknwbkchgdioku

REALM: Retrieval-Augmented Language Model Pre-Training [article]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang
2020 arXiv   pre-print
For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner, using masked language modeling as the learning signal and backpropagating through a retrieval step that  ...  To capture knowledge in a more modular and interpretable way, we augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from  ...  Retrieve-and-edit with learned retrieval In order to better explain the variance in the input text and enable controllable generation, Guu et al. (2018) proposed a language model with the retrieve-and-edit  ... 
arXiv:2002.08909v1 fatcat:ky5otedxjrcahpwcvknmj3cgae

Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it? [article]

Tobias Norlund, Lovisa Hagström, Richard Johansson
2021 arXiv   pre-print
A proposed solution to this is to provide the model with additional data modalities that complements the knowledge obtained through text.  ...  Additionally, we introduce a model architecture that involves a visual imagination step and evaluate it with our proposed method.  ...  We also thank the anonymous reviewers for their valuable feedback and knowledge sharing.  ... 
arXiv:2109.11321v2 fatcat:sdda4qr5t5azxhbelh72yzls2u
« Previous Showing results 1 — 15 out of 50 results