Filters








22,375 Hits in 6.8 sec

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning [article]

Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Lidia S. Chao, Zhaopeng Tu
2021 arXiv   pre-print
Encoder layer fusion (EncoderFusion) is a technique to fuse all the encoder layers (instead of the uppermost layer) for sequence-to-sequence (Seq2Seq) models, which has proven effective on various NLP  ...  However, it is still not entirely clear why and when EncoderFusion should work. In this paper, our main contribution is to take a step further in understanding EncoderFusion.  ...  In this work, we demonstrate the effectiveness of the two typical probability-level fusion methods on sequence-to-sequence learning tasks.  ... 
arXiv:2012.14768v2 fatcat:jhbmn7uc5zeeffhxy7srjopqmq

Attention Mechanism based Cognition-level Scene Understanding [article]

Xuejiao Tang, Tai Le Quy, Eirini Ntoutsi, Kea Turner, Vasile Palade, Israat Haque, Peng Xu, Chris Brown, Wenbin Zhang
2022 arXiv   pre-print
In this paper, we propose a parallel attention-based cognitive VCR network PAVCR, which fuses visual-textual information efficiently and encodes semantic information in parallel to enable the model to  ...  However, these approaches suffer from a lack of generalizability and losing information in long sequences.  ...  fusion layer and commonsense encoder layer.  ... 
arXiv:2204.08027v2 fatcat:prxrxgr44bf6li6hnsv5fhhwom

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech [article]

Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff
2020 arXiv   pre-print
Conventional approaches in speech processing typically use forced alignment to encoder per frame acoustic features to word level features and perform multimodal fusion of the resulting acoustic and lexical  ...  In this work, we explore a multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data.  ...  The acoustic encoder used for learning task specific embeddings consists a convolutional layer of kernel size 5 and an LSTM hidden layer of size 256.  ... 
arXiv:2008.00702v1 fatcat:mmu44d6r7nb7dbjjywow6ximyq

Multimodal Semi-Supervised Learning Framework for Punctuation Prediction in Conversational Speech

Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff
2020 Interspeech 2020  
Conventional approaches in speech processing typically use forced alignment to encoder per frame acoustic features to word level features and perform multimodal fusion of the resulting acoustic and lexical  ...  In this work, we explore a multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data.  ...  The acoustic encoder used for learning task specific embeddings consists a convolutional layer of kernel size 5 and an LSTM hidden layer of size 256.  ... 
doi:10.21437/interspeech.2020-3074 dblp:conf/interspeech/SunkaraRBBK20 fatcat:mcnbaynjqvexnldmqolu22egaa

Prediction of DNA binding proteins using local features and long-term dependencies with primary sequences based on deep learning

Guobin Li, Xiuquan Du, Xinlu Li, Le Zou, Guanhong Zhang, Zhize Wu
2021 PeerJ  
In this paper, we propose a method, called PDBP-Fusion, to identify DBPs based on the fusion of local features and long-term dependencies only from primary sequences.  ...  Many traditional machine learning (ML) methods and deep learning (DL) methods have been proposed to predict DBPs.  ...  ADDITIONAL INFORMATION AND DECLARATIONS  ... 
doi:10.7717/peerj.11262 pmid:33986992 pmcid:PMC8101451 fatcat:yvdxmszaebemllvawgwo6dqis4

Double Path Networks for Sequence to Sequence Learning [article]

Kaitao Song, Xu Tan, Di He, Jianfeng Lu, Tao Qin, Tie-Yan Liu
2018 arXiv   pre-print
Encoder-decoder based Sequence to Sequence learning (S2S) has made remarkable progress in recent years. Different network architectures have been used in the encoder/decoder.  ...  In this work we propose Double Path Networks for Sequence to Sequence learning (DPN-S2S), which leverage the advantages of both models by using double path information fusion.  ...  In particular, we propose Double Path Networks for Sequence to Sequence learning (DPN-S2S), which contain a convolutional path and a self-attention path with attention information fusion between the encoder  ... 
arXiv:1806.04856v2 fatcat:r7h5h2d5g5exxl3ao5lx6c6abm

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [article]

Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, Ji Zhang, Songfang Huang (+3 others)
2022 arXiv   pre-print
Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment.  ...  This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation.  ...  Finally, the output cross-modal representations are fed into a transformer decoder for sequence to sequence learning, which equips mPLUG with both understanding and generation capabilities.  ... 
arXiv:2205.12005v2 fatcat:cck3km3syjdytc5so2gzglucni

Facial Expression Classification using Fusion of Deep Neural Network in Video for the 3rd ABAW3 Competition [article]

Kim Ngan Phan and Hong-Hai Nguyen and Van-Thong Huynh and Soo-Hyung Kim
2022 arXiv   pre-print
In this paper, we employ a transformer mechanism to encode the robust representation from the backbone.  ...  Fusion of the robust representations plays an important role in the expression classification task.  ...  In this paper, we employ a transformer encoder with multi-head attention as the embedded layer to generate sequence representations.  ... 
arXiv:2203.12899v3 fatcat:nbvch5bxzjarvete7d32g5u3cm

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [article]

Tiezheng Yu, Wenliang Dai, Zihan Liu, Pascale Fung
2021 arXiv   pre-print
Furthermore, we conduct thorough ablation studies to analyze the effectiveness of various modality fusion methods and fusion locations.  ...  In this paper, we present a simple yet effective method to construct vision guided (VG) GPLMs for the MAS task using attention-based add-on layers to incorporate visual information while maintaining their  ...  Acknowledgments We want to thank the anonymous reviewers for their constructive feedback.  ... 
arXiv:2109.02401v4 fatcat:2565xxufuvcbtipwxqk7qq6yyy

TransCouplet:Transformer based Chinese Couplet Generation [article]

Kuan-Yu Chiang, Shihao Lin, Joe Chen, Qian Yin, Qizhen Jin
2021 arXiv   pre-print
This paper presents a transformer-based sequence-to-sequence couplet generation model. With the utilization of AnchiBERT, the model is able to capture ancient Chinese language understanding.  ...  Moreover, we evaluate the Glyph, PinYin and Part-of-Speech tagging on the couplet grammatical rules to further improve the model.  ...  With 6 layers in the encoder, 6 layers in the decoder, and 12 attention heads, the total number of parameters for the Fusion Decoder Model and Fusion Transformer Model are 115M and 60M respectively.  ... 
arXiv:2112.01707v1 fatcat:zaatltnh35ddxetlcmidolweha

Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network

Qinhu Zhang, Zhen Shen, De-Shuang Huang
2019 Scientific Reports  
sequences that a bound DNA sequence may has multiple TFBS(s), and, (2) use one-hot encoding to encode DNA sequences and ignore the dependencies among nucleotides.  ...  In this paper, we propose a weakly supervised framework, which combines multiple-instance learning with a hybrid deep neural network and uses k-mer encoding to transform DNA sequences, for modeling in-vivo  ...  Therefore we offer a better and more robust fusion method Noisy-and 26 to replace them. (3) WSCNN, like other deep-learning based methods, used one-hot encoding to transform DNA sequences into image-like  ... 
doi:10.1038/s41598-019-44966-x pmid:31186519 pmcid:PMC6559991 fatcat:aru5ewgl5jf7rlwzbpqf3ttbsm

RDMMFET: Representation of Dense Multimodality Fusion Encoder Based on Transformer

Xu Zhang, DeZhi Han, Chin-Chen Chang, Chin-Ling Chen
2021 Mobile Information Systems  
The RDMMFET model consists of three parts: dense language encoder, image encoder, and multimodality fusion encoder.  ...  Therefore, this paper proposes a new model, Representation of Dense Multimodality Fusion Encoder Based on Transformer, for short, RDMMFET, which can learn the related knowledge between vision and language  ...  Acknowledgments is work was supported by the National Natural Science Foundation of China (grant nos. 61672338 and 61873160).  ... 
doi:10.1155/2021/2662064 fatcat:5qxdumw5hzhutb4txpzuy4rqlu

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning [article]

Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, Fei Huang
2021 arXiv   pre-print
In this paper, we propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where we build a unified Transformer framework to jointly learn  ...  We incorporate the tasks of object detection and image captioning into pre-training with a unified Transformer encoder-decoder architecture for enhancing visual learning.  ...  token [CLS] in the last encoder layer h L CLS .  ... 
arXiv:2106.01804v2 fatcat:echgyssdmrh7hmz2d4pwlslslq

Learning Multimodal Attention LSTM Networks for Video Captioning

Jun Xu, Ting Yao, Yongdong Zhang, Tao Mei
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
Moreover, we design a novel child-sum fusion unit in the MA-LSTM to effectively combine different encoded modalities to the initial decoding states.  ...  Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence while ignoring intrinsic multimodality nature.  ...  In the encoding stage, multiple encoding LSTMs are used to model the temporal sequence for different modalities.  ... 
doi:10.1145/3123266.3123448 dblp:conf/mm/XuYZM17 fatcat:walf26k3tve45idna47khhtsq4

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training [article]

Ming Yan, Haiyang Xu, Chenliang Li, Bin Bi, Junfeng Tian, Min Gui, Wei Wang
2021 arXiv   pre-print
model is used for cross-modal fusion.  ...  By pre-training only with in-domain datasets, the proposed Grid-VLP method can outperform most competitive region-based VLP methods on three examined vision-language understanding tasks.  ...  We hope our findings can help further advance the progress of vision-language pre-training and potentially provide new perspectives to vision-language pre-training.  ... 
arXiv:2108.09479v1 fatcat:d5hl4372nbdxndt2npcfzz6qrm
« Previous Showing results 1 — 15 out of 22,375 results