A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
[article]
2020
arXiv
pre-print
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks. Our Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of
arXiv:2004.00849v2
fatcat:5ccgm6lrmfdn7kjkbvfp7tiq2m