A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information
[article]
2021
arXiv
pre-print
Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch. Recent work has shown that "token-free" models can be trained directly on characters or bytes, but training these models from scratch requires substantial computational resources, and this implies discarding the many
arXiv:2108.00391v1
fatcat:j5j6rspp6bbw3mrcftnmfwpzri