MotifBoost: k-mer based data-efficient immune repertoire classification method
The repertoire of T cell receptors encodes various types of immunological information. Machine learning is indispensable for decoding such information from repertoire datasets measured by next-generation sequencing. In particular, the classification of repertoires is the most basic task, which is relevant for a variety of scientific and clinical problems. Supported by the recent appearance of large datasets, efficient but data-expensive methods have been proposed. However, it is unclear whether
... they can work efficiently when the available sample size is severely restricted as in practical situations. In this study, we demonstrate that the their performances are impaired catastrophically below critical sample sizes. To overcome this, we propose MotifBoost, which exploits the information of short motifs of TCRs. MotifBoost can perform the classification as efficiently as a deep learning method on large datasets while providing more stable and reliable results on small datasets. We also clarify that the robustness of MotifBoost can be attributed to the efficiency of motifs as representation features of repertoires. Finally, by comparing predictions of these methods, we show that the whole sequence identity and sequence motifs encode partially different information and that a combination of such complementary information is necessary for further development of repertoire analysis.