1 Hit in 2.2 sec

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models [article]

Ravindra Nayak, Raviraj Joshi
2022 arXiv   pre-print
We also release L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language identification(LID) dataset and HingBERT-LID, a production-quality LID model to facilitate capturing of more code-mixed  ...  We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter.  ...  Acknowledgements This work was done under the L3Cube Pune mentorship program. We would like to express our gratitude towards our mentors at L3Cube for their continuous support and encouragement.  ... 
arXiv:2204.08398v1 fatcat:b3ltly4s6bbprofijwhhhbx3xi