***INVITED TALK***: Handling and Mining Linguistic Variation in UGC Distributed Representations of Words and Documents for Discriminating Similar Languages

Preslav Nakov, Marcos Zampieri, Petya Osenova, Liling Tan, Cristina Vertan, Nikola Ljubeši´c, Jörg Tiedemann, Cristina Vertan, Željko Agi´c, Laura Alonso, Alemany, Jorge Baptista (+52 others)
2015 unpublished
A large number of closely related language varieties and dialects are in daily use, not only as spoken colloquial languages but also in some written media, e.g., in SMS, chats, and social networks. Language resources for these varieties and dialects are sparse and building them could be very labor intensive. Yet, these efforts can often be reduced by making use of pre-existing resources and tools for related, resource-richer languages. Examples of closely-related language varieties include the
more » ... ifferent variants of Spanish in Latin America, the Arabic dialects in North Africa and the Middle East, German in Germany, Austria and Switzerland, French in France and in Belgium, etc. Examples of pairs of related languages include Swedish-Norwegian, Bulgarian-Macedonian, Serbian-Bosnian, Spanish-Catalan, Russian-Ukrainian, Irish-Gaelic Scottish, Malay-Indonesian, Turkish-Azerbaijani, Mandarin-Cantonese, Hindi-Urdu, etc. Recent interest in language resources and technology for closely related languages, varieties and dialects has led to previous editions of the LT4CloseLang workshop at RANLP2013 and EMNLP2014, and of the VarDial workshop at COLING2014. Both the LT4CloseLang and the VarDial workshops have attracted a lot of research interest, which indicated that there was need for further activities. Thus, this year we decided to join forces between these two workshops and to organize a joint workshop, LT4VarDial, aiming to bring together researchers interested in building language resources for language varieties or dialects and in creating language technology that makes use of language closeness and exploits existing resources in a related language or a language variant. As part of the workshop, we organized the second edition of the DSL Shared Task on Discriminating between Similar Languages. The first edition was held in conjunction with VarDial, aiming to distinguish between closely related languages and language varieties, thus filling the research gap in fine-grained language identification, which was previously perceived as a solved task. Yet, DSL remains a challenge for state-of-the-art language identification. The attention received from the research community and the feedback provided by the participants of the first edition motivated us to organize this Second DSL Shared Task, where we made two important changes compared to the first edition. First, in order to simulate a real-world language identification scenario, we included in the testing dataset some languages that were not present in the training dataset. Moreover, we included a second test set, where we substituted the named entities with placeholders to make the task more challenging and less dependent on the text topic and domain. A total of 24 teams subscribed to participate in the shared task, 10 of them submitted official runs, and 8 of the latter also wrote system description papers. These numbers represent a slight increase in participation compared to the 2014 edition, which attracted 22 teams, 8 submissions, and 5 system description papers. Overall, 12 papers are published in this volume. Nine papers were about the DSL shared task (8 system descriptions and the shared task overview), and three regular workshop papers. Given the above numbers, we consider the workshop a success, and we take the opportunity to thank the LT4VarDial program committee for their professional and thorough reviews, and the DSL Shared Task participants for the valuable feedback and discussions. We further thank our invited speakers and our panelists for sharing with us their thought-provoking opinions on topics of interest to the workshop. The workshop organizers:
fatcat:am5wftv4o5gmtbfzcw3iubpp2i