Aggregating Dictionaries into the Language Portal Sõnaveeb: Issues With and Without Solutions

Kristina Koppel, Arvi Tavast, Margit Langemets, Jelena Kallas
2019 Zenodo  
In this paper we present Sõnaveeb (Wordweb), a new type of language portal of the Institute of the Estonian Language containing data from a growing number of dictionaries and termbases. Sõnaveeb currently displays a total of 150,000 Estonian headwords, obtained from many databases, with many new types of lexicographic information: collocations, etymology, multi-word expressions, etc. The paper reports on problems encountered so far: the consistency of information and avoiding duplicates when
more » ... duplicates when unifying the dictionaries, turning dictionary-specific information into customisations of the central service, deciding on deliberate ambiguities, parsing data fields containing more than one data element, including textual condensation, moving from annotating form (e.g. italics) to annotating content (e.g. a citation), moving from (near-)duplicates to sensible information fragments, deciding on the advantage of an app over a responsive web page, and possible legal problems regarding the authorship of the new central resource, as it may become difficult to show who authored which part of the published resource. The development of Sõnaveeb continues in the direction of both the tighter aggregation of existing datasets and the addition of new data from other dictionaries and termbases, as well as compiling new data in the new DWS Ekilex.
doi:10.5281/zenodo.3612931 fatcat:corrpuefurfmlbjni35t32nidm