Meet the Data [chapter]

Krisztian Balog
2018 Advanced Topics in Information Retrieval  
This chapter introduces the basic types of data sources, as well as specific datasets and resources, that we will be working with in later chapters of the book. These may be placed on a spectrum of varying degrees of structure, from unstructured to structured data, as shown in Fig. 2 .1. Fig. 2.1 The data spectrum On the unstructured end of the spectrum we have plain text. Typically, these are documents written in natural language. 1 As a matter of fact, almost any type of data can be converted
more » ... into plain text, including web pages, emails, spreadsheets, and database records. Of course, such a conversion would result in an undesired loss of internal document structure and semantics. It is nevertheless always an option to treat data as unstructured, by not making any assumptions about the particular data format. Search in unstructured text is often referred to as full-text search. On the opposite end of the spectrum there is structured data, which is typically stored in relational databases; it is highly organized, tabular, and governed by a strict schema. Search in this type of data is performed using formal query languages, like SQL. These languages allow for a very precise formulation of information needs, but require expert knowledge of the query language and of the underlying database schema. This generally renders them unsuitable for ordinary users. The data we will mostly be dealing with is neither of two extremes and falls somewhere "in the middle." Therefore, it is termed semi-structured. It is 1 Written in natural language does not imply that the text has to be grammatical (or even sensible).
doi:10.1007/978-3-319-93935-3_2 fatcat:mijb4wvy2zgxvkkvbafojxp2r4