Extraction, Transformation, and Loading [chapter]

Alkis Simitsis, Panos Vassiliadis
2017 Encyclopedia of Database Systems  
SYNONYMS ETL; ETL process; ETL tool; Back Stage of a Data Warehouse; Data warehouse refreshment DEFINITION Extraction, Transformation, and Loading (ETL) processes are responsible for the operations taking place in the back stage of a data warehouse architecture. In a high level description of an ETL process, first, the data are extracted from the source datastores, which can be in a relational and/or a semi-structured format. In typical cases, the source datastores can be On-Line Transactional
more » ... rocessing (OLTP) or legacy systems, files under any format, web pages, various kinds of documents (e.g., spreadsheets and text documents) or even data coming in a streaming fashion. Typically, only the data that are different from the previous execution of an ETL process (newly inserted, updated, and deleted information) should be extracted from the sources. After this phase, the extracted data are propagated to a special-purpose area of the warehouse, called Data Staging Area (DSA), where their transformation, homogenization, and cleansing take place. The most frequently used transformations include filters and checks to ensure that the data propagated to the warehouse respect business rules and integrity constraints, as well as schema transformations that ensure that data fit the target data warehouse schema. Finally, the data are loaded to the central data warehouse (DW) and all its counterparts (e.g., data marts and views). In a traditional data warehouse setting, the ETL process periodically refreshes the data warehouse during idle or low-load, periods of its operation (e.g., every night) and has a specific time-window to complete. Nowadays, business necessities and demands require near real-time data warehouse refreshment and significant attention is drawn to this kind of technological advancement. HISTORICAL BACKGROUND Despite the fact that ETL took its name and separate existence during the first decade of the 21st century, ETL processes have been a companion to database technology for a lengthier period of time -in fact, from the beginning of its existence. During that period, ETL software was just silently hidden as a routine programming task without any particular name or individual importance. ETL was born on the first day that a programmer constructed a program that takes records from a certain persistent file and populates or enriches another file with this information. Since then, any kind of data processing software that reshapes or filters records, calculates new values, and populates another data store than the original one is a form of an ETL program. Apart from this low-profile programming task, research efforts have long hidden ETL tasks, although not much attention was paid to them. The earliest form of ETL system that we know of goes back to the EXPRESS system [13] that was intended to act as an engine that produces data transformations given some data definition and conversion nonprocedural statements. In later years, during the early days of data integration, the driving force behind data integration were wrapper-mediator schemes; the construction of the wrappers is a primitive form of ETL scripting [12]. In the mid '90's, data warehousing came in the central stage of database research and still, ETL was there, but hidden behind the lines. Popular books [4] do not mention the ETL triplet at all, although the different parts (transformation, cleansing, staging of intermediate data, and loading) are all covered -even if this is done very
doi:10.1007/978-1-4899-7993-3_158-3 fatcat:etto3enuuneind3s3ldeg4s5qy