ETL | ELT Processing

The process of extracting data into the warehouse from operational systems, also called Acquisition.

We can Extract, Transform and Load. or

We can Extract, Load and then Transform

ETL is extract, transfer, load in which transformation takes place on a transformation server using weather an ‘engine’ or by generated code.

ELT is extract, load, transform in which data transformations take place in a relational database on the data warehouse server.

An ETL Process typically involves the following steps:

Extracts from source systems
Data movement
Transformations
Data loading
Index maintenance
Statistics collection
Summary data maintenance
Data mart construction
Backups

Stakeholders

Extracting Techniques
Transforming Techniques
Loading Techniques

Major steps in the ETL process

Data Pipelining

The process of dividing data into chunks and concurrently run all the dependent jobs when the chunk is ready or collected

Starting the next job as soon some data of the the prerequisite job is ready to be executed

An example could be:

Lets say we want to join three tables. The third table cannot be joined unless the first two tables are joined. By implementing pipelines here, we can join first few rows of the first two joined tables as soon as they are available with the third table.

Data Extraction

The process of fetching and getting data from operational systems

Techniques

Before examining the various data extraction techniques, you must clearly understand the nature of the source data you are extracting or capturing. Also, you need to get an insight into how the extracted data will be used. Source data is in a state of constant flux

Source identification: a stepwise approach

Deferred Source Data

Deferred data extraction do not capture the changes in real time. The capture happens later.

Options for Deferred Data Extraction

Deferred Extraction Techniques

Data and Time Stamps

Every time a source record is created or updated it may be marked with a stamp showing the date and time.

For Example, If you run your data extraction program at midnight every day, each day you will extract only those with the date and time stamp later than midnight of the previous day.

Comparing Files

This technique necessitates the keeping of prior copies of all the relevant source data. Though simple and straightforward, comparison of full rows in a large file can be very inefficient . However, this may be the only feasible option for some legacy data sources that do not have transaction logs or time stamps on source records

Immediate Data Extraction

In this option, the data extraction is real-time. It occurs as the transactions happen at the source databases and files.

Options for Immediate Data Extraction

Immediate Extraction Techniques

Transaction Logs

This option uses the transaction logs of the DBMSs maintained for recovery from possible failures. As each transaction adds, updates, or deletes a row from a database table, the DBMS immediately writes entries on the log file. This data extraction technique reads the transaction log and selects all the committed transactions. There is no extra overhead in the operational systems because logging is already part of the transaction processing

Database Triggers

Get data from the transactional systems based on the triggers defined. A trigger will run at a specific event and when that event occurs, we will feed the data into the warehouse (put it into the data staging area).

Source Application

We can configure our source or transaction systems to send data to the warehouse based on our specifications.

Data Capture Techniques: Advantages and Disadvantages

Data Loading

The process of inserting the data into our warehouse

We must also worry about rolling off old data as its economic value drops below the cost for storing and maintaining it.

A pattern exists between data freshness and insert efficiency. Buffering rows for insertion allows for fewer round trips to RDBMS but waiting to accumulate rows into the buffer impact data freshness

Less Efficiency = Less Freshness