ETL+ Upsert Concepts & Logic
Upsert refers to the process of inserting new data or updating existing data in a target (data warehouse) table based on matching criteria. The term "Upsert" is a combination of "Update" and "Insert". The limitation of Upsert is that it does not Delete old records.
Purpose and Goal of Upsert
With Upsert, you can ensure that your target database contains the newest data, without having to reload the entire data set every time. Contrast Load Upsert with Load All.
Upsert is a database operation that updates an existing row if a specified value already exists in a table, and inserts a new row if the specified value does not already exist.
Upsert updates or inserts rows – Upsert does not delete/remove records.
The Upsert operation updates only the new or modified data instead of updating the entire dataset.
Upsert loads are commonly used in ETL operations where large volumes of data are involved. Compared to a Load All, a Upsert can save time, reduce network bandwidth, and improve processing efficiency.
Preconditions: The columns in the source table that are designed as Key+ columns should be unique.
Upsert will run without unique keys but the results can be unexpected.
For more see https://dataself.atlassian.net/wiki/spaces/DS/pages/2055176193
Upserts and Delta Updates.
A Delta update or delta load is a data processing technique that updates only the changes or incremental updates that have occurred since the last data update. Upsert is a limited function type of Delta update that does not remove old records. Upsert requires a source table with a field with a chronological order such as a date, datetime stamp, or a Sequentially Incremented Values .
Unique Logical Keys for Upsert Processing / Design Page's Key+
Load Upsert processing uses a ‘logical unique key’. This unique key must be designated by the Key+ check boxes on the target's Design page. See ETL+ Load Upsert Example 2 for an example.
It’s called a “logical unique key” because the uniqueness constraint is enforced by the Load Upsert process. The target table created in the data warehouse doesn’t have a unique key constraint (or primary key) defined on the data warehouse table in SQL Server.
What if there isn’t a Unique key in the Source Table?
For an example of Upsert processing when the Key+ column(s) are not unique in the source table see https://dataself.atlassian.net/wiki/spaces/DS/pages/2055176193
Logic of Operations
Upsert is a database operation that updates an existing row if the value of a designated key column already exists in the target table otherwise it inserts a new row.
The Upsert used in ETL+ assumes that the data in the data source has a mix of new and old records. Some are new or updated records and others are records that already exist in the target data warehouse table.
ETL+'s Upsert begins identifying the changes that have occurred since the last update. The changes are tracked and identified using a timestamp or a sequence number in the Column with Last Modified Value which identifies the records that have been modified since the last update.
Outline of the Steps in the Load Upsert Process
A rough/approximate outline of the key operations in the Load Upsert process.
Find the "highwater” value in the target table’s Last Modified Value.
SQL like expression: highwater value = SELECT MAX(<Last-Modified-Value>) FROM <target-table>
Select from source filtering by the Last Modified Value >= highwater value into a temporary staging table.
For every row in the staging table.
SQL like expression:
DELETE
target-tableWHERE
staging-table.<Key+ column> = target-table.<Key+ column>
Insert every row in the staging-table into the target-table.
The Upsert does the logical equivalent of a SQL UPDATE – a DELETE followed by a INSERT. The DELETE and INSERT steps explains the results shown at https://dataself.atlassian.net/wiki/spaces/DS/pages/2055176193.
Related Pages
ETL+ Load Upsert Example 2 The more typical case when the source table has a unique key such as a primary key.
https://dataself.atlassian.net/wiki/spaces/DS/pages/2055176193 When the source table does not have a unique key the results can be unexpected. This situation can occur with source data from Excel or Text/CSV data source drivers.
v2023.10 ✅