Stream & Merge for Incremental Loading in Snowflake

Is manual data loading slowing down operations in your organization? Discover Snowflake's approach to data warehousing. By leveraging Stream and Merge objects, it enables effortless incremental loading, reducing time and resources compared to full loads. Snowflake's innovative features are redefining efficiency in data management and analytics workflows.
Snowflake Cover image | Factspan

Why this blog?

The blog introduces Snowflake’s innovative approach to data warehousing, emphasizing Stream and Merge objects for easy incremental loading. Understanding these concepts can boost data management efficiency and cut down on resource usage compared to full loads. It also showcases SnowPipe’s transformative potential for optimizing analytics workflows, making it a must-read for anyone aiming to streamline their data engineering tasks.

Snowflake is a cloud-hosted relational database used to create data warehouse on demand. Data in the data warehouse can be loaded as full load or incremental load. The full load is a process of deleting whole existing data and reloading it again. Full loads are time and resource-consuming tasks compared to incremental loads that only load a small amount of new or updated data instead of loading full data every time.

We can achieve incremental loading in Snowflake by implementing change data capture (CDC) using Stream and Merge objects. Stream object is used for change data capture which includes inserts, updates, and deletes, as well as metadata about each change so that actions can be taken using the changed data. The data captured using Stream is then merged to the target table using match and not match condition.

What are Stream and Merge?

Merge:

Merge is a command used to perform some alterations on the table, to update the existing records, delete the old/inactive records, or add new rows from another table.

Snowflake offers two clauses to perform Merge:

  1. Matched Clause: Matched Clause performs Update and Delete operation on the target table when the rows satisfy the condition.
  2. Not Matched Clause: Not Matched Clause performs the Insert operation when the row satisfying conditions are not matched. The rows from the source table that are not matched with the target table will be inserted.
Stream:

Stream is a table created on the top of the source to capture change data; it tracks the changes made to source table rows.

The created stream object just holds the offset from where change data capture can be tracked, however, the main data in source remains unaltered.

Three additional columns are added to the source table in a stream:

Column
Description
METADATA$ACTION
It may have only two values Insert/Delete
METADATA$ISUPDATE
This will be flagged as True when the record is updated
METADATA$ROW_ID
There are unique hash keys that will be tracked against each change

As we now know what Stream and Merge is, let’s see how to use Stream and Merge to load the data:

Step 1:

Connect to the Snowflake DB and Create sample source and target tables:

Step 2:

Create stream on source table using below query:

A screenshot of a computer

Description automatically generated
 Step 3:

Let’s insert some dummy data into the source table:

After inserting data into the source let’s check data captured in the stream:

A screenshot of a computer

Description automatically generated

As we inserted data the first time in the source the newly inserted rows will be flagged as INSERT in the METADATA$ACTION column and METADATA$UPDATE as FALSE in the stream.

Step 4:

Insert data into the target using Stream and Merge using the following query:

A screenshot of a computer

Description automatically generated

As we are inserting data the first time there will not be any matching personal_id in the target table and as the METADATA$ACTION flag is INSERT, the merge command will insert the whole data into the target table as it is.

Step 5:

Let’s update a few source rows and load them again to target:

As soon as we update the source table, the stream will capture these changes and update the stream data.

The updated row will be labelled as “INSERT,” while the older row that was updated will be marked as “DELETE” in the METADATA$ACTION column. Consequently, when we load updated data from the source to the target, the older row with “City Nagpur” will be deleted, and the updated row with “City Mumbai” will be inserted.

Again, run the same Stream and Merge command we used earlier to load only updated data in target, updated target data will look like this:

To automate this load process, we can create a task, this task will run after a specified time interval and load data into the target if there are any source changes.

Most Popular
A Detailed Guide to Create a Snowflake Python Worksheet

How Can We Create a Snowflake Python Wor...

Streamlining ETL Pipeline with Snowflake...

AI based Medical Imaging Tools 2024 | Factspan

Top AI Medical Imaging tools for Healthc...

SnowPipe: Cloud Data Ingestion Tool Powe...

Let’s Connect

    Work Email*

    Phone Number (Optional)

    1000/1000

    Scroll to Top