Storing data and files in snowflake

After sharing briefs over Snowflake concepts and architecture, let’s move to the final part how to load and store data in Snowflake. We all are familiar with the fact that storing data is a critical part of any Snowflake database. The most highlighted two key points in any database are storing data within the environment and accessing data that is stored in other cloud storage environments. Snowflake is capable of both them.

Considering Data Storage

Snowflake provides spectacle features for the stored data. Continuous Data Protection (CDP) is given to all Snowflake accounts which includes Fail-Safe and Time Travel with no additional charger. But that doesn’t count on CDP that it will not have an impact on your storage costs. Storage costs will be incurred until the data leaves the Fail-safe state. Hence, considering the right storage is important while staging.

Staged File Storage in Process to Load Data

The location the data is stored in is known as a stage whether it is stored internally or externally and the process of uploading files into one of these stages is known as staging. To assist loading bulk data into tables, Snowflake has stages where files that have the data to be loaded are staged. Snowflake allows both internal (location objects within a Snowflake database/schema) and external (storage locations outside the Snowflake environment in another cloud storage and this could be S3, Azure) stages. For bulk data, you could utilize this feature.

Cloning Tables, Schemas, and Databases

Snowflake has a feature, called zero-copy cloning that gives an easy way to take a “snapshot” of any schema, table or database. It creates a derived copy of the object (which was shared initially the underlying storage). This can come handy when creating instant backups.

Data Loading Considerations

Preparing Your Data Files

Before we begin to process data loading and storing, data size, text and many multiple features should be checked.

General File Sizing Recommendations

Creating compressed data files that are roughly 10 MB to 100 MB in size should be recommended in order to optimize the number of parallel loads into Snowflake. Combine the smaller files to reduce processing overhead. now, split the large files into a number of smaller files for faster load. This allows distributing the load between servers in the active Snowflake warehouse.

Data Size Limitations for Semi-Structured Data

The variant data type has a size limit of 16 MB in the compressed form on the individual rows. JSON and Avro are the most commonly used data formats and both are a concatenation of many documents. The source software that provides an output to both will provide it in the form of a single huge array having multiple records. For document separation, line breaks and commas are being supported.

While executing the COPY INTO <table> command, it is recommended to enable the STRIP_OUTER_ARRAY file format option. This will load the records into separate table rows by removing the outer array structure and also enhance efficiency. Let’s consider an example,

COPY INTO <table_name>
FROM @~/<file_name>.json
file_format = (type = \'JSON\' strip_outer_array = true);

Data Size Limitations of Parquet files

For smooth loading, it is highly recommended to split parquet files that are greater than 3GB into smaller files of 1GB or lesser in size. This ensures that the loading does not timeout.

Databricks vs. Snowflake (2024)

Preparing Delimited Text Files

While preparing CSV/Delimited text files for loading, the given points must be considered

Files must have data in ASCII format only. The default character set is UTF-8. However, by using the ENCODING file format option, additional encodings can be mentioned.
Records and fields should be delimited by different characters within the files.

Note- Both should be a single like, pipe (|), caret (^), comma (,), and tilde (~) are common field delimiters. Often the line feed (\\n) has been used as a row delimiter.

Fields that have delimiter must be enclosed in single or double-quotes. If the data being loaded must be escaped if contains quotes.
Fields should be enclosed in single or double quotes(only those who have carriage returns (\\r \\n)). Carriage returns are commonly introduced along with a line feed character to mark the end of a line individually in windows system.
Each row must be having the same number of columns.

Planning a Data Load

It is advised to dedicate individual Snowflake Clusters for loading and querying operations, especially when dealing with large data sets. This helps in optimizing the performance for both activities i.e. loading and querying operations. As loading requires fewer resources, the standard virtual warehouse is enough to load data. One can choose the size of the warehouse based on the speed at which you want to load data but do not forget to split large data files for faster loading.

Staging Data

Both Snowflake and your data source allow stage references via paths and it is a good practice to stage regular data sets by partitioning them into logical paths. This could include details such as source identifiers, geographical location, etc., along with the date when the data was first written. This exercise will provide the flexibility to copy files by path using a single command and allows take advantage of Snowflake’s parallel operations by letting you execute concurrent COPY statements that match a subset of files.

Loading Data

By using the COPY INTO <table> command data is loaded into Snowflake. One can optimize load in your data warehouse environment by understanding the levers of this command. To execute the COPY command, move data from staged files to an existing table. Given below are the possible staging locations:

Internal stage (i.e. table/user stage).

Note- PUT command can be used to stage files.

For external stage, AWS S3 bucket or Microsoft Azure container can be referenced.
An External location (i.e. AWS S3 bucket or Microsoft Azure container).

For loading data from a stage, the COPY command accepts several options. This will enable to bring only a fraction of data(which is staged into Snowflake).

Handpicked Content

Explore Advanced Prompt Engineering with ChatGPT Frameworks

While wrapping up this article, we understood that uploading files to a Snowflake stage can be done by any Snowflake connector client. Well, a faster data pipeline can accelerate the value of your Big Data strategy.

A Blueprint on Data Engineering Services

FAQ’s

Does Snowflake support unstructured data storage?

While Snowflake primarily focuses on structured data, it offers functionalities for handling some unstructured data formats like JSON and Avro. However, complex unstructured data (e.g., images, videos) might require additional integrations or external storage solutions.

How can I manage data lifecycle and expiration in Snowflake?

Snowflake doesn’t have built-in features for automated data expiration. However, you can achieve this through manual scripting or integration with third-party tools. Options include scheduling regular unload commands to remove expired data or leveraging data lifecycle management solutions.

What are advanced security considerations for Snowflake data storage?

Beyond basic access controls, Snowflake offers features like row-level security and data masking to restrict access to sensitive data at a granular level. Additionally, integrating with external key management services can further enhance data security.

How can I optimize storage costs in Snowflake?

Data tiering: Utilize Snowflake’s auto-clustering feature to automatically move inactive data to less expensive storage tiers.
Data partitioning: Partition tables based on logical criteria to improve query performance and potentially reduce storage footprint by selectively unloading less frequently accessed partitions.
Compression options: Explore advanced compression techniques offered by Snowflake during the COPY INTO command to minimize storage size.

What are some best practices for disaster recovery with Snowflake?

Regular backups: Schedule automated backups to a separate cloud region for disaster recovery purposes.
Failover considerations: Establish a disaster recovery plan outlining steps to failover to a backup environment in case of a primary region outage.
Testing and validation: Regularly test your disaster recovery plan to ensure it functions as intended.

What is Continuous Data Protection (CDP)

Network Policies: These policies control access based on IP addresses, ensuring that only authorized users can access your data.
Multi-Factor Authentication (MFA): This adds an extra layer of security by requiring multiple forms of verification, making it harder for unauthorized users to gain access.
Security Roles: These roles help manage user access to different objects within the system, ensuring that users only have access to the data they need.
Encryption: All data is encrypted both in transit and at rest using AES-256 encryption, providing strong protection against unauthorized access.
Time Travel: This feature allows you to access historical data that has been changed or deleted, giving you the ability to query and restore data to a previous state.
Fail-safe: This disaster recovery feature ensures that data can be recovered even after accidental or intentional deletion, providing an additional layer of protection.

Comments

Harsha

June 30, 2024

Very informative and well explained!

Reply

Data and AI

latest case studies

Functional Expertise

latest case studies

Industries

latest case studies

Knowledge Hub

latest case studies

Company

latest case studies

Data and AI

latest case studies

Functional Expertise

latest case studies

Industries

latest case studies

Knowledge Hub

latest case studies

Company

latest case studies

Storing data and files in snowflake

Considering Data Storage

Staged File Storage in Process to Load Data

Cloning Tables, Schemas, and Databases

Data Loading Considerations

Preparing Your Data Files

General File Sizing Recommendations

Data Size Limitations for Semi-Structured Data

Data Size Limitations of Parquet files

Preparing Delimited Text Files

Planning a Data Load

Staging Data

Loading Data

Handpicked Content

FAQ’s

Comments

Leave a Comment Cancel Reply

Experience Fluid Intelligence

Connect with our team