Storing data and files in snowflake

Snowflakes Cloud Data Warehouse A Data Architecture Built for the Cloud

After sharing briefs over Snowflake concepts and architecture, let’s move to the final part how to load and store data in Snowflake. We all are familiar with the fact that storing data is a critical part of any Snowflake database. The most highlighted two key points in any database are storing data within the environment and accessing data that is stored in other cloud storage environments. Snowflake is capable of both them.

Considering Data Storage

Snowflake provides spectacle features for the stored data. Continuous Data Protection (CDP) is given to all Snowflake accounts which includes Fail-Safe and Time Travel with no additional charger. But that doesn’t count on CDP that it will not have an impact on your storage costs. Storage costs will be incurred until the data leaves the Fail-safe state. Hence, considering the right storage is important while staging.

STAGED FILE STORAGE IN PROCESS TO LOAD DATA

The location the data is stored in is known as a stage whether it is stored internally or externally and the process of uploading files into one of these stages is known as staging. To assist loading bulk data into tables, Snowflake has stages where files that have the data to be loaded are staged. Snowflake allows both internal (location objects within a Snowflake database/schema) and external (storage locations outside the Snowflake environment in another cloud storage and this could be S3, Azure) stages. For bulk data, you could utilize this feature.

CLONING TABLES, SCHEMAS, AND DATABASES

Snowflake has a feature, called zero-copy cloning that gives an easy way to take a “snapshot” of any schema, table or database. It creates a derived copy of the object ( which was shared initially the underlying storage). This can come handy when creating instant backups.

Data Loading Considerations

PREPARING YOUR DATA FILES

Before we begin to process data loading and storing, data size, text and many multiple features should be checked.

General File Sizing Recommendations

Creating compressed data files that are roughly 10 MB to 100 MB in size should be recommended in order to optimize the number of parallel loads into Snowflake. Combine the smaller files to reduce processing overhead. now, split the large files into a number of smaller files for faster load. This allows distributing the load between servers in the active Snowflake warehouse.

Data Size Limitations for Semi-Structured Data

The variant data type has a size limit of 16 MB in the compressed form on the individual rows. JSON and Avro are the most commonly used data formats and both are a concatenation of many documents. The source software that provides an output to both will provide it in the form of a single huge array having multiple records. For document separation, line breaks and commas are being supported.

While executing the COPY INTO <table> command, it is recommended to enable the STRIP_OUTER_ARRAY file format option. This will load the records into separate table rows by removing the outer array structure and also enhance efficiency. Let’s consider an example,

COPY INTO <table_name>
FROM @~/<file_name>.json
file_format = (type = 'JSON' strip_outer_array = true);

Data Size Limitations of Parquet files

For smooth loading, it is highly recommended to split parquet files that are greater than 3GB into smaller files of 1GB or lesser in size. This ensures that the loading does not timeout.

Preparing Delimited Text Files

While preparing CSV/Delimited text files for loading, the given points must be considered

>Files must have data in ASCII format only. The default character set is UTF-8. However, by using the ENCODING file format option, additional encodings can be mentioned.

>Records and fields should be delimited by different characters within the files.

Note- Both should be a single like, pipe (|), caret (^), comma (,), and tilde (~) are common field delimiters. Often the line feed (\n) has been used as a row delimiter.

>Fields that have delimiter must be enclosed in single or double-quotes. If the data being loaded must be escaped if contains quotes.

>Fields should be enclosed in single or double quotes(only those who have carriage returns (\r \n)). Carriage returns are commonly introduced along with a line feed character to mark the end of a line individually in windows system.

>Each row must be having the same number of columns.

Planning a Data Load

It is advised to dedicate individual Snowflake Clusters for loading and querying operations, especially when dealing with large data sets. This helps in optimizing the performance for both activities i.e. loading and querying operations. As loading requires fewer resources, the standard virtual warehouse is enough to load data. One can choose the size of the warehouse based on the speed at which you want to load data but do not forget to split large data files for faster loading.

Staging Data

Both Snowflake and your data source allow stage references via paths and it is a good practice to stage regular data sets by partitioning them into logical paths. This could include details such as source identifiers, geographical location, etc., along with the date when the data was first written. This exercise will provide the flexibility to copy files by path using a single command and allows take advantage of Snowflake’s parallel operations by letting you execute concurrent COPY statements that match a subset of files.

Loading Data

By using the COPY INTO <table> command data is loaded into Snowflake. One can optimize load in your data warehouse environment by understanding the levers of this command. To execute the COPY command, move data from staged files to an existing table. Given below are the possible staging locations:

>Internal stage (i.e. table/user stage).

Note- PUT command can be used to stage files.

>For external stage, AWS S3 bucket or Microsoft Azure container can be referenced.

>An External location (i.e. AWS S3 bucket or Microsoft Azure container).

For loading data from a stage, the COPY command accepts several options. This will enable to bring only a fraction of data(which is staged into Snowflake).

While wrapping up this article, we understood that uploading files to a Snowflake stage can be done by any Snowflake connector client. Well, a faster data pipeline can accelerate the value of your Big Data strategy.

Must Read: A Blueprint on Data Engineering Services

About the Author

Haswitha Madupalli is a Data Engineer at Factspan who resides by the data, where trends in datasets bring contentment and new algorithms to solve real-world problems bring a thrill to her. Besides being a technophile, she has a keen interest in swimming, driving and hiking!

Most Popular

Let's Connect

Please enable JavaScript in your browser to complete this form.

Join Factspan Community

Subscribe to our newsletter

Related Articles

Add Your Heading Text Here

Blogs

Meta’s LLAMA 2 Vs Open AI’s ChatGPT

Explore the world of cutting-edge AI with a detailed analysis of Meta’s LLaMA and OpenAI’s ChatGPT. Uncover their workings, advantages, and considerations to help you make the right choice for your specific needs. Dive into the future of AI and its profound impact on content creation and data analysis.

Read More ...
Blogs

Data Contract Implementation in a Kafka Project: Ensuring Data Consistency and Adaptability

Data contracts are essential for ensuring data consistency and adaptability in data engineering projects. This blog explains how to implement data contract in a Kafka project and how it can be utilized to solve data quality and inconsistency issues.

Read More ...
Blogs

CDP: A band-aid solution?

Step into the world of Customer Data Platforms (CDPs) with our captivating blog, designed to guide you through every angle. Discover the origin story of CDPs – why they stepped into the spotlight. Uncover their true essence and explore the four common categories they belong to. Delve into real-life scenarios with eight compelling use cases that are revolutionizing businesses today. Tackle the question: are CDPs a quick fix or a sustainable solution? And don’t shy away from addressing the challenges that come with CDP territory. Wrapping it all up, you’ll find key takeaways that provide fresh insights into this dynamic technology.

Read More ...
Blogs

The Magical Transformation: How Nike Used Marketing Intelligence to Win the Game

Discover how Marketing Intelligence and Generative AI shape effective strategies. Learn from Nike’s success against Adidas in 2018. Dive into personalized content, automation, and insights.

Read More ...
Blogs

Web 3.0: Transforming the Future of E-commerce

With Web 3.0, users will experience heightened control over their data, leading to faster and safer transactions. For businesses, this paradigm shift will necessitate embracing AI, blockchain, and machine learning technologies to better connect with customers and thrive in this new era of digital commerce.

Read More ...
Blogs

Unveiling Insights: Checking File Trend Analysis for Data Engineers

This blog highlights the significance of file trend analysis in data engineering, addressing challenges faced by professionals in managing and utilizing data effectively. It explores the benefits of file trend analysis, including performance optimization, data quality assurance, and decision-making support.

Read More ...