After sharing briefs over Snowflake concepts and architecture, let’s move to the final part how to load and store data in Snowflake. We all are familiar with the fact that storing data is a critical part of any Snowflake database. The most highlighted two key points in any database are storing data within the environment and accessing data that is stored in other cloud storage environments. Snowflake is capable of both them.
Considering Data Storage
Snowflake provides spectacle features for the stored data. Continuous Data Protection (CDP) is given to all Snowflake accounts which includes Fail-Safe and Time Travel with no additional charger. But that doesn’t count on CDP that it will not have an impact on your storage costs. Storage costs will be incurred until the data leaves the Fail-safe state. Hence, considering the right storage is important while staging.
STAGED FILE STORAGE IN PROCESS TO LOAD DATA
The location the data is stored in is known as a stage whether it is stored internally or externally and the process of uploading files into one of these stages is known as staging. To assist loading bulk data into tables, Snowflake has stages where files that have the data to be loaded are staged. Snowflake allows both internal (location objects within a Snowflake database/schema) and external (storage locations outside the Snowflake environment in another cloud storage and this could be S3, Azure) stages. For bulk data, you could utilize this feature.
CLONING TABLES, SCHEMAS, AND DATABASES
Snowflake has a feature, called zero-copy cloning that gives an easy way to take a “snapshot” of any schema, table or database. It creates a derived copy of the object ( which was shared initially the underlying storage). This can come handy when creating instant backups.
Data Loading Considerations
PREPARING YOUR DATA FILES
Before we begin to process data loading and storing, data size, text and many multiple features should be checked.
General File Sizing Recommendations
Creating compressed data files that are roughly 10 MB to 100 MB in size should be recommended in order to optimize the number of parallel loads into Snowflake. Combine the smaller files to reduce processing overhead. now, split the large files into a number of smaller files for faster load. This allows distributing the load between servers in the active Snowflake warehouse.
Data Size Limitations for Semi-Structured Data
The variant data type has a size limit of 16 MB in the compressed form on the individual rows. JSON and Avro are the most commonly used data formats and both are a concatenation of many documents. The source software that provides an output to both will provide it in the form of a single huge array having multiple records. For document separation, line breaks and commas are being supported.
While executing the COPY INTO <table> command, it is recommended to enable the STRIP_OUTER_ARRAY file format option. This will load the records into separate table rows by removing the outer array structure and also enhance efficiency. Let’s consider an example,
COPY INTO <table_name> FROM @~/<file_name>.json file_format = (type = 'JSON' strip_outer_array = true);
Data Size Limitations of Parquet files
For smooth loading, it is highly recommended to split parquet files that are greater than 3GB into smaller files of 1GB or lesser in size. This ensures that the loading does not timeout.
Preparing Delimited Text Files
While preparing CSV/Delimited text files for loading, the given points must be considered
>Files must have data in ASCII format only. The default character set is UTF-8. However, by using the ENCODING file format option, additional encodings can be mentioned.
>Records and fields should be delimited by different characters within the files.
Note- Both should be a single like, pipe (|), caret (^), comma (,), and tilde (~) are common field delimiters. Often the line feed (\n) has been used as a row delimiter.
>Fields that have delimiter must be enclosed in single or double-quotes. If the data being loaded must be escaped if contains quotes.
>Fields should be enclosed in single or double quotes(only those who have carriage returns (\r \n)). Carriage returns are commonly introduced along with a line feed character to mark the end of a line individually in windows system.
>Each row must be having the same number of columns.
Planning a Data Load
It is advised to dedicate individual Snowflake Clusters for loading and querying operations, especially when dealing with large data sets. This helps in optimizing the performance for both activities i.e. loading and querying operations. As loading requires fewer resources, the standard virtual warehouse is enough to load data. One can choose the size of the warehouse based on the speed at which you want to load data but do not forget to split large data files for faster loading.
Both Snowflake and your data source allow stage references via paths and it is a good practice to stage regular data sets by partitioning them into logical paths. This could include details such as source identifiers, geographical location, etc., along with the date when the data was first written. This exercise will provide the flexibility to copy files by path using a single command and allows take advantage of Snowflake’s parallel operations by letting you execute concurrent COPY statements that match a subset of files.
By using the COPY INTO <table> command data is loaded into Snowflake. One can optimize load in your data warehouse environment by understanding the levers of this command. To execute the COPY command, move data from staged files to an existing table. Given below are the possible staging locations:
>Internal stage (i.e. table/user stage).
Note- PUT command can be used to stage files.
>For external stage, AWS S3 bucket or Microsoft Azure container can be referenced.
>An External location (i.e. AWS S3 bucket or Microsoft Azure container).
For loading data from a stage, the COPY command accepts several options. This will enable to bring only a fraction of data(which is staged into Snowflake).
While wrapping up this article, we understood that uploading files to a Snowflake stage can be done by any Snowflake connector client. Well, a faster data pipeline can accelerate the value of your Big Data strategy.
Must Read: A Blueprint on Data Engineering Services
About the Author
Haswitha Madupalli is a Data Engineer at Factspan who resides by the data, where trends in datasets bring contentment and new algorithms to solve real-world problems bring a thrill to her. Besides being a technophile, she has a keen interest in swimming, driving and hiking!