Creating DataFrames in Snowflake Snowpark

What are Snowpark DataFrames? Here is a step-by-step guide explaining how to use Snowpark, Snowflake's developer framework, to create DataFrames for building complex data pipelines and efficiently processing data. Learn how DataFrames act as virtual tables and are converted into SQL for execution in Snowflake.
snowflake, create data frame

Why this blog?

If you’re finding it challenging to manage and optimize your data pipelines, this blog is for you. It deep- dives into Snowpark’s powerful DataFrame capabilities within Snowflake, offering a streamlined way to handle complex data processes. By understanding DataFrames, you’ll enhance your data manipulation skills, improve efficiency, and unlock the full potential of Snowflake’s processing power. Whether you’re a developer or data engineer, this detailed guide will provide valuable insights into leveraging Snowpark for your data needs.

Content
  • Introduction
  • What is a DataFrame?
  • Prerequisites to create a DataFrame in Snowpark
  • How to create a DataFrame in Snowpark?
  • How to Read Data from a Snowpark DataFrame?
  • How to create a DataFrame in Snowpark with a List of Specified Values?
  • How to create a DataFrame in Snowpark with a List of Specified Values and Schema?
  • How to create a DataFrame in Snowpark using Pandas?
  • How to create a DataFrame in Snowpark from a range of numbers?
  • How to create a DataFrame in Snowpark from a Database Table?
  • How to create a DataFrame in Snowpark by reading files from a stage?
Introduction

Snowpark is a developer framework from Snowflake that enables developers to directly interact with Snowflake and construct complex data pipelines. In Snowpark, DataFrames are the primary tool for querying and processing data. In this article, we will explore what DataFrames are and provide a step-by-step guide for creating them in Snowpark.

The transformation phase is flexible; data can be restructured and prepared based on current analytical needs without the sequential dependencies typical in ETL processes. The added validation step in ELTV thoroughly checks the transformed data against specific quality criteria and business rules before it is deemed fit for use.

What is a DataFrame?

A DataFrame in Snowpark functions as a virtual table that arranges data in an organized manner. It is similar to expressing a SQL query, but using a different language. DataFrames operate lazily, which means they do not process data until you explicitly instruct them to perform a specific action, such as retrieving or analyzing information.

The Snowpark API ultimately converts the DataFrames into SQL to run your code within Snowflake.

Prerequisites to create a DataFrame in Snowpark

To create a DataFrame, you need to use the Session class in Snowpark, which connects to a Snowflake database and offers methods for creating DataFrames and accessing objects.

When you create a Session object, you provide connection parameters to establish a connection with a Snowflake database as shown below:

import snowflake.snowpark as snowpark
from snowflake.snowpark import Session
connection_parameters = {
   "account": "snowflake account",
   "user": "snowflake username",
   "password": "snowflake password",
   "role": "snowflake role",  # optional
   "warehouse": "snowflake warehouse",  # optional
   "database": "snowflake database",  # optional
   "schema": "snowflake schema"  # optional
}


session = Session.builder.configs(connection_parameters).create()

To create DataFrames in a Snowsight Python worksheet, construct them within the handler function (main) and utilize the Session object (session) passed into the function.

def main(session: snowpark.Session):
    # your code goes here
How to create a DataFrame in Snowpark?

The createDataFrame method of Session class in Snowpark constructs a new DataFrame with the specified values from the local data.

Syntax:The following is the syntax to create a DataFrame using createDataFrame method:

session.createDataFrame(data[, schema])

The accepted values for data in the createDataFrame method are List, Tuple or a Pandas DataFrame.

  • Lists are used to store multiple items in a single variable and are created using square brackets.
    ex: myList = [“one”, “two”, “three”]
  • Tuples are used to store multiple items in a single variable and are created using round brackets. The contents of a tuple cannot change once they have been created in Python.
    ex: myTuple = (“one”, “two”, “three”)
  • Pandas is a Python library used for working with data sets. Pandas allows the creation of DataFrames natively in Python.

The schema in the createDataFrame method can be a StructType containing names and data types of columns, or just a list of column names, or None.

How to Read Data from a Snowpark DataFrame?

Data from a Snowpark DataFrame can be retrieved by using the show method.

Syntax:The following is the syntax to read data from a Snowpark DataFrame using show method.

DataFrame.show([n, max_width])

Parameters:

  • – The value represents the number of rows to print out. This default value is 10
  • max_width – The maximum number of characters to print out for each column
How to create a DataFrame in Snowpark with a List of Specified Values?

Example 1:

df1 = session.createDataFrame([1,2,3,4], schema=["a"])
df1.show()

------
|"A" |
------
|1   |
|2   |
|3   |
|4   |
------

The DataFrame df1 when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below:

SELECT "A" FROM ( 
  SELECT $1 AS "A" 
  FROM  VALUES (1::INT), (2::INT), (3::INT), (4::INT)
) LIMIT 10

Example 2:

The following is an example of creating a DataFrame with multiple lists of values and assigning the column names as “a”,”b”, “c” and “d”.

df2 = session.createDataFrame([[1,2,3,4],[5,6,7,8]], schema=["a","b","c","d"])

df2.show()

--------------------------

|"A"  |"B"  |"C"  |"D"   |

--------------------------

|1    |2    |3    |4     |

|5    |6    |7    |8     |

--------------------------

The DataFrame df2 when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below:

SELECT "A", "B", "C", "D" FROM ( 
  SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D" 
  FROM  VALUES 
  (1::INT, 2::INT, 3::INT, 4::INT), 
  (5::INT, 6::INT, 7::INT, 8::INT)
) LIMIT 10
How to create a DataFrame in Snowpark with a List of Specified Values and Schema?

When schema parameter in the createDataFrame method is passed as a list of column names or None, the schema of the DataFrame will be inferred from the data across all rows.

Example 3:

The following is an example of creating a DataFrame with multiple lists of values with different data types and assigning the column names as “a”,”b”, “c” and “d”.

df3 = session.createDataFrame([[1, 2, 'Snow', '2024-01-01'],[3, 4, 'Park', '2024-01-02']], schema=["a","b","c","d"])
df3.show()


----------------------------------
|"A"  |"B"  |"C"   |"D"          |
----------------------------------
|1    |2    |Snow  |2024-01-01   |
|3    |4    |Park  |2024-01-02   |
----------------------------------

The DataFrame df3 when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below:

SELECT "A", "B", "C", "D" FROM ( 
  SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D" 
  FROM  VALUES 
  (1::INT, 2::INT, 'Snow'::STRING, '2024-01-01'::STRING), 
  (3::INT, 4::INT, 'Park'::STRING, '2024-01-02'::STRING)
) LIMIT 10

Note that in the above query, since we did not explicitly specify the data types of the columns during definition, the values ‘2024-01-01’ and ‘2024-01-02’, despite being of “Date” data type, are identified as “String” data type.

Example 4:

Create a custom schema parameter of StructType containing names and data types of columns and pass it to the createDataFrame method as shown below:

#//create dataframe with schema
from snowflake.snowpark.types import IntegerType, StringType, StructField, StructType, DateType


my_schema = StructType(
    [StructField("a", IntegerType()),
     StructField("b", IntegerType()),
     StructField("c", StringType()),
     StructField("d", DateType())]
    )


df4 = session.createDataFrame([[1, 2, 'Snow', '2024-01-01'],[3, 4, 'Park', '2024-01-02']], schema=my_schema)
df4.show()

------------------------------------------
|"A"  |"B"  |"C"   |"D"                  |
------------------------------------------
|1    |2    |Snow  |2024-01-01 00:00:00  |
|3    |4    |Park  |2024-01-02 00:00:00  |
------------------------------------------

The DataFrame df4 when executed is translated and executed as SQL in Snowflake by Snowpark API referencing to the columns with the defined data types as shown below:

SELECT 
  "A", "B", "C", 
  to_date("D") AS "D" 
FROM ( 
  SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D" 
  FROM  VALUES 
  (1::INT, 2::INT, 'Snow'::STRING, '2024-01-01'::STRING), 
  (3::INT, 4::INT, 'Park'::STRING, '2024-01-02'::STRING)
) LIMIT 10

Note that in the above query, the column “D” is read as Date data type in Snowflake.

How to create a DataFrame in Snowpark using Pandas?

A Pandas DataFrame can be passed as “data” to create a DataFrame in Snowpark.

Example 5:

The following is an example of creating a Snowpark DataFrame using pandas DataFrame.

import pandas as pd


df_pandas = session.createDataFrame(pd.DataFrame([1,2,3],columns=["a"]))
df_pandas.show()


------
|"a" |
------
|1   |
|2   |
|3   |
------

Unlike DataFrames created with Lists or Tuples using the ‘createDataFrame‘ method, when a DataFrame is created using a pandas DataFrame, the Snowpark API creates a temporary table and imports the data from the pandas DataFrame into it.

When extracting data from the Snowpark DataFrame created using the pandas DataFrame, the data is retrieved by querying the temporary table.The DataFrame df_pandas when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below:

SELECT  *  FROM "SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA"."SNOWPARK_TEMP_TABLE_9RSV8KITUO" LIMIT 10
How to create a DataFrame in Snowpark from a range of numbers?

A DataFrame from a range of numbers can be created using range method of Session class in Snowpark. The resulting DataFrame has a single column named “ID” containing elements in a range from start to end.

Syntax:

The following is the syntax to create a DataFrame using range method.

session.range(start[, end, step])

Parameters:

  • start : The start value of the range. If end is not specified, start will be used as the value of end
  • end : The end value of the range
  • step : The step or interval between numbers

Example 6:

The following is an example of creating a DataFrame with a range of numbers from 1 to 9.

df_range = session.range(1,10).to_df("a")
df_range.show()


-------
|"A"  |
-------
|1    |
|2    |
|3    |
|4    |
|5    |
|6    |
|7    |
|8    |
|9    |
-------

The DataFrame df_range when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below:

SELECT  *  FROM ( 
  SELECT ( ROW_NUMBER()  OVER ( ORDER BY  SEQ8() ) -  1 ) * (1) + (1) AS id 
  FROM ( TABLE (GENERATOR(ROWCOUNT => 9)))
) LIMIT 10

Example 7:

The following is an example of creating a DataFrame with a range of numbers from 1 to 9 with a step value of 2 and returning the output column renamed as “A”.

df_range2 = session.range(1,10,2).to_df("a")
df_range2.show()


-------
|"A"  |
-------
|1    |
|3    |
|5    |
|7    |
|9    |
-------

The DataFrame df_range2 when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below:

SELECT "ID" AS "A" FROM ( 
  SELECT ( ROW_NUMBER()  OVER ( ORDER BY  SEQ8() ) -  1 ) * (2) + (1) AS id 
  FROM ( TABLE (GENERATOR(ROWCOUNT => 5)))
) LIMIT 10
How to create a DataFrame in Snowpark from a Database Table?

The sql and table methods of Session class in Snowpark can be used to create a DataFrame from a Database Table.

Example 8:

The following is an example of creating a DataFrame from a database table by executing a SQL query using sql method of Session class in Snowpark.

df_sql = session.sql("SELECT * FROM SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.MONTHLY_REVENUE")
df_sql.show(5)


----------------------------------
|"YEAR"  |"MONTH"  |"REVENUE"    |
----------------------------------
|2012    |5        |3264300.11   |
|2012    |6        |3208482.33   |
|2012    |7        |3311966.98   |
|2012    |8        |3311752.81   |
|2012    |9        |3208563.06   |
----------------------------------

The DataFrame df_sql when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below:

SELECT * FROM (SELECT * FROM SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.MONTHLY_REVENUE) LIMIT 5

Example 9:

The following is an example of creating a DataFrame from a database table by executing a SQL query using table method of Session class in Snowpark.

df_sql = session.table("MONTHLY_REVENUE")
df_sql.show(5)


----------------------------------
|"YEAR"  |"MONTH"  |"REVENUE"    |
----------------------------------
|2012    |5        |3264300.11   |
|2012    |6        |3208482.33   |
|2012    |7        |3311966.98   |
|2012    |8        |3311752.81   |
|2012    |9        |3208563.06   |
----------------------------------

The DataFrame df_table when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below:

SELECT  *  FROM MONTHLY_REVENUE LIMIT 5
How to create a DataFrame in Snowpark by reading files from a stage?

DataFrameReader class in Snowpark provides methods for loading data from a Snowflake stage to a DataFrame with format-specific options.

To use it:

  1. Create a DataFrameReader object through Session.read method
  2. For CSV file format, create a custom schema parameter of StructType containing names and data types of columns
  3. Set the file format specific properties such as delimiter using options() method
  4. Specify the file path and stage details by calling the method corresponding to the CSV format, csv()

Example 10:

The following is an example of creating a DataFrame in Snowpark by reading CSV files from S3 stage.

from snowflake.snowpark.types import IntegerType, StringType, StructField, StructType


schema = StructType(
    [StructField("EMPLOYEE_ID", IntegerType()),
     StructField("FIRST_NAME", StringType()),
     StructField("LAST_NAME", StringType()),
     StructField("EMAIL", StringType())
    ])


df_s3_employee = session.read.schema(schema).options({"field_delimiter": ",", "skip_header": 1}).csv('@my_s3_stage/Inbox/')
df_s3_employee.show(5)


--------------------------------------------------------------
| EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL               |
| 204384      | Steven     | King      | SKING@test.com      |
| 204388      | Neena      | Kochhar   | NKOCHHAR@test.com   |
| 204392      | Lex        | De Haan   | LDEHAAN@test.com    |
| 204393      | Alexander  | Hunold    | AHUNOLD@test.com    |
| 204394      | Bruce      | Ernst     | BERNST@test.com     |
--------------------------------------------------------------

The DataFrame df_s3_employee when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below:

  1. A temporary file format is created using the properties specified in the options() method
  2. The stage files are queried using the file format created in the first step and the columns are cast into the data types specified in the schema defined
  3. The file format created in the first step is dropped
--create a temporary file format
CREATE SCOPED TEMPORARY FILE  FORMAT  If  NOT  EXISTS "SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FORMAT_Y00K7HK598 
TYPE  = CSV  FIELD_DELIMITER = ',' SKIP_HEADER = 1


--select data from stage files using the temporary file format
SELECT  *  FROM ( 
  SELECT 
    $1::INT AS "EMPLOYEE_ID", 
    $2::STRING AS "FIRST_NAME", 
    $3::STRING AS "LAST_NAME", 
    $4::STRING AS "EMAIL" 
  FROM @my_s3_stage/Inbox/( FILE_FORMAT  => '"SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FORMAT_Y00K7HK598')
) LIMIT 5


--drop the temporary file format
DROP  FILE  FORMAT  If  EXISTS "SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FORMAT_Y00K7HK598

Featured content
Case study : Unified Workforce Data automation using snowflake

Unified Workforce Data and Automated Ins...

Loading Data into Snowflake using Snowpark DataFrames

Loading Data into Snowflake using Snowpa...

Elevating Data Integration with ELTV Data Pipeline

Elevating Data Integration with ELTV Dat...

Streamlining ETL Pipeline with Snowflake...

Snowflake Cover image | Factspan

Stream & Merge for Incremental Load...

SnowPipe: Cloud Data Ingestion Tool Powe...

Ventilation Management in Emergency care with Machine Learning and DBT

ML-powered ventilation systems for emerg...

AWS in Data engineering Cover | Factspan

AWS Data Engineering Essentials Guideboo...

Choosing the Right Cloud Data Engineering & Analytics Platform: Databricks vs. Snowflake

Choosing the Right Cloud Data Engineerin...

FAQs on Data Engineering Services

FAQs on Data Engineering Services...

Let’s Connect

    Work Email*

    Phone Number (Optional)

    1000/1000

    Scroll to Top