Set – up a development environment for pyspark

Set-up a development environment for pyspark

Spark is the most popular, fast and reliable cluster computing technology. Comparing with other computing technology, it provides implicit data parallelism and default fault tolerance. In addition, it integrates smoothly with HIVE and HDFS and provides a seamless experience of parallel data processing. By default, Spark SQL does not run on some OS and require to set-up an environment first. So, let’s learn how to set-up a development environment in a local machine to run testing and proceeding a local data development without the need for any cloud infrastructure.

Need to Set-up Development Environment

Setting up a Development Environment is the combination of both, hardware and software environment on which the tests will be executed. It includes several supports to perform the test such as hardware configuration, operating system settings, software configuration, test terminals and others.

CHALLENGES WHILE SETTING UP A DEVELOPMENT ENVIRONMENT

There are some challenges that come up while setting up a development environment in PySpark.

>Using external python packages only when needed or it can be freely accessible.

>Still resolving to set your environment variables for Spark?

>Still installing more python versions 3.7, 2.6 to resolve support issues?

>Do you need auto-suggestions while you write python code?

>Do you want to easily visualize your debugs/data while you write your code?

Overcoming the challenges using Docker in VS Code

When you start working on your project, sometimes it demands to install extra packages to run your code, when you start working on another project again it needs an extra set of packages. These would result in installing both sets of packages into a single work machine and during the review with clients not sure which extra packages used in which project, this is not a good approach to maintain the project level codes. To resolve these, let’s start using localized docker containers where we set each container for each project which is a good sign of how we maintain project level codes. By selecting one container we install all required packages and show clients what packages we used, also status on test coverage so far and any documentation around that project. Use appropriate extra packages only when needed. By switching to another container, you can switch to another project where this container holds that corresponding project-related packages, docs, etc. One of the best software to use docker containers is VS Code.

WHY VS CODE

VS Code is popular and successful due to the growth of the web development field these years and the need of the developers of having a lightweight well-done editor, with few features but less complex than the others available on the market. Let’s walk through some of them,

  • It is easy to install in any machine.
  • Easy to add extensions
  • Easy to maintain the design of code:

(Makes simpler to understand status on code development in one glance)

-README – Project Description

-Documentation – Standards followed by Org

-Module Code – All the code resides here

-Test – Update on test coverage so far.

Docker File 
# set base image OS (CentOS is a Linux distribution that provides a free, community-supported computing platform)
FROM centos

# Every command below happens within the container

# make some directory inside container
RUN mkdir mydirectory

# make mydirectory as working directory inside container
WORKDIR /mydirectory

# set unique maintainer
LABEL maintainer = \"Sebastian Srikanth Kumar<sebastin.kumar@factspan.com>\"


# Use yum command to install any package in CentOS (Linux dist.)

# install python3, Java, tar, git, wget and zip

# The java-1.8.0-openjdk package contains just the Java Runtime Environment. 
# If you want to develop Java programs then install the java-1.8.0-openjdk-devel package.
RUN dnf install -y python3-pip
RUN yum install -y python3 java-1.8.0-openjdk java-1.8.0-openjdk-devel tar git wget zip

# Import Spark archive file from AWS S3 PUBLIC bucket
# https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

# download spark from AWS public bucket
RUN wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
# Extract spark archive
RUN tar zxfv spark-2.4.3-bin-hadoop2.8.tgz
# Remove the spark archive folder from container
RUN rm spark-2.4.3-bin-hadoop2.8.tgz

# create environment variable for spark
ENV SPARK_HOME /mydirectory/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8
# create path variable and point to bin location in spark
ENV PATH $PATH:$SPARK_HOME/bin

# copy requirements.txt file to working directory
COPY requirements.txt .

# Install the packages listed in text file
RUN pip3 install -r requirements.txt

# Clean cache inside container
RUN yum clean all
RUN rm -rf /var/cache/yum

-Set the environment variables (“remoteEnv”) in a remote container in devcontainer.json file shown below:

 >>Here we add python path and py4j path to SPARK_HOME variable

// For format details, see https://aka.ms/devcontainer.json. For config options, see the README at:
// https://github.com/microsoft/vscode-dev-containers/tree/v0.140.1/containers/docker-existing-dockerfile
{
	\"name\": \"Sebastian Dockerfile\",

	// Sets the run context to one level up instead of the .devcontainer folder.
	\"context\": \"..\",

	// Update the \'dockerFile\' property if you aren\'t using the standard \'Dockerfile\' filename.
	\"dockerFile\": \"../Dockerfile\",

	// Set *default* container specific settings.json values on container create.
	\"settings\": { 
		\"terminal.integrated.shell.linux\": null
	},

	/* Add the containerEnv property to devcontainer.json to set variables that should apply
	to the entire container or remoteEnv to set variables for VS Code 
	and related sub-processes (terminals, tasks, debugging, etc) */
	\"remoteEnv\": {
		\"WORKSPACE\": \"${containerWorkspaceFolder}\",
		\"LOCAL_WORKSPACE_FOLDER\": \"${localWorkspaceFolder}\",
		\"PYTHONPATH\": \"/workspaces/test_project:${containerEnv:SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:${containerEnv:SPARK_HOME}/python/\",
	},

	// Add the IDs of extensions you want installed when the container is created.
	\"extensions\": [
		\"dongli.python-preview\",
		\"frhtylcn.pythonsnippets\",
		\"njpwerner.autodocstring\",
		\"ms-python.python\",
		\"ms-azuretools.vscode-docker\",
		\"RandomFractalsInc.vscode-data-preview\",
		\"msrvida.vscode-sanddance\"
	],

	// Use \'forwardPorts\' to make a list of ports inside the container available locally.
	\"forwardPorts\": [3030],

	// Uncomment the next line to run commands after the container is created - for example installing curl.
	// \"postCreateCommand\": \"apt-get update && apt-get install -y curl\",

	// Uncomment when using a ptrace-based debugger like C++, Go, and Rust
	// \"runArgs\": [ \"--cap-add=SYS_PTRACE\", \"--security-opt\", \"seccomp=unconfined\" ],

	// Uncomment to use the Docker CLI from inside the container. See https://aka.ms/vscode-remote/samples/docker-from-docker.
	\"mounts\": [ \"source=/var/run/docker.sock,target=/var/run/docker.sock,type=bind\" ],

	// Uncomment to connect as a non-root user if you\'ve added one. See https://aka.ms/vscode-remote/containers/non-root.
	// \"remoteUser\": \"vscode\"
Improve Style of Coding:

Extensions in VS Code for effective coding:

“eamodio.gitlens”,

GitLens supercharges the Git capabilities built into Visual Studio Code. It helps you to visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more.

“shd101wyy.markdown-preview-enhanced”,

Markdown compilation test, modified from remarkable demo such as headings, horizontal lines, maths, size, etc..

“njpwerner.autodocstring”,

It is an auto-generated docstring.

“ms-python.python”,

A Visual Studio Code extension with rich support for the Python language (for all actively supported versions of the language: >=3.6).

“ms-azuretools.vscode-docker”,

Adds syntax highlighting, commands, hover tips, and linting for Dockerfile and docker-compose files.

“johnpapa.vscode-peacock”,

A Visual Studio Code extension that subtly changes the workspace colour of your workspace. Ideal when you have multiple VS Code instances and you want to quickly identify which is which.

“tht13.html-preview-vscode”,

Peek Definition- To preview HTML in Visual Studio.

“frhtylcn.pythonsnippets”

Enable/disable extensions from config file.

Wrapping up this extended demo on setting up a development environment for PySpark which was given in a full, in-depth understanding of configurations related to the local setup. You can also go through MyGithub to learn more.

About the Author

Sebastian Srikanth Kumar is an Associate Business Analyst at Factspan who has a keen interest in External Automation projects like WhatsApp automation and more. His fondness for Games is inevitable. He loves to keep him up-to-date to Tech news daily.

Featured content

Enhancing Data Processing with Aggregate Functions...

Implementing Sustainable Model Design with Green A...

Snowflake Copilot

Streamline SQL Workflow with Snowflake Copilot...

GCP vs. AWS vs. Azure (2024)...

Exploring Data Mesh – PoV...

Choosing the Right Cloud Data Engineering & Analytics Platform: Databricks vs. Snowflake

Databricks vs. Snowflake (2024)...

Enhancing Retail Data Quality with Apache Airflow ...

Data governance consulting

Data Governance – Starter Kit...

Snowflake tutorial

Quick Tutorial on DataFrame Updates in Snowpark...

Building Gen AI for Enterprise – PoV...

Scroll to Top