For our example purposes, we will set-up Spark in the location: C:\Users\Public\Spark_Dev_set_up
Note: I am running Eclipse Neon
Prerequisites
- Python 3.5
- JRE 8
- JDK 1.8
- Eclipse plugins: PyDev
Steps to set up:
- Download from here: https://spark.apache.org/downloads.html
1. Choose a Spark release: 2.1.0
2. Choose a package type: Pre-built for Apache Hadoop 2.6
3. Download below version of Spark: - Download winutils.exe
Download from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and copy to C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\wintuils\bin -
Final folder structure
C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6 bin/ conf/ data/ examples/ jars/ licenses/ python/ R/ sbin/ winutils/ winutils/bin yarn/ LICENSE NOTICE README.md RELEASE - In Eclipse, Set environment variables:
Windows -> Preferences -> PyDev -> Python Interpreter -> EnvironmentVariable: SPARK_HOME Value: C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6 Variable: HADOOP_HOME Value: C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\winutils - In Eclipse, Add libraries to PYTHONPATH:
Windows -> Preferences -> PyDev -> Python Interpreter -> Libraries -> New Egg/Zip(s) -> C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\python\lib\pyspark.zip
Windows -> Preferences -> PyDev -> Python Interpreter -> Libraries -> New Egg/Zip(s) -> C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\python\lib\py4j-0.10.4-src.zip
- How to use it:
In Eclipse, Create new python project
File -> New -> PyDev Project -> “sample” :
Note: If you don’t see the python interpreter configured, Click on “Clik here to configure an interpreter not listed” -> Quick Auto-Config - Create sample program
Right click on the Project “sample” -> New PyDev Module -> test1.py
from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .getOrCreate() sc = spark.sparkContext myRdd = sc.parallelize([1,2,3,4]) print(myRdd.take(5)) - Run the program
Right click on the program -> Run As -> Python Run
Program output:
Is there is anyway , that intellij can be configured with pyspark?
I haven’t tried it in IntelliJ, but I guess if you add the spark folders within the same project, it should auto-detect.
your blog is excellent. Your blog is very much useful to me, Many thanks for that. My warm regards to you.
Excellent blog with detailed explanation.
I am getting the below error while trying to run the pyspark program:
Traceback (most recent call last):
File “C:\JavaSpark\WC\test\wc.py”, line 8, in
spark = SparkSession\
File “C:\spark\python\lib\pyspark.zip\pyspark\sql\session.py”, line 169, in getOrCreate
File “C:\spark\python\lib\pyspark.zip\pyspark\context.py”, line 310, in getOrCreate
File “C:\spark\python\lib\pyspark.zip\pyspark\context.py”, line 115, in __init__
File “C:\spark\python\lib\pyspark.zip\pyspark\context.py”, line 259, in _ensure_initialized
File “C:\spark\python\lib\pyspark.zip\pyspark\java_gateway.py”, line 86, in launch_gateway
File “C:\Users\PanySan\Anaconda3\lib\subprocess.py”, line 709, in __init__
restore_signals, start_new_session)
File “C:\Users\PanySan\Anaconda3\lib\subprocess.py”, line 997, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
I have Spark 2.1.3 .
I did not get any help when i searched in the internet as well though many have reported this kind of issue.
Any help would be great.
I too got the same error:
‘cmd’ is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
File “C:\Users\c116262\workspace\pySparkTest\src\testpyspark.py”, line 7, in
spark = SparkSession\
File “C:\Users\c116262\Documents\Work\scalasdk\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\session.py”, line 173, in getOrCreate
File “C:\Users\c116262\Documents\Work\scalasdk\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py”, line 351, in getOrCreate
File “C:\Users\c116262\Documents\Work\scalasdk\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py”, line 115, in __init__
File “C:\Users\c116262\Documents\Work\scalasdk\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py”, line 300, in _ensure_initialized
File “C:\Users\c116262\Documents\Work\scalasdk\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\java_gateway.py”, line 93, in launch_gateway
Exception: Java gateway process exited before sending its port number
I am getting below error when i am tying to run pyspark program .Can you give me solution for it
‘””C:\Program’ is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
File “D:\workspace\test\Test.py”, line 2, in
spark = SparkSession\
File “D:\spark-2.1.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\session.py”, line 169, in getOrCreate
File “D:\spark-2.1.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py”, line 310, in getOrCreate
File “D:\spark-2.1.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py”, line 115, in __init__
File “D:\spark-2.1.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py”, line 259, in _ensure_initialized
File “D:\spark-2.1.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\java_gateway.py”, line 95, in launch_gateway
Exception: Java gateway process exited before sending the driver its port number