PySpark – dev set up – Eclipse – Windows

By | October 4, 2017

For our example purposes, we will set-up Spark in the location: C:\Users\Public\Spark_Dev_set_up
Note: I am running Eclipse Neon

Prerequisites

  1. Python 3.5
  2. JRE 8
  3. JDK 1.8
  4. Eclipse plugins: PyDev

Steps to set up:

  1. Download from here: https://spark.apache.org/downloads.html
    1. Choose a Spark release: 2.1.0
    2. Choose a package type: Pre-built for Apache Hadoop 2.6
    3. Download below version of Spark:


  2. Download winutils.exe
    Download from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and copy to C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\wintuils\bin
  3. Final folder structure

    
    C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6
        bin/
        conf/   
        data/   
        examples/   
        jars/       
        licenses/       
        python/     
        R/  
        sbin/   
        winutils/   
          winutils/bin
        yarn/   
        LICENSE
        NOTICE
        README.md
        RELEASE
    
  4. In Eclipse, Set environment variables:
    Windows -> Preferences -> PyDev -> Python Interpreter -> Environment

    
    Variable: SPARK_HOME
    Value: C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6
    
    Variable: HADOOP_HOME 
    Value: C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\winutils
    


  5. In Eclipse, Add libraries to PYTHONPATH:
    
    Windows -> Preferences -> PyDev -> Python Interpreter -> Libraries -> New Egg/Zip(s) -> C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\python\lib\pyspark.zip <br/><br/>
    
    Windows -> Preferences -> PyDev -> Python Interpreter -> Libraries -> New Egg/Zip(s) -> C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\python\lib\py4j-0.10.4-src.zip  <br/><br/>
    

  6. How to use it:

    In Eclipse, Create new python project

    File -> New -> PyDev Project -> “sample” :


    Note: If you don’t see the python interpreter configured, Click on “Clik here to configure an interpreter not listed” -> Quick Auto-Config

  7. Create sample program

    Right click on the Project “sample” -> New PyDev Module -> test1.py

    
    from pyspark.sql import SparkSession
    spark = SparkSession\
            .builder\
            .getOrCreate()
     
    sc = spark.sparkContext
    myRdd = sc.parallelize([1,2,3,4])
    print(myRdd.take(5))
    
  8. Run the program

    Right click on the program -> Run As -> Python Run
    Program output:

5 thoughts on “PySpark – dev set up – Eclipse – Windows

    1. admin Post author

      I haven’t tried it in IntelliJ, but I guess if you add the spark folders within the same project, it should auto-detect.

      Reply
  1. S Baskara Vishnu

    your blog is excellent. Your blog is very much useful to me, Many thanks for that. My warm regards to you.

    Reply
  2. Pradeep

    Excellent blog with detailed explanation.
    I am getting the below error while trying to run the pyspark program:

    Traceback (most recent call last):
    File “C:\JavaSpark\WC\test\wc.py”, line 8, in
    spark = SparkSession\
    File “C:\spark\python\lib\pyspark.zip\pyspark\sql\session.py”, line 169, in getOrCreate
    File “C:\spark\python\lib\pyspark.zip\pyspark\context.py”, line 310, in getOrCreate
    File “C:\spark\python\lib\pyspark.zip\pyspark\context.py”, line 115, in __init__
    File “C:\spark\python\lib\pyspark.zip\pyspark\context.py”, line 259, in _ensure_initialized
    File “C:\spark\python\lib\pyspark.zip\pyspark\java_gateway.py”, line 86, in launch_gateway
    File “C:\Users\PanySan\Anaconda3\lib\subprocess.py”, line 709, in __init__
    restore_signals, start_new_session)
    File “C:\Users\PanySan\Anaconda3\lib\subprocess.py”, line 997, in _execute_child
    startupinfo)
    FileNotFoundError: [WinError 2] The system cannot find the file specified

    I have Spark 2.1.3 .
    I did not get any help when i searched in the internet as well though many have reported this kind of issue.

    Any help would be great.

    Reply
  3. naveen

    I too got the same error:

    ‘cmd’ is not recognized as an internal or external command,
    operable program or batch file.
    Traceback (most recent call last):
    File “C:\Users\c116262\workspace\pySparkTest\src\testpyspark.py”, line 7, in
    spark = SparkSession\
    File “C:\Users\c116262\Documents\Work\scalasdk\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\session.py”, line 173, in getOrCreate
    File “C:\Users\c116262\Documents\Work\scalasdk\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py”, line 351, in getOrCreate
    File “C:\Users\c116262\Documents\Work\scalasdk\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py”, line 115, in __init__
    File “C:\Users\c116262\Documents\Work\scalasdk\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py”, line 300, in _ensure_initialized
    File “C:\Users\c116262\Documents\Work\scalasdk\spark-2.3.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\java_gateway.py”, line 93, in launch_gateway
    Exception: Java gateway process exited before sending its port number

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *