问题:在python shell中导入pyspark

这是其他论坛上从未有人回答过的其他人的问题的副本,因此我想在这里重新提问,因为我有同样的问题。(请参阅http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736

我已经在机器上正确安装了Spark,并且在使用./bin/pyspark作为我的python解释器时,能够使用pyspark模块运行python程序而不会出错。

但是,当我尝试运行常规Python Shell时,当我尝试导入pyspark模块时,出现此错误:

from pyspark import SparkContext

它说

"No module named pyspark".

我怎样才能解决这个问题?我是否需要设置环境变量以将Python指向pyspark标头/库/等?如果我的spark安装是/ spark /,我需要包括哪些pyspark路径?还是只能从pyspark解释器运行pyspark程序?

This is a copy of someone else’s question on another forum that was never answered, so I thought I’d re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736)

I have Spark installed properly on my machine and am able to run python programs with the pyspark modules without error when using ./bin/pyspark as my python interpreter.

However, when I attempt to run the regular Python shell, when I try to import pyspark modules I get this error:

from pyspark import SparkContext

and it says

"No module named pyspark".

How can I fix this? Is there an environment variable I need to set to point Python to the pyspark headers/libraries/etc.? If my spark installation is /spark/, which pyspark paths do I need to include? Or can pyspark programs only be run from the pyspark interpreter?


回答 0

这是一个简单的方法(如果您不关心它的工作原理!!!)

使用findspark

  1. 转到您的python shell

    pip install findspark
    
    import findspark
    findspark.init()
  2. 导入必要的模块

    from pyspark import SparkContext
    from pyspark import SparkConf
  3. 完成!!!

Here is a simple method (If you don’t bother about how it works!!!)

Use findspark

  1. Go to your python shell

    pip install findspark
    
    import findspark
    findspark.init()
    
  2. import the necessary modules

    from pyspark import SparkContext
    from pyspark import SparkConf
    
  3. Done!!!


回答 1

如果打印出这样的错误:

ImportError:没有名为py4j.java_gateway的模块

请将$ SPARK_HOME / python / build添加到PYTHONPATH:

export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

If it prints such error:

ImportError: No module named py4j.java_gateway

Please add $SPARK_HOME/python/build to PYTHONPATH:

export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

回答 2

原来pyspark bin是LOADING python,并且会自动加载正确的库路径。签出$ SPARK_HOME / bin / pyspark:

# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

我将此行添加到我的.bashrc文件中,现在可以正确找到模块了!

Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark :

# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

I added this line to my .bashrc file and the modules are now correctly found!


回答 3

不要将py文件运行为:python filename.py 而是使用:spark-submit filename.py

dont run your py file as: python filename.py instead use: spark-submit filename.py


回答 4

通过导出SPARK路径和Py4j路径,它开始起作用:

export SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH 
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

因此,如果您不想在每次启动Python Shell时都键入这些内容,则可能需要将其添加到.bashrc文件中

By exporting the SPARK path and the Py4j path, it started to work:

export SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH 
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

So, if you don’t want to type these everytime you want to fire up the Python shell, you might want to add it to your .bashrc file


回答 5

在Mac上,我使用Homebrew来安装Spark(公式为“ apache-spark”)。然后,我以这种方式设置PYTHONPATH,以便Python导入起作用:

export SPARK_HOME=/usr/local/Cellar/apache-spark/1.2.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH

用Mac上的实际apache-spark版本替换“ 1.2.0”。

On Mac, I use Homebrew to install Spark (formula “apache-spark”). Then, I set the PYTHONPATH this way so the Python import works:

export SPARK_HOME=/usr/local/Cellar/apache-spark/1.2.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH

Replace the “1.2.0” with the actual apache-spark version on your mac.


回答 6

为了在pyspark中执行Spark,需要两个组件一起工作:

  • pyspark python包
  • JVM中的Spark实例

在使用spark-submit或pyspark启动事物时,这些脚本将同时处理这两个脚本,即它们设置了PYTHONPATH,PATH等,以便您的脚本可以找到pyspark,并且它们还启动spark实例,并根据您的参数进行配置,例如–master X

另外,也可以绕过这些脚本,并直接在python解释器中运行spark应用程序python myscript.py。当spark脚本开始变得更加复杂并最终收到自己的args时,这尤其有趣。

  1. 确保pyspark软件包可以被Python解释器找到。如前所述,可以将spark / python目录添加到PYTHONPATH或使用pip install直接安装pyspark。
  2. 从您的脚本(曾经传递给pyspark的脚本)中设置spark实例的参数。
    • 对于通常使用–conf设置的spark配置,它们在SparkSession.builder.config中使用配置对象(或字符串配置)进行定义
    • 对于当前的主要选项(例如–master或–driver-mem),您可以通过写入PYSPARK_SUBMIT_ARGS环境变量来进行设置。为了使事情更干净,更安全,您可以在Python本身中进行设置,并且启动时spark会读取它。
  3. 启动实例,只需要您getOrCreate()从构建器对象调用即可。

因此,您的脚本可以具有以下内容:

from pyspark.sql import SparkSession

if __name__ == "__main__":
    if spark_main_opts:
        # Set main options, e.g. "--master local[4]"
        os.environ['PYSPARK_SUBMIT_ARGS'] = spark_main_opts + " pyspark-shell"

    # Set spark config
    spark = (SparkSession.builder
             .config("spark.checkpoint.compress", True)
             .config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11")
             .getOrCreate())

For a Spark execution in pyspark two components are required to work together:

  • pyspark python package
  • Spark instance in a JVM

When launching things with spark-submit or pyspark, these scripts will take care of both, i.e. they set up your PYTHONPATH, PATH, etc, so that your script can find pyspark, and they also start the spark instance, configuring according to your params, e.g. –master X

Alternatively, it is possible to bypass these scripts and run your spark application directly in the python interpreter likepython myscript.py. This is especially interesting when spark scripts start to become more complex and eventually receive their own args.

  1. Ensure the pyspark package can be found by the Python interpreter. As already discussed either add the spark/python dir to PYTHONPATH or directly install pyspark using pip install.
  2. Set the parameters of spark instance from your script (those that used to be passed to pyspark).
    • For spark configurations as you’d normally set with –conf they are defined with a config object (or string configs) in SparkSession.builder.config
    • For main options (like –master, or –driver-mem) for the moment you can set them by writing to the PYSPARK_SUBMIT_ARGS environment variable. To make things cleaner and safer you can set it from within Python itself, and spark will read it when starting.
  3. Start the instance, which just requires you to call getOrCreate() from the builder object.

Your script can therefore have something like this:

from pyspark.sql import SparkSession

if __name__ == "__main__":
    if spark_main_opts:
        # Set main options, e.g. "--master local[4]"
        os.environ['PYSPARK_SUBMIT_ARGS'] = spark_main_opts + " pyspark-shell"

    # Set spark config
    spark = (SparkSession.builder
             .config("spark.checkpoint.compress", True)
             .config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11")
             .getOrCreate())

回答 7

要摆脱ImportError: No module named py4j.java_gateway,您需要添加以下几行:

import os
import sys


os.environ['SPARK_HOME'] = "D:\python\spark-1.4.1-bin-hadoop2.4"


sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python")
sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf

    print ("success")

except ImportError as e:
    print ("error importing spark modules", e)
    sys.exit(1)

To get rid of ImportError: No module named py4j.java_gateway, you need to add following lines:

import os
import sys


os.environ['SPARK_HOME'] = "D:\python\spark-1.4.1-bin-hadoop2.4"


sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python")
sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf

    print ("success")

except ImportError as e:
    print ("error importing spark modules", e)
    sys.exit(1)

回答 8

在Windows 10上,以下内容对我有用。我使用“设置” >“ 编辑您的帐户的环境变量添加了以下环境变量:

SPARK_HOME=C:\Programming\spark-2.0.1-bin-hadoop2.7
PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%

(将“ C:\ Programming \ …”更改为安装了spark的文件夹)

On Windows 10 the following worked for me. I added the following environment variables using Settings > Edit environment variables for your account:

SPARK_HOME=C:\Programming\spark-2.0.1-bin-hadoop2.7
PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%

(change “C:\Programming\…” to the folder in which you have installed spark)


回答 9

对于Linux用户,以下是在PYTHONPATH中包含pyspark libaray的正确方法(并且不是硬编码)。PATH的两个部分都是必需的:

  1. pyspark Python模块本身的路径,以及
  2. 导入时pyspark模块依赖的压缩库的路径

请注意以下内容,压缩库的版本是动态确定的,因此我们不会对其进行硬编码。

export PYTHONPATH=${SPARK_HOME}/python/:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}

For Linux users, the following is the correct (and non-hard-coded) way of including the pyspark libaray in PYTHONPATH. Both PATH parts are necessary:

  1. The path to the pyspark Python module itself, and
  2. The path to the zipped library that that pyspark module relies on when imported

Notice below that the zipped library version is dynamically determined, so we do not hard-code it.

export PYTHONPATH=${SPARK_HOME}/python/:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}

回答 10

我正在CentOS VM上运行一个火花集群,该集群是从cloudera yum软件包安装的。

必须设置以下变量才能运行pyspark。

export SPARK_HOME=/usr/lib/spark;
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

I am running a spark cluster, on CentOS VM, which is installed from cloudera yum packages.

Had to set the following variables to run pyspark.

export SPARK_HOME=/usr/lib/spark;
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

回答 11

export PYSPARK_PYTHON=/home/user/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

这就是我将Anaconda发行版与Spark结合使用的过程。这是独立于Spark版本的。您可以将第一行更改为用户的python bin。另外,从Spark 2.2.0起,PySpark作为PyPi上的独立程序包提供,但我尚未对其进行测试。

export PYSPARK_PYTHON=/home/user/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

This is what I did for using my Anaconda distribution with Spark. This is Spark version independent. You can change the first line to your users’ python bin. Also, as of Spark 2.2.0 PySpark is available as a Stand-alone package on PyPi but I am yet to test it out.


回答 12

您可以使用以下方式获取pyspark pathpython中的in pip(如果您已使用PIP安装了pyspark),如下所示

pip show pyspark

You can get the pyspark path in python using pip (if you have installed pyspark using PIP) as below

pip show pyspark

回答 13

我有同样的问题。

还要确保您使用的是正确的python版本,并且要以正确的pip版本进行安装。就我而言:我同时拥有python 2.7和3.x。我已经安装了pyspark与

pip2.7安装pyspark

而且有效。

I had the same problem.

Also make sure you are using right python version and you are installing it with right pip version. in my case: I had both python 2.7 and 3.x. I have installed pyspark with

pip2.7 install pyspark

and it worked.


回答 14

我收到此错误,是因为我尝试提交的python脚本称为pyspark.py(facepalm)。解决方法是按照上述建议设置我的PYTHONPATH,然后将脚本重命名为pyspark_test.py并清理基于我的脚本原始名称创建的pyspark.pyc并清除此错误。

I got this error because the python script I was trying to submit was called pyspark.py (facepalm). The fix was to set my PYTHONPATH as recommended above, then rename the script to pyspark_test.py and clean up the pyspark.pyc that was created based on my scripts original name and that cleared this error up.


回答 15

对于DSE(DataStax Cassandra和Spark),需要在PYTHONPATH中添加以下位置

export PYTHONPATH=/usr/share/dse/resources/spark/python:$PYTHONPATH

然后使用dse pyspark获取路径中的模块。

dse pyspark

In the case of DSE (DataStax Cassandra & Spark) The following location needs to be added to PYTHONPATH

export PYTHONPATH=/usr/share/dse/resources/spark/python:$PYTHONPATH

Then use the dse pyspark to get the modules in path.

dse pyspark

回答 16

我遇到了同样的问题,将在上面提出的解决方案中添加一件事。在Mac OS X上使用Homebrew安装Spark时,您需要更正py4j路径地址,以在路径中包含libexec(记住将py4j版本更改为您拥有的版本);

PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.9-src.zip:$PYTHONPATH

I had this same problem and would add one thing to the proposed solutions above. When using Homebrew on Mac OS X to install Spark you will need to correct the py4j path address to include libexec in the path (remembering to change py4j version to the one you have);

PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.9-src.zip:$PYTHONPATH

回答 17

就我而言,它是在另一个python dist_package(python 3.5)上安装的,而我正在使用python 3.6,因此以下内容有所帮助:

python -m pip install pyspark

In my case it was getting install at a different python dist_package (python 3.5) whereas I was using python 3.6, so the below helped:

python -m pip install pyspark

回答 18

您还可以创建一个以Alpine作为操作系统,并以Python和Pyspark作为软件包的Docker容器。这样就将所有内容打包了。

You can also create a Docker container with Alpine as the OS and the install Python and Pyspark as packages. That will have it all containerised.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。