ablog

不器用で落着きのない技術者のメモ

PySpark のコードを実行すると "ValueError: Cannot run multiple SparkContexts at once; existing SparkContext" と怒られる

事象

AWS Glue の Zeppelin で以下の PySpark のコードを実行すると、"ValueError: Cannot run multiple SparkContexts at once; existing SparkContext" と怒られる。

  • コード
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import SQLContext
from pyspark.sql.functions import year, month, date_format

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session 
job = Job(glueContext)
job.init('sh10sales_parquet')
  • エラーメッセージ
Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-302704721498299847.py", line 367, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-302704721498299847.py", line 355, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 9, in <module>
  File "/usr/lib/spark/python/pyspark/context.py", line 115, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/usr/lib/spark/python/pyspark/context.py", line 299, in _ensure_initialized
    callsite.function, callsite.file, callsite.linenum))
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=Zeppelin, master=yarn-client) created by __init__ at /tmp/zeppelin_pyspark-302704721498299847.py:278 

解決策

  • "sc = SparkContext()"を"sc = SparkContext.getOrCreate()"に書き換える。
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import SQLContext
from pyspark.sql.functions import year, month, date_format

sc = SparkContext.getOrCreate() ★ココ
glueContext = GlueContext(sc)
spark = glueContext.spark_session 
job = Job(glueContext)
job.init('sh10sales_parquet')

参考

This happens because when you type "pyspark" in the terminal, the system automatically initialized the SparkContext (maybe a Object?), so you should stop it before creating a new one.

You can use

sc.stop()

before you create your new SparkContext.

Also, you can use

sc = SparkContext.getOrCreate()

instead of

sc = SparkContext()

I am new in Spark and I don't know much about the meaning of the parameters of the function SparkContext() but the code showed above both worked for me.

python - multiple SparkContexts error in tutorial - Stack Overflow