事象
AWS Glue の Zeppelin で以下の PySpark のコードを実行すると、"ValueError: Cannot run multiple SparkContexts at once; existing SparkContext" と怒られる。
- コード
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.sql import SQLContext from pyspark.sql.functions import year, month, date_format sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init('sh10sales_parquet')
- エラーメッセージ
Traceback (most recent call last): File "/tmp/zeppelin_pyspark-302704721498299847.py", line 367, in <module> raise Exception(traceback.format_exc()) Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-302704721498299847.py", line 355, in <module> exec(code, _zcUserQueryNameSpace) File "<stdin>", line 9, in <module> File "/usr/lib/spark/python/pyspark/context.py", line 115, in __init__ SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File "/usr/lib/spark/python/pyspark/context.py", line 299, in _ensure_initialized callsite.function, callsite.file, callsite.linenum)) ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=Zeppelin, master=yarn-client) created by __init__ at /tmp/zeppelin_pyspark-302704721498299847.py:278
解決策
- "sc = SparkContext()"を"sc = SparkContext.getOrCreate()"に書き換える。
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.sql import SQLContext from pyspark.sql.functions import year, month, date_format sc = SparkContext.getOrCreate() ★ココ glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init('sh10sales_parquet')
参考
This happens because when you type "pyspark" in the terminal, the system automatically initialized the SparkContext (maybe a Object?), so you should stop it before creating a new one.
You can use
sc.stop()before you create your new SparkContext.
Also, you can use
sc = SparkContext.getOrCreate()instead of
sc = SparkContext()I am new in Spark and I don't know much about the meaning of the parameters of the function SparkContext() but the code showed above both worked for me.
python - multiple SparkContexts error in tutorial - Stack Overflow