事象
- Spark SQL で Glue カタログのデータベース名("_"や"-"を含む)を指定すると、"Possibly unquoted identifier ... detected. Please consider quoting it with back-quotes as" と怒られる。
$ pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder. \
appName("ExamplePySparkSubmitTask"). \
config("spark.databricks.hive.metastore.glueCatalog.enabled", "true"). \
enableHiveSupport(). \
getOrCreate()
>>> sql("SELECT count(*) FROM tpc-h_10gb_parquet.supplier_tbl").show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 1034, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self)
File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.ParseException:
Possibly unquoted identifier tpc-h_10gb_parquet detected. Please consider quoting it with back-quotes as `tpc-h_10gb_parquet`(line 1, pos 24)
== SQL ==
SELECT count(*) FROM tpc-h_10gb_parquet.supplier_tbl
解決策
>>> sql("SELECT count(*) FROM `tpc-h_10gb_parquet`.supplier_tbl").show()
+
|count(1)|
+
| 100000|
+
環境
- Amazon EMR
- version: emr-6.10.0
- Installed applications: Spark 3.3.1, Zeppelin 0.10.1
- AWS Glue Data Catalog settings: Use for Spark table metadata
- Parquet を Crawler で登録した Glue カタログを使用