ablog

不器用で落着きのない技術者のメモ

Spark で "Possibly unquoted identifier ... detected. Please consider quoting it with back-quotes as" と怒られる

ApacheSpark

事象

Spark SQL で Glue カタログのデータベース名（"_"や"-"を含む）を指定すると、"Possibly unquoted identifier ... detected. Please consider quoting it with back-quotes as" と怒られる。

$ pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.           \
  appName("ExamplePySparkSubmitTask").  \
  config("spark.databricks.hive.metastore.glueCatalog.enabled", "true"). \
  enableHiveSupport(). \
  getOrCreate()
>>> sql("SELECT count(*) FROM tpc-h_10gb_parquet.supplier_tbl").show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 1034, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self)
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.ParseException:
Possibly unquoted identifier tpc-h_10gb_parquet detected. Please consider quoting it with back-quotes as `tpc-h_10gb_parquet`(line 1, pos 24)

== SQL ==
SELECT count(*) FROM tpc-h_10gb_parquet.supplier_tbl
------------------------^^^

解決策

データベース名を `` で囲ってやる。

>>> sql("SELECT count(*) FROM `tpc-h_10gb_parquet`.supplier_tbl").show()
+--------+
|count(1)|
+--------+
|  100000|
+--------+

環境

Amazon EMR
- version: emr-6.10.0
- Installed applications: Spark 3.3.1, Zeppelin 0.10.1
- AWS Glue Data Catalog settings: Use for Spark table metadata
Parquet を Crawler で登録した Glue カタログを使用