ablog

不器用で落着きのない技術者のメモ

Glue Spark ジョブで dynamic_frame から Parquet を読もうとすると "Unsupported encoding: DELTA_BINARY_PACKED" と怒られる

事象

  • Glue Spark ジョブで dynamic_frame から Parquet を読もうとすると "Unsupported encoding: DELTA_BINARY_PACKED" と怒られる。

解決策

  • 以下を設定してやる。
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")

参考

In order to generate the DELTA encoded parquet file in PySpark, we need to enable version 2 of the Parquet write. This is the only way it works. Also, for some reason the setting only works when creating the spark context. The setting is:

"spark.hadoop.parquet.writer.version": "v2"

and the result is:

time:         INT64 GZIP DO:0 FPO:11688 SZ:84010/2858560/34.03 VC:15043098 ENC:DELTA_BINARY_PACKED ST:[min: 1577715561210, max: 1577839907009, num_nulls: 0]

HOWEVER, one cannot read the same file back in PySpark as is as you will get

java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BINARY_PACKED

In order to read the file back, one needs to disable the following conf:

spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
scala - Write a parquet file with delta encoded coulmns - Stack Overflow