事象
- Glue Spark ジョブで dynamic_frame から Parquet を読もうとすると "Unsupported encoding: DELTA_BINARY_PACKED" と怒られる。
解決策
- 以下を設定してやる。
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
参考
In order to generate the DELTA encoded parquet file in PySpark, we need to enable version 2 of the Parquet write. This is the only way it works. Also, for some reason the setting only works when creating the spark context. The setting is:
"spark.hadoop.parquet.writer.version": "v2"and the result is:
time: INT64 GZIP DO:0 FPO:11688 SZ:84010/2858560/34.03 VC:15043098 ENC:DELTA_BINARY_PACKED ST:[min: 1577715561210, max: 1577839907009, num_nulls: 0]HOWEVER, one cannot read the same file back in PySpark as is as you will get
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BINARY_PACKEDIn order to read the file back, one needs to disable the following conf:
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")scala - Write a parquet file with delta encoded coulmns - Stack Overflow