ablog

不器用で落着きのない技術者のメモ

Glue ジョブでのサイズの大きい gzip ファイルの解凍について

メモ

f:id:yohei-a:20200703075852p:plain
What could be the problem?
The first thing I looked at was whether the compression type for the data was the problem. GZip is a non splittable compression type, so it is likely the excess time is from uncompression of the data.


(中略)

f:id:yohei-a:20200703075848p:plain
A solution to our problem was to either uncompress gzip files using S3 event hooks, prior to them being processed with Glue, or to use smaller GZip files to get over these performance barriers.

The end solution was to use the small GZip files, as it had the least disruption on the existing process, and also meant that the transfer to S3 was quicker.

Fixing slow performance issues with AWS Glue ETL jobs | beardy.digital