ablog

不器用で落着きのない技術者のメモ

Amazon Redshift でゾーンマップがあってもソートキーが必要な理由

Amazon Redshift はゾーンマップで 1MB ブロックの最小値と最大値を保持しているので、クエリのフィルタ条件などに合致しないブロックを読み飛ばすことができる。ソートキーでソートされていないと同じ値のデータがあちこちのブロックに散らばるため、ソートされてない場合より読取りブロックが増える。

Will sorting improve zone maps?
A zone map exists for each 1 MB block, and consists of in-memory metadata that tracks the minimum and maximum values within the block. This metadata is accessed before a disk scan in order to identify which blocks are relevant to the query. For an unsorted column, these min-max ranges can overlap from block to block, reducing the effectiveness of these zone maps. In almost every case, a sort key improves the effectiveness of zone maps, by physically ordering the data so values increase in ascending order through each column (and block) within each slice. However, there are some cases where a SORTKEY won’t improve zone maps:

Amazon Redshift Engineering’s Advanced Table Design Playbook: Compound and Interleaved Sort Keys | AWS Big Data Blog

ゾーンマップの情報はスーパーブロックに格納されている。

A table’s data is partitioned into data slices and stored as logical chains of blocks. Each data block is described by its block header (e.g., identity, table ownership and slice information) and indexed via an in-memory construct, called superblock. The superblock is
an indexing structure similar to many filesystems. Queries reach the relevant data blocks by using zone maps to scan the superblock. The superblock also contains query tracking information for data blocks owned by live queries.

Amazon Redshift Re-invented