ablog

不器用で落着きのない技術者のメモ

Amazon Redshift の redistribute と broadcast

Amazon Redshift の redistribute と broadcast の違いについてのメモ。

  • redistribute と broadcast(A copy of the entire table is broadcast to all the compute nodes) がある。

DS_DIST_OUTER
The outer table is redistributed.

DS_BCAST_INNER
A copy of the entire inner table is broadcast to all the compute nodes.

DS_DIST_ALL_INNER
The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.

DS_DIST_BOTH
Both tables are redistributed.

Evaluating the query plan - Amazon Redshift
  • broadcast はテーブル全体または行セットを全ノードに転送すること
  • distribute は行を他ノードに転送すること
Query plan operation Label field value Description
DS_BCAST_INNER BCAST (broadcast) Broadcasts an entire table or some set of rows (such as a filtered set of rows from a table) to all nodes.
DS_DIST_NONE
DS_DIST_ALL_NONE
DS_DIST_INNER
DS_DIST_ALL_INNER
DS_DIST_ALL_BOTH
DIST (distribute) Distributes rows to nodes for parallel joining purposes or other parallel processing.
クエリの概要へのクエリプランのマッピング - Amazon Redshift
  • 大きなファクトテーブルと小さいなディメンションテーブルを結合する際、ファクトテーブルの結合対象の行数分のディメンションテーブルレコードを転送するとファクトテーブルの結合対象行数の回数だけディメンションテーブルの対象行を転送すると効率が悪いため、ディメンションテーブル側の全行を転送する(broadcast)ということか。

ALL distribution replicates the entire table across all nodes. Each node holds a full copy of the table, eliminating data movement during query execution.

Use Case: This style is best suited for small, frequently accessed tables, such as lookup tables. Typical scenarios include:

  • > Small dimension tables joined with large fact tables.
  • > Queries requiring broadcast joins to avoid redistribution costs.
Understanding Amazon Redshift Distribution Styles and Internal Architecture
  • Merge join は結合キーが分散キーかつ結合キーの場合のみのため、distribute や broadcast は発生しないと考えられる。
  • Nested Loop
    • The least optimal join, a nested loop is used mainly for cross-joins (Cartesian products) and some inequality joins.
  • Hash Join and Hash
    • Typically faster than a nested loop join, a hash join and hash are used for inner joins and left and right outer joins. These operators are used when joining tables where the join columns are not both distribution keys and sort keys. The hash operator creates the hash table for the inner table in the join; the hash join operator reads the outer table, hashes the joining column, and finds matches in the inner hash table.
  • Merge Join
    • Typically the fastest join, a merge join is used for inner joins and outer joins. The merge join is not used for full joins. This operator is used when joining tables where the join columns are both distribution keys and sort keys, and when less than 20 percent of the joining tables are unsorted. It reads two sorted tables in order and finds the matching rows. To view the percent of unsorted rows, query the SVV_TABLE_INFO system table.
  • Spatial Join
    • Typically a fast join based on proximity of spatial data, used for GEOMETRY and GEOGRAPHY data types.
Creating and interpreting a query plan - Amazon Redshift

Redistribution of both tables

In my example database, all of the tables are distributed on userid. But what if they weren’t? In particular, what if they had the EVEN distribution style (which is Redshift’s default)? In that case, you’d see a query plan that looks like this:

XN Limit
   ->  XN HashAggregate
       ->  XN Hash Left Join DS_DIST_BOTH
           Outer Dist Key: atc.userid
           Inner Dist Key: cc.userid
           Hash Cond: (("outer".userid)::text = ("inner".userid)::text)
           ->  XN Seq Scan on add_to_cart atc
           ->  XN Hash
               ->  XN Seq Scan on checkout_complete cc

I have seen cases where both tables were redistributed because the query was based on an alternate — but valid — set of join columns. These tend to be extremely rare, but if you find that you’re frequently doing such joins, the best solution is to create a second copy of the tables, distributed on that alternate key.