Amazon Redshift の redistribute と broadcast の違いについてのメモ。
- redistribute と broadcast(A copy of the entire table is broadcast to all the compute nodes) がある。
DS_DIST_OUTER
The outer table is redistributed.DS_BCAST_INNER
A copy of the entire inner table is broadcast to all the compute nodes.DS_DIST_ALL_INNER
The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.DS_DIST_BOTH
Evaluating the query plan - Amazon Redshift
Both tables are redistributed.
- broadcast はテーブル全体または行セットを全ノードに転送すること
- distribute は行を他ノードに転送すること
クエリの概要へのクエリプランのマッピング - Amazon Redshift
Query plan operation Label field value Description DS_BCAST_INNER BCAST (broadcast) Broadcasts an entire table or some set of rows (such as a filtered set of rows from a table) to all nodes. DS_DIST_NONE
DS_DIST_ALL_NONE
DS_DIST_INNER
DS_DIST_ALL_INNER
DS_DIST_ALL_BOTHDIST (distribute) Distributes rows to nodes for parallel joining purposes or other parallel processing.
- 大きなファクトテーブルと小さいなディメンションテーブルを結合する際、ファクトテーブルの結合対象の行数分のディメンションテーブルレコードを転送するとファクトテーブルの結合対象行数の回数だけディメンションテーブルの対象行を転送すると効率が悪いため、ディメンションテーブル側の全行を転送する(broadcast)ということか。
ALL distribution replicates the entire table across all nodes. Each node holds a full copy of the table, eliminating data movement during query execution.
Use Case: This style is best suited for small, frequently accessed tables, such as lookup tables. Typical scenarios include:
Understanding Amazon Redshift Distribution Styles and Internal Architecture
- > Small dimension tables joined with large fact tables.
- > Queries requiring broadcast joins to avoid redistribution costs.
- Merge join は結合キーが分散キーかつ結合キーの場合のみのため、distribute や broadcast は発生しないと考えられる。
- Nested Loop
- The least optimal join, a nested loop is used mainly for cross-joins (Cartesian products) and some inequality joins.
- Hash Join and Hash
- Typically faster than a nested loop join, a hash join and hash are used for inner joins and left and right outer joins. These operators are used when joining tables where the join columns are not both distribution keys and sort keys. The hash operator creates the hash table for the inner table in the join; the hash join operator reads the outer table, hashes the joining column, and finds matches in the inner hash table.
- Merge Join
- Typically the fastest join, a merge join is used for inner joins and outer joins. The merge join is not used for full joins. This operator is used when joining tables where the join columns are both distribution keys and sort keys, and when less than 20 percent of the joining tables are unsorted. It reads two sorted tables in order and finds the matching rows. To view the percent of unsorted rows, query the SVV_TABLE_INFO system table.
Creating and interpreting a query plan - Amazon Redshift
- Spatial Join
- Typically a fast join based on proximity of spatial data, used for GEOMETRY and GEOGRAPHY data types.
Redistribution of both tables
In my example database, all of the tables are distributed on userid. But what if they weren’t? In particular, what if they had the EVEN distribution style (which is Redshift’s default)? In that case, you’d see a query plan that looks like this:
XN Limit -> XN HashAggregate -> XN Hash Left Join DS_DIST_BOTH Outer Dist Key: atc.userid Inner Dist Key: cc.userid Hash Cond: (("outer".userid)::text = ("inner".userid)::text) -> XN Seq Scan on add_to_cart atc -> XN Hash -> XN Seq Scan on checkout_complete ccI have seen cases where both tables were redistributed because the query was based on an alternate — but valid — set of join columns. These tend to be extremely rare, but if you find that you’re frequently doing such joins, the best solution is to create a second copy of the tables, distributed on that alternate key.