ablog

不器用で落着きのない技術者のメモ

筋子からイクラの醤油漬けの作り方

調味液
  • 濃く作った一番だし(3割)
  • 醤油(5割)
  • 日本酒(1割強)
  • みりん(1割弱)
作り方のポイント
  • 軽く煮切って、氷水で冷やして掃除が終わったいくらに合わせる。
  • 調味液を入れる量は、いくらに対してひたひたからちょっと控えるぐらい。

Redshift ユーザーによって COPY/UNLOAD でアクセスできる S3 のフォルダ(プレフィックス)を分ける

Redshift ユーザーによって COPY/UNLOAD でアクセスできる S3 のフォルダ(プレフィックス)を分ける方法をメモ。

実現すること

  • Redshift ユーザー "rs_user_a" は S3 パス "s3://s3-for-redshift/rs_user_a/" 以下のみに COPY/UNLOAD でアクセスできる。
  • Redshift ユーザー "rs_user_b" は S3 パス "s3://s3-for-redshift/rs_user_b/" 以下のみに COPY/UNLOAD でアクセスできる。

検証結果

Redshift ユーザー "rs_user_a" は S3 パス "s3:///rs_user_a/" 以下のみに COPY/UNLOAD でアクセスできる。
  • rs_user_a ユーザーで Redshift にログインする。
$ psql "host=redshift-cluster-poc-central.********.ap-northeast-1.redshift.amazonaws.com user=rs_user_a dbname=dev port=5439" 
  • 許可している S3 パスには COPY/UNLOAD できる。
dev=> unload ('select * from public.customer')
dev-> to 's3://s3-for-redshift/rs_user_a/customer' 
dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a'
dev-> header
dev-> format csv
dev-> gzip
dev-> parallel off
dev-> allowoverwrite;
INFO:  UNLOAD completed, 10125500 record(s) unloaded successfully.
UNLOAD

dev=> copy public.customer from 's3://s3-for-redshift/rs_user_a/customer000.gz'
dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a'
dev-> format csv
dev-> gzip compupdate off region 'ap-northeast-1'
dev-> ignoreheader 1;
INFO:  Load into table 'customer' completed, 10125500 record(s) loaded successfully.
COPY
  • 許可していない S3 パスには COPY/UNLOAD できない。
dev=> unload ('select * from public.customer')
dev-> to 's3://s3-for-redshift/rs_user_b/customer' 
dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a'
dev-> header
dev-> format csv
dev-> gzip
dev-> parallel off
dev-> allowoverwrite;
ERROR:  S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_a/RedshiftIamRoleSession is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_b/customer000.gz" because no identity-based polic
DETAIL:  
  -----------------------------------------------
  error:  S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_a/RedshiftIamRoleSession is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_b/customer000.gz" because no identity-based polic
  code:      8001
  context:   Failed to initialize S3 output stream. S3 path: s3://s3-for-redshift/rs_user_b/customer000.gz
  query:     35041819
  location:  s3_text_unloader.cpp:369
  process:   padbmaster [pid=1073750164]
  -----------------------------------------------

dev=> copy public.customer from 's3://s3-for-redshift/rs_user_b/customer000.gz'
dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a'
dev-> format csv
dev-> gzip compupdate off region 'ap-northeast-1'
dev-> ignoreheader 1;

ERROR:  S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_a/RedshiftIamRoleSession is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_b/customer000.gz" because no identity-based polic
DETAIL:  
  -----------------------------------------------
  error:  S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_a/RedshiftIamRoleSession is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_b/customer000.gz" because no identity-based polic
  code:      8001
  context:   S3 key being read : s3://s3-for-redshift/rs_user_b/customer000.gz
  query:     35482475
  location:  copy_s3_scanner.cpp:344
  process:   query1_252_35482475 [pid=15960]
  -----------------------------------------------
Redshift ユーザー "rs_user_b" は S3 パス "s3:///rs_user_b/" 以下のみに COPY/UNLOAD でアクセスできる。
  • rs_user_b ユーザーで Redshift にログインする。
$ psql "host=redshift-cluster-poc-central.********.ap-northeast-1.redshift.amazonaws.com user=rs_user_b dbname=dev port=5439" 
  • 許可している S3 パスには COPY/UNLOAD できる。
dev=> unload ('select * from public.customer')
dev-> to 's3://s3-for-redshift/rs_user_b/customer' 
dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b'
dev-> header
dev-> format csv
dev-> gzip
dev-> parallel off
dev-> allowoverwrite;
INFO:  UNLOAD completed, 20251000 record(s) unloaded successfully.
UNLOAD

dev=> copy public.customer from 's3://s3-for-redshift/rs_user_b/customer000.gz'
dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b'
dev-> format csv
dev-> gzip compupdate off region 'ap-northeast-1'
dev-> ignoreheader 1;
INFO:  Load into table 'customer' completed, 20251000 record(s) loaded successfully.
COPY
  • 許可していない S3 パスには COPY/UNLOAD できない。
dev=> unload ('select * from public.customer')
dev-> to 's3://s3-for-redshift/rs_user_a/customer' 
dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b'
dev-> header
dev-> format csv
dev-> gzip
dev-> parallel off
dev-> allowoverwrite;

ERROR:  S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_b/RedshiftIamRoleSession is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_a/customer000.gz" because no identity-based polic
DETAIL:  
  -----------------------------------------------
  error:  S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_b/RedshiftIamRoleSession is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_a/customer000.gz" because no identity-based polic
  code:      8001
  context:   Failed to initialize S3 output stream. S3 path: s3://s3-for-redshift/rs_user_a/customer000.gz
  query:     35483157
  location:  s3_text_unloader.cpp:369
  process:   padbmaster [pid=1073988131]
  -----------------------------------------------

dev=> copy public.customer from 's3://s3-for-redshift/rs_user_a/customer000.gz'
dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b'
dev-> format csv
dev-> gzip compupdate off region 'ap-northeast-1'
dev-> ignoreheader 1;

ERROR:  S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_b/RedshiftIamRoleSession is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_a/customer000.gz" because no identity-based polic
DETAIL:  
  -----------------------------------------------
  error:  S3ServiceException:User: arn:aws:sts::123456789012:assumed-role/iam-role_for_rs_role_b/RedshiftIamRoleSession is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::s3-for-redshift/rs_user_a/customer000.gz" because no identity-based polic
  code:      8001
  context:   S3 key being read : s3://s3-for-redshift/rs_user_a/customer000.gz
  query:     35483161
  location:  copy_s3_scanner.cpp:344
  process:   query1_252_35483161 [pid=15960]
  -----------------------------------------------
GRANT されていない IAM ロールには ASSUMEROLE できない
  • rs_user_a
$ psql "host=redshift-cluster-poc-central.ceyg6jv96hfq.ap-northeast-1.redshift.amazonaws.com user=rs_user_a dbname=dev port=5439" 
dev=> copy public.customer from 's3://s3-for-redshift/rs_user_b/customer000.gz'
dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b'
dev-> format csv
dev-> gzip compupdate off region 'ap-northeast-1'
dev-> ignoreheader 1;
ERROR:  User rs_user_a does not have ASSUMEROLE permission on IAM role "arn:aws:iam::123456789012:role/iam-role_for_rs_role_b" for COPY
  • rs_user_b
$ psql "host=redshift-cluster-poc-central.ceyg6jv96hfq.ap-northeast-1.redshift.amazonaws.com user=rs_user_b dbname=dev port=5439" 
dev=> copy public.customer from 's3://s3-for-redshift/rs_user_a/customer000.gz'
dev-> iam_role 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a'
dev-> format csv
dev-> gzip compupdate off region 'ap-northeast-1'
dev-> ignoreheader 1;
ERROR:  User rs_user_b does not have ASSUMEROLE permission on IAM role "arn:aws:iam::123456789012:role/iam-role_for_rs_role_a" for COPY

設定手順

  • 対象の Redshift クラスターで以下を1度実行する。
    • これを実行しないと grant されてない IAM ロールに AssumeRole できてしまう。
revoke assumerole on all from public for all;
ASSUMEROLE のための IAMロール "iam-role_for_rs_role_a" を作成する。
  • iam-policy_for_rs_role_a
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket"
			],
			"Resource": [
				"arn:aws:s3:::*"
			]
		},
		{
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:PutObject"
			],
			"Resource": [
				"arn:aws:s3:::s3-for-redshift/rs_user_a/",
				"arn:aws:s3:::s3-for-redshift/rs_user_a/*"
			]
		}
	]
}
  • 信頼関係
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
ASSUMEROLE のための IAMロール "iam-role_for_rs_role_b" を作成する。
  • iam-policy_for_rs_role_b
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket"
			],
			"Resource": [
				"arn:aws:s3:::*"
			]
		},
		{
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:PutObject"
			],
			"Resource": [
				"arn:aws:s3:::s3-for-redshift/rs_user_b/",
				"arn:aws:s3:::s3-for-redshift/rs_user_b/*"
			]
		}
	]
}
  • 信頼関係
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
  • 作成した IAM ロールを Redshift にアタッチする。


  • Redshift ユーザーを作成し、PUBLIC スキーマのテーブルへのアクセス権を付与する。
CREATE USER rs_user_a PASSWORD 'Password123!';
CREATE USER rs_user_b PASSWORD 'Password123!';
grant usage on schema public to rs_user_a, rs_user_b;
grant all on all tables in schema public to rs_user_a, rs_user_b;
  • Redshift ユーザーが ASSUMEROLE できるよう権限を GRANT する。
GRANT ASSUMEROLE
       ON 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_a'
       TO rs_user_a
       FOR COPY, UNLOAD;

GRANT ASSUMEROLE
       ON 'arn:aws:iam::123456789012:role/iam-role_for_rs_role_b'
       TO rs_user_b
       FOR COPY, UNLOAD;
設定したロールを確認する
  • rs_user_a
dev=# select pg_get_iam_role_by_user('rs_user_a');
                            pg_get_iam_role_by_user                            
-------------------------------------------------------------------------------
 (rs_user_a,all,None)
 (rs_user_a,arn:aws:iam::123456789012:role/iam-role_for_rs_role_a,COPY|UNLOAD)
(2 rows)
  • rs_user_b
dev=# select pg_get_iam_role_by_user('rs_user_b');
                            pg_get_iam_role_by_user                            
-------------------------------------------------------------------------------
 (rs_user_b,all,None)
 (rs_user_b,arn:aws:iam::123456789012:role/iam-role_for_rs_role_b,COPY|UNLOAD)
(2 rows)

補足

  • スーパーユーザーで以下を実行すると "revoke assumerole on all from public for all" を元に戻せる。また、"revoke assumerole on all from public for all" を手順の最後に実行するとそれまでは grant していない IAM ロールにも AssumeRole できる。
grant assumerole on all to public for all;

参考

ASSUMEROLE アクセス許可の付与
指定されたロールを持つユーザーおよびグループに付与される ASSUMEROLE アクセス許可の構文を次に示します。ASSUMEROLE 権限の使用を開始する際は、「ASSUMEROLE アクセス許可を付与するための使用上の注意事項」を参照してください。

GRANT ASSUMEROLE
       ON { 'iam_role' [, ...] | default | ALL }
       TO { username | ROLE role_name | GROUP group_name | PUBLIC } [, ...]
       FOR { ALL | COPY | UNLOAD | EXTERNAL FUNCTION | CREATE MODEL } [, ...]
GRANT - Amazon Redshift

Redshift で UNLOAD した時のファイル suffix の命名規則

Redshift で UNLOAD した時のファイル suffix の命名規則

ネーミングルール

  • CSV
    • 圧縮なし: 000
    • gzip: 000.gz
    • bzip2: 000.bz2
    • zstd: 000.zst
  • Parquet
    • 000.parquet

検証手順

unload ('select * from public.customer')
to 's3://test-rs-copy-bucket/unload/customer_'
allowoverwrite
iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess'
csv
 parallel off;


unload ('select * from public.customer')
to 's3://test-rs-copy-bucket/unload/customer_'
allowoverwrite
iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess'
csv gzip
 parallel off;

unload ('select * from public.customer')
to 's3://test-rs-copy-bucket/unload/customer_'
allowoverwrite
iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess'
csv zstd
 parallel off;

unload ('select * from public.customer')
to 's3://test-rs-copy-bucket/unload/customer_'
allowoverwrite
iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess'
csv bzip2
 parallel off;
 
 
unload ('select * from public.customer')
to 's3://test-rs-copy-bucket/unload/customer_'
allowoverwrite
iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess'
parquet
parallel off;


unload ('select * from public.customer')
to 's3://test-rs-copy-bucket/unload/customer_'
allowoverwrite
iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess'
json gzip
parallel off;

unload ('select * from public.customer')
to 's3://test-rs-copy-bucket/unload/customer_'
allowoverwrite
iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess'
json bzip2
parallel off;

unload ('select * from public.customer')
to 's3://test-rs-copy-bucket/unload/customer_'
allowoverwrite
iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess'
json zstd
parallel off;

unload ('select * from public.customer')
to 's3://test-rs-copy-bucket/unload/customer_'
allowoverwrite
iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess'
json 
parallel off;

Redshift の COPY コマンドでロード時に無効な UTF-8 文字を自動的に置換する

COPY コマンドで Redshift にデータロード時に無効な UTF-8 文字がある場合、ACCEPTINVCHARS で固定の文字列に置換することができる。

ACCEPTINVCHARS [AS] ['replacement_char']
データに無効な UTF-8 文字がある場合でも、VARCHAR 列へのデータのロードを有効にします。ACCEPTINVCHARS を指定した場合、COPY は replacement_char で指定されている文字列から構成される同じ長さの文字列で、無効な各 UTF-8 文字を置き換えます。たとえば、置換文字が '^' である場合、無効な 3 バイト文字は '^^^' で置き換えられます。

置換文字には NULL 以外の任意の ASCII 文字を使用できます。デフォルトは疑問符 (?) です。無効な UTF-8 文字の詳細については、「マルチバイト文字のロードエラー」を参照してください。

COPY は無効な UTF-8 文字を含んだ行の数を返し、対象行ごとに STL_REPLACEMENTS システムテーブルにエントリを追加します (各ノードスライスで最大 100 行まで)。さらに多くの無効な UTF-8 文字も置き換えられますが、それらの置換イベントは記録されません。

ACCEPTINVCHARS を指定しなかった場合、無効な UTF-8 文字があるごとに、COPY はエラーを返します。

ACCEPTINVCHARS は VARCHAR 列に対してのみ有効です。

データ変換パラメータ - Amazon Redshift

Redshift で COPY コマンド実行時に svl_query_metrics_summary や sys_query_history に情報が記録されるか

Redshift で COPY コマンド実行時に svl_query_metrics_summary や sys_query_history に情報が記録されることを確認したメモ。

結果

dev=# copy public.customer from 's3://test-rs-copy-bucket/csv/customer/' csv
dev-# iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess';
INFO:  Load into table 'customer' completed, 9000000 record(s) loaded successfully.
COPY
Time: 8538.877 ms (00:08.539)
dev=# select pg_last_query_id();
 pg_last_query_id 
------------------
         33416134
(1 row)

Time: 3.421 ms
dev=# \gset
Time: 3.452 ms
dev=# \x
Expanded display is on.
dev=# select * from svl_query_metrics_summary where query = :pg_last_query_id;
-[ RECORD 1 ]--------------+-----------------------------------------------------------------
userid                     | 100
query                      | 33416134
service_class              | 102
query_cpu_time             | 14
query_blocks_read          | 160
query_execution_time       | 8
query_cpu_usage_percent    | 40.07
query_temp_blocks_to_disk  | 
segment_execution_time     | 3
cpu_skew                   | 1.91
io_skew                    | 1.00
scan_row_count             | 8850432
join_row_count             | 
nested_loop_join_row_count | 
return_row_count           | 
spectrum_scan_row_count    | 
spectrum_scan_size_mb      | 
query_queue_time           | 
service_class_name         | Default queue                                                   

Time: 682.883 ms
dev=# select * from sys_query_history where transaction_id = (select xid from stl_query where query = :pg_last_query_id);
-[ RECORD 1 ]----+------------------------------------------------------------------------------------------------
user_id          | 100
query_id         | 33416133
query_label      | default
transaction_id   | 93744987
session_id       | 1073889466
database_name    | dev
query_type       | COPY
status           | success   
result_cache_hit | f
start_time       | 2024-09-06 00:04:43.957806
end_time         | 2024-09-06 00:04:51.574099
elapsed_time     | 7616293
queue_time       | 0
execution_time   | 7491272
error_message    | 
returned_rows    | 0
returned_bytes   | 0
query_text       | copy public.customer from 's3://test-rs-copy-bucket/csv/customer/' csv\niam_role '';
redshift_version | 1.0.73348                       
usage_limit      | 
compute_type     | primary
compile_time     | 6539232
planning_time    | 0
lock_wait_time   | 36

Time: 790.980 ms

手順

create table public.customer 
(
  c_custkey      integer not null,
  c_name         varchar(25) not null,
  c_address      varchar(25) not null,
  c_city         varchar(10) not null,
  c_nation       varchar(15) not null,
  c_region       varchar(12) not null,
  c_phone        varchar(15) not null,
  c_mktsegment   varchar(10) not null
)
diststyle even
compound sortkey(c_nation,c_region);

copy public.customer from 's3://test-rs-copy-bucket/csv/customer/' csv
iam_role 'arn:aws:iam::123456789012:role/redshift-spectrum-s3-fullaccess';

select pg_last_query_id();
\gset

\x
select * from svl_query_metrics_summary where query = :pg_last_query_id;
select * from sys_query_history where transaction_id = (select xid from stl_query where query = :pg_last_query_id);

Redshift の STL テーブルと SYS ビュー

Redshift の STL テーブルのクエリID(query列)と SYS ビューのクエリID(query_id)は別の値のため、クエリIDで結合することはできない。トランザクションID(STLとSYSのトランザクションIDが 1:1 の場合)で結合することができる。

dev=# \timing on
dev=# select count(distinct(lo_orderkey)) from lineorder;
   count
-----------
 150000000
(1 row)

Time: 13282.862 ms (00:13.283)
dev=# select pg_last_query_id();
 pg_last_query_id
------------------
         33269125
(1 row)

Time: 3.632 ms
dev=# \gset
Time: 3.703 ms
dev=# \x
Expanded display is on.
dev=# select * from sys_query_history where transaction_id = (select xid from stl_query where query = :pg_last_query_id);
-[ RECORD 1 ]----+----------------------------------------------------
user_id          | 100
query_id         | 33269123
query_label      | default
transaction_id   | 93596628
session_id       | 1073922183
database_name    | dev
query_type       | SELECT
status           | success
result_cache_hit | f
start_time       | 2024-09-05 04:25:48.409507
end_time         | 2024-09-05 04:26:01.689166
elapsed_time     | 13279659
queue_time       | 0
execution_time   | 13263519
error_message    |
returned_rows    | 1
returned_bytes   | 15
query_text       | select count(distinct(lo_orderkey)) from lineorder;
redshift_version | 1.0.73348
usage_limit      |
compute_type     | primary
compile_time     | 155
planning_time    | 6009
lock_wait_time   | 22

Time: 378.407 ms

参考

SYS monitoring views such as such as SYS_QUERY_HISTORY and SYS_QUERY_DETAIL contain the query_id column, which holds the identifier for users’ queries. Similarly, provisioned-only views such as STL_QUERY and SVL_QLOG contain the query column, which also holds the query identifiers. However, the query identifiers recorded in the SYS system views are different from those recorded in the provisioned-only views.

The difference between the SYS views’ query_id column values and the provisioned-only views’ query column values is as follows:

In SYS views, the query_id column records user-submitted queries in their original form. The Amazon Redshift optimizer might break them down into child queries for improved performance, but a single query you run will still only have a single row in SYS_QUERY_HISTORY. If you want to see the individual child queries, you can find them in SYS_QUERY_DETAIL.

In provisioned-only views, the query column records queries at the child query level. If the Amazon Redshift optimizer rewrites your original query into multiple child queries, there will be multiple rows in STL_QUERY with differing query identifier values for a single query you run.

When you migrate your monitoring and diagnostic queries from provisioned-only views to SYS views, consider this difference and edit your queries accordingly. For more information on how Amazon Redshift processes queries, see Query planning and execution workflow.

System tables and views reference - Amazon Redshift