ablog

不器用で落着きのない技術者のメモ

Glue PySpark でダブルクオートなしタブ区切りで CSV に出力する

AWS

Glue で PySpark でダブルクオートなしタブ区切りで CSV に出力したメモ。

書き方

glueContext.write_dynamic_frame の format_options で以下を指定する
- "quoteChar": -1 でクオートなし
- "separator": "\t" でタブ区切り

例

datasink2 = glueContext.write_dynamic_frame.from_options(
    frame = applymapping1, 
    connection_type = "s3", 
    connection_options = {"path": "s3://dl-sfdc-dm/test/no_quote_test"}, 
    format = "csv", 
    format_options={"quoteChar": -1, "separator": "\t" }, 
    transformation_ctx = "datasink2")

参考

format="csv"

This value designates comma-separated-values as the data format (for example, see RFC 4180 and RFC 7111).
You can use the following format_options values with format="csv":

separator — Specifies the delimiter character. The default is a comma: ",", but any other character can be specified.

escaper — Specifies a character to use for escaping. This option is used only when reading CSV files. The default value is none. If enabled, the character which immediately follows is used as-is, except for a small set of well-known escapes (\n, \r, \t, and \0).

quoteChar — Specifies the character to use for quoting. The default is a double quote: '"'. Set this to -1 to turn off quoting entirely.

multiLine — A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to True if any record spans multiple lines. The default value is False, which allows for more aggressive file-splitting during parsing.

withHeader — A Boolean value that specifies whether to treat the first line as a header. The default value is False. This option can be used in the DynamicFrameReader class.

writeHeader — A Boolean value that specifies whether to write the header to output. The default value is True. This option can be used in the DynamicFrameWriter class.

skipFirst — A Boolean value that specifies whether to skip the first data line. The default value is False.

The following example shows how to specify the format options within an AWS Glue ETL job script.
glueContext.write_dynamic_frame.from_options(
    frame = datasource1,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://s3path"
        }, 
    format = "csv", 
    format_options={
        "quoteChar": -1, 
        "separator": "|"
        }, 
    transformation_ctx = "datasink2")
Format Options for ETL Inputs and Outputs in AWS Glue - AWS Glue

コード全量

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "default", table_name = "no_quote_test", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "default", table_name = "no_quote_test", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("col0", "string", "col0", "string"), ("col1", "string", "col1", "string"), ("col2", "string", "col2", "string"), ("col3", "string", "col3", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("col0", "string", "col0", "string"), ("col1", "string", "col1", "string"), ("col2", "string", "col2", "string"), ("col3", "string", "col3", "string")], transformation_ctx = "applymapping1")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://dl-sfdc-dm/test/no_quote_test"}, format = "csv", transformation_ctx = "datasink2"]
## @return: datasink2
## @inputs: [frame = applymapping1]
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://dl-sfdc-dm/test/no_quote_test"}, format = "csv", format_options={"quoteChar": -1, "separator": "\t" }, transformation_ctx = "datasink2")
job.commit()