ablog

不器用で落着きのない技術者のメモ

Glue PySpark で Parquet ファイルの型を変換する

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="tpc-h_10gb_parquet",
    table_name="nation_tbl",
    transformation_ctx="S3bucket_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1,
    mappings=[
        ("col0", "decimal", "col0", "decimal(10,2)"),
        ("col1", "string", "col1", "string"),
        ("col2", "long", "col2", "decimal(20,4)"),
        ("col3", "string", "col3", "string"),
    ],
    transformation_ctx="ApplyMapping_node2",
)

# Script generated for node S3 bucket
S3bucket_node3 = glueContext.getSink(
    path="s3://redshift-downloads-ablog/TPC-H/2.18/parquet-10GB/sales-type-converted/",
    connection_type="s3",
    updateBehavior="UPDATE_IN_DATABASE",
    partitionKeys=[],
    compression="snappy",
    enableUpdateCatalog=True,
    transformation_ctx="S3bucket_node3",
)
S3bucket_node3.setCatalogInfo(
    catalogDatabase="tpc-h_10gb_parquet",
    catalogTableName="nation_table_type_converted2",
)
S3bucket_node3.setFormat("glueparquet")
S3bucket_node3.writeFrame(ApplyMapping_node2)
job.commit()
  • 元テーブル


  • 変換先のテーブル


  • Spark on EMR で参照
$ pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.           \
  appName("ExamplePySparkSubmitTask").  \
  config("spark.databricks.hive.metastore.glueCatalog.enabled", "true"). \
  enableHiveSupport(). \
  getOrCreate()
>>> sql("SELECT * FROM `tpc-h_10gb_parquet`.nation_table_type_converted2").show(1000)
+-----+--------------+------+--------------------+
| col0|          col1|  col2|                col3|
+-----+--------------+------+--------------------+
|14.00|          null|0.0000| pending excuses ...|
|22.00|        RUSSIA|3.0000| requests against...|
| 9.00|     INDONESIA|2.0000| slyly express as...|
|13.00|          null|4.0000|ic deposits are b...|
|15.00|          null|0.0000|rns. blithely bol...|
|23.00|UNITED KINGDOM|3.0000|eans boost carefu...|
| 3.00|          null|1.0000|eas hang ironic, ...|
| 0.00|       ALGERIA|0.0000| haggle. carefull...|
|17.00|          PERU|1.0000|platelets. blithe...|
|17.00|          null|1.0000|platelets. blithe...|
| 2.00|        BRAZIL|1.0000|y alongside of th...|
|13.00|        JORDAN|4.0000|ic deposits are b...|
|10.00|          IRAN|4.0000|efully alongside ...|
|20.00|  SAUDI ARABIA|4.0000|ts. silent reques...|
|24.00|          null|1.0000|y final packages....|
|11.00|          null|4.0000|nic deposits boos...|
|21.00|          null|2.0000|hely enticingly e...|
|24.00| UNITED STATES|1.0000|y final packages....|
| 8.00|         INDIA|2.0000|ss excuses cajole...|
|19.00|          null|3.0000|ular asymptotes a...|
| 7.00|          null|3.0000|l platelets. regu...|
| 1.00|     ARGENTINA|1.0000|al foxes promise ...|
| 8.00|          null|2.0000|ss excuses cajole...|
| 9.00|          null|2.0000| slyly express as...|
| 1.00|          null|1.0000|al foxes promise ...|
|16.00|    MOZAMBIQUE|0.0000|s. ironic, unusua...|
| 6.00|        FRANCE|3.0000|refully final req...|
|14.00|         KENYA|0.0000| pending excuses ...|
| 6.00|          null|3.0000|refully final req...|
| 5.00|          null|0.0000|ven packages wake...|
|16.00|          null|0.0000|s. ironic, unusua...|
| 3.00|        CANADA|1.0000|eas hang ironic, ...|
| 2.00|          null|1.0000|y alongside of th...|
| 5.00|      ETHIOPIA|0.0000|ven packages wake...|
|10.00|          null|4.0000|efully alongside ...|
|15.00|       MOROCCO|0.0000|rns. blithely bol...|
|21.00|       VIETNAM|2.0000|hely enticingly e...|
| 4.00|          null|4.0000|y above the caref...|
|19.00|       ROMANIA|3.0000|ular asymptotes a...|
|12.00|          null|2.0000|ously. final, exp...|
|18.00|          null|2.0000|c dependencies. f...|
| 7.00|       GERMANY|3.0000|l platelets. regu...|
|20.00|          null|4.0000|ts. silent reques...|
|11.00|          IRAQ|4.0000|nic deposits boos...|
| 0.00|          null|0.0000| haggle. carefull...|
|18.00|         CHINA|2.0000|c dependencies. f...|
| 4.00|         EGYPT|4.0000|y above the caref...|
|23.00|          null|3.0000|eans boost carefu...|
|22.00|          null|3.0000| requests against...|
|12.00|         JAPAN|2.0000|ously. final, exp...|
+-----+--------------+------+--------------------+
  • Redshift Spectrum で参照
dev awsuser 20230606_07:07:24 =# select * from spectrum_schema.nation_table_type_converted2;
 col0  |      col1      |  col2  |                                                        col3
-------+----------------+--------+--------------------------------------------------------------------------------------------------------------------
 23.00 |                | 3.0000 | eans boost carefully special requests. accounts are. carefull
 22.00 |                | 3.0000 |  requests against the platelets use never according to the quickly regular pint
 12.00 | JAPAN          | 2.0000 | ously. final, express gifts cajole a
 18.00 |                | 2.0000 | c dependencies. furiously express notornis sleep slyly regular accounts. ideas sleep. depos
  7.00 | GERMANY        | 3.0000 | l platelets. regular accounts x-ray: unusual, regular acco
  9.00 | INDONESIA      | 2.0000 |  slyly express asymptotes. regular deposits haggle slyly. carefully ironic hockey players sleep blithely. carefull
 13.00 |                | 4.0000 | ic deposits are blithely about the carefully regular pa
 10.00 | IRAN           | 4.0000 | efully alongside of the slyly final dependencies.
 20.00 | SAUDI ARABIA   | 4.0000 | ts. silent requests haggle. closely express packages sleep across the blithely
 24.00 |                | 1.0000 | y final packages. slow foxes cajole quickly. quickly silent platelets breach ironic accounts. unusual pinto be
 14.00 | KENYA          | 0.0000 |  pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t
  6.00 |                | 3.0000 | refully final requests. regular, ironi
  5.00 |                | 0.0000 | ven packages wake quickly. regu
  1.00 |                | 1.0000 | al foxes promise slyly according to the regular accounts. bold requests alon
 16.00 | MOZAMBIQUE     | 0.0000 | s. ironic, unusual asymptotes wake blithely r
  6.00 | FRANCE         | 3.0000 | refully final requests. regular, ironi
  3.00 |                | 1.0000 | eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold
  0.00 | ALGERIA        | 0.0000 |  haggle. carefully final deposits detect slyly agai
 17.00 | PERU           | 1.0000 | platelets. blithely pending dependencies use fluffily across the even pinto beans. carefully silent accoun
  2.00 |                | 1.0000 | y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special
  5.00 | ETHIOPIA       | 0.0000 | ven packages wake quickly. regu
 10.00 |                | 4.0000 | efully alongside of the slyly final dependencies.
 15.00 | MOROCCO        | 0.0000 | rns. blithely bold courts among the closely regular packages use furiously bold platelets?
 17.00 |                | 1.0000 | platelets. blithely pending dependencies use fluffily across the even pinto beans. carefully silent accoun
  2.00 | BRAZIL         | 1.0000 | y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special
 13.00 | JORDAN         | 4.0000 | ic deposits are blithely about the carefully regular pa
 20.00 |                | 4.0000 | ts. silent requests haggle. closely express packages sleep across the blithely
 11.00 | IRAQ           | 4.0000 | nic deposits boost atop the quickly final requests? quickly regula
 11.00 |                | 4.0000 | nic deposits boost atop the quickly final requests? quickly regula
 21.00 |                | 2.0000 | hely enticingly express accounts. even, final
 24.00 | UNITED STATES  | 1.0000 | y final packages. slow foxes cajole quickly. quickly silent platelets breach ironic accounts. unusual pinto be
 16.00 |                | 0.0000 | s. ironic, unusual asymptotes wake blithely r
  3.00 | CANADA         | 1.0000 | eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold
 14.00 |                | 0.0000 |  pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t
 22.00 | RUSSIA         | 3.0000 |  requests against the platelets use never according to the quickly regular pint
  1.00 | ARGENTINA      | 1.0000 | al foxes promise slyly according to the regular accounts. bold requests alon
  8.00 |                | 2.0000 | ss excuses cajole slyly across the packages. deposits print aroun
  9.00 |                | 2.0000 |  slyly express asymptotes. regular deposits haggle slyly. carefully ironic hockey players sleep blithely. carefull
 21.00 | VIETNAM        | 2.0000 | hely enticingly express accounts. even, final
  4.00 |                | 4.0000 | y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d
  0.00 |                | 0.0000 |  haggle. carefully final deposits detect slyly agai
 18.00 | CHINA          | 2.0000 | c dependencies. furiously express notornis sleep slyly regular accounts. ideas sleep. depos
  4.00 | EGYPT          | 4.0000 | y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d
 19.00 | ROMANIA        | 3.0000 | ular asymptotes are about the furious multipliers. express dependencies nag above the ironically ironic account
 12.00 |                | 2.0000 | ously. final, express gifts cajole a
 15.00 |                | 0.0000 | rns. blithely bold courts among the closely regular packages use furiously bold platelets?
 23.00 | UNITED KINGDOM | 3.0000 | eans boost carefully special requests. accounts are. carefull
  8.00 | INDIA          | 2.0000 | ss excuses cajole slyly across the packages. deposits print aroun
 19.00 |                | 3.0000 | ular asymptotes are about the furious multipliers. express dependencies nag above the ironically ironic account
  7.00 |                | 3.0000 | l platelets. regular accounts x-ray: unusual, regular acco
(50 rows)