{"id":2021455,"date":"2023-03-20T14:49:06","date_gmt":"2023-03-20T18:49:06","guid":{"rendered":"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/plato-data\/introducing-native-support-for-apache-hudi-delta-lake-and-apache-iceberg-on-aws-glue-for-apache-spark-part-2-aws-glue-studio-visual-editor\/"},"modified":"2023-03-20T14:49:06","modified_gmt":"2023-03-20T18:49:06","slug":"introducing-native-support-for-apache-hudi-delta-lake-and-apache-iceberg-on-aws-glue-for-apache-spark-part-2-aws-glue-studio-visual-editor","status":"publish","type":"station","link":"https:\/\/platodata.io\/plato-data\/introducing-native-support-for-apache-hudi-delta-lake-and-apache-iceberg-on-aws-glue-for-apache-spark-part-2-aws-glue-studio-visual-editor\/","title":{"rendered":"Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor"},"content":{"rendered":"
In the first post of this series<\/a>, we described how AWS Glue<\/a> for Apache Spark works with Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg datasets tables using the native support of those data lake formats<\/a>. This native support simplifies reading and writing your data for these data lake frameworks so you can more easily build and maintain your data lakes in a transactionally consistent manner. This feature removes the need to install a separate connector and reduces the configuration steps required to use these frameworks in AWS Glue for Apache Spark jobs.<\/p>\n These data lake frameworks help you store data more efficiently and enable applications to access your data faster. Unlike simpler data file formats such as Apache Parquet, CSV, and JSON, which can store big data, data lake frameworks organize distributed big data files into tabular structures that enable basic constructs of databases on data lakes.<\/p>\n Expanding on the functionality we announced at AWS re:Invent 2022, AWS Glue now natively supports Hudi, Delta Lake and Iceberg through the AWS Glue Studio visual editor. If you prefer authoring AWS Glue for Apache Spark jobs using a visual tool, you can now choose any of these three data lake frameworks as a source or target through a graphical user interface (GUI) without any custom code.<\/p>\n Even without prior experience using Hudi, Delta Lake or Iceberg, you can easily achieve typical use cases. In this post, we demonstrate how to ingest data stored in Hudi using the AWS Glue Studio visual editor.<\/p>\n To demonstrate the visual editor experience, this post introduces the Global Historical Climatology Network Daily (GHCN-D)<\/a> dataset. The data is publicly accessible through an Amazon Simple Storage Service<\/a> (Amazon S3) bucket. For more information, see the Registry of Open Data on AWS<\/a>. You can also learn more in Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight<\/a>.<\/p>\n The Amazon S3 location The records have fields including ID, DATE, ELEMENT, and more. Each combination of In this tutorial, we assume that the files are updated with new records every day, and want to store only the latest record per the primary key ( To continue this tutorial, you need to create the following AWS resources in advance:<\/p>\n Let\u2019s author an AWS Glue job to read daily records in 2022, and write the latest snapshot into the Hudi table on your S3 bucket using UPSERT. Complete following steps:<\/p>\n A new visual job configuration appears. The next step is to configure the data source to read an example dataset:<\/p>\n The data source is configured.<\/p>\n The next step is to configure the data target to ingest data in Apache Hudi on your S3 bucket:<\/p>\n To make it easy to discover the sample data, and also make it queryable from Athena, configure the job to create a table definition on the AWS Glue Data Catalog:<\/p>\n Now your data integration job is authored in the visual editor completely. Let\u2019s add one remaining setting about the IAM role, then run the job:<\/p>\n Now that the job has successfully created the Hudi table, you can query the table through different engines, including Amazon Athena<\/a>, Amazon EMR<\/a>, and Amazon Redshift Spectrum<\/a>, in addition to AWS Glue for Apache Spark.<\/p>\n To query through Athena, complete the following steps:<\/p>\n The following screenshot shows the query result. Let\u2019s dive deep into the table to understand how the data is ingested and focus on the records with ID=\u2019AE000041196\u2032.<\/p>\n The following screenshot shows the query result. The original source file According to the preceding result, we were able to ingest the latest snapshot from all the 2022 data. Now let\u2019s do an UPSERT of the new 2023 data to overwrite the records on the target Hudi table.<\/p>\n If you have further future records, this approach works well to upsert new records based on the Hudi record key and Hudi precombine key.<\/p>\n Now to the final step, cleaning up the resources:<\/p>\n This post demonstrated how to process Hudi datasets using the AWS Glue Studio visual editor. The AWS Glue Studio visual editor enables you to author jobs while taking advantage of data lake formats and without needing expertise in them. If you have comments or feedback, please feel free to leave them in the comments.<\/p>\n <\/p>\nExample scenario<\/h2>\n
s3:\/\/noaa-ghcn-pds\/csv\/by_year\/<\/code> has all the observations from 1763 to the present organized in CSV files, one file for each year. The following block shows an example of what the records look like:<\/p>\n
ID,DATE,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME\nAE000041196,20220101,TAVG,204,H,,S,\nAEM00041194,20220101,TAVG,211,H,,S,\nAEM00041217,20220101,TAVG,209,H,,S,\nAEM00041218,20220101,TAVG,207,H,,S,\nAE000041196,20220102,TAVG,226,H,,S,\n...\nAE000041196,20221231,TMAX,243,,,S,\nAE000041196,20221231,PRCP,0,D,,S,\nAE000041196,20221231,TAVG,202,H,,S,<\/code><\/pre>\n<\/p><\/div>\n
ID<\/code>,
DATE<\/code>, and
ELEMENT<\/code> represents a unique record in this dataset. For example, the record with
ID<\/code> as
AE000041196<\/code>,
ELEMENT<\/code> as
TAVG<\/code>, and
DATE<\/code> as
20220101<\/code> is unique.<\/p>\n
ID<\/code> and
ELEMENT<\/code>) to make the latest snapshot data queryable. One typical approach is to do an INSERT for all the historical data, and calculate the latest records in queries; however, this can introduce additional overhead in all the queries. When you want to analyze only the latest records, it\u2019s better to do an UPSERT (update and insert) based on the primary key and
DATE<\/code> field rather than just an INSERT in order to avoid duplicates and maintain a single updated row of data.<\/p>\n
Prerequisites<\/h2>\n
Process a Hudi dataset on the AWS Glue Studio visual editor<\/h2>\n
\n
\n
s3:\/\/noaa-ghcn-pds\/csv\/by_year\/2022.csv<\/code>.<\/li>\n<\/ol>\n
<\/p>\n
\n
ghcn<\/code>.<\/li>\n
ID<\/code>.<\/li>\n
DATE<\/code>.<\/li>\n
s3:\/\/<Your S3 bucket name>\/<Your S3 bucket prefix>\/hudi_native\/ghcn\/<\/code>. (Provide your S3 bucket name and prefix.)<\/li>\n<\/ol>\n
\n
hudi_native<\/code>.<\/li>\n
ghcn<\/code>.<\/li>\n
ELEMENT<\/code>.<\/li>\n<\/ol>\n
\n
<\/p>\n
\n
<\/p>\n
Query the table with Athena<\/h2>\n
\n
SELECT * FROM \"hudi_native\".\"ghcn\" limit 10;<\/code><\/pre>\n
<\/p>\n
\n
ID='AE000041196'<\/code>:<\/li>\n<\/ol>\n
SELECT * FROM \"hudi_native\".\"ghcn\" WHERE ID='AE000041196';<\/code><\/pre>\n
<\/p>\n
2022.csv<\/code> has historical records for record
ID='USW00012894'<\/code> from
20220101<\/code> to
20221231<\/code>, however the query result shows only four records, one record per
ELEMENT<\/code> at the latest snapshot of the day
20221230<\/code> or
20221231<\/code>. Because we used the UPSERT write option when writing data, we configured the ID field as a Hudi record key field, the
DATE<\/code> field as a Hudi precombine field, and the
ELEMENT<\/code> field as partition key field. When two records have the same key value, Hudi picks the one with the largest value for the precombine field. When the job ingested data, it compared all the values in the
DATE<\/code> field for each pair of
ID<\/code> and
ELEMENT<\/code>, and then picked the record with the largest value in the
DATE<\/code> field.<\/p>\n
\n
s3:\/\/noaa-ghcn-pds\/csv\/by_year\/2023.csv<\/code>, then save and run the job.<\/li>\n<\/ol>\n
<\/p>\n
\n
Now you see that the four records have been updated with the new records in 2023.<\/p>\nClean up<\/h2>\n
\n
hudi_native<\/code>.<\/li>\n
ghcn<\/code>.<\/li>\n
s3:\/\/<Your S3 bucket name>\/<Your S3 bucket prefix>\/hudi_native\/ghcn2022\/<\/code>.<\/li>\n<\/ol>\n
Conclusion<\/h2>\n
\nAbout the authors<\/h3>\n
Noritaka Sekiyama<\/strong> is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.<\/p>\n
Scott Long<\/strong> is a Front End Engineer on the AWS Glue team. He is responsible for implementing new features in AWS Glue Studio. In his spare time, he enjoys socializing with friends and participating in various outdoor activities.<\/p>\n
Sean Ma<\/strong> is a Principal Product Manager on the AWS Glue team. He has an 18+ year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.<\/p>\n
\n