nerdexam
AmazonAmazon

DEA-C01 · Question #163

DEA-C01 Question #163: Real Exam Question with Answer & Explanation

The correct answer is A: Configure the third-party application to create the files in a columnar format.. A columnar format like Parquet or ORC optimizes query performance, especially when using services like Redshift Spectrum. Redshift Spectrum allows querying data directly in S3, and columnar formats help reduce the amount of data scanned during queries because only the needed colu

Data Store Management

Question

A retail company uses an Amazon Redshift data warehouse and an Amazon S3 bucket. The company ingests retail order data into the S3 bucket every day. The company stores all order data at a single path within the S3 bucket. The data has more than 100 columns. The company ingests the order data from a third-party application that generates more than 30 files in CSV format every day. Each CSV file is between 50 and 70 MB in size. The company uses Amazon Redshift Spectrum to run queries that select sets of columns. Users aggregate metrics based on daily orders. Recently, users have reported that the performance of the queries has degraded. A data engineer must resolve the performance issues for the queries. Which combination of steps will meet this requirement with LEAST developmental effort? (Choose two.)

Options

  • AConfigure the third-party application to create the files in a columnar format.
  • BDevelop an AWS Glue ETL job to convert the multiple daily CSV files to one file for each day.
  • CPartition the order data in the S3 bucket based on order date.
  • DConfigure the third-party application to create the files in JSON format.
  • ELoad the JSON data into the Amazon Redshift table in a SUPER type column.

Explanation

A columnar format like Parquet or ORC optimizes query performance, especially when using services like Redshift Spectrum. Redshift Spectrum allows querying data directly in S3, and columnar formats help reduce the amount of data scanned during queries because only the needed columns are read. This reduces I/O and speeds up the query performance without needing to load all the columns of the dataset. This change is highly beneficial, especially when querying large datasets with many columns like in this scenario. Partitioning the data in Amazon S3 helps Redshift Spectrum prune unnecessary data, improving query performance. Partitioning by a frequently filtered column like order date allows Redshift Spectrum to scan only relevant partitions, reducing the amount of data that needs to be processed. This leads to faster query times. While combining files might reduce the number of files and improve performance slightly, it doesn't address the core issue of optimizing the data format and partitioning, which have a much bigger impact on performance. JSON is not an efficient format for large-scale analytics and tends to have worse performance compared to columnar formats. Columnar formats like Parquet or ORC are preferable in this The SUPER type is useful for semi-structured data in Redshift but isn't directly related to improving query performance in this scenario, where columnar formats and partitioning would provide more benefit.

Topics

#Redshift Spectrum#Data Partitioning#Columnar Storage#Query Optimization

Community Discussion

No community discussion yet for this question.

Full DEA-C01 PracticeBrowse All DEA-C01 Questions