DAS-C01 · Question #42
DAS-C01 Question #42: Real Exam Question with Answer & Explanation
The correct answer is A: In Apache ORC partitioned by date and sorted by source IP. As the data should be partitioned on date so that the query scan is optimized and only scans required date partitions rather than scanning the entire data.
Question
A company that produces network devices has millions of users. Data is collected from the devices on an hourly basis and stored in an Amazon S3 data lake. The company runs analyses on the last 24 hours of data flow logs for abnormality detection and to troubleshoot and resolve user issues. The company also analyzes historical logs dating back 2 years to discover patterns and look for improvement opportunities. The data flow logs contain many metrics, such as date, timestamp, source IP, and target IP. There are about 10 billion events every day. How should this data be stored for optimal performance?
Options
- AIn Apache ORC partitioned by date and sorted by source IP
- BIn compressed .csv partitioned by date and sorted by source IP
- CIn Apache Parquet partitioned by source IP and sorted by date
- DIn compressed nested JSON partitioned by source IP and sorted by date
Explanation
As the data should be partitioned on date so that the query scan is optimized and only scans required date partitions rather than scanning the entire data.
Topics
Community Discussion
No community discussion yet for this question.