You are analyzing customer purchases in a Fabric notebook by using PySpark. You have the following DataFrames:

transactions: Contains five columns named transaction_id,

customer_id, product_id, amount, and date and has 10 million rows, with each row representing a transaction.

customers: Contains customer details in 1,000 rows and three columns

named customer_id, name, and country. You need to join the DataFrames on the customer_id column. The solution must minimize data shuffling. You write the following code. from pyspark.sql import functions as F results = Which code should you run to populate the results DataFrame?

Question

You are analyzing customer purchases in a Fabric notebook by using PySpark. You have the following DataFrames:

transactions: Contains five columns named transaction_id,

customer_id, product_id, amount, and date and has 10 million rows, with each row representing a transaction.

customers: Contains customer details in 1,000 rows and three columns

named customer_id, name, and country. You need to join the DataFrames on the customer_id column. The solution must minimize data shuffling. You write the following code. from pyspark.sql import functions as F results = Which code should you run to populate the results DataFrame?

Accepted Answer

A. transactions.join(F.broadcast(customers), transactions.customer_id == customers.customer_id)

Answer

B. transactions.join(customers, transactions.customer_id == customers.customer_id).distinct()

Answer

C. transactions.join(customers, transactions.customer_id == customers.customer_id)

Answer

D. transactions.crossJoin(customers).where(transactions.customer_id == customers.customer_id)

You are analyzing customer purchases in a Fabric notebook by using PySpark. You have the following DataFrames: - transactions: Contains five columns named transaction_id, customer_id, product_id, amou

Question

Options

How the community answered

Explanation

Topics

Community Discussion