Dear Spark experts!
In one of our projects we faced the challenge of computing several KPIs by using the count(distinct X) operation.
In one of the examples our input has roughly 230 million rows. We then calculate many different KPIs, most of them have the following pattern:
COUNT(distinct CASE WHEN trx_date >= ADD_MONTHS(X, -3) THEN trx_amount END)
When looking at the Spark execution plan we found out, that when computing these KPIs Spark is doing an
EXPAND operation on the data, blowing it up to more than 2 billion rows.
Has somebody experience with these kind of operations and has a clue of why Spark needs to do such an expansion?
Thank you in advance,