Well, sort of. Except for the “schedule new tasks” part.
You can think of the non-distributed mode like an hourglass. It collects all the URLs from the input to the processor via a Spark action, namely a
collect() (not necessarily to the driver, though, but rather via the driver) and performs the API calls sequentially within the processor’s execution thread. The responses are then “scheduled” for being parallelized (see Spark Programming Guide ). It’s not an immediate Spark action but the collection will reside with the processor/driver until an action is triggered by a subsequent processor (e.g. a result table).
In distributed mode, there is no Spark action triggered by the processor. Instead, the API call logic is added to the input RDD as a common transformation (and yes, it’s an RDD from here on, not a Dataframe). This means, that the API calls are only executed when there is a Spark Action triggered by a subsequent processor.
Additional info for distributed mode: When not all the data gets materialized via the action (e.g. Result Table with 400 rows but 500 different API calls enter the processor), chances are, that not all calls are actually executed in this mode but only as many as needed to get 400 rows of data (due to Spark’s lazy materialization). There is no way of controlling which calls are executed in this scenario. Therefore, in distributed mode, one has to make sure that all data gets materialized by an action when all calls should be executed. Most extreme example here: Simply leaving both output ports of the API processor unconnected will not trigger any request in distributed mode whereas in non-distributed mode, all API calls will be executed regardless of what happens to the output.