Do virtual data tables provide any kind of performance benefit in Workflows, i.e. loading them compared to loading the whole data table then and applying the SQL query to the data?
The documentation does not state a performance benefit and I suspect there is none, but maybe I am missing something.
You are right. There is in general no performance gain. VDTs are just stored requests for data that get evaluated newly every time when you want to read them. There is no caching.
The answer is (as for usual): It depends.
Take a scenario where the physical data is stored inside our internal PostgreSQL DB and on top, there is a VDT.
When using the VDT in a WF, he whole VDT-query is pushed down to the database in Spark’s JDBCRelation (i.e., the query is executed on the source, the results are loaded into the WF). If the query is optimized for (or faster to execute on) the source system, you can gain drastic speed improvements. However, it depends on the nature of the query since Spark itself is already capable of pushing down certain operations to the source system automatically (column selection, filters). So if your query only contains filters, you probably won’t notice any difference. If it’s a more complex query (Self-Joins, group-by, etc), there’s a chance to gain execution speed since Spark currently cannot push down these elements to the source.
For PARQUET-backed VDTs, we are always using Spark as query engine and there should not be any difference. There for sure also ware situations, where executing the queries on the DB source system can be less efficient (when queries are not optimized for performance or when the source system is simply not that powerful). Then, loading the data into the Spark execution context and doing the query execution there can be faster.
Anyway, VDTs in their current form alweays are computed on the fly (be it by Spark or the source system), so they will always cause computational load somewhere, everytime they are accessed.
Fun fact: Spark is also capable of pushing down additional operations to the source system when using a VDT. This means, that you don’t lose any optimization done by Spark when using a VDT in your WF.
On top of the answers from Flogge and Stefan: From my perspective, the major benefits of VDTs are
a1) reducing redundancy - you can apply transformations on a datatable without having to execute a function or workflow, and the data in it is always up-to-date (if the underlying table is up-to-date)
a2) reducing reduncancy - you can push down and reuse transformations that you would otherwise implement in apps to the source datatable
b) data hiding - you can easily remove columns or filter out rows that you don’t want to show to specific user groups