Predicate Pushdown
Vector to Spark
If a SQL query is issued in Vector to an EXTERNAL TABLE using Spark, predicates from the WHERE clause may already be evaluated by Spark, reducing the number of tuples sent. However, currently this works only under the following circumstances:
• It is implemented for all Vector column data types except “time with time zone” and “timestamp with timezone.”
• Supported predicates are: <, <=, =, >=, and >. Using “like” is currently unsupported. Only predicates that can be translated into a column value range are supported.
• For logical connections of simple predicates, only AND is supported. If the whole complex predicate contains a single OR, nothing is pushed down to Spark. In this case, Spark transfers all tuples to Vector and the filtering is done solely on the Vector side.
If more predicates are to be evaluated by Spark, they must be given already in the external table specification using the FILTER option.
Spark to Data Source
Predicate pushdown is considered only from Vector to Spark. On the Spark side, predicate pushdown to the data source is determined by the optimizer framework of Spark. It depends on the specific file format, the data source implementation, and the Spark version. For example, in Spark 2, Parquet files may already be filtered during parsing, but this is not possible for CSV files. Filtering during CSV file parsing is supported in Spark 3.
Note: The Actian custom data source implementation (FORMAT='vector') for accessing Vector tables also supports pushdown of the following predicates: =, <=, >=, <, >, IS NOT NULL, IS NULL, IN, LIKE.
Last modified date: 06/28/2024