
If the id
in the Capacitor file for this table is dictionary-encoded, the system’s expression folding will evaluate all dictionary values, and, because none of its values contain two digits, determine that the REGEXP_CONTAINS
condition is always false, and replace the WHERE
clause with a constant false
. As a result, BigQuery completely skips scanning the Capacitor file for this table, significantly boosting performance. Of course, these optimizations are applicable across a broad range of scenarios and not just to the query used in this example.
Data-encoding-enabled optimizations
Our state-of-the art join algorithm tries to preserve dictionary and run-length-encoded data wherever possible and makes runtime decisions taking data encoding into account. For example, if the probe side in the join key is dictionary-encoded, we can use that knowledge to avoid repeated hash-table lookups. Also, during aggregation, we can skip building a hashmap if data is already dictionary-encoded and its cardinality is known.
Parallelizable join and aggregation algorithms
Enhanced vectorization harnesses sophisticated parallelizable algorithms for efficient joins and aggregations. When parallel execution is enabled in a Dremel leaf node for certain query-execution modes, the join algorithm can build and probe the right-hand side hash table in parallel using multiple threads. Similarly, aggregation algorithms can perform both local and global aggregations across multiple threads simultaneously. This parallel execution of join and aggregation algorithms leads to a substantial acceleration of query execution.
Tighter integration with Capacitor
We re-engineered Capacitor for the enhanced vectorization runtime, making it smarter and more efficient. This updated version now natively supports semi-structured and JSON data, using sophisticated operators to rebuild JSON data efficiently. Capacitor enables enhanced vectorization runtime to directly access dictionary and run-length-encoded data and apply various optimizations based on data. It intelligently applies folding to a constant optimization when an entire column has the same value. And it can prune expressions in functions expecting NULL
, such as IF_NULL
and COALESCE
, when a column is confirmed to be NULL
-free.
Filter pushdown in Capacitor
Capacitor leverages the same vectorized engine as enhanced vectorization to efficiently push down filters and computations. This allows for tailored optimizations based on specific file characteristics and the expressions used. When combined with dictionary and run-length-encoded data, this approach delivers exceptionally fast and efficient data scans, enabling further optimizations like expression folding.
Enhanced vectorization in action
Let’s illustrate the power of these techniques with a concrete example. Enhanced vectorization accelerated one query by 21 times, slashing execution time from over one minute (61 seconds) down to 2.9 seconds.
The query that achieved this dramatic speedup was:
Source Credit: https://cloud.google.com/blog/products/data-analytics/understanding-bigquery-enhanced-vectorization/