
BigQuery delivers optimized search/lookup query performance by efficiently pruning irrelevant files. However, in some cases, additional column information is required for search indexes to further optimize query performance. To help, we recently announced indexing with column granularity, which lets BigQuery pinpoint relevant data within columns, for faster search queries and lower costs.
BigQuery arranges table data into one or more physical files, each holding N rows. This data is stored in a columnar format, meaning each column has its own dedicated file block. You can learn more about this in the BigQuery Storage Internals blog. The default search index is at the file level, which means it maintains mappings from a data token to all the files containing it. Thus, at query time, the search index helps reduce the search space by only scanning those relevant files. This file-level indexing approach excels when search tokens are selective, appearing in only a few files. However, scenarios arise where search tokens are selective within specific columns but common across others, causing these tokens to appear in most files, and thus diminishing the effectiveness of file-level indexes.
For example, imagine a scenario where we have a collection of technical articles stored in a simplified table named TechArticles
with two columns — Title
and Content
. And let’s assume that the data is distributed across four files, as shown below.
Source Credit: https://cloud.google.com/blog/products/data-analytics/search-indexes-with-column-granularity-in-bigquery/