While petabyte-scale data warehouses are becoming more common, getting the performance you need without escalating costs and effort remain key challenges, even in a modern cloud data warehouse. While many data warehouse platform providers continue to work on these challenges, BigQuery has already moved past petabyte-scale data warehouses to petabyte-scale tables. In fact, some BigQuery users have single tables in excess of 200 petabytes and over 70 trillion rows.
At this scale, even metadata is big data that requires an (almost) infinitely scalable design and high performance. In 2021, we presented the Column Metadata (CMETA) index in a 2021 VLDB paper, which, as the name implies, acts like an index for metadata. Compared to existing techniques, CMETA proved to be superior, meeting both our scalability and performance requirements. Further, BigQuery’s implementation thereof requires no user effort to maintain, and in addition to transparently improving query performance, CMETA may also reduce overall slot usage.
In this blog, we take a look at how CMETA works, the impact it can have on your workloads, and how to maximize its benefits. Let’s jump in.
How BigQuery stores data
All data in BigQuery tables is stored as data blocks that are organized in a columnar format. Data blocks also store metadata about all rows within the block. This includes min and max values for each column in the block and any other necessary properties that may be used for query optimization. This metadata allows BigQuery to perform fine-grained dynamic pruning to improve both query performance and resource efficiency.
This approach is well-known and commonly applied in the data management industry. However, as noted above, BigQuery operates on a vast scale, routinely handling tables that have over a hundred petabytes of data spread across billions of blocks in storage. Metadata for these tables frequently reach terabyte scale — larger than many organizations’ entire data warehouses!
Source Credit: https://cloud.google.com/blog/products/data-analytics/understanding-the-bigquery–column-metadata-cmeta-index/
