Supercharge Your Data Quality: Anomaly Detection with BigQueryML | by Gilberto Gutierrez | Google Cloud - Community

In the world of data-driven decisions, undetected data quality issues can silently sabotage your analytics, lead to flawed insights, and erode trust in your data. How often have critical datasets experienced unexpected drops in row counts, a sudden surge in NULLs, or bizarre metric fluctuations that went unnoticed until it was too late?

Fortunately, Google BigQuery ML offers a powerful, scalable way to move from reactive data fire-fighting to proactive data quality monitoring. By first training a machine learning model and then leveraging the ML.DETECT_ANOMALIES function, you can supercharge your data quality checks and catch these issues early.

In this post, we’ll explore how to use BigQueryML’s anomaly detection capabilities for robust data quality monitoring. We’ll cover what it involves, why it’s particularly effective for data quality, and walk through practical SQL examples to get you started.

What it is:

BigQuery ML’s ML.DETECT_ANOMALIES function is a powerful tool for identifying outliers or unexpected patterns in your data.

You first need to create and train a model of a supported type — such as ARIMA_PLUS or ARIMA_PLUS_XREG for time-series data, or K-means, PCA, or Autoencoder models for other types of anomaly detection tasks. Once the model is trained, ML.DETECT_ANOMALIES uses this model to identify data points that deviate significantly from the patterns it has learned.

For time-series data quality checks, like monitoring daily row counts or key metrics over time, you’ll typically train an ARIMA_PLUS model on your historical data. The ML.DETECT_ANOMALIES function then uses this model to flag anomalies.

A key option during the training of an ARIMA_PLUS model for anomaly detection purposes is DECOMPOSE_TIME_SERIES=TRUE. This allows ML.DETECT_ANOMALIES to effectively identify anomalies within the historical data the model was trained on, as well as on new, unseen data.

Why it’s useful for Data Quality Monitoring:

Model-Driven & Proactive Detection: Instead of relying solely on manual checks or static threshold rules, ML.DETECT_ANOMALIES leverages a machine learning model trained on your data’s historical patterns. This allows for more nuanced and proactive detection of issues like:

Sudden, significant drops or spikes in table row counts.
Unexpected increases in NULL value percentages in critical columns.
Unusual shifts or outliers in key business metrics (e.g., average transaction value, daily active users).
Changes in the distribution or frequency of categorical values.

Efficiency & Scalability: Leverage BigQuery’s massive parallel processing power to monitor petabyte-scale datasets efficiently. No need to move data or manage separate anomaly detection infrastructure.

SQL-Based Workflow: Define your model training and anomaly detection steps using familiar SQL, integrating seamlessly into your existing BigQuery data pipelines and ELT processes.

Customizable Sensitivity: During the anomaly detection step, you can fine-tune sensitivity using parameters like anomaly_prob_threshold to balance catching more anomalies versus reducing false positives.

Rich Output: The function returns clear indicators like is_anomaly (TRUE/FALSE), anomaly_probability, and often expected_value or lower_bound/upper_bound, making it easier to understand and act on the detected anomalies.

Let’s walk through an example of monitoring the daily row count of a critical table. Unexpected large drops could indicate an ETL failure, while sudden large spikes might suggest data duplication or an upstream issue. This now involves a two-step process: training a model and then detecting anomalies.

Example: Detecting Anomalies in Daily Table Row Counts

Step 1: Train an ARIMA_PLUS Model on Historical Row Counts

Preparing the Training Data:

First, we need a historical time series of row counts for the table we want to monitor. We can generate this by querying INFORMATION_SCHEMA.PARTITIONS (for partitioned tables) or by regularly recording row counts into a dedicated metadata table.
For this example, let’s assume we have a table your_project.your_dataset.my_table_daily_counts_history with time_series_timestamp_col (TIMESTAMP) and time_series_data_col (INT64 representing the row count).
SQL

- Sample data preparation (if not already existing)
CREATE OR REPLACE TABLE `your_project.your_dataset.my_table_daily_counts_history` AS
SELECT
CAST(PARSE_DATE('%Y%m%d', partition_id) AS TIMESTAMP) AS time_series_timestamp_col, - Adjust if your partition_id is different
total_rows AS time_series_data_col
FROM
`your_project.your_dataset.your_partitioned_table.INFORMATION_SCHEMA.PARTITIONS`
WHERE
partition_id IS NOT NULL AND CAST(PARSE_DATE('%Y%m%d', partition_id) AS DATE) < CURRENT_DATE() - Use historical data for training
ORDER BY
time_series_timestamp_col;

Create and Train the Model:

Now, we create an ARIMA_PLUS model. The DECOMPOSE_TIME_SERIES = TRUE option is important as it helps in identifying anomalies later by analyzing the decomposed components of the time series (trend, seasonality, residuals).

CREATE OR REPLACE MODEL `your_project.your_dataset.dq_row_count_anomaly_model`
OPTIONS(
MODEL_TYPE='ARIMA_PLUS',
TIME_SERIES_TIMESTAMP_COL='time_series_timestamp_col',
TIME_SERIES_DATA_COL='time_series_data_col',
AUTO_ARIMA=TRUE,
DATA_FREQUENCY='AUTO_FREQUENCY', - Or specify 'DAILY', 'HOURLY', etc.
DECOMPOSE_TIME_SERIES=TRUE,
CLEAN_SPIKES_AND_DIPS=FALSE, - For DQ, we often want to detect these, not clean them from the model training
ADJUST_STEP_CHANGES=FALSE
) AS
SELECT
time_series_timestamp_col,
time_series_data_col
FROM
`your_project.your_dataset.my_table_daily_counts_history`;

Step 2: Detect Anomalies using the Trained Model

Detecting Anomalies on Historical (Training) Data or New Data:

Once the model is trained, you can use ML.DETECT_ANOMALIES. If you pass the same historical data (or no data, relying on the model’s training data if DECOMPOSE_TIME_SERIES was true), it can find anomalies within that period. More commonly for ongoing DQ, you’d pass new data to this function.
Let’s assume we want to check the most recent day’s row count, which is in a table (or could be a query) called your_project.your_dataset.my_table_daily_counts_today with the same column names (time_series_timestamp_col, time_series_data_col).

- Example: Detecting anomalies in new (e.g., data for today)
SELECT
*
FROM
ML.DETECT_ANOMALIES(
MODEL `your_project.your_dataset.dq_row_count_anomaly_model`,
TABLE `your_project.your_dataset.my_table_daily_counts_today`, - Or use a (SELECT … FROM … WHERE date = CURRENT_DATE())
STRUCT(0.05 AS anomaly_prob_threshold) - Detects anomalies outside the 95% confidence interval
)
WHERE
is_anomaly = TRUE; - Only show the anomalous recordss

If you wanted to re-evaluate the training data for anomalies (useful for understanding the model and past issues):

- Example: Detecting anomalies in the original training data
- This relies on DECOMPOSE_TIME_SERIES=TRUE during model creation.
SELECT
*
FROM
ML.DETECT_ANOMALIES(
MODEL `your_project.your_dataset.dq_row_count_anomaly_model`,
STRUCT(0.05 AS anomaly_prob_threshold)
- No input table/query means it evaluates anomalies over its training data
)
WHERE
is_anomaly = TRUE;

Explanation of Parameters in ML.DETECT_ANOMALIES:

MODEL your_project.your_dataset.dq_row_count_anomaly_model: Specifies the pre-trained BQML model to use.
TABLE your_project.your_dataset.my_table_daily_counts_today (or a query): This is the input data you want to check for anomalies. Its schema must match what the model expects (e.g., time_series_timestamp_col, time_series_data_col).
STRUCT(anomaly_prob_threshold AS value): Defines the significance level. A value of 0.05 means data points with less than a 5% probability of occurring normally (or outside the 95% confidence interval generated by the model) are flagged as anomalies.

Interpreting the Output: The query will return rows from your input data that are flagged as anomalies. Key output columns include:

is_anomaly (BOOL): TRUE if the data point is an anomaly.
anomaly_probability (FLOAT64): The probability that the data point is an anomaly.
The original data columns from your input (e.g., time_series_timestamp_col, time_series_data_col).
lower_bound, upper_bound (FLOAT64): The expected range for the data point based on the model.
expected_value (FLOAT64): The value predicted by the model.

(Visual Mention: You could then visualize your actual data alongside lower_bound and upper_bound over time in a tool like Looker Studio, highlighting the points where is_anomaly = TRUE. This would clearly show the unexpected deviations. I will show you how in my next blog post)

Model Training Matters:

DECOMPOSE_TIME_SERIES=TRUE: During ARIMA_PLUS model creation, this is essential if you want ML.DETECT_ANOMALIES to properly identify anomalies within the historical training data or if your data has trends/seasonality.
CLEAN_SPIKES_AND_DIPS and ADJUST_STEP_CHANGES: For data quality anomaly detection, you typically set these to FALSE during model training because you want to detect these events as anomalies, not have the model automatically clean or adjust for them during its learning phase.
Sufficient Historical Data for Training: ARIMA_PLUS models need enough historical data to learn patterns, trends, and seasonality.

Parameter Tuning in ML.DETECT_ANOMALIES:

anomaly_prob_threshold: This is the most direct way to control sensitivity when calling ML.DETECT_ANOMALIES.
Data Granularity and Regularity: Ensure your time series data is at a consistent granularity (e.g., hourly, daily) and that timestamps are regular for best model training results.
Iterative Refinement & Domain Knowledge: Anomaly detection isn’t always “set it and forget it.” Expect to iterate. Use your domain knowledge to interpret results and adjust parameters or model training as needed.

Operationalizing Your Checks:

Scheduling: Schedule both model re-training (periodically, as needed) and the ML.DETECT_ANOMALIES queries (e.g., daily after ETL).
Alerting: Integrate the output with alerting mechanisms (e.g., saving anomaly results to a table for BI tool monitoring, or using Dataplex AutoDQ which I will cover in my next blog post).
Logging Results: Store the results of your anomaly detection runs to track effectiveness.

Cost Consideration: Model training (CREATE MODEL) and anomaly detection (ML.DETECT_ANOMALIES) both incur BigQuery processing costs. Plan accordingly.

While this post focuses on using ARIMA_PLUS with ML.DETECT_ANOMALIES for time-series based data quality checks (like monitoring row counts or metrics over time), BigQuery ML supports other model types that can be used with ML.DETECT_ANOMALIES for different kinds of data quality scenarios:

ARIMA_PLUS / ARIMA_PLUS_XREG:

Best for: Time-series data where you’re looking for unusual deviations from historical trends, seasonality, or forecasted values. Examples: sudden spikes/dips in daily transaction counts, unexpected changes in average error rates over time.
Considerations: Requires a time-series dataset for training. ARIMA_PLUS_XREG allows incorporating external regressors which might be relevant for more complex DQ checks influenced by known external factors.

K-Means (MODEL_TYPE=’KMEANS’)

Best for: Detecting anomalies in multi-dimensional numerical data that isn’t necessarily time-series. It can identify data points that are far from any learned cluster centroids, potentially indicating anomalous combinations of feature values.
Considerations: Requires you to choose the number of clusters (num_clusters option). Feature scaling is often important.

PCA (MODEL_TYPE=’PCA’)

Best for: Anomaly detection based on reconstruction error in high-dimensional numerical data. Points that are poorly reconstructed from the principal components might be anomalous.
Considerations: You need to choose the number of principal components.

Autoencoder (MODEL_TYPE=’AUTOENCODER’)

Best for: Similar to PCA, using reconstruction error, but with neural networks, potentially capturing more complex non-linear relationships. Good for finding records that don’t conform to the general structure learned by the autoencoder.
Considerations: More complex to train and tune (e.g., hidden units, activation functions).

For the common data quality use case of monitoring a specific metric over time, ARIMA_PLUS paired with ML.DETECT_ANOMALIES is a very effective and SQL-friendly approach. This blog post has focused on this pattern.

Maintaining high data quality is non-negotiable in today’s analytical landscape. BigQuery ML’s CREATE MODEL and ML.DETECT_ANOMALIES functions provide a scalable, SQL-native solution to move beyond manual checks and simplistic rule-based systems.

By training models on your data’s historical patterns and integrating anomaly detection into your workflows, you can proactively identify data issues, reduce the time to detection, and ultimately build greater trust and reliability in your data assets.

Whether you’re monitoring table volumes, tracking critical data metrics, or ensuring data consistency, BigQuery ML offers the tools to supercharge your data quality strategy. I encourage you to experiment with these capabilities and tailor them to your specific data quality challenges.

Source Credit: https://medium.com/google-cloud/supercharge-your-data-quality-anomaly-detection-with-bigqueryml-22a46e573cb5?source=rss—-e52cf94d98af—4