Production-Ready MLOps on GCP Part 8: Model Monitoring & Continuous Training

Part 8 of an 8-part series on building enterprise-grade MLOps systems

Complete Series:

Article 1: Architecture Overview
Article 2: Tools & Workflows for ML Teams
Article 3: Infrastructure as Code
Article 4: Reusable KFP Components
Article 5: Production Training Pipeline
Article 6: Production Prediction Pipeline
Article 7: CI/CD for ML
Article 8: Model Monitoring & Continuous Training (You are here)

Introduction

In the previous article, we automated the entire development workflow with CI/CD. But production ML has one more critical challenge: models degrade over time.

Your model was trained on January data. It’s now November. User behavior has changed. Payment methods shifted. Routes evolved. Your model’s accuracy is silently degrading.

This final article covers:

Event-driven continuous training (automatic retraining)
Scheduled retraining patterns
Production orchestration patterns
Observability and debugging
Cost management
Responding to model degradation

By the end, you’ll know how to keep models fresh and accurate in production.

The Model Degradation Problem

Scenario: Your taxi fare prediction model

January (training data):
  - Average trip: 5.2 miles
  - Payment: 60% credit, 40% cash
  - Peak hour: 8 AM
  - Model RMSE: 2.5
November (production data):
  - Average trip: 7.8 miles (+50%!)
  - Payment: 75% credit, 25% cash
  - Peak hour: 9 AM
  - Model RMSE: 3.8 (+52% worse!)

Without monitoring: You don’t notice until customers complain. With monitoring: Automatic alerts + retraining.

Event-Driven Continuous Training

Goal: New data arrives → automatically retrain model.

Cloud Run Function Trigger

# Cloud Run Function (simplified)
def mlops_entrypoint(event, context):
    """Triggered when new data arrives in BigQuery."""
    # Parse event
    dataset_id = event['protoPayload']['resourceName']
    # Check if significant new data
    if should_retrain(dataset_id):
        # Trigger training pipeline
        trigger_training_pipeline(
            template_path="gs://.../taxifare-training-pipeline:latest",
            enable_caching=False,
            use_latest_data=True
        )
    return "OK"

Trigger Conditions

def should_retrain(dataset_id):
    # Option 1: Time-based
    if hours_since_last_training() > 24:
        return True
    # Option 2: Data volume-based
    if new_rows_since_last_training() > 100000:
        return True
    # Option 3: Performance-based (requires ground truth)
    if recent_rmse() > champion_rmse * 1.1:
        return True
    return False

Event Flow

New BigQuery Rows
       ↓
Pub/Sub Message
       ↓
Cloud Run Function
       ↓
(Decision: Should retrain?)
       ↓
Trigger Training Pipeline
       ↓
Train → Evaluate → Compare → Upload (if better)
       ↓
Pub/Sub Notification (pipeline complete)
       ↓
Cloud Run Function
       ↓
Trigger Prediction Pipeline (use new model)

Configuration

# Terraform: Event trigger setup
resource "google_eventarc_trigger" "bigquery_insert_trigger" {
  name     = "bigquery-data-insert"
  location = var.region
  matching_criteria {
    attribute = "type"
    value     = "google.cloud.bigquery.dataset.v1.dataInserted"
  }
  destination {
    cloud_run_service {
      service = google_cloud_run_service.mlops_trigger.name
      region  = var.region
    }
  }
}

Scheduled Retraining

For predictable retraining (e.g., weekly):

Vertex AI Pipeline Schedule Setup

# Create weekly training schedule using Vertex AI Pipeline Schedules
poetry run python -m pipelines.utils.schedule_pipeline \
  --pipeline_type=training \
  --template_path=https://us-central1-kfp.pkg.dev/my-project/mlops-pipeline-repo/taxifare-training-pipeline/latest \
  --pipeline_root=gs://my-project-pl-root \
  --display_name=prod-training-pipeline \
  --schedule_name=prod-training-schedule \
  --cron="0 2 * * 0" \
  --enable_caching=false \
  --use_latest_data=true

Common Schedules

# Daily at 2 AM
--schedule="0 2 * * *"
# Weekly on Sunday at 2 AM
--schedule="0 2 * * 0"
# Monthly on 1st at 2 AM
--schedule="0 2 1 * *"
# Every 6 hours
--schedule="0 */6 * * *"

Scheduled Pipeline Parameters

# Schedule with parameters
poetry run python -m pipelines.utils.schedule_pipeline \
  --project=my-prod-project \
  --location=us-central1 \
  --pipeline_template_path=gs://.../training:latest \
  --schedule="0 2 * * 0" \
  --parameters='{
    "use_latest_data": true,
    "enable_caching": false,
    "model_name": "taxi-traffic-model"
  }'

Production Orchestration Patterns

Pattern 1: Scheduled Training → Automatic Prediction

Vertex AI Pipeline Schedule (weekly)
    ↓
Training Pipeline
    ↓
(If new champion)
    ↓
Trigger Prediction Pipeline
    ↓
Generate predictions for next week

Use case: Weekly batch predictions for business planning

Implementation: This pattern is achieved through the Cloud Run Function that listens for training pipeline completion events via Pub/Sub, then triggers the prediction pipeline if a new champion model was uploaded.

Pattern 2: New Data → Continuous Training

New Data Arrives (hourly)
    ↓
Cloud Run Function
    ↓
(Check: Enough new data?)
    ↓
Training Pipeline
    ↓
(Champion/Challenger comparison)
    ↓
Model Registry (update champion if better)

Use case: Always have the freshest model

Implementation: The Cloud Run Function (terraform/modules/cloudrunfunction/src/main.py) triggers pipelines based on Pub/Sub events:

@functions_framework.cloud_event
def mlops_entrypoint(event):
    pipeline_config = os.getenv("PIPELINE_CONFIG")
    pipeline_config_dict = json.loads(pipeline_config)
    submit_pipeline_job(pipeline_config_dict)

The function reads configuration from environment variables and submits the appropriate pipeline job (training or prediction) to Vertex AI.

Pattern 3: Event-Driven Training via Cloud Run Function

The Cloud Run Function can be triggered by various events (Pub/Sub, Cloud Storage, etc.) to automatically start training or prediction pipelines. The actual trigger mechanism is configured in Terraform (terraform/modules/cloudrunfunction/) and the function logic handles pipeline submission to Vertex AI.

Observability and Debugging

Key Metrics to Monitor

1. Model Performance: View in Vertex AI Model Registry:

RMSE, MAE, MAPE, MSLE metrics for each model version
Comparison between champion and challenger models
Evaluation results from test set

2. Data Skew: Monitored by the model_batch_predict_op component in the prediction pipeline:

Training vs. serving feature distributions
Skew detection thresholds configured per feature
Automatic email alerts when skew exceeds threshold
Metrics logged to Cloud Logging

3. Pipeline Execution: Track in Vertex AI Pipelines console:

Pipeline success/failure rates
Component execution times
Resource utilization
Error logs and stack traces

4. Training Frequency: Monitor via Vertex AI Pipeline Schedules:

Scheduled run frequency (hourly, daily, weekly)
Manual vs. automatic triggers
Champion model update frequency

Cloud Logging Queries

Find training triggers:

resource.type="cloud_run_revision"
resource.labels.service_name="mlops-trigger"
jsonPayload.message="Triggering training pipeline"

Find champion promotions:

resource.type="aiplatform.googleapis.com/PipelineJob"
jsonPayload.message=~"Challenger wins"

Find skew detections:

resource.type="aiplatform.googleapis.com/BatchPredictionJob"
jsonPayload.skew_detected=true

Dashboards

Create Cloud Monitoring dashboards to visualize:

Vertex AI Pipeline execution success rates and durations
Model evaluation metrics from the Model Registry
Cloud Run Function invocation counts and errors
BigQuery job statistics for data processing steps
Skew detection alerts from batch prediction jobs

Responding to Model Degradation

Alert → Investigate → Retrain Workflow

1. Receive Alert: When the prediction pipeline’s model_batch_predict_op component detects data skew, it sends an email alert configured in the component parameters.

2. Investigate:

Check Vertex AI Pipelines console for skew detection details
Review Cloud Logging for skew metrics and feature distributions
Compare recent prediction data against training dataset in BigQuery

3. Trigger Retraining: Manually trigger the training pipeline with latest data:

cd pipelines
poetry run python -m pipelines.utils.trigger_pipeline \
  --template_path=./taxifare-training-pipeline.yaml \
  --display_name=manual-retrain-pipeline \
  --enable_caching=false \
  --use_latest_data=true

Or use the Makefile shortcut:

make training build=false enable_caching=false use_latest_data=true

4. Validate Improvement:

Check Vertex AI Model Registry for new model metrics
Compare RMSE between old champion and new model
The pipeline’s champion/challenger logic automatically promotes better models

Optimization Strategies

1. Use Pipeline Caching:

# Enable caching for preprocessing steps that don't change
make training enable_caching=true

2. Adjust Training Schedule: Configure pipeline schedules based on data velocity:

High-frequency data: Daily training
Stable data: Weekly training
Monitor skew alerts to determine optimal frequency

3. Right-Size Training Resources: Configure machine types in get_workerpool_spec_op component based on dataset size and model complexity.

4. Clean Up Old Artifacts: Regularly manage artifacts in:

Artifact Registry (old pipeline versions and Docker images)
Vertex AI Model Registry (non-champion model versions)
Cloud Storage (old pipeline artifacts and outputs)

5. Optimize BigQuery Costs: The preprocessing SQL queries (ingest.sql, ingest_pred.sql) are optimized to:

Filter data early in the query
Use partitioning when available
Limit data scanned with timestamps

Best Practices

1. Always Version Everything

The system automatically versions:

Models: Stored in Vertex AI Model Registry with version numbers
Pipelines: Tagged in Artifact Registry (e.g., v1.2.3, latest)
Docker images: Tagged in Artifact Registry matching pipeline versions
Training data: Timestamped via timestamp parameter in pipeline runs

2. Use Champion/Challenger Pattern

Implemented in the upload_best_model_op component:

New models are only promoted if they beat the current champion
RMSE comparison happens automatically during training pipeline
All models are preserved in registry for rollback capability

3. Monitor Before Optimizing

1. Deploy with monitoring
2. Observe for 1 week
3. Identify bottlenecks
4. Optimize selectively
5. Measure improvement

4. Set Up Alerts Thoughtfully

# Bad: Alert on every small change
if rmse > baseline_rmse * 1.01:
    alert()
# Good: Alert on sustained degradation
if rolling_avg_rmse(days=7) > baseline_rmse * 1.15:
    alert()

5. Document Retraining Decisions

## Retraining Log
### 2024-01-15
- Trigger: Scheduled weekly retrain
- Data: 2024-01-08 to 2024-01-15
- Result: New model RMSE 2.3 (vs champion 2.5) → Promoted
- Notes: Improved accuracy on credit card payments
### 2024-01-22
- Trigger: Accuracy degradation alert
- Data: 2024-01-15 to 2024-01-22
- Result: New model RMSE 2.8 (vs champion 2.3) → Not promoted
- Notes: Holiday data anomaly, monitoring

Conclusion: Your Complete MLOps System

You’ve now built a complete, production-ready MLOps system across 8 articles:

Architecture → Multi-environment design
Developer Experience → Productive workflows
Infrastructure → Automated with Terraform
Components → Modular and reusable
Training → Sophisticated pipeline
Prediction → Scalable inference
CI/CD → Complete automation
Operations → Continuous improvement

Your system now:

Trains models automatically
Deploys only better models
Generates predictions at scale
Monitors for degradation
Retrains when needed
Maintains itself

What’s next?

Implement it for your use case
Customize for your data
Extend with new features
Share learnings with the community

Thank you for following this comprehensive series!

Now go build amazing, self-maintaining ML systems! 🚀🎉

Thanks for reading! I hope this helps you on your journey. If you found value in this, please clap, leave a comment, and star the GitHub repo. Hit the Follow button to get notified about my next article, so don’t forget to Subscribe to the email list and let’s connect on LinkedIn!

Production-Ready MLOps on GCP Part 8: Model Monitoring & Continuous Training was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source Credit: https://medium.com/google-cloud/production-ready-mlops-on-gcp-part-8-model-monitoring-continuous-training-f244bc046014?source=rss—-e52cf94d98af—4

Deven Goratela

Administrator

Visit Website View All Posts

Related Stories

Introducing Fireworks AI on Microsoft Foundry: Bringing high performance, low latency open model inference to Azure

Reduce 429 errors on Vertex AI

mcp-chaos-monkey Now Speaks Python

You may have missed