Part 8 of an 8-part series on building enterprise-grade MLOps systems
Complete Series:
- Article 1: Architecture Overview
- Article 2: Tools & Workflows for ML Teams
- Article 3: Infrastructure as Code
- Article 4: Reusable KFP Components
- Article 5: Production Training Pipeline
- Article 6: Production Prediction Pipeline
- Article 7: CI/CD for ML
- Article 8: Model Monitoring & Continuous Training (You are here)
Introduction
In the previous article, we automated the entire development workflow with CI/CD. But production ML has one more critical challenge: models degrade over time.
Your model was trained on January data. It’s now November. User behavior has changed. Payment methods shifted. Routes evolved. Your model’s accuracy is silently degrading.
This final article covers:
- Event-driven continuous training (automatic retraining)
- Scheduled retraining patterns
- Production orchestration patterns
- Observability and debugging
- Cost management
- Responding to model degradation
By the end, you’ll know how to keep models fresh and accurate in production.
The Model Degradation Problem
Scenario: Your taxi fare prediction model
January (training data):
- Average trip: 5.2 miles
- Payment: 60% credit, 40% cash
- Peak hour: 8 AM
- Model RMSE: 2.5
November (production data):
- Average trip: 7.8 miles (+50%!)
- Payment: 75% credit, 25% cash
- Peak hour: 9 AM
- Model RMSE: 3.8 (+52% worse!)
Without monitoring: You don’t notice until customers complain. With monitoring: Automatic alerts + retraining.
Event-Driven Continuous Training
Goal: New data arrives → automatically retrain model.
Cloud Run Function Trigger
# Cloud Run Function (simplified)
def mlops_entrypoint(event, context):
"""Triggered when new data arrives in BigQuery."""
# Parse event
dataset_id = event['protoPayload']['resourceName']
# Check if significant new data
if should_retrain(dataset_id):
# Trigger training pipeline
trigger_training_pipeline(
template_path="gs://.../taxifare-training-pipeline:latest",
enable_caching=False,
use_latest_data=True
)
return "OK"
Trigger Conditions
def should_retrain(dataset_id):
# Option 1: Time-based
if hours_since_last_training() > 24:
return True
# Option 2: Data volume-based
if new_rows_since_last_training() > 100000:
return True
# Option 3: Performance-based (requires ground truth)
if recent_rmse() > champion_rmse * 1.1:
return True
return False
Event Flow
New BigQuery Rows
↓
Pub/Sub Message
↓
Cloud Run Function
↓
(Decision: Should retrain?)
↓
Trigger Training Pipeline
↓
Train → Evaluate → Compare → Upload (if better)
↓
Pub/Sub Notification (pipeline complete)
↓
Cloud Run Function
↓
Trigger Prediction Pipeline (use new model)
Configuration
# Terraform: Event trigger setup
resource "google_eventarc_trigger" "bigquery_insert_trigger" {
name = "bigquery-data-insert"
location = var.region
matching_criteria {
attribute = "type"
value = "google.cloud.bigquery.dataset.v1.dataInserted"
}
destination {
cloud_run_service {
service = google_cloud_run_service.mlops_trigger.name
region = var.region
}
}
}
Scheduled Retraining
For predictable retraining (e.g., weekly):
Vertex AI Pipeline Schedule Setup
# Create weekly training schedule using Vertex AI Pipeline Schedules
poetry run python -m pipelines.utils.schedule_pipeline \
--pipeline_type=training \
--template_path=https://us-central1-kfp.pkg.dev/my-project/mlops-pipeline-repo/taxifare-training-pipeline/latest \
--pipeline_root=gs://my-project-pl-root \
--display_name=prod-training-pipeline \
--schedule_name=prod-training-schedule \
--cron="0 2 * * 0" \
--enable_caching=false \
--use_latest_data=true
Common Schedules
# Daily at 2 AM
--schedule="0 2 * * *"
# Weekly on Sunday at 2 AM
--schedule="0 2 * * 0"
# Monthly on 1st at 2 AM
--schedule="0 2 1 * *"
# Every 6 hours
--schedule="0 */6 * * *"
Scheduled Pipeline Parameters
# Schedule with parameters
poetry run python -m pipelines.utils.schedule_pipeline \
--project=my-prod-project \
--location=us-central1 \
--pipeline_template_path=gs://.../training:latest \
--schedule="0 2 * * 0" \
--parameters='{
"use_latest_data": true,
"enable_caching": false,
"model_name": "taxi-traffic-model"
}'
Production Orchestration Patterns
Pattern 1: Scheduled Training → Automatic Prediction
Vertex AI Pipeline Schedule (weekly)
↓
Training Pipeline
↓
(If new champion)
↓
Trigger Prediction Pipeline
↓
Generate predictions for next week
Use case: Weekly batch predictions for business planning
Implementation: This pattern is achieved through the Cloud Run Function that listens for training pipeline completion events via Pub/Sub, then triggers the prediction pipeline if a new champion model was uploaded.
Pattern 2: New Data → Continuous Training
New Data Arrives (hourly)
↓
Cloud Run Function
↓
(Check: Enough new data?)
↓
Training Pipeline
↓
(Champion/Challenger comparison)
↓
Model Registry (update champion if better)
Use case: Always have the freshest model
Implementation: The Cloud Run Function (terraform/modules/cloudrunfunction/src/main.py) triggers pipelines based on Pub/Sub events:
@functions_framework.cloud_event
def mlops_entrypoint(event):
pipeline_config = os.getenv("PIPELINE_CONFIG")
pipeline_config_dict = json.loads(pipeline_config)
submit_pipeline_job(pipeline_config_dict)
The function reads configuration from environment variables and submits the appropriate pipeline job (training or prediction) to Vertex AI.
Pattern 3: Event-Driven Training via Cloud Run Function
The Cloud Run Function can be triggered by various events (Pub/Sub, Cloud Storage, etc.) to automatically start training or prediction pipelines. The actual trigger mechanism is configured in Terraform (terraform/modules/cloudrunfunction/) and the function logic handles pipeline submission to Vertex AI.
Observability and Debugging
Key Metrics to Monitor
1. Model Performance: View in Vertex AI Model Registry:
- RMSE, MAE, MAPE, MSLE metrics for each model version
- Comparison between champion and challenger models
- Evaluation results from test set
2. Data Skew: Monitored by the model_batch_predict_op component in the prediction pipeline:
- Training vs. serving feature distributions
- Skew detection thresholds configured per feature
- Automatic email alerts when skew exceeds threshold
- Metrics logged to Cloud Logging
3. Pipeline Execution: Track in Vertex AI Pipelines console:
- Pipeline success/failure rates
- Component execution times
- Resource utilization
- Error logs and stack traces
4. Training Frequency: Monitor via Vertex AI Pipeline Schedules:
- Scheduled run frequency (hourly, daily, weekly)
- Manual vs. automatic triggers
- Champion model update frequency
Cloud Logging Queries
Find training triggers:
resource.type="cloud_run_revision"
resource.labels.service_name="mlops-trigger"
jsonPayload.message="Triggering training pipeline"
Find champion promotions:
resource.type="aiplatform.googleapis.com/PipelineJob"
jsonPayload.message=~"Challenger wins"
Find skew detections:
resource.type="aiplatform.googleapis.com/BatchPredictionJob"
jsonPayload.skew_detected=true
Dashboards
Create Cloud Monitoring dashboards to visualize:
- Vertex AI Pipeline execution success rates and durations
- Model evaluation metrics from the Model Registry
- Cloud Run Function invocation counts and errors
- BigQuery job statistics for data processing steps
- Skew detection alerts from batch prediction jobs
Responding to Model Degradation
Alert → Investigate → Retrain Workflow
1. Receive Alert: When the prediction pipeline’s model_batch_predict_op component detects data skew, it sends an email alert configured in the component parameters.
2. Investigate:
- Check Vertex AI Pipelines console for skew detection details
- Review Cloud Logging for skew metrics and feature distributions
- Compare recent prediction data against training dataset in BigQuery
3. Trigger Retraining: Manually trigger the training pipeline with latest data:
cd pipelines
poetry run python -m pipelines.utils.trigger_pipeline \
--template_path=./taxifare-training-pipeline.yaml \
--display_name=manual-retrain-pipeline \
--enable_caching=false \
--use_latest_data=true
Or use the Makefile shortcut:
make training build=false enable_caching=false use_latest_data=true
4. Validate Improvement:
- Check Vertex AI Model Registry for new model metrics
- Compare RMSE between old champion and new model
- The pipeline’s champion/challenger logic automatically promotes better models
Optimization Strategies
1. Use Pipeline Caching:
# Enable caching for preprocessing steps that don't change
make training enable_caching=true
2. Adjust Training Schedule: Configure pipeline schedules based on data velocity:
- High-frequency data: Daily training
- Stable data: Weekly training
- Monitor skew alerts to determine optimal frequency
3. Right-Size Training Resources: Configure machine types in get_workerpool_spec_op component based on dataset size and model complexity.
4. Clean Up Old Artifacts: Regularly manage artifacts in:
- Artifact Registry (old pipeline versions and Docker images)
- Vertex AI Model Registry (non-champion model versions)
- Cloud Storage (old pipeline artifacts and outputs)
5. Optimize BigQuery Costs: The preprocessing SQL queries (ingest.sql, ingest_pred.sql) are optimized to:
- Filter data early in the query
- Use partitioning when available
- Limit data scanned with timestamps
Best Practices
1. Always Version Everything
The system automatically versions:
- Models: Stored in Vertex AI Model Registry with version numbers
- Pipelines: Tagged in Artifact Registry (e.g., v1.2.3, latest)
- Docker images: Tagged in Artifact Registry matching pipeline versions
- Training data: Timestamped via timestamp parameter in pipeline runs
2. Use Champion/Challenger Pattern
Implemented in the upload_best_model_op component:
- New models are only promoted if they beat the current champion
- RMSE comparison happens automatically during training pipeline
- All models are preserved in registry for rollback capability
3. Monitor Before Optimizing
1. Deploy with monitoring
2. Observe for 1 week
3. Identify bottlenecks
4. Optimize selectively
5. Measure improvement
4. Set Up Alerts Thoughtfully
# Bad: Alert on every small change
if rmse > baseline_rmse * 1.01:
alert()
# Good: Alert on sustained degradation
if rolling_avg_rmse(days=7) > baseline_rmse * 1.15:
alert()
5. Document Retraining Decisions
## Retraining Log
### 2024-01-15
- Trigger: Scheduled weekly retrain
- Data: 2024-01-08 to 2024-01-15
- Result: New model RMSE 2.3 (vs champion 2.5) → Promoted
- Notes: Improved accuracy on credit card payments
### 2024-01-22
- Trigger: Accuracy degradation alert
- Data: 2024-01-15 to 2024-01-22
- Result: New model RMSE 2.8 (vs champion 2.3) → Not promoted
- Notes: Holiday data anomaly, monitoring
Conclusion: Your Complete MLOps System
You’ve now built a complete, production-ready MLOps system across 8 articles:
- Architecture → Multi-environment design
- Developer Experience → Productive workflows
- Infrastructure → Automated with Terraform
- Components → Modular and reusable
- Training → Sophisticated pipeline
- Prediction → Scalable inference
- CI/CD → Complete automation
- Operations → Continuous improvement
Your system now:
- Trains models automatically
- Deploys only better models
- Generates predictions at scale
- Monitors for degradation
- Retrains when needed
- Maintains itself
What’s next?
- Implement it for your use case
- Customize for your data
- Extend with new features
- Share learnings with the community
Thank you for following this comprehensive series!
Now go build amazing, self-maintaining ML systems! 🚀🎉
Thanks for reading! I hope this helps you on your journey. If you found value in this, please clap, leave a comment, and star the GitHub repo. Hit the Follow button to get notified about my next article, so don’t forget to Subscribe to the email list and let’s connect on LinkedIn!
Production-Ready MLOps on GCP Part 8: Model Monitoring & Continuous Training was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/production-ready-mlops-on-gcp-part-8-model-monitoring-continuous-training-f244bc046014?source=rss—-e52cf94d98af—4
