Databases on K8s — Really? (part 7) | by Boris Dali | Google Cloud - Community

This multi-part blog series is a backstory that deep-dives into the motivation behind building Kubernetes Operators for running databases. This seventh installment in this series outlines my take on the shared responsibility model for the Assisted and AI-Assisted Database Management product, attempts to answer why I chose to rely on the Operator and wrap up with listing some of the Kubernetes challenges.

24 min read

17 hours ago

Written by: Boris Dali, Database Engineer @ Google (LinkedIn)

Disclaimer

First off, the disclaimer: this blog series is neither a Google official documentation, nor it is an authoritative source. For the former, docs on AlloyDB Omni in particular, please see here and for the latter, please feel free to reach out to the AlloyDB product team. Google Cloud official blogs are posted here. Medium hosts Google Cloud Community here. This is neither. The opinions I express here in this blog post series are my own and may not represent or agree with Google’s official position on the subject.

So what is it then? Well, if you think of this blog series as the one engineer’s pseudo random ramblings on a particular topic (“running databases in containers on K8s” that is), you won’t be far off 🙂.

Recap

If you missed part 1, part 2, part 3, part 4, part 5 and part 6 of this backstory and want to follow along, it may be a more logical place to start there (and yes, if you haven’t read it, apologies in advance, it’s not exactly a short read 🙂)

To recap, I started with the goal of exploring my reasoning behind the decision to invest into containers and K8s as a viable alternative to the more mainstream ways of running databases, but went down the rabbit hole of building the foundation first. It took me four installments in this blog series to define with sufficient (I hope) depth what those 13 expectations are. In the fifth installment I presented what I referred to as “a reality check” to see how cloud provider’s DBaaS systems fare against my expectations. TL;DR: they didn’t 😭.

Finally in the sixth installment in this series I outlined the solution that in my mind closes the gap and satisfies many of my 13 expectations. That solution is what I referred to as the Assisted and AI-Assisted Database Management product. Portable, run-anywhere (including being temporarily disconnected and even fully air-gapped), leaving a customer in the driver seat with the “root” privileges and offering declarative management interface, just like SQL, but in the form of the YAML manifests to state the desired database intent.

Assisted Database Management product doesn’t yet have a commonly accepted delineation of responsibilities, no clear line in the sand with respect to what a cloud provider vs. a customer is responsible for and so I think it’s critical to define the shared responsibility model to set the right expectations for both parties.

Shared Responsibility Model

In contrast to the fully managed DBaaS systems where a service provider accepts most of the responsibilities and just exposes an API and SQL endpoints, for the Assisted Management product to succeed, both a product provider and a customer are to collaborate and play by the following rules (disclaimer: please see the top of this and the rest of the blog posts in this series, but it bears repeating: this is just my aspirational list from some six years ago and so not all features have been implemented [yet?] in AlloyDB Omni K8s Operator — see the official documentation for details on what’s actually available today and please follow the upcoming blogs on specific subjects):

Provider:

Defines the exact expectations and the rules of the game for a customer to follow to ensure a successful collaboration. In my opinion, this is critical because in contrast to a fully managed DBaaS where you can get away with just following the best practices because a service provider carries most of the weight, here both parties have to play along.
Provides a downloadable product (an Operator for K8s or an RPM for VMs and a fully compatible Postgres database), security vulnerability and functional bug fixes in the form of patches and new releases. It’s up to a customer to decide to apply a provided fix or not, but a provider is to publish release notes and notify registered customers of both a problem (CVE, functional bug) and a patch once it becomes available.
Provides a rich and well documented set of Prometheus metrics¹. That is, in addition to the standard Operator metrics, a Critical User Journey (CUJ) can’t be marked as completed by a product provider without delivering a set of metrics, sample alerts (with the recommended thresholds) and Grafana dashboards. A customer is obviously free to use any observability platforms of their choice, but the Grafana dashboards are to come with the product “out of the box” (and if Grafana is not the o11y tool of choice, they can be used as a reference).
Enumerates the list of mandatory metrics that customers are expected to scrape from the Prometheus endpoints and create alerts on. Provide templates for scraping and for creating alerts in common frameworks, e.g. Prometheus alerts, Grafana alerts, etc.
The rest of the Prometheus metrics can be considered as optional as they provide additional information for the purpose of investigation if needed at a later time, but do not serve as the base for customer alerting of critical events in a product.
Defines a clear and documented set of the K8s events associated with every CUJ. Similar to the metrics, alerts and dashboards, a CUJ is not to be marked as complete (i.e. not GA ready), until the set of K8s events is defined, tested and published in the docs.
Provides conditional SLO (see expectation#6 in the earlier post) based on the traffic light supportability behavior and if a “supported status” is marked as green, enable the otherwise grayed out SLOs. If a customer is to break a contract (e.g. by deleting one of the AlloyDB Omni views from the system catalog or run out of space and ignore an alert or disallow a scheduled backup or delete a StatefulSet or scale down the Controller to 0 or delete a storage class or…), do everything possible to diagnose the problem and show a path to get back to a supported state. To be sure, the supportability traffic light is not to punish a customer, rather indicate that one of the prerequisites for a normal product functionality of an Operator or a database are off. This is the innovation because in contrast to DBaaS systems that run and are fully controlled by the cloud providers on their own hardware and are watched like a hawk by their SREs, Assisted Database Management products run on any K8s distribution and work with many variations of BYO storage and compute. alloydbomnichk is a CLI designed to deliver on what’s described in this bullet point.
The above requirement is to facilitate a provider to deliver diagnostic tools for their product and a database, so that a customer can either self-diagnose and address an issue on their own (aka “shift-right support”, which is preferable) or automatically gather the artifacts for a customer to generate an incident that a customer can review and upload to the support team. Provide customers with an AI-based troubleshooting tool hosted on Google Cloud where a customer can describe their problem in a natural language and get recommendations based on Google’s internal knowledge base and previous experience.
Introduces a concept of a Critical Incident (CI) to facilitate troubleshooting where any abnormal behavior detected by a Controller is to result in creating a condition in a Custom Resource (CR) status with the timestamp, type of an incident and a clear error/warning message that is to point further to a runbook. CIs are to be actionable or they shouldn’t be raised at all. All CIs are to be numbered and so are the runbooks. The runbooks are to be hosted on Google Cloud alongside the official Operator documentation and provided to a customer in a downloadable format if a customer opts for hosting it internally. In addition to the unexpected (by a Controller) outcome, CI is to include SLO violations as CR status conditions as well (assuming of course that the supportability traffic light is lit as green).
The goal of the above requirement from the provider is so that when a customer creates an incident and contacts support, there’s a clear CI number associated with an error condition and an expectation that a customer already looked at the right runbook, couldn’t resolve an issue at hand and auto gathered a snapshot of the required artifacts for troubleshooting (e.g. CR description including status and conditions, STS, Pod and Controller logs, K8s events, including events history, etc.).
Ultimately the objective of these strict requirements is to get to the point where if no CI, no abnormal metric value, no alerts, no events are raised, whatever a possible problem that a customer is facing, they need to look outside of the Operator (at a storage, compute, networking, environment, etc.). And yes, if the Operator itself becomes unhealthy, this is to be reflected in the standard Operator metrics for customers to see.
Allows for a customer to operate a product (an Operator and a database) in a disconnected or even fully air-gapped environment, where it’s a customer’s responsibility to get the patches, transfer them to a location accessible by the Operator and trigger an upgrade. If a product is to operate in a connected mode, show an indication to a customer when a new patch/fix is available for download.
Allows a customer to manage their on-prem hosted (or hosted on other clouds) AlloyDB Omni databases from a cloud console alongside AlloyDB databases created “natively” on Google Cloud. In particular, allows a customer to register these databases, upload logs, metrics, alerts, etc. — see expectation#11 for details.
Provides an ability for a customer to create an AlloyDB replica on Google Cloud for an AlloyDB Omni database created on-prem or other clouds.
Provides customers with the two-tier Control Plane in the form of the two Operators that can be installed together in the same K8s cluster (in which case they both can be packaged in single Helm chart or OLM bundle for simplicity) or in separate clusters. The first, customer facing, Operator is to facilitate the fleet management operations (e.g. change a database flag for all Dev databases) and so it effectively acts as a management plane, while the second tier Operator is to act as a local control plane (that doesn’t know or care about other clusters, Controllers deployed in them and replicas they may be hosting).
Provides customers with the AlloyDB Omni SQL CLI utility called alloydbomnisql for managing database clusters via SQL-like commands to orchestrate multi-step declarative database cluster changes as a single, atomic, transaction-like operation (with the automatic roll back of all changes if any of the individual commands in a block fail half way through):

cat <set dbservice=alloydbomni
set connect=:ns
# set whenever sqlerror exit-and-rollback (default behavior)# SQL to be converted to YAML or call an existing YAML directly:
create dbcluster pg1 --type=postgres --version 17;
create dbcluster ora1 --from_file=;
load from  into ;
update dbcluster set cpu_count=4, memory=100M, disk=1G where name=pg1;
select * from dbclusters;
EOF
# Think of Omni SQL as a TF for a hybrid and multi-cloud database management

Provides GitOps friendly database management interface where customer checked-in intended state doesn’t get overridden by the Operator’s Controllers.
Provides GitOps flow where database changes checked-in into a repo get optionally deployed on-merge to a Build database, a customer-defined soaking period is observed (a week by default) after which the change automatically propagates to Test, Staging, etc. environments. For this change to be applied to Prod, but default, an SRE manual approval is required, but the flow of the environments (e.g. like this), the soaking period, the revert criteria and the necessary approvals are all fully configurable by a customer. The overall idea is offer customers a way to ensure consistency across the database fleet, so that config adjustments, patches or any other changes that affect database behavior are peer-reviewed, documented and can be automatically propagated through the fleet in the orderly fashion via a progressive rollout or not at all (i.e. rolled back all the way if problems are discovered with a new config/patch).
Provides customers with the ability to mix and match individual services
Provides detailed guides and documentation for all major database CUJs, including backups, patching, HA/DR, security and user management, monitoring, logging, etc. An AI chatbot is to be available on a provider’s cloud to accept customers requests in a natural language, scour its knowledge base and provide steps to resolve a problem at hand. Depending on the customer’s feedback, an AI is either to proceed with troubleshooting or if stuck, file a ticket with the support team on customer’s behalf.
Stands by their product and provides customer support for it, including support for all the tools (OSS or proprietary) used/included by a product. For the OSS tools used in a product, a provider is to either work with the upstream vendor or patch (and contribute upstream) an OSS fix required by a customer.

Customer:

Retains the “root” privilege on their K8s cluster (so a cluster admin role) or a VM, as well as the full privileges on a database cluster.
Reviews the provider’s rules and best practices for the Assisted Management product and assesses if it suits their needs. If it’s a good fit a customer is to follow the rules, scrape the mandatory metrics from their favorite tool and create alerts as instructed by a provider in their favorite alerting framework.
Check the support status either via alloydbomnichk CLI or a corresponding CR. If the status is other than green, review the recommendations to get back to a supported state.
(Optional, but recommended) Subscribes to a provider’s mechanism of being notified when fixes (especially the critical CVE fixes) become available for download. A customer is to decide on whether to apply a patch or skip based on the release notes published by a provider.
Provisions the underlying infrastructure (if used on-prem) or requests one from a cloud provider if used on a public/private cloud, but in any case, ensure that there’s sufficient capacity according to the requirements outlined by a provider.

While it may be acceptable for a provider to declare some of the CUJs as beta-complete without providing all of the deliverables outlined above (e.g. metrics, alerts, dashboards, events, CIs, runbooks, docs), in general, a CUJ is not to be marked as GA without the complete set (i.e. it should be part of the provider’s pre-release checklist).

For instance, if scheduled backups start failing, a set of metrics is to be emitted and a customer is to create an alert to be notified that something (likely in the customer environment, e.g. disk space) changed and requires attention. If a sync HA replica creation hangs because the underlying backup operation gets stuck (e.g. due to the Cilium or other CNI rule changes), an operation is to be time-capped, a CI, an event and a metric are to be raised to reflect the problem.

Source Credit: https://medium.com/google-cloud/databases-on-k8s-really-part-7-90e99d059a90?source=rss—-e52cf94d98af—4