Databases on K8s — Really? (part 3) | by Boris Dali | Google Cloud - Community

¹ It’s not like compute/storage disaggregation can’t be done (and it obviously has been done) on-prem too, but there it’s usually hidden within the “infra teams” by a server group or a storage group or some other IT department that requires filing 10 tickets, waiting for a week and sometimes buying a lunch or bringing a yummy sandwich to the IT folks to “facilitate the process”. Cloud changed that.

Expectation #3: Application as the first class citizen

My business is selling shirts and shoes. I have my custom apps developed to do just that. I think in terms of these apps on-prem, but when you pitch me your cloud, I’m to change my mental model and start thinking in terms of projects and/or accounts (depending on a cloud provider). How’s that “user friendly” or “meeting a user where they are”?

Ultimately what I want is to describe to you my app or, better yet, have an ability for your Database Migration Assessment (DMA) tool to check out my app automatically on its own (with me authorizing it, of course, possibly for just one-time use) and automatically create the most appropriate and cost effective setup on your cloud for me… that doesn’t make me bend my mental model by forcing me to do the mapping of my app and my business terms to your cloud’s idiosyncratic definitions.

My app should be your cloud’s “landing zone” that instead of introducing frictions, allows me to sharpen my focus on my business while relieving me from the unnecessary infrastructure and deployment details.

That is, your cloud shouldn’t be forcing me to operate at the infrastructure level. That could indeed be acceptable and desirable for visibility in some cases, but in general I’d like to move to a whole-stack SRE team, where engineers may specialize in databases or apps, but the whole team can operate the app as a whole (not through a separate DBA and separate App Management teams). As such, that infra level is too low for me. I want an easy, high-level mechanism for setting up all the infrastructure for my app (and that includes a database) without being bothered with your cloud’s specific implementation details. You, as a cloud provider should know best what the right setup for me is (based on the myriads of your customers, some of which surely require a similar setup) and your DMA assessment tool.

What I’m asking you, as a cloud provider, is to give me an option to think at a higher level of abstraction, lift me up from your infra and make an Application the first class citizen on your cloud. In particular, free me up from setting up networking, security, accounting and other (your platform specific) infrastructure. This expectation#3 is a step further, an extension if you will, of what I described earlier where you elevate me by allowing me to just work with my data (expectation#1, preferred), or at least allow me work with the database, but disaggregate it too, just as you did with the storage and compute (expectation#2).

Expectation #4: Unearth and reveal data insights

Whether or not you, dear reader, agree with my pitch of a database being a pure overhead, a (pretty heavy) tax on me, as a customer, for hosting my data or you are more in the camp of folks that prefer a less radical approach and just looking for some automation of mundane database tasks, I think you’d agree with me that it would be really useful if you could just opt-in to a service that makes sense of your data. Just like that. As an on-prem customer, I know I would!

Unearthing data insights from raw data (image generated by Google Gemini)

Indeed, whenever in my past life I consulted and was introduced to a new customer database setup, as part of the intake interview, I always inquired about the kind of data that actually required those gigabytes or terabytes of data storage that the customer DBAs were proud of administering. The rest of my questions were usually easily answered, but this one stumbled many and I rarely got a reasonable answer that could be easily digested and supported by the back of the envelope arithmetic.

This has always been fascinating to me. Lots of effort, money (including my consulting fee), infrastructure (compute, storage, networking) was often unquestionably spent on hosting large databases, but few in the organization could give a straightforward, easy to understand answer as to what kind of data was stored inside to really justify its size and all the expenses associated with hosting, patching, securing and protecting it.

“What do you mean? Data is the lifeblood of our business, right?” That was the first standard response. Sure, but if you operate on the active dataset of of a few hundred megabytes or perhaps 1GB or even 10GB on any given day for your transactional database…. What else is stored in your 10TB database?

Naturally, we can design a smart caching solution, get a bigger machine, allocate all the available RAM (and perhaps enable disk cache for AlloyDB Omni to extend your RAM even further) or scale horizontally perhaps, and, sure, enable columnar engine for data crunching and analytical queries and lots of other cool features that as a consultant I’d be happy to implement for you, but… do you actually make use of all that data that you claim you need faster access to?

To be sure, most of this data was probably useful… for something, but to me it often hasn’t been apparent what for exactly and so I often had a feeling that my technical solutions/recommendations could’ve been a lot more cost effective if I knew the exact data access patterns (which I could get if customers allowed me to spend time on it), if I understood the exact nature of the business and the treasures hidden inside the raw data.

The fascinating part to me was that the people who looked after that data not only couldn’t easily answer this “what’s actually stored in your 10TB database” question, but were often stunned by me asking it. Maybe it’s just me, but I’m the kind of person who prefers to have a ballpark figure in mind before I approach a cash register in a grocery store. My estimate may be 10, 20 perhaps even 30% off, but when I review my credit card statement later and see a weekend grocery charge of $500, that tingles my spidey senses that something may possibly be off.

I see data in the same light. As I already pointed out earlier in this blog series, I see a database as a tax that I have to pay to host my data (see expectation#1 above). All of these system catalogs or data dictionaries, online/archive redo logs or WAL logs, undo or temp, backups and replicas that take space and consume CPU resources… I didn’t ask for any of it. All I want is to host my data, but I’m forced to pay a tax for my data to be stored in a particular format, accessed in a particular way, my requests queued waiting for locks or latches or mutexes, etc, etc. Overhead!

What I always wondered –and sometimes asked customers– is to conduct a simple mental exercise of running a back of the envelope estimate on the useful data vs. the tax they pay for hosting it (i.e. for maintaining that shell of a database). What’s that overhead like? Is it just the data normalization, the keys that are stored in multiple tables, the indexes that are perhaps overblown to avoid hitting the actual tables to satisfy queries or the “real overhead” that stems from data that is very carefully accumulated, but not really used? That is, imagine that you dump your actual, daily used transactional data to a file or two or a hundred of them. How much space would these flat files “weigh” compared to the size of your database? Would it be 20% or 80%?

And yes, as they say, space is cheap nowadays, except that the highly available, triple replicated, encrypted at rest, enterprise storage backed by good durability numbers… ain’t.

And sure, you could run a query, perhaps a really clever one, perhaps more than a page long and perhaps with analytics and window functions and correlated subqueries and perhaps with recursive nesting and subquery factoring, etc. to understand what’s really stored in that table or the other one. But is that what you’d qualify as the easily accessible, easily discoverable, easy to visualize metadata and that also lends itself easily for unearthing the hidden gems stored within your data and your changing data dynamics?

And if it is, why is it that most customers couldn’t easily tell me what’s really stored in that 10TB database and what the real value of that data is?

In my old consulting days I observed this problem almost everywhere. Tons of data, limited use of it. That is, the old adage of being “data rich, yet information poor” applied to many shops I visited. At the startups it was usually “let’s collect that data somehow and clean up and think of what to do with it later” (which rarely happens, if ever). At the more mature customers it was usually well normalized, with minimal data duplication and discrepancies, with the proper referential integrity constraints in place, as well as indexes, but… not much of that data “mined” beyond the rigid ETL processes that took a while to develop and that were not typically adapting to the new data changes fast enough.

While some on-prem shops had sophisticated decision support systems with expensive BI tools and data ware[houses|marts|lakes|ponds|whatever], in my experience as a database consultant, most lacked the facilities to help them understand the content of their databases beyond the immediate operational data points that came in the form of standard canned reports. In my experience, the understanding of the true value of their accumulated data and distilling insights from it was often beyond reach, beyond the capabilities of the databases and the toolset deployed on-prem and so it’s high on my wishlist for you, as a cloud provider, to make this service available to me out of the box as part of your standard value proposition.

What kind of insights? Both operational and business. The former could be a hint that 80% of my data hasn’t been accessed for a year and so it can perhaps be offloaded to a tiered/cold storage (HDDs with the an order of magnitude lower IOPs and throughput are probably fine if this data isn’t accessed anyway) to save me money. And perhaps offload it even further to object storage if nobody screams about it for another year.

And business insights… well, the sky’s the limit here, no pun intended, for you, as a cloud provider, to show me the benefit from using your data insights service on my specific data (we’ll talk about it later). That’s your chance to impress me with your AI inference service (like this one)!

Expectations ##5-N:

So far in these first three posts in this series, I only managed to define four expectations. I hope to do better in the next/forth installment and finish off the rest. Thank you for sticking with me!

Until then! See part 4

Credits

The Omni Database Engineering team at Google is the one that makes the magic presented in this blog series happen. I was the first, founding engineer, helped build the team, became a TL and later UberTL, so I believe I have a good perspective of why and how we approached the Operators and evolved them over time. Today, AlloyDB Omni K8s Operator is the production grade database management solution used by major banks, retailers and other customers and the eng team deserves all the credit.

I’d also like to thank Marc Fielding, Gleb Otochkin, Anand Kumar, Virender Singla, Martin Nash and Ash Gbadamassi for reviewing these blog posts, correcting my broken English and my broken thoughts. All the remaining mistakes and inaccuracies are obviously still mine.

Source Credit: https://medium.com/google-cloud/databases-on-k8s-really-part-3-3634a0560494?source=rss—-e52cf94d98af—4