
Great, this is a good looking query. But what happens when you move beyond this trivial example, and use Gemini for text-to-SQL against a real world database and on real-world user questions? It turns out that the problem is more difficult. The model needs to be complemented with methods to:
-
provide business-specific context
-
understand user intent
-
manage differences in SQL dialects
Let’s take a look at each of these challenges.
Problem #1: Provide business-specific context
Just like data analysts or engineers, LLMs need significant amounts of knowledge or “context” to generate accurate SQL. The context can be both explicit (what does the schema look like, what are the relevant columns, and what does the data itself look like?) or more implicit (what is the precise semantic meaning of a piece of data? what does it mean for the specific business case?).
Specialized model training, or fine tuning, is typically not a scalable solution to this problem. Training on the shape of every database or dataset, and keeping up with schema or data changes, is both difficult and cost-prohibitive. Business knowledge and semantics are often not well documented in the first place, and difficult to turn into training data.
For example, even the best DBA in the world would not be able to write an accurate query to track shoe sales if they didn’t know that cat_id2 = 'Footwear'
in a pcat_extension
table means that the product in question is a kind of shoe. The same is true for LLMs.
Problem #2: Understanding user intent
Natural language is less precise than SQL. An engineer or analyst faced with an ambiguous question can detect that they need more information and go back and ask the right follow-up questions. An LLM, on the other hand, tends to try to give you an answer, and when the question is ambiguous, can be prone to hallucinating.
Example: Take a question like “What are the best-selling shoes?” Here, one obvious point of ambiguity is what “best selling” actually means in the context of the business or application — the most ordered shoes? The shoe brand that brought in the most money? Further, should the SQL count returned orders? And how many kinds of shoes do you want to see in the report? etc.
Further, different users need different kinds of answers. If the user is a technical analyst or a developer asking a vague question, giving them a reasonable, but perhaps not 100% correct SQL query is a good starting point. On the other hand, if the user is less technical and does not understand SQL, providing precise, correct SQL is more important. Being able to reply with follow-up questions to disambiguate, explaining the reasoning that went into an answer, and guiding the user to what they are looking for is key.
Problem #3: Limits of LLM generation
Out of the box, LLMs are particularly good at tasks like creative writing, summarizing or extracting information from documents. But some models can struggle with following precise instructions and getting details exactly right, particularly when it comes to more obscure SQL features. To be able to produce correct SQL, the LLM needs to adhere closely to what can often turn into complex specifications.
Example: Consider the differences between SQL dialects, which are more subtle than differences between programming languages like Python and Java. As a simple example, if you’re using BigQuery SQL, the correct function for extracting a month from a timestamp column is EXTRACT(MONTH FROM timestamp_column)
. But if you are using MySQL, you use MONTH(timestamp_column)
.
Text-to-SQL techniques
At Google Cloud, we’re constantly evolving our text-to-SQL agents to improve their quality. To address the problems listed above, we apply a number of techniques.
Source Credit: https://cloud.google.com/blog/products/databases/techniques-for-improving-text-to-sql/