Authors — Akanksha Bhagwanani and Sai Charan Tej Kommuri
Imagine asking your data agent,
“Show me the average order value by region for the last quarter.”
and getting back… a SQL query that joins the wrong tables, sums the wrong field, or uses state_name instead of country.
It’s not the model’s fault — it simply doesn’t know your data landscape.
It’s like asking a tourist for directions in a city without giving them a map.
This post is about giving your agent that map.
It’s about grounding — teaching your AI assistant how your data is structured, what each table means, and how things connect. And we’ll use the Dataplex Knowledge Engine to do it.
Why do Agents Struggle with Organization’s Data?
Most data warehouses, including BigQuery, have rich metadata hidden in their schema — but agents only see text prompts.
So when you ask:
“What’s the repeat purchase rate among loyal customers?”
the model needs to figure out:
- Which table has orders?
- What column tracks repeat behavior?
- How is “loyal” defined?
Without context, it guesses. With context, it reasons.
The Knowledge Engine JSON: A Map for Machines
The Dataplex Knowledge Engine converts your dataset metadata into a JSON artifact that an agent can easily consume.
It’s like a schema memory capsule: a compact, machine-readable document that describes your data universe — tables, columns, relationships, and even example queries.
Here’s a simplified snippet from thelook_ecommerce: [BigQuery Public Dataset]
{
“dataset”: “thelook_ecommerce”,
“tables”: [
{
“name”: “orders”,
“description”: “Customer orders including total cost, status, and timestamps”
},
{
“name”: “users”,
“description”: “Registered customers with demographic attributes”
}
],
“columns”: [
{“table”: “orders”, “name”: “user_id”, “dtype”: “INTEGER”, “description”: “Foreign key to users table”},
{“table”: “orders”, “name”: “total_cost”, “dtype”: “FLOAT”, “description”: “Order value in USD”},
{“table”: “users”, “name”: “country”, “dtype”: “STRING”, “description”: “Country of residence”}
],
“relationships”: [
{“src_table”: “orders”, “src_column”: “user_id”, “dst_table”: “users”, “dst_column”: “id”, “relation_type”: “fk”}
],
“recommended_queries”: [
{“intent”: “average order value by country”, “sql_template”: “SELECT u.country, AVG(o.total_cost) FROM orders o JOIN users u ON o.user_id = u.id GROUP BY u.country”}
]
}
This JSON becomes the grounding context that you can pass to any LLM-based data agent.
How Agents use this Context
Let’s walk through the major use cases — with examples from thelook_ecommerce.
(a) NL2SQL that actually understands your Schema
Without grounding:
“Total revenue by gender last month”
→ The model might guess column names like revenue, sales, or gender_code — which may not exist.
With grounding, the agent reads that:
- orders.total_cost stores revenue
- users.gender stores gender
- orders.created_at stores timestamps
- There’s a user_id → users.id relationship
So the agent confidently generates:
SELECT u.gender, SUM(o.total_cost) AS total_revenue
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.created_at BETWEEN ‘2025–09–01’ AND ‘2025–09–30’
GROUP BY u.gender;
It’s not just syntax; it’s semantic reasoning.
(b) Detecting Duplicate Tables and Columns
In large enterprises, duplicate tables are a silent cost sink and discovery swamp.
Suppose your data warehouse has:
- orders, orders_backup, and orders_cleaned_2023
All with similar columns like total_cost and user_id.
The Knowledge Engine JSON captures column names, types, descriptions and relationships, letting an agent spot overlaps:
“These tables share 92% schema similarity — consider merging or retiring one.”
This can be extended across projects, too — comparing e-commerce and marketing datasets to identify redundant customer dimension tables.
(c) Recommending Glossary Terms
Data literacy depends on consistency.
Let’s say your column is named net_payment_amount but your glossary defines “Revenue” as “Sum of order total excluding refunds and discounts.”
The agent, reading column descriptions and examples, suggests:
“Column orders.total_cost maps closely to the glossary term Revenue.”
Over time, your glossary becomes self-updating — not because someone curated it manually, but because your Knowledge Engine gave the agent the clues.
(d) Identifying Possible Data Products
Sometimes, your datasets already contain the ingredients for data products — you just haven’t named them yet.
Example: in thelook_ecommerce, frequent joins between:
- orders,
- users, and
- inventory_items
may indicate a “Customer Purchase Insights” data product.
The agent, observing these patterns and relationships, could suggest:
“Propose a data product combining orders, users, and inventory to support revenue analysis by product category.”
This turns ad hoc analytics into reusable, shareable data assets.
(e) Creating a Cross-Dataset Knowledge Graph
If you run multiple datasets — say, ecommerce, marketing, and support — their relationships can form a cross-dataset knowledge graph.
For example:
- ecommerce.orders.user_id ↔ marketing.email_campaigns.user_id
- support.tickets.email ↔ users.email
The agent merges these relationships into a graph persisted in Neo4j, enabling queries like:
“Show me users who purchased after opening a campaign email.”
This is the foundation for reasoning over your organization’s entire data landscape.
Two ways to Ground your Agents
Now that you have the JSON, how do you use it?
Option 1: Inline grounding (system instructions)
If your agent runs inside a managed environment (like LangChain or Vertex AI Agents), embed the JSON directly in its system prompt:
You are a data reasoning agent.
Use the following dataset context to interpret queries:
{… knowledge_engine.json …}
The agent will “see” your schema before parsing user questions.
Option 2: Retrieval grounding (context-on-demand)
If your JSON is large, store it in a vector or graph store (like Pinecone or Neo4j).
When a user asks a question, retrieve only the relevant snippet.
For example:
User: “Show me total sales by country.”
System: Retrieve only nodes related to orders, users, and country.
Agent: Generates grounded SQL.
This keeps grounding lightweight and scalable.
How to Generate the JSON Context
Simply run the Insights scan on the Dataset of your choice, wait for it to complete and you see the JSON.
Optional: push it into Neo4j for relationship visualizations and reasoning.
A Quick Demonstration
Let’s try a simple before–after.
Ungrounded prompt:
“Top 5 product categories by revenue.”
Model guesses incorrectly:
SELECT category, SUM(revenue) FROM sales GROUP BY category;
→ sales table doesn’t exist.
Grounded prompt (with JSON context):
SELECT p.category, SUM(o.total_cost) AS revenue
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY p.category
ORDER BY revenue DESC
LIMIT 5;
Now it’s valid, joins correctly, and uses real column names.
That’s the power of grounding.
From Query Automation to Data Intelligence
Once your agents understand structure, they start understanding meaning.
They stop being “SQL translators” and start acting like junior data analysts who can:
- Explain which tables influence a metric,
- Suggest glossary mappings,
- Recommend reusable data products,
- Or even audit your schema for redundancy.
You don’t just automate data access — you amplify data intelligence.
Closing Thought
A well-grounded agent is like a data scientist who not only knows SQL syntax but also understands your business model.
You can hand it any question — “Which products are trending in Germany?” — and trust that it knows which tables to look at and how they connect.
So before you scale your next data agent, ask yourself:
“Does it know my data landscape?”
If not, hand it the map — your Knowledge Engine JSON — and watch it finally learn to navigate your data city.
Also published at — https://discuss.google.dev/t/teaching-agents-to-read-your-data-a-practical-guide-to-grounding-with-the-dataplex-knowledge-engine/274768
Source Credit: https://medium.com/google-cloud/teaching-agents-to-read-your-data-a-practical-guide-to-grounding-with-the-dataplex-knowledge-394cb827f931?source=rss—-e52cf94d98af—4
