Exploring the Data Engineering Agent in BigQuery

Data is the lifeblood of the modern enterprise, but the process of making it useful is often fraught with friction. Data engineers, analysts, and scientists—some of the most skilled and valuable talent in any organization—are spending a disproportionate amount of their time on repetitive, low-impact tasks. What if you could shift your focus from manually building and maintaining pipelines to defining the best practices and rules that automate them?

Today, we’re announcing a fundamental shift to solve this challenge. We’re excited to announce the preview of the Data Engineering Agent in BigQuery, a first-party agent designed to automate the most complex and time-consuming data engineering tasks, powered by Gemini.

The Data Engineering Agent isn’t just an incremental improvement; it’s fundamentally transforming the way we work, with truly autonomous data engineering operations. According to IDC, ‘GenAI and other automation solutions will drive over $1 trillion in productivity gains for companies by 2026’¹.

Here is a closer look at the powerful capabilities you can access today:

Pipeline development and maintenance

The Data Engineering Agent makes it easy to build and maintain robust data pipelines. The agent is available in BigQuery pipelines and it can help you with:

Natural language pipeline creation: Describe your pipeline requirements in plain language, and the agent generates the necessary SQL code, adhering to data engineering best practices that you can customize through instruction files. For example: “Create a pipeline to load data from the ‘customer_orders’ bucket, standardize the date formats, remove duplicate entries, and load it into a BigQuery table named ‘clean_orders’.”

Intelligent pipeline modification: Need to update an existing pipeline? Just tell the agent what you want to change. It analyzes the existing code, and proposes the necessary modifications, leaving you to simply review and approve the changes. For example, you can ask it to “Create a pipeline to load data from the ‘customer_orders’ bucket, standardize the date formats, remove duplicate entries, and load it into a BigQuery table named ‘clean_orders’.” The agent follows best-practice design principles and helps you optimize and redesign your existing pipelines to eliminate redundant operations, as well as to leverage BigQuery’s query optimization features such as partitioning.

Dataplex Universal Catalog integration: The agent leverages Google Cloud’s Dataplex data governance offering. It automatically retrieves additional resource metadata such as business glossaries and data profiles from Dataplex to improve the relevance, table-metadata generation (new tables) and performance of the generated pipelines.
Custom agent instructions and logic: Incorporate your unique business logic and engineering best practices by providing custom instructions and leveraging User-Defined Functions (UDFs) within the pipeline.
Automated code documentation: The agent automatically generates clear and concise documentation for your pipelines along with column descriptions, making them easier to understand and maintain for the entire team.

Spanish-language news and entertainment group PRISA Media and early access customer has had a positive experience with the Data Engineering Agent.

“The agent provides solutions that enable us to explore new development approaches, showing strong potential to address complex data engineering tasks. It demonstrates an impressive ability to correctly interpret our requirements, even for sophisticated data modeling tasks like creating SCD Type 2 dimensions. In its current state, it already delivers value in automating maintenance and small optimizations, and we believe it has the foundation to become a truly distinctive tool in the future.” – Fernando Calo, Lead Data Engineer at the Spanish-language news and entertainment group PRISA

Data preparation, transformation and modeling

The first step in any data project is often the most time-consuming: understanding, preparing, and cleaning raw data. The Data Engineering Agent allows you, for example, to access raw files from Google Cloud Storage. It automatically cleans, deduplicates, formats and standardizes your data based on the provided instructions. Integration with Dataplex allows you to generate data quality assertions based on rules defined in the Dataplex repository and automatically encrypt columns that were flagged as containing Personally Identifiable Information (PII). No more writing complex queries to identify data quality issues or to standardize formats.

The agent can then generate the necessary code to perform essential data transformation tasks, significantly reducing the time it takes to get your data ready for analysis. This process covers operations like joining and aggregating datasets.

The agent assists with complex data modeling, too. You can use natural language prompts to generate sophisticated schemas, such as Data Vault or Star Schemas, directly from your source tables.

Source Credit: https://cloud.google.com/blog/products/data-analytics/exploring-the-data-engineering-agent-in-bigquery/