In the enterprise landscape, data is often highly fragmented across multiple source systems. Data curation is the process of organizing, cleaning, and enriching raw data to transform it into high-quality, AI-ready data assets. The traditional process of merging and cleaning this data using ETL tools, manual SQL or Python to build dashboards is the primary bottleneck for AI and analytics.
Google Data Cloud provides several curation accelerators designed to reduce the time-to-insight and automate these workflows.
1. Cloud Storage auto-discovery for semi-structured data
The first step in modern curation is eliminating the manual effort of cataloging dark data in Cloud Storage.
-
Automatic data discovery: The automatic discovery feature in Dataplex Universal Catalog scans GCS buckets to automatically create external tables for structured data and catalog the metadata.
-
Ad-hoc analysis: This allows for immediate, Gemini-powered analysis via vibe querying to assess value and quality without having to load the data with a traditional ETL process.
-
Unified governance: This also lets you apply fine-grained access control and automated metadata generation directly on the raw storage layer, ensuring security and governance are baked in right from the start.
2. Metadata curation and augmentation
Curation acceleration relies on moving from columns and rows to a semantic understanding of the data.
-
Automated insights: Data insights automatically generates column descriptions, relationship graphs, along with suggested questions in natural language. This helps speed up metadata documentation and accelerate initial exploration and analysis when facing new or unfamiliar data.
-
Grounding Conversational Analytics: These insights later serve to ground conversational analytics in your data, giving agents the additional context to understand how assets relate to your business. This ensures more accurate responses when you chat with your data using natural language.
3. Integrated governance: Quality, profiling, and lineage
Trusted curation requires a robust metadata framework that tracks data health and movement.
-
Data profiling: Data profiling automatically identifies statistical characteristics (e.g., null counts, distribution) to catch anomalies early.
-
Quality Controls: Users can define and run data quality checks to ensure that data meets organization’s quality standards. Auto data quality lets users automate scans, validate data against rules, and log alerts if the data doesn’t meet quality requirements.
-
Lineage tracking: Table- and column-level lineage, allows engineers to trace how data moves through transformations. This transparency accelerates curation making it easier to debug pipeline errors.
4. Agentic workflows for pipeline development
Google Data Cloud introduces AI agents to handle the heavy lifting of code generation for ingestion and transformation.
-
Data Engineering Agent: This agent allows you to use Gemini in BigQuery to build and manage pipelines using natural language or by passing a technical design document.
-
Data Science Agent: Integrated into Colab Enterprise/BigQuery Notebooks, Data Science Agent automates exploratory data analysis (EDA) and generates Python/PySpark code for complex ML-ready pipelines.
5. Catalog-driven asset discovery and data products
To prevent redundant work in large organizations, curation must focus on reuse and internal marketplaces.
-
Discovery first: Before building new pipelines, teams use the Dataplex Data Catalog to discover existing assets.
-
Data products: Data is published as data products enriched with logical grouping of data assets, formally packaged to be discoverable, trusted, and accessible for solving specific business problems.
-
BigQuery sharing (formerly Analytics Hub): This enables in-place sharing, allowing internal and 3rd party teams to access curated data without moving or copying it, which maintains a single source of truth.
6. Built-in AI functions for multi-modal data curation
As enterprises generate increasing amounts of multi-modal data, curation now extends to unstructured formats like images, audio, and documents. The following capabilities address these evolving needs:
-
SQL reimagined with generative AI functions: By using standard SQL operators, data teams can classify and rank data by quality or criteria without specialized ML expertise. BigQuery AI functions allow users to perform sentiment analysis, summarization, and entity extraction directly within a SQL statement.
-
Embeddings generation: Curation pipelines can now generate vector embeddings to enable use cases like similarity searches, product recommendations, log analytics, entity resolution and deduplication and more across massive datasets.
-
Multimodal tables: Multimodal tables let you Integrate unstructured data into standard tables and work with multimodal data with SQL.
7. Real-time curation with continuous queries
For real-time curation, BigQuery provides simplified experience enabling no-code ingestion and SQL based transforms for constant data movement.
-
Pub/Sub to BigQuery: Direct subscriptions allow for no-code ingestion of streaming data into BigQuery tables.
-
Continuous queries: Continuous queries are SQL statements that run continuously, processing incoming data in real-time. Curated output can be immediately streamed to Pub/Sub, Bigtable, or Spanner to power downstream applications and real-time dashboards.
In summary, these curation accelerators remove the slow, manual work of cleaning and organizing data by automating the most time-consuming steps. Spend less time prepping and more time making decisions — explore these curation accelerators today to get started.
Source Credit: https://cloud.google.com/blog/products/data-analytics/data-curation-accelerators-for-google-data-cloud/
