Generate Structured Output with Gemini in BigFrames | by Garrett Wu | Google Cloud - Community

BigQuery DataFrames aka BigFrames is an open source Python library offered by Google. BigFrames scales Python data processing by transpiling common Python data science APIs to BigQuery SQL. You can read more about BigFrames in the official introduction to BigFrames and can refer to the public git repository for BigFrames.

This blog introduces an easy way to generate structured results in BigFrames with Gemini.

AI generating data. Image generated by Gemini’s Imagen feature.

Usually people use LLM models to extract or generate text results from input prompts. While databases require structured data for processing and analysis. Extracting structured data from AI-generated text can be error prone. Using APIs optimized for structured output can make your AI workflows more robust.

For example: if you have a list of city names, you want to filter your output to only include the US cities. LLMs can tell you if one city is in the US or not, but their outputs are verbose texts. Your database needs simple boolean values to apply filtering.

BigFrames can be installed with pypi if you haven’t already:

pip install --upgrade bigframes

If you are in a notebook, use the %pip magic and restart your notebook runtime after installation completes.

%pip install --upgrade bigframes

Make sure you have bigframes >= 2.2.0.

Before starting using bigframes, you need to set up the GCP project you are using.

PROJECT = ""import bigframes
# Setup project, optional if using BigQuery Studio Python notebooks
bigframes.options.bigquery.project = PROJECT
bigframes.options.display.progress_bar = None
import bigframes.pandas as bpd
from bigframes.ml import llm

Let’s go back to the problem: with a list of cities, you want to know which are the US cities. Creating a BigFrames DataFrame to represent that city list:

df = bpd.DataFrame({"city": ["Seattle", "New York", "Shanghai"]})
df

  city
0 Seattle
1 New York
2 Shanghai

Gemini model can give you informations of the cities:

gemini = llm.GeminiTextGenerator()
result = gemini.predict(df, prompt=[df["city"], "is a US city?"])
result[["city", "ml_generate_text_llm_result"]]

  city       ml_generate_text_llm_result
0 Seattle    Yes, Seattle is a city in the United States. I...
1 New York   Yes, New York City is a city in the United Sta...
2 Shanghai   No, Shanghai is not a US city. It is a major c...

The outputs are text results that humans can read. But if you want the output data to be more useful for analysis, it is better to transfer to structured data like boolean, int or float values. Usually the process wasn’t easy.

Now you can get structured output out-of-the-box by specifying the output_schema parameter with a dictionary of column names to BigQuery data types in Gemini model predict method. In the below example, the outputs are only boolean values.

result = gemini.predict(df, prompt=[df["city"], "is a US city?"],
output_schema={"is_us_city": "bool"})
result[["city", "is_us_city"]]

  city       is_us_city
0 Seattle    True
1 New York   True
2 Shanghai   False

Databases are happy now, they can easily handle the structured data such as filtering only the US cities.

You can also get float or int values, for example, to get the population of each city in millions:

result = gemini.predict(df, prompt=["what is the population in millions of", df["city"]],      
output_schema={"population_in_millions": "float64"})
result[["city", "population_in_millions"]]

  city       population_in_millions
0 Seattle    0.75
1 New York   19.68
2 Shanghai   26.32

And yearly rainy days:

result = gemini.predict(df, prompt=["how many rainy days per year in", df["city"]], 
output_schema={"rainy_days": "int64"})
result[["city", "rainy_days"]]

  city       rainy_days
0 Seattle    152
1 New York   121
2 Shanghai   115

You can also get the different output columns and types in one prediction.

Note it doesn’t require dedicated prompts, as long as the output column names are informative to the model.

result = gemini.predict(df, prompt=[df["city"]], 
output_schema={"is_US_city": "bool", 
"population_in_millions": "float64", 
"rainy_days_per_year": "int64"})
result[["city", "is_US_city", "population_in_millions", "rainy_days_per_year"]]

  city       is_US_city  population_in_millions rainy_days_per_year
0 Seattle    True        0.75                   152
1 New York   True        19.68                  121
2 Shanghai   False       26.32                  115

You can also generate composite datatypes like array and struct. This example generates a places_to_visit column as array of strings and a gps_coordinates column as struct of floats. Along with previous fields, all in one prediction.

result = gemini.predict(df, prompt=[df["city"]], 
output_schema={"is_US_city": "bool", 
"population_in_millions": "float64", 
"rainy_days_per_year": "int64", 
"places_to_visit": "array", 
"gps_coordinates": "struct"})
result[["city", "is_US_city", "population_in_millions", "rainy_days_per_year", "places_to_visit", "gps_coordinates"]]

  city       is_US_city   population_in_millions rainy_days_per_year places_to_visit                                   gps_coordinates
0 Seattle    True         0.74                   150                 ['Space Needle' 'Pike Place Market' 'Museum of... {'latitude': 47.6062, 'longitude': -122.3321}
1 New York   True         8.4                    121                 ['Times Square' 'Central Park' 'Statue of Libe... {'latitude': 40.7128, 'longitude': -74.006}
2 Shanghai   False        26.32                  115                 ['The Bund' 'Yu Garden' 'Shanghai Museum' 'Ori... {'latitude': 31.2304, 'longitude': 121.4737}

Check out https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_llm_output_schema.ipynb for more details.

The BigFrames team would love to hear from you. If you would like to reach out, please send an email to: bigframes-feedback@google.com or by filing an issue at the open source BigFrames repository. To receive updates about BigFrames, subscribe to the BigFrames email list.

Source Credit: https://medium.com/google-cloud/generate-structured-output-with-gemini-in-bigframes-ba3ee8957ce4?source=rss—-e52cf94d98af—4