BigQuery DataFrames aka BigFrames is an open source Python library offered by Google. BigFrames scales Python data processing by transpiling common Python data science APIs to BigQuery SQL. You can read more about BigFrames in the official introduction to BigFrames and can refer to the public git repository for BigFrames.
This blog introduces an easy way to generate structured results in BigFrames with Gemini.
Usually people use LLM models to extract or generate text results from input prompts. While databases require structured data for processing and analysis. Extracting structured data from AI-generated text can be error prone. Using APIs optimized for structured output can make your AI workflows more robust.
For example: if you have a list of city names, you want to filter your output to only include the US cities. LLMs can tell you if one city is in the US or not, but their outputs are verbose texts. Your database needs simple boolean values to apply filtering.
BigFrames can be installed with pypi if you haven’t already:
pip install --upgrade bigframes
If you are in a notebook, use the %pip magic and restart your notebook runtime after installation completes.
%pip install --upgrade bigframes
Make sure you have bigframes >= 2.2.0.
Before starting using bigframes, you need to set up the GCP project you are using.
PROJECT = ""import bigframes
# Setup project, optional if using BigQuery Studio Python notebooks
bigframes.options.bigquery.project = PROJECT
bigframes.options.display.progress_bar = None
import bigframes.pandas as bpd
from bigframes.ml import llm
Let’s go back to the problem: with a list of cities, you want to know which are the US cities. Creating a BigFrames DataFrame to represent that city list:
df = bpd.DataFrame({"city": ["Seattle", "New York", "Shanghai"]})
df
city
0 Seattle
1 New York
2 Shanghai
Gemini model can give you informations of the cities:
gemini = llm.GeminiTextGenerator()
result = gemini.predict(df, prompt=[df["city"], "is a US city?"])
result[["city", "ml_generate_text_llm_result"]]
city ml_generate_text_llm_result
0 Seattle Yes, Seattle is a city in the United States. I...
1 New York Yes, New York City is a city in the United Sta...
2 Shanghai No, Shanghai is not a US city. It is a major c...
The outputs are text results that humans can read. But if you want the output data to be more useful for analysis, it is better to transfer to structured data like boolean, int or float values. Usually the process wasn’t easy.
Now you can get structured output out-of-the-box by specifying the output_schema parameter with a dictionary of column names to BigQuery data types in Gemini model predict method. In the below example, the outputs are only boolean values.
result = gemini.predict(df, prompt=[df["city"], "is a US city?"],
output_schema={"is_us_city": "bool"})
result[["city", "is_us_city"]]
city is_us_city
0 Seattle True
1 New York True
2 Shanghai False
Databases are happy now, they can easily handle the structured data such as filtering only the US cities.
You can also get float or int values, for example, to get the population of each city in millions:
result = gemini.predict(df, prompt=["what is the population in millions of", df["city"]],
output_schema={"population_in_millions": "float64"})
result[["city", "population_in_millions"]]
city population_in_millions
0 Seattle 0.75
1 New York 19.68
2 Shanghai 26.32
And yearly rainy days:
result = gemini.predict(df, prompt=["how many rainy days per year in", df["city"]],
output_schema={"rainy_days": "int64"})
result[["city", "rainy_days"]]
city rainy_days
0 Seattle 152
1 New York 121
2 Shanghai 115
You can also get the different output columns and types in one prediction.
Note it doesn’t require dedicated prompts, as long as the output column names are informative to the model.
result = gemini.predict(df, prompt=[df["city"]],
output_schema={"is_US_city": "bool",
"population_in_millions": "float64",
"rainy_days_per_year": "int64"})
result[["city", "is_US_city", "population_in_millions", "rainy_days_per_year"]]
city is_US_city population_in_millions rainy_days_per_year
0 Seattle True 0.75 152
1 New York True 19.68 121
2 Shanghai False 26.32 115
You can also generate composite datatypes like array and struct. This example generates a places_to_visit column as array of strings and a gps_coordinates column as struct of floats. Along with previous fields, all in one prediction.
result = gemini.predict(df, prompt=[df["city"]],
output_schema={"is_US_city": "bool",
"population_in_millions": "float64",
"rainy_days_per_year": "int64",
"places_to_visit": "array",
"gps_coordinates": "struct"})
result[["city", "is_US_city", "population_in_millions", "rainy_days_per_year", "places_to_visit", "gps_coordinates"]]
city is_US_city population_in_millions rainy_days_per_year places_to_visit gps_coordinates
0 Seattle True 0.74 150 ['Space Needle' 'Pike Place Market' 'Museum of... {'latitude': 47.6062, 'longitude': -122.3321}
1 New York True 8.4 121 ['Times Square' 'Central Park' 'Statue of Libe... {'latitude': 40.7128, 'longitude': -74.006}
2 Shanghai False 26.32 115 ['The Bund' 'Yu Garden' 'Shanghai Museum' 'Ori... {'latitude': 31.2304, 'longitude': 121.4737}
Check out https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_llm_output_schema.ipynb for more details.
The BigFrames team would love to hear from you. If you would like to reach out, please send an email to: bigframes-feedback@google.com or by filing an issue at the open source BigFrames repository. To receive updates about BigFrames, subscribe to the BigFrames email list.
Source Credit: https://medium.com/google-cloud/generate-structured-output-with-gemini-in-bigframes-ba3ee8957ce4?source=rss—-e52cf94d98af—4