


BigQuery Vector Search is a powerful way to find semantically similar search results, but did you know it can also be used to provide recommendations? This article shows you how and takes it a step further by describing how to filter by a location.
In case you are not familiar with Vector Search, to put it simply, it allows you to find rows in your dataset that are semantically similar to a search query. Vector Search is an improvement over simple keyword search. With keyword search you need to locate the exact term and locate it in the dataset. With Vector Search you can find results that are similar in meaning — they can understand the intention and context of a query. If you’re not familiar a great starting point is this blog.
The scenario discussed in this article is about using Vector Search is as a recommendation engine.
Note: in this article I will talk about text search only but it is possible to perform multimodal search also.
One option for making recommendations is a Collaborative Recommender. In this approach, you would find users who have similar interests to other users and then provide recommendations based on these similarities. So in the case of movies, if we have one user Julie who had seen several movies and another user Artem who liked 80% of the same movies, then we might recommend movies to Julie that Artem liked and movies to Artem that Julie liked.
Where this doesn’t work however is where you have no feedback data from users, but you do have some other searchable information. This is where Vector Search can prove useful as a recommender. We can allow the user to enter some information about the type of movie they want to watch e.g. “light-hearted comedy with cats” as an example and Vector Search can use this search string to search a database containing movie synopsis to identify a movie that is similar to the search criteria.
In early 2024, Google released BigQuery Vector Search, making this solution more accessible since users only need a basic understanding of Vector Search along with some simple SQL skills to get started. With just a few lines of code you can build your own Vector Index and then perform semantic searches.
The basic steps to creating and using Vector Search are:
- Create an embedding on a “content” column. This column contains the text you want to search across. An embedding is a semantic representation of the data in the form of vectors for use by ML models — if you’re not familiar this is a great medium article on the topic as well as this Google documentation page.
- Create a vector index which makes searching the embedded column more efficient.
- Perform a semantic search
This process is shown in the following diagram:
Here’s an example of how this can be performed — there are a few options in here but we’ll skim over that to begin with just to provide an overall sense of what the process looks like, and come back to the details after.
Step 1: Create an Embedding Generation Model
The first step is to create the embedding generation model. This is basically a connection that allows you to use one of Google’s models to create the embeddings on your data. In this example we use the “text-embedding-005” model — this article outlines the supported models — please review and use the right one for your use case.
# Create a Vertex AI text embedding generation model
CREATE OR REPLACE MODEL `my_dataset.embedding_model`
REMOTE WITH CONNECTION `us.bqml1`
OPTIONS (ENDPOINT = 'text-embedding-005');
Step 2: Generate a text embedding on your “content” column
Second we need to have a STRING column that we will embed (note that the column must be named “content” so we have given it an alias below). We can then run the following query to create a new table which contains the embeddings.
# Generate a text embedding against a content column
CREATE OR REPLACE TABLE `my_dataset.embeddings` AS
SELECT * FROM ML.GENERATE_EMBEDDING(
MODEL `my_dataset.embedding_model`,
(
SELECT *, someTextColumn AS content
FROM `project_id.my_dataset.my_table`
WHERE LENGTH(someTextColumn) > 0
)
)
WHERE LENGTH(ml_generate_embedding_status) = 0;
Step 3: Create a vector index
The next step is to create a vector index so the embeddings can be searched efficiently. The following code will achieve this:
# Create a vector index
CREATE OR REPLACE VECTOR INDEX my_index
ON `my_dataset.embeddings`(ml_generate_embedding_result)
OPTIONS(index_type = 'IVF',
distance_type = 'COSINE',
ivf_options = '{"num_lists":500}')
Depending on the size of data this can take some time to create.
Step 4: Search for results that are similar to the query string
Finally we will perform a search using our very own search string. To make this possible we need to embed the query string as well and then compare the embedded query string to the embeddings we have in our dataset.
This example shows a single query but you can also provide batch queries.
# Search for recommendations matching the query string
SELECT query.query, base.displayName, base.summary
FROM VECTOR_SEARCH(
TABLE `recommend.embeddings`, 'ml_generate_embedding_result',
(
SELECT ml_generate_embedding_result, content AS query
FROM ML.GENERATE_EMBEDDING(
MODEL `recommend.embedding_model`,
(SELECT 'a light-hearted comedy with cats' AS content))
),
top_k => 5, options => '{"fraction_lists_to_search": 0.01}')
Within BigQuery this is really simple. We can take advantage of the geospatial functions native to BigQuery to limit our search to a radius around a user’s latitude & longitude.
We also will need to create a stored column out of the geography column so that we can use it as a pre-filter. This limits the search space that the vector index will use which should lead to cost savings and performance improvements.
Here’s an example of what this might look like:
Step 1: Use or create a GEOGRAPHY column using ST_GEOGPOINT (longitude, latitude)
This query will just create a copy with an additional GEOGRAPHY column as an illustrative example only:
CREATE TABLE `project_id.my_dataset.items_geo`
AS
SELECT t.*, ST_GEOGPOINT(longitude,latitude) AS location
FROM `project_id.my_dataset.items` t;
Step 2: Find items/places within a distance of a given radius
This query searches within a 1km radius of a particuar geopoint:
# Search within a 1km radius of a particular geopoint.
SELECT *
FROM `project_id.my_dataset.items_geo`
WHERE ST_DWITHIN (ST_GEOGPOINT(151.207095, -33.873784), location, 1000)
Step 3: Create an embedding (same as before) on the table containing the geography column
This query creates a text embedding from a description on the table where we have created the geography column:
# Generate a text embedding from description
CREATE OR REPLACE TABLE `project_id.my_dataset.items_geo_embeddings` AS
SELECT * FROM ML.GENERATE_EMBEDDING(
MODEL `my_dataset.embedding_model_text_005`,
(
SELECT *, summary AS content
FROM `project_id.my_dataset.items_geo`
WHERE LENGTH(someTextColumn) > 0 AND location IS NOT NULL
)
)
WHERE LENGTH(ml_generate_embedding_status) = 0
Step 4: Create a vector index with a stored column
This query creates an IVF vector index with a stored column.
# Create a vector index with a stored column
CREATE VECTOR INDEX items_geo_index
ON `my_dataset.items_geo_embeddings`(ml_generate_embedding_result)
STORING (location)
OPTIONS(index_type = 'IVF',
distance_type = 'COSINE',
ivf_options = '{"num_lists":500}')
Step 5: Search using the stored column as a pre-filter
Finally we search all the embeddings (our content field) by using a search query, and providing a geography point as a pre-filter. So what we end up doing is reducing the search space to only those within the radius, and then we search within those records. This greatly improves the performance of the search and we don’t want to recommend salons really far away from the search location anyway so we can avoid sending those as recommendations.
# Select only stored columns from a vector search to avoid an expensive join.
SELECT query.query, base.itemName, base.content, base.location, distance
FROM
VECTOR_SEARCH(
(SELECT * FROM `my_dataset.items_geo_embeddings` WHERE ST_DWITHIN (ST_GEOGPOINT(-73.981111, 40.7409999), location, 2000)),
'ml_generate_embedding_result',
(
SELECT ml_generate_embedding_result, content AS query
FROM ML.GENERATE_EMBEDDING(
MODEL `my_dataset.embedding_model_text_005`,
(SELECT 'a hairdresser specialising in highlights' AS content))
),
top_k => 5, options => '{"fraction_lists_to_search": 0.01}')
This article shows you how to use BigQuery Vector Search to provide geography-based recommendations using the built-in GEOGRAPHY data types. This is really valuable when you don’t have user data available to provide recommendations or you want to get some recommendations quickly without a high level of investment.
Using the built in BigQuery syntax is really useful for offline serving where you either don’t have a lot of users or where low latency is not a concern. However if you require faster serving times, you will need to consider online serving. I will share how to do this in my next article.
Source Credit: https://medium.com/google-cloud/using-bigquery-vector-search-for-location-based-recommendations-a93aec507e9e?source=rss—-e52cf94d98af—4