
An Introduction to Google Cloud Vertex AI Search for Commerce: Data Ingestion — Part 3
What Data Do You Need for Vertex AI Search for Commerce?

- ML models are only as good as your input data.
- The most time consuming step is collecting and ingesting necessary data. Developers can spend around 80% of their time in collecting, cleaning, and ingesting data.
- Two data sources required for Commerce Search are: Product Catalog and Customer Events.
- After data processing, the data will be ingested into the Retail API.
- You want as much data as possible.
- The minimum amount of data depends on the recommendation model type, optimization objective, and user events types.
- The Retail API defines a schema for the data. The format is JSON.
3 Data Questions to Think About Before Starting

What data do you need?
- Product Catalog data
- Customers Events (user online behavior)
How much data do you need?
- As much as possible
- The minimum data requirements depend on the recommendation model type and the optimization objective.
- An optimization objective could be conversion rate.
What format does the data need to be in?
- JSON formatted data that matches a pre-defined schema
What Data Do You Need for Vertex AI Search for Commerce?

Product Catalog Data
- Information about the products a company sells.
- A detailed understanding of each product is essential so the model can predict if a product is useful to a customer.
- Product catalog data is typically stored in a database connected to a store front-end.
- The product catalog data must be exported and transformed to conform to the Retail API schema.
- The product catalog data should be regularly updated via the Retail API so changes in product offerings are reflected in the model and its recommendations.
- The more accurate and specific the catalog data, the higher quality of the model.
- Small catalog sizes, less than 100 items, may not see much benefit from recommendations since there are few products to recommend.
User Event Data
- Information about user behavior on your website, such as their browsing behavior.
- The best way to collect this information is to capture visitor actions on your website. For example, capture which product pages were viewed by a particular user.
- You need historical events to analyze the past behavior of a user.
- You need live, real-time events to understand the present context.
- The models work best with at least 3 months of product page views, home page views, and add-to-cart events. Ideally, there is 1 to 2 years of purchase history for the frequently-bought-together model.
- Providing data for the whole year will show seasonality and trends.
Catalogs and Catalog Information
Required Product Information

- Product ID, product title, and product name are required fields. The rest of the fields are optional.
- These values should correspond with the values used in your internal product database.
Product Attributes

- Product attributes are key-value pairs associated with the product. For example, store name, vendor, style, etc.
- Product attributes are highly recommended since they act as strong signals for the recommendation model.
- System Attributes: Provide more information about the product such as brand, availability, color, and size.
- Custom Attributes: Extra attributes that you define such as store name, vendors, or style.
- Inventory Level Attributes: Provide store level information about the product.
Product Levels

- Primary Items: The results that the retail API returns. These can be individual SKUs and SKU groups.
- Variant Items: Versions of the SKU group primary product. Variants can only be individual SKU items. Variants are typically child items of the primary Items. For example, a t-shirt could be the primary product and different colors of the t-shirt could be variants.
- Collections: Bundles of primary or variant items.
Product Inventory

- Product Level Inventory: Used by online retailers. Price, availability, and other inventory data is set for each product in the catalog.
- Local-Level Inventory: Used by retailers with brick-and-mortar and online stores. Keeps inventory information on a per-store basis.
User Events
User Event Types

- The recommendation is to record user events for all event types for the high quality search results.
Event Type Priority

- Log the highest priority user events to achieve quality data models.
Visitor ID

- The user Event JSON format has a visitorID field.
- The visitorID field identifies the visitor and is very important because it’s required by event event.
- The visitorID is used to join the same visitor session across devices or different session. It's crutal to providing personalized recommendations and search results.
Implement Visitor ID

- If you have a sign-in feature, a web application can assign a unique identifier to every signed in user. This helps track user behavior across devices.
- If the user has not signed in, you can use SessionID as the visitorID.
- If you use Google Analytics, you can use clientID in Google Analytics. This ID uniquely identifies a visitor on a single device.
- You cannot use PII data, such as an email or mobile number, in the visitorID.
What Format Should the Data Be In To Be Ingested?

- There are multiple predefined Schemas that you can use to import data into Retail API. Specifying the wrong schema can lead to errors or data getting dropped.
Resources
- Vertex AI Search for Commerce: Frequently asked questions
- Google Cloud Skills Boost: Vertex AI Search for Commerce → Data Ingestion
- About Catalogs and Products
- Import Catalog Information
- Configure User Events
An Introduction to Google Cloud Vertex AI Search for Commerce: Data Ingestion — Part 3 was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/an-introduction-to-google-cloud-vertex-ai-search-for-commerce-data-ingestion-part-3-563e169e97cf?source=rss—-e52cf94d98af—4