Building on the bananas momentum of generative media models on Google Cloud

“We’ve always believed that the future of creativity is a dynamic partnership between creatives and technology. That’s why we’re bringing together the best-in-class AI models across video, images and audio directly into our unlimited Envato subscription. Early signals indicate strong usage of models like Veo 3, accelerating our community’s creativity and empowering them to thrive in the process.” – Hichame Assi, CEO, Envato

Imagen 4 is Generally Available (GA) on Vertex AI

Our leading text-to-image model, Imagen 4, is engineered for creativity and speed. It delivers photorealistic images, sharp clarity, and text rendering and typography, bringing your imagination to life faster than ever before. It is Generally Available and production ready on Vertex AI.

Shutterstock, a family of brands delivering scalable creative and GenAI solutions, is using Google’s Imagen 4 models to power high-quality, commercially ready AI images.

“At Shutterstock, our mission is to empower businesses with the essential, universal ingredients to make their work more effective. By bringing Google’s Imagen 4 models into our AI Image Generator, we’re making it easier than ever for teams to go from an idea to impact with market-ready visuals in seconds. Imagen 4 allows us to deliver high-quality, commercially ready outputs that meet the standards our customers expect. This integration ensures Shutterstock customers are always equipped with the most advanced, future-ready tools in creative AI.” – Keenan Kadam, Senior Product Manager, Shutterstock

Gemini 2.5 TTS (Text to Speech) is Generally Available (GA) on Vertex AI

Our powerful generative media capabilities also include audio. Leverage Gemini 2.5 Text-to-Speech for the creation of high-fidelity voice applications, all with the security and scale of Vertex AI.

We are thrilled to announce that Gemini 2.5 Text-to-Speech (TTS) is now Generally Available across both Pro and Flash models. This model prioritizes human-like expression and control, transforming how you build voice applications:

Studio-Quality Dialogue, Now GA: Stop relying on choppy, single-speaker systems. Gemini 2.5 Flash and Gemini 2.5 Pro TTS are now ready for production, enabling you to generate dynamic, multi-speaker dialogue in a single API call for podcasts, audiobooks, and rich conversational customer service.
Advanced Style and Tone Control: Leverage natural language prompts to dictate the performance, not just the text. You can now control the voice’s tone, emotional expression, and accent, ensuring your brand’s voice is delivered with perfect fidelity.
Global Reach, Perfect Delivery: Gemini 2.5 Flash and Gemini 2.5 Pro TTS are now available across more than 70 languages, guaranteeing your global audience receives the same high-quality, expressive voice experience, regardless of region.

Which gen media model to use, and when

We have a suite of options for enterprise-grade work because we know choice is important when it comes to the right model for your project. If you’re not sure where to start, here’s a quick cheat sheet:

Choose Veo 3 if your workflow demands dynamic, high-quality video creation with granular control over scenes, characters, and narrative flow. It’s perfect for social media content, marketing campaigns, and any project where bringing stories to life through motion is key. Your input can be text, images, or a combination. For the latest Veo 3 pricing go here.
Choose Gemini 2.5 Flash Image as a starting point for image creation or if your workflow is iterative and requires creating or editing an image with strong visual consistency. It’s the right choice for conversational editing, sketch-to-image tasks, style transfers, and adapting existing visuals. Your input is often a combination of images and text prompts.
Choose Imagen 4 if your workflow is focused on generating net-new images from text with speed and higher resolution. It’s built for high-volume text-to-image applications where speed and resolution are your primary concerns.
Choose Gemini 2.5 Flash or Gemini 2.5 Pro TTS (Text-to-Speech) if your workflow is centered on bringing text to life with high-quality, emotionally expressive audio. It’s the right choice for creating lifelike voice agents, professional narration for content like podcasts and e-learning including multi-speaker synthesis, and dynamic character voices for gaming and entertainment. Your input is text.

Get started with enterprise-grade creativity on Vertex AI

Gemini 2.5 Flash Image, Veo 3, Imagen 4, and Gemini 2.5 TTS are available on Vertex AI today.

Dive into the Vertex AI Studio to get started with Gemini 2.5 Flash Image and Gemini 2.5 TTS today. For Veo 3 and Imagen 4, get started at Vertex AI Media Studio.

Source Credit: https://cloud.google.com/blog/products/ai-machine-learning/building-momentum-for-gen-media-including-nano-banana-/