Unlocking Multimodal Video Transcription with Gemini — Part 4: 🏗️ Prompt Crafting | by Laurent Picard | Google Cloud - Community

💬 Transcripts

First of all, let’s focus on getting the audio transcripts:

Gemini has proven to be natively good at audio transcription
This requires less inference than image analysis
It is central and independent information

💡 Generating an output that starts with correct answers should help to achieve an overall correct output.

We’ve also seen what a typical transcription entry can look like:

00:02 speaker_1: Welcome!

But, right away, there can be some ambiguities in our multimodal use case:

What is a speaker?
Is it someone we see/hear?
What if the person visible in the video is not the one speaking?
What if the person speaking is never seen in the video?

How do we unconsciously identify who is speaking in a video?

First, probably by identifying the different voices on the fly?
Then, probably by consolidating additional audio and visual cues?

Can Gemini understand voice characteristics?

prompt = """
Using only the video's audio, list the following audible characteristics:
- Voice tones
- Voice pitches
- Languages
- Accents
- Speaking styles
"""
video = TestVideo.GDM_PODCAST_TRAILER_PT59Sgenerate_content(prompt, video, show_as=ShowAs.MARKDOWN)

----------------- GDM_PODCAST_TRAILER_PT59S / gemini-2.0-flash -----------------
Input tokens   :    16,730
Output tokens  :       168
------------------------------ start of response -------------------------------

Okay, here's a breakdown of the audible characteristics in the video's audio:- **Voice Tones:** The tones range from conversational and friendly to more serious and thoughtful, depending on the speaker and the topic. There are also moments of excitement and humor.
- **Voice Pitches:** There's a mix of high and low pitches, reflecting the different speakers (both male and female).
- **Languages:** The primary language is English.
- **Accents:** There are a variety of accents, including British and American.
- **Speaking Styles:** The speaking styles vary from casual and conversational to more formal and academic, depending on the speaker and the context.

------------------------------- end of response --------------------------------

What about a French video?

video = TestVideo.BRUT_FR_DOGS_WATER_LEAK_PT8M28Sgenerate_content(prompt, video, show_as=ShowAs.MARKDOWN)

-------------- BRUT_FR_DOGS_WATER_LEAK_PT8M28S / gemini-2.0-flash --------------
Input tokens   :   144,055
Output tokens  :       147
------------------------------ start of response -------------------------------

Here is a list of the audible characteristics of the video's audio:- **Voice tones:** Conversational, informative, enthusiastic, serious, humorous
- **Voice pitches:** Varying, from low to high
- **Languages:** French
- **Accents:** Standard French
- **Speaking styles:** Clear, articulate, professional, casual

------------------------------- end of response --------------------------------

⚠️ We have to be cautious here: responses can consolidate multimodal information or even general knowledge. For example, if a person is famous, their name is most likely part of the LLM’s knowledge. If they are known to be from the UK, a possible inference is that they have a British accent. This is why we made our prompt more specific by including “using only the video’s audio”.

💡 If you conduct more tests, for example on private audio files (i.e., not part of common knowledge and with no additional visual cues), you’ll see that Gemini’s audio tokenizer performs exceptionally well and extracts semantic speech information!

After a few iterations, we can arrive at a transcription prompt focusing on the audio and voices:

prompt = """
Task:
- Watch the video and listen carefully to the audio.
- Identify each unique voice using a `voice` ID (1, 2, 3, etc.).
- Transcribe the video's audio verbatim with voice diarization.
- Include the `start` timecode (MM:SS) for each speech segment.
- Output a JSON array where each object has the following fields:
- `start`
- `text`
- `voice`
"""
video = TestVideo.GDM_PODCAST_TRAILER_PT59Sgenerate_content(prompt, video, show_as=ShowAs.MARKDOWN)

----------------- GDM_PODCAST_TRAILER_PT59S / gemini-2.0-flash -----------------
Input tokens   :    16,800
Output tokens  :       635
------------------------------ start of response -------------------------------

[
{
"start": "00:00",
"text": "Do I have to call you Sir Demis now?",
"voice": 1
},
{
"start": "00:01",
"text": "Oh, you don't. Absolutely not.",
"voice": 2
},
{
"start": "00:03",
"text": "Welcome to Google Deep Mind the podcast with me, your host Professor Hannah Fry.",
"voice": 1
},
{
"start": "00:06",
"text": "We want to take you to the heart of where these ideas are coming from. We want to introduce you to the people who are leading the design of our collective future.",
"voice": 1
},
{
"start": "00:19",
"text": "Getting the safety right is probably, I'd say, one of the most important challenges of our time. I want safe and capable.",
"voice": 3
},
{
"start": "00:26",
"text": "I want a bridge that will not collapse.",
"voice": 3
},
{
"start": "00:30",
"text": "just give these scientists a superpower that they had not imagined earlier.",
"voice": 4
},
{
"start": "00:34",
"text": "autonomous vehicles. It's hard to fathom that when you're working on a search engine.",
"voice": 5
},
{
"start": "00:38",
"text": "We may see entirely new genre or entirely new forms of art come up. There may be a new word that is not music, painting, photography, movie making, and that AI will have helped us create it.",
"voice": 6
},
{
"start": "00:48",
"text": "You really want AGI to be able to peer into the mysteries of the universe.",
"voice": 1
},
{
"start": "00:51",
"text": "Yes, quantum mechanics, string theory, well, and the nature of reality.",
"voice": 2
},
{
"start": "00:55",
"text": "Ow.",
"voice": 1
},
{
"start": "00:56",
"text": "the magic of AI.",
"voice": 6
}
]

------------------------------- end of response --------------------------------

This is looking good! And if you test these instructions on more complex videos, you’ll get similarly promising results.

Notice how the prompt reuses cherry-picked terms from the terminology previously provided by Gemini, while aiming for precision and concision:

verbatim is unambiguous (unlike “spoken words”)
1, 2, 3, etc. is an ellipsis (Gemini can infer the pattern)
timecode is specific (timestamp has more meanings)
MM:SS clarifies the timecode format

💡 Gemini 2.0 was trained to understand the specific MM:SS timecode format. Gemini 2.5 also supports the H:MM:SS format for longer videos. For the latest updates, refer to the video understanding documentation.

We’re halfway there. Let’s complete our database generation with a second task…

Source Credit: https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part4-3381b61aaaec?source=rss—-e52cf94d98af—4