Text-To-Speech (TTS) using Gemini API with Google Apps Script | by Kanshi Tanaike | Google Cloud - Community

This script provides a simple example for generating Text-To-Speech (TTS) using the Gemini API within Google Apps Script. The Gemini API generates audio data in the audio/L16;codec=pcm;rate=24000 format, which is not directly playable. Since there’s no built-in method to convert this to a standard audio/wav format, this sample script includes a custom function to handle the conversion.

The provided convertL16ToWav_ function is specifically designed for the audio/L16;codec=pcm;rate=24000 MIME type. Using it with other audio formats will result in an error.
The script uses a hardcoded WAV header. This header assumes specific audio parameters (e.g., sample rate, bit depth, number of channels) that match the Gemini API’s output for this format. If the Gemini API’s output format changes, this header might need adjustment.

Before running, replace "###" with your actual Gemini API key in the myFunction.

/**
* Convert "audio/L16;codec=pcm;rate=24000" generated by Gemini API to "audio/wav" using Google Apps Script.
* This can be used for only the mimeType of "audio/L16;codec=pcm;rate=24000". Please be careful about this.
*
* @param {Byte[]} data Byte array of the input data "audio/L16;codec=pcm;rate=24000"
* @param {String} mimeType MimeType of the input data.
* @returns {Byte[]} Converted data.
*/
function convertL16ToWav_(data, mimeType) {
if (mimeType != "audio/L16;codec=pcm;rate=24000") {
throw new Error(
`Sorry. As a simple sample, this can be used for only "audio/L16;codec=pcm;rate=24000".`
);
}
// This header is for a 24000 Hz, 16-bit, mono PCM WAV file.
const headerData =
"5249464632B1050057415645666D74201000000001000100C05D000080BB000002001000646174610EB10500";
const array = [...headerData];
const head = [...Array(Math.ceil(array.length / 2))]
.map((_) => array.splice(0, 2).join(""))
.map((e) =>
parseInt(e[0], 16).toString(2).length == 4
? parseInt(e, 16) - 256
: parseInt(e, 16)
);
return [...head, ...data];
}function myFunction() {
const apiKey = "###"; // Please set your API key here.
const text = [
"Create Text-To-Speech the following conversation.",
"User A: Hey there! How are you doing today?",
"User B: Hi! I'm doing well, thanks. How about you?",
"User A: I'm good too, thanks for asking!",
].join("\n");
const url = `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-tts:generateContent?key=${apiKey}`;
const payload = {
contents: [{ role: "user", parts: [{ text }] }],
generationConfig: {
responseModalities: ["AUDIO"],
speechConfig: {
multiSpeakerVoiceConfig: {
speakerVoiceConfigs: [
{
speaker: "User A",
voiceConfig: { prebuiltVoiceConfig: { voiceName: "Kore" } },
},
{
speaker: "User B",
voiceConfig: { prebuiltVoiceConfig: { voiceName: "Leda" } },
},
],
},
},
},
};
const res = UrlFetchApp.fetch(url, {
contentType: "application/json",
payload: JSON.stringify(payload),
});
const obj = JSON.parse(res.getContentText());
const { data, mimeType } = obj.candidates[0].content.parts[0].inlineData;
const blob = Utilities.newBlob(
convertL16ToWav_(Utilities.base64Decode(data), mimeType),
"audio/wav",
"sample.wav"
);
DriveApp.createFile(blob);
}

Upon successful execution of the myFunction, an audio file named sample.wav will be created in the root folder of your Google Drive. You can then click this file to play the generated speech.

Source Credit: https://medium.com/google-cloud/text-to-speech-tts-using-gemini-api-with-google-apps-script-6ece50a617fd?source=rss—-e52cf94d98af—4

Deven Goratela

Administrator

Visit Website View All Posts

Related Stories

Decoding the Multilingual Shield: Optimizing AI Security for Low-Resource Languages

The AWS MCP Server is now generally available

Google named a Leader in the 2026 Gartner Magic Quadrant for Cyberthreat Intelligence Technologies

You may have missed

Character.AI Settlement Raises Alarms About AI’s Growing Mental Health Impact on Black and Brown Youth – Know Your Rights Camp

Decoding the Multilingual Shield: Optimizing AI Security for Low-Resource Languages

7 best vibe coding courses to learn AI-powered development

Cost effective deployment of vision-language models for pet behavior detection on AWS Inferentia2

About the Author

Related Stories

You may have missed