A car you can talk to has been a longstanding dream, whether as the basis for television shows or more recent smartphone integrations.
One way of achieving better, more natural voice commands is by incorporating AI foundation models into vehicle systems, which offer more intelligence than traditional voice commands. AI foundation models can connect everyday questions with vehicle functions in a seamless dialogue. These models allow drivers to focus on the road ahead and enjoy every aspect of the journey while making interactions more intuitive.
While large language models (LLMs) offer powerful capabilities, they present one considerable drawback, at least in automotive settings: their reliance on consistent network access makes LLMs impractical for in-vehicle use due to potential lag and interruption.
To deliver reliable, next-level intelligence, BMW Group and Google Cloud successfully completed a proof of concept to build an efficient, reproducible solution to automate the workflows for fine-tuning, optimizing, evaluating, and deploying language models for specific domains, with special focus on small-language models, or SLMs. In this blog, we want to show results, findings and provide source code to encourage wider adoption.
“Finding the optimal trade-off for small-language models is a challenging, iterative process,” Dr. Celine Laurent-Winter, vice-president for Connected Vehicle Platforms at BMW Group, said. “Automating the workflow for training, testing, and deploying domain-specific SLM allows a big push for our development efficiency. With automated pipelines, we can rapidly adapt models to our domain and rigorously test and evaluate them against domain-specific benchmarks. This allows us to iterate and optimize models in hours rather than days, in an automated, reproducible workflow.”
Small language models: small concept, big potential
Generative AI offers automakers powerful new capabilities, enabling complex voice commands. Before, it would have been almost impossible for a voice command system to understand a request like: “Find me a restaurant with vegetarian offerings along my route that is open now and has a customer rating higher than four stars.” With its language understanding and reasoning capabilities, gen AI can puzzle out such a request.
Integrating this intelligence, however, presents a challenge: Cloud-based LLMs are powerful but rely on a stable network to avoid frustrating lag. Conversely, onboard LLM are constrained by a vehicle’s limited computing hardware.
Small language models may offer an ideal balance — but finding the right trade-off between size and capability requires careful optimization.
These purpose-built, right-sized generative AI models can be run directly on edge devices, including vehicles. A common approach is having the SLMs handle the most frequently used features locally and only routing more complex requests to a cloud-based LLM. An SLM must be small enough to run on the target device, yet capable enough to be useful — especially when tailored to the specific automotive context via fine-tuning
Challenges of integrating foundation models into vehicles
Compared to the cloud, vehicle infotainment systems have limited storage and computing power. A 5 Series sedan or X3 SUV might look big, but there’s still limited space given all the performance, technology, and luxury that must fit between their four wheels.
Therefore, integrating a large language model, such as Gemma 3 27B which can consume over 40 GB of memory at 16-bit precision, is difficult. While smaller versions exist (e.g., Gemma 3 270M), they still tend to have a broad, generalized focus albeit with potential reduced accuracy compared to bigger models.
Hence, model compression (to reduce size) and tuning (to ensure high accuracy) become necessary for specialized use cases like ours. The goal then is finding the best tradeoffs between model size, inference time, and accuracy for the most frequent tasks.
Converting LLMs to SLMs
Turning large, resource-intensive LLMs into efficient SLMs requires well-known compression and quality enhancement techniques. Here’s a (reduced) overview of common techniques we’ve explored:
Compression techniques:
The primary goal is to reduce the model’s compute and memory complexity. This can be done via:
-
Quantization: Reducing the model’s memory footprint by converting high-precision parameters (e.g., 32-bit floats) to lower-precision formats (e.g., 8-bit integers or 4-bit floats). This leads, however, to a potential, but often minor, reduction in accuracy.
-
Pruning: Systematically identifying and removing the least important parameters or connections within the neural network, streamlining the SLM while retaining core capabilities.
-
Knowledge distillation: A compact “student” model is trained to replicate the performance of a larger “teacher” LLM, transferring complex knowledge into a much smaller, more efficient architecture.
Post-compression quality enhancement
We further engaged methods that can help recover or improve performance lost during compression.
Evaluating performance for automotive tasks
Once a model has been compressed and enhanced, a crucial final step is to rigorously evaluate its performance. This covers system performance (e.g., latency, resource utilization on target hardware) and the qualitative assessment of the model’s generated responses. For assessing quality, established methods are:
-
Point-wise evaluation: These methods assess the quality of a single generated response by comparing it against a pre-defined “ground truth” or reference answer. Examples include ROUGE and BLEU metrics.
-
Pair-wise evaluation: This approach determines which of two different model outputs is better, often aligning more closely with subjective human preferences for conversational quality. This can be executed with an Auto-rater (or LLM-as-a-judge) or direct Human Feedback.
Developing a robust testing strategy combining these evaluation methods is essential for validating the success of the compression and fine-tuning efforts.
The challenge of finding the optimal configuration
The path from a general-purpose LLM to a specialized SLM is not straightforward. Every choice — from type of quantization to characteristics and contents of the fine-tuning domain-specific dataset — directly affects the quality and efficiency of the final model. This creates an exponential range of possible configurations each with its own trade-offs.
This intricate landscape is further complicated by practical constraints: Not every compression or enhancement technique is applicable to every language model, and some methods are incompatible. For example, API-only models like Google Gemini permit fine-tuning only through a fixed set of methods.
The sheer volume of viable combinations renders a manual search for the optimal configuration an incredibly tedious, if not impossible, undertaking. To overcome this challenge, we built automated, reproducible workflows through executable pipelines.
Solution: An automated workflow for SLM optimization
Our solution is an automated workflow that orchestrates compression, adaptation, and evaluation steps needed to produce optimized SLMs. This is achieved by designing a flexible pipeline where each step is a modular, parameterized component. This workflow-based approach allows us to systematically explore the vast configuration space and pinpoint the best-performing models for in-vehicle deployment.
The process is structured as a workflow that can be executed automatically on a powerful workflow engine, such as Vertex AI Pipelines. In this workflow, we can define the sequence of operations (e.g., quantization, followed by LoRA fine-tuning and DPO) as a chain of interconnected components. Through pipeline parameters, we can search the entire configuration space, test different base models, compression techniques, tuning methods, and evaluation datasets.
This automated search allows for the comprehensive exploration of possibilities that would be unfeasible to test manually. The final artifacts from each pipeline execution are fully traceable and ready for deployment. This includes the versioned SLM itself, exact configuration parameters that produced the model, datasets used for evaluation, and a detailed report of its performance metrics, ensuring complete reproducibility.
Implementation: An automated workflow with Vertex AI Pipelines
Our solution is built on Google Cloud’s Vertex AI platform, using configurable, executable pipeline templates. This offers a structured and automated way to find optimal SLMs in the vast possible search space. Figure 1 illustrates this workflow, its steps and their interactions with various data and model stores.
Source Credit: https://cloud.google.com/blog/topics/manufacturing/how-bmw-is-testing-slms-not-llms-for-in-vehicle-voice-commands/
