
An evaluation framework for gen AI
Evaluating gen AI applications and agents like DB Lumina requires a custom framework due to the complexity and variability of model outputs. Traditional metrics and generic benchmarks often fail to capture the needs for gen AI features, the nuanced expectations of domain-specific users, and the operational constraints of enterprise environments. This necessitates a new set of gen AI metrics to accurately measure performance.
The DB Lumina evaluation framework employs a rich and extensible set of both industry-standard and custom-developed metrics, which are mapped to defined categories and documented in a central metric dictionary to ensure consistency across teams and features. Standard metrics like accuracy, completeness, and latency are foundational, but they are augmented with custom metrics, such as citation precision and recall, false rejection rates, and verbosity control — each tailored to the specific demands and regulatory requirements of financial research and document-grounded generation. Popular frameworks like Ragas also provide a solid foundation for assessing how well our RAG system grounds its responses in retrieved documents and avoids hallucinations.
In addition, test datasets are carefully curated to reflect a wide range of real-world scenarios, edge cases, and potential biases across DB Lumina’s core features like chat, document Q&A, templates, and RAG-based knowledge retrieval. These datasets are version-controlled and regularly updated to maintain relevance as the tool evolves. Their purpose is to provide a stable benchmark for evaluating model behavior under controlled conditions, enabling consistent comparisons across optimization cycles.
Evaluation is both quantitative and qualitative, combining automated scoring with human review for aspects like tone, structure, and content fidelity. Importantly, the framework ensures each feature is assessed for correctness, usability, efficiency, and compliance while enabling the rapid feedback and robust risk management needed to support iterative optimization and ongoing performance monitoring. We compare current metric outputs against historical baselines, leveraging stable test sets, Git hash tracking, and automated metric pipelines to support proactive interventions to ensure that performance deviations are caught early and addressed before they impact users or compliance standards.
This layered approach ensures that DB Lumina is not only accurate and efficient but also aligned with Deutsche Bank’s internal standards, achieving a balanced and rigorous evaluation strategy that supports both innovation and accountability.
Bringing new benefits to the business
We developed an initial pilot for DB Lumina with Google Cloud Consulting, creating a simple prototype early in the use case development that used only embeddings without prompts. Though it was later surpassed by later versions, this pilot informed the subsequent development of DB Lumina’s RAG architecture.
The project transitioned then through our development and application testing environments to our production deployment, eventually going live in September 2024. Currently, DB Lumina is already in the hands of around 5,000 users across Deutsche Bank Research, specifically in divisions like Investment Bank Origination & Advisory and Fixed Income & Currencies. We plan to roll it out to more than 10,000 users across corporate banking and other functions by the end of the year.
DBLumina is expected to deliver significant business benefits for Deutsche Bank:
-
Time savings: Analysts reported significant time savings, saving 30 to 45 minutes on preparing earnings note templates and up to two hours when writing research reports and roadshow updates.
-
Increased analysis depth: One analyst increased the analysis in an earnings report by 50%, adding additions sections by region and activity, as well as a summary section for forecast changes. This was achieved through summarization of earnings releases and investor transcripts and subsequent analysis through conversational prompts.
-
New analysis opportunities: DB Lumina has created new opportunities for teams to analyze new topics. For example, the U.S. and European Economics teams use DB Lumina to score central bank communications to assess hawkishness and dovishness over time. Another analyst was able to analyze and compare budget speeches from eight different ministries, tallying up keywords related to capacity constraints and growth orientation to identify shifts in priorities.
-
Increased accuracy: Analysts have also started using DB Lumina as part of their editing process. One supervisory analyst noted that since the rollout, there has been a noted improvement in the editorial and grammatical accuracy across analyst notes, especially from non-native English speakers.
Building the future of gen AI and RAG in finance
We’ve seen the power of RAG transform how financial institutions interact with their data. DB Lumina has proved the value of combining retrieval, gen AI, and conversational AI, but this is just the start of our journey. We believe the future lies in embracing and refining the “agentic” capabilities that are inherent in our architecture. We envision building and orchestrating a system where various components act as agents — all working together to provide intelligent and informed responses to complex financial inquiries.
To support our vision moving forward, we plan to deepen agent specialization within our RAG framework, building agents designed to handle specific types of queries or tasks across compliance, investment strategies, and risk assessment. We also want to incorporate the ReAct (Reasoning and Acting) paradigm into our agents’ decision-making process to enable them to not only retrieve information but also actively reason, plan actions, and refine their searches to provide more accurate and nuanced answers.
In addition, we’ll be actively exploring and implementing more of the tools and services available within Vertex AI to further enhance our AI capabilities. This includes exploring other models for specific tasks or to achieve different performance characteristics, optimizing our vector search infrastructure, and utilizing AI pipelines for greater efficiency and scalability across our RAG system.The ultimate goal is to empower DB Lumina to handle increasingly complex and multi-faceted queries through improved context understanding, ensuring it can accurately interpret context like previous interactions and underlying financial concepts. This includes moving beyond simple question answers to providing analysis and recommendations based on retrieved information. To enhance DB Lumina’s ability to provide real-time information and address queries requiring up-to-date external data, we are planning to integrate a feature for grounding responses with internet-based information.
By focusing on these areas, we aim to transform DB Lumina from a helpful information retriever into a powerful AI agent capable of tackling even the most challenging financial inquiries. This will unlock new opportunities for improved customer service, enhanced decision-making, and greater operational efficiency for financial institutions. The future of RAG and gen AI in finance is bright, and we’re excited to be at the forefront of this transformative technology.
Source Credit: https://cloud.google.com/blog/topics/financial-services/deutsche-bank-delivers-ai-powered-financial-research-with-db-lumina/