Welcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at PigmentWelcome to my blog, I am Benjamin, a designer and developer working at Pigment
5 Financial AI Benchmarks for anyone building financial AI agents: what they test vs. don't test

5 Financial AI Benchmarks for anyone building financial AI agents: what they test vs. don't test

In late 2023, GPT-4 got 81% of questions about SEC filings wrong. That number comes from FinanceBench. Models have improved since then but the result set off a wave of new benchmarks. Each one trying to capture a different slice of what "financial AI" means. In this article, I map those benchmarks, breaking down what they test and identify where the gaps are.

Sources used

This article draws on peer-reviewed research, industry frameworks, and benchmark documentation.

The 5 major financial AI benchmarks

1. FinanceBench: document QA over SEC filings

Released in November 2023 by Patronus AI, FinanceBench targets a specific question: can an LLM answer straightforward financial questions when given access to SEC filings?

SEC filings are standardized financial reports that publicly traded companies in the U.S. must submit to regulators. They include 10-Ks (annual reports covering revenue, expenses, cash flows, and risk factors), 10-Qs (quarterly financial updates), and 8-Ks (reports on major events like mergers or leadership changes). These documents are often hundreds of pages, full of tables, and cross-references. FinanceBench focuses on revenue lookups, margin calculations, and factual extraction.

The benchmark contains 10,231 questions about publicly traded companies, each paired with an answer and an evidence string pointing to the source document. The questions are designed to be "clear-cut and straightforward," a minimum performance standard, not an expert-level challenge.

The first results of this benchmark are from late 2023 and reflect an earlier generation of models. But they established an important baseline. The top score was GPT-4-Turbo with retrieval, and it incorrectly answered or refused 81% of questions.

This 150-case open-source subset is still available for anyone to test against.

2. FinBen: holistic evaluation across 7 financial aspects

Released in February 2024 (revised June 2024), FinBen was designed to close the gap left by FinanceBench (the older benchmark with 10,000+ questions).

They wanted to test more than extracting facts from financial documents.

FinBen evaluates LLMs across 36 datasets spanning 24 financial tasks and 7 distinct aspects: information extraction, textual analysis, question answering, text generation, risk management, forecasting, and decision-making.

The benchmark is fully open-source. Datasets, results, and code are available on GitHub. Early results show that LLMs handle extraction and textual analysis well but struggle with advanced reasoning, forecasting, and text generation.

FinBen is the most comprehensive taxonomy available. If you want to understand where your product is strong versus weak across the full range of financial tasks, this is your starting point.

3. FinanceQA: complex numerical analyst tasks

Released in January 2025, FinanceQA focuses on the actual numerical work. The benchmark is on Hugging Face.

It evaluates tasks that "mirror real-world investment work" at hedge funds, PE firms, and investment banks: hand-spreading metrics, corporate valuation, multi-step analysis requiring assumptions. This test also requires adhering to standard accounting conventions and making judgment calls when information is incomplete, not just crunching numbers.

The results of FinanceQA expose a gap in current LLM performance: assumption generation. The authors put it well: "This [...] highlights the disconnect between existing LLM capabilities and the demands of professional financial analysis that are inadequately tested by current testing architectures."

4. Finance Agent Benchmark (FAB): agent tool-use on analyst tasks

Updated March 2026 by Vals AI (code is open source on GitHub as well), this is the only major benchmark that evaluates agent behavior with tool use. The agents are given access to EDGAR search, Google search, document parsing, and retrieval.

Giving access to tools for a benchmark is where things get tricky. FAB gives agents generic tools, but in practice your agent would use your proprietary tools. A model that scores well with generic tools might behave very differently with yours.

The FAB test itself is about performing entry-level financial analyst tasks. Vals AI created 537 expert-authored questions in collaboration with Stanford researchers, a Global Systemically Important Bank, and industry experts. Questions span nine categories from easy (quantitative retrieval, qualitative retrieval) to hard (financial modeling, market analysis).

50 questions are public, 150 are available under license, and 337 are permanently private. The private questions exist to prevent model training contamination. All published results are based solely on the private test set.

The benchmark uses GAIA-style final-answer accuracy, meaning each question has one correct answer and the agent either gets it right or wrong (no partial credit, no subjective grading).

FAB v1.1 is the closest thing we have to testing whether an AI can actually do analyst work end to end.

5. MultiFinBen: multilingual and multimodal financial reasoning

Released in June 2025 (revised October 2025) by a large multi-institution team, MultiFinBen is the first expert-annotated benchmark covering five languages. Most financial AI benchmarks are English-only and text-only.

It has two task families: multilingual financial reasoning (cross-lingual evidence integration from filings and news) and financial OCR (extracting structured text from scanned documents with tables and charts).

Rather than aggregating all available datasets, the authors filter them. They run GPT-4o and LLaMA 3.1-70B on every candidate dataset, bucket each into easy/medium/hard tiers, then keep only one dataset per modality-language-task combination, choosing the one where the two models diverge most.

This cuts out easy tasks that inflate scores (their predecessor FinBen had 7 of 36 datasets where zero-shot LLMs already exceeded 60% accuracy) and keeps only the datasets that actually discriminate between models.All datasets, evaluation scripts, and leaderboards are publicly released.

What the benchmarks collectively reveal

A pattern emerges when you look across these five benchmarks. Models are decent at extraction and simple retrieval. They can pull numbers from a filing, classify sentiment, and answer factual questions, especially with enough context. But performance drops sharply the moment tasks require reasoning, forecasting, multi-step analysis, or working in multilingual settings.

The ~60% ceiling appears repeatedly: FinanceQA's 60% failure rate on realistic analyst tasks, Vals AI's 60.65% top score on agent tasks, MultiFinBen's 46% for frontier multimodal models.

Meanwhile, benchmarks that test simpler capabilities are already done. S&P's AI Benchmarks by Kensho went from GPT-4 Turbo scoring 88% at launch to open-source reasoning models hitting 90-92.5% within a year. Kensho had to sunset the benchmarks in July 2025 because they were "no longer providing a discriminative signal between models."

The easy stuff is solved. The hard stuff, the work that actually matters for FP&A, remains largely untested.

The blind spots: what benchmarks aren't testing today

1. Forecasting

FinBen does include a forecasting aspect, but no benchmark tests the kind of forward-looking analysis that defines FP&A: driver-based modeling, rolling forecasts with changing assumptions, or scenario planning under genuine uncertainty. LLMs are tested on what happened, not on what might happen.

2. Assumption generation

FinanceQA identifies assumption generation as a primary failure mode. But no benchmark systematically tests a model's ability to generate, justify, and stress-test assumptions across scenarios.

This is what senior analysts spend most of their time doing.

3. Multi-source reconciliation

Real financial analysis requires reconciling data across sources: a 10-K, an earnings transcript, a competitor's filing, and a macro report. No benchmark tests cross-document reasoning at this scale. FinanceBench tests document QA on individual filings. But the gap between "answer a question about one filing" and "synthesize across five sources" is enormous.

4. Human-in-the-loop

Every benchmark tests autonomous performance: give the model a question, measure the answer. None test the collaborative workflow where an AI drafts an analysis and a human refines it. That is the way financial AI is actually used in practice.

FINOS, the Fintech Open Source Foundation, backed by major banks and technology firms, is working to change this. Their shared benchmarking initiative launched in 2025 argues that "traditional model-level benchmarks fall short of capturing what matters most in financial contexts." Their evaluation framework (still in development) explicitly calls for "system-level evaluation" that covers "workflows, agents, orchestration, not just the LLM." It is the first serious industry effort to benchmark AI the way finance actually uses it.

5. Cost-performance tradeoffs

Vals AI reports that some queries cost over $5 per question for models like o3. No benchmark systematically reports cost alongside accuracy. For enterprise deployment, a model that scores 60% at $0.10 per query may be more valuable than one scoring 62% at $5.

As Vincent Caldeira from Red Hat and FINOS puts it: "The challenge is to design fine-grained and scalable evaluation methods that reflect cost-efficiency [...]"

How to evaluate your financial AI product

If you are working on financial AI, here is what helps most when it comes to evaluation.

Use existing benchmarks as a baseline, not as proof of readiness:

  • FinBen gives you the broadest coverage across financial task types.
  • Vals AI FAB v1.1 gives you the closest proxy for real analyst work.
  • FinanceQA tells you where numerical reasoning breaks down.

Build internal evaluations for the gaps. No public benchmark tests your specific use case. If you are building an FP&A copilot, create your own test suite for scenario modeling, assumption generation, etc. Use real data from your domain.

Watch the FINOS initiative. As mentioned above, FINOS is building a shared benchmarking framework specifically for financial services. Their roadmap targets expanding shared benchmarks by Q2 2026. The project focuses on "establishing guardrails, repeatable test datasets, and reference architectures that financial institutions can trust."

Track benchmark saturation, and look for new ones. A benchmark that was discriminative last year may be saturated today. Kensho's sunsetting was a preview. The benchmarks that matter going forward will test reasoning, agency, and judgment. Those are the capabilities that are hardest to build and hardest to measure.

As FINOS put it: "The financial services industry cannot afford to adopt AI blindly. Benchmarks designed for consumer apps won't cut it.". Time to build!

Frequently asked questions about this topic.

What are the main benchmarks for evaluating financial AI in 2026?

Five benchmarks cover the most ground. FinanceBench tests document QA over SEC filings. FinBen evaluates 36 datasets across 7 financial task types. FinanceQA targets complex numerical analyst work. Vals AI FAB v1.1 tests agent tool-use on real analyst tasks. MultiFinBen covers multilingual and multimodal financial reasoning.

How accurate are LLMs at financial analysis right now?

Frontier models hit a ceiling around 60% on realistic financial tasks. Vals AI's best agent scores 60.65% on entry-level analyst work. FinanceQA reports a ~60% failure rate on multi-step numerical analysis. MultiFinBen shows even top multimodal models at just 46% when you add multilingual and visual tasks. Models handle extraction and retrieval well but drop sharply on reasoning, forecasting, and assumption generation.

What financial AI capabilities does no benchmark test today?

The biggest gaps are forecasting under uncertainty, scenario modeling, and assumption generation. These define real FP&A work: building driver-based models, running rolling forecasts, stress-testing assumptions with incomplete data. No benchmark systematically evaluates any of them.

What is the difference between a financial LLM benchmark and a financial agent benchmark?

An LLM benchmark tests whether a model can answer questions using only its parameters and provided context. An agent benchmark like Vals AI FAB v1.1 tests whether an AI system can use tools (search, document parsers, data retrievers) to complete tasks end to end. Agent benchmarks are closer to real-world usage.

Should I build my own financial AI evaluation instead of relying on public benchmarks?

Yes, for anything beyond baseline screening. Public benchmarks help you compare models and establish a performance floor. But they do not test your specific use case, your data, or your workflows. If you are building an FP&A tool, create internal test suites for the tasks your product actually needs to support.