Sources used
This article draws on peer-reviewed research, industry frameworks, and benchmark documentation.
- FinanceBench, Patronus AI (arXiv, November 2023)
- FinBen, The FinAI (NeurIPS 2024, February 2024)
- FinanceQA (arXiv, January 2025)
- Finance Agent Benchmark v1.1, Vals AI with Stanford researchers and a G-SIB (March 2026)
- MultiFinBen (arXiv, June 2025)
- FINOS Shared Benchmarking Framework, FINOS (September 2025)
- FINOS AI Governance Framework v2.0, FINOS (November 2025)
- S&P AI Benchmarks by Kensho, Kensho (July 2025)
The 5 major financial AI benchmarks
1. FinanceBench: document QA over SEC filings
Released in November 2023 by Patronus AI, FinanceBench targets a specific question: can an LLM answer straightforward financial questions when given access to SEC filings?
SEC filings are standardized financial reports that publicly traded companies in the U.S. must submit to regulators. They include 10-Ks (annual reports covering revenue, expenses, cash flows, and risk factors), 10-Qs (quarterly financial updates), and 8-Ks (reports on major events like mergers or leadership changes). These documents are often hundreds of pages, full of tables, and cross-references. FinanceBench focuses on revenue lookups, margin calculations, and factual extraction.
The benchmark contains 10,231 questions about publicly traded companies, each paired with an answer and an evidence string pointing to the source document. The questions are designed to be "clear-cut and straightforward," a minimum performance standard, not an expert-level challenge.
The first results of this benchmark are from late 2023 and reflect an earlier generation of models. But they established an important baseline. The top score was GPT-4-Turbo with retrieval, and it incorrectly answered or refused 81% of questions.
This 150-case open-source subset is still available for anyone to test against.
2. FinBen: holistic evaluation across 7 financial aspects
Released in February 2024 (revised June 2024), FinBen was designed to close the gap left by FinanceBench (the older benchmark with 10,000+ questions).
They wanted to test more than extracting facts from financial documents.
FinBen evaluates LLMs across 36 datasets spanning 24 financial tasks and 7 distinct aspects: information extraction, textual analysis, question answering, text generation, risk management, forecasting, and decision-making.
The benchmark is fully open-source. Datasets, results, and code are available on GitHub. Early results show that LLMs handle extraction and textual analysis well but struggle with advanced reasoning, forecasting, and text generation.
FinBen is the most comprehensive taxonomy available. If you want to understand where your product is strong versus weak across the full range of financial tasks, this is your starting point.
3. FinanceQA: complex numerical analyst tasks
Released in January 2025, FinanceQA focuses on the actual numerical work. The benchmark is on Hugging Face.
It evaluates tasks that "mirror real-world investment work" at hedge funds, PE firms, and investment banks: hand-spreading metrics, corporate valuation, multi-step analysis requiring assumptions. This test also requires adhering to standard accounting conventions and making judgment calls when information is incomplete, not just crunching numbers.
The results of FinanceQA expose a gap in current LLM performance: assumption generation. The authors put it well: "This [...] highlights the disconnect between existing LLM capabilities and the demands of professional financial analysis that are inadequately tested by current testing architectures."
4. Finance Agent Benchmark (FAB): agent tool-use on analyst tasks
Updated March 2026 by Vals AI (code is open source on GitHub as well), this is the only major benchmark that evaluates agent behavior with tool use. The agents are given access to EDGAR search, Google search, document parsing, and retrieval.
Giving access to tools for a benchmark is where things get tricky. FAB gives agents generic tools, but in practice your agent would use your proprietary tools. A model that scores well with generic tools might behave very differently with yours.
The FAB test itself is about performing entry-level financial analyst tasks. Vals AI created 537 expert-authored questions in collaboration with Stanford researchers, a Global Systemically Important Bank, and industry experts. Questions span nine categories from easy (quantitative retrieval, qualitative retrieval) to hard (financial modeling, market analysis).
50 questions are public, 150 are available under license, and 337 are permanently private. The private questions exist to prevent model training contamination. All published results are based solely on the private test set.
The benchmark uses GAIA-style final-answer accuracy, meaning each question has one correct answer and the agent either gets it right or wrong (no partial credit, no subjective grading).
FAB v1.1 is the closest thing we have to testing whether an AI can actually do analyst work end to end.
5. MultiFinBen: multilingual and multimodal financial reasoning
Released in June 2025 (revised October 2025) by a large multi-institution team, MultiFinBen is the first expert-annotated benchmark covering five languages. Most financial AI benchmarks are English-only and text-only.
It has two task families: multilingual financial reasoning (cross-lingual evidence integration from filings and news) and financial OCR (extracting structured text from scanned documents with tables and charts).
Rather than aggregating all available datasets, the authors filter them. They run GPT-4o and LLaMA 3.1-70B on every candidate dataset, bucket each into easy/medium/hard tiers, then keep only one dataset per modality-language-task combination, choosing the one where the two models diverge most.
This cuts out easy tasks that inflate scores (their predecessor FinBen had 7 of 36 datasets where zero-shot LLMs already exceeded 60% accuracy) and keeps only the datasets that actually discriminate between models.All datasets, evaluation scripts, and leaderboards are publicly released.
What the benchmarks collectively reveal
A pattern emerges when you look across these five benchmarks. Models are decent at extraction and simple retrieval. They can pull numbers from a filing, classify sentiment, and answer factual questions, especially with enough context. But performance drops sharply the moment tasks require reasoning, forecasting, multi-step analysis, or working in multilingual settings.
The ~60% ceiling appears repeatedly: FinanceQA's 60% failure rate on realistic analyst tasks, Vals AI's 60.65% top score on agent tasks, MultiFinBen's 46% for frontier multimodal models.
Meanwhile, benchmarks that test simpler capabilities are already done. S&P's AI Benchmarks by Kensho went from GPT-4 Turbo scoring 88% at launch to open-source reasoning models hitting 90-92.5% within a year. Kensho had to sunset the benchmarks in July 2025 because they were "no longer providing a discriminative signal between models."
The easy stuff is solved. The hard stuff, the work that actually matters for FP&A, remains largely untested.
The blind spots: what benchmarks aren't testing today
1. Forecasting
FinBen does include a forecasting aspect, but no benchmark tests the kind of forward-looking analysis that defines FP&A: driver-based modeling, rolling forecasts with changing assumptions, or scenario planning under genuine uncertainty. LLMs are tested on what happened, not on what might happen.
2. Assumption generation
FinanceQA identifies assumption generation as a primary failure mode. But no benchmark systematically tests a model's ability to generate, justify, and stress-test assumptions across scenarios.
This is what senior analysts spend most of their time doing.
3. Multi-source reconciliation
Real financial analysis requires reconciling data across sources: a 10-K, an earnings transcript, a competitor's filing, and a macro report. No benchmark tests cross-document reasoning at this scale. FinanceBench tests document QA on individual filings. But the gap between "answer a question about one filing" and "synthesize across five sources" is enormous.
4. Human-in-the-loop
Every benchmark tests autonomous performance: give the model a question, measure the answer. None test the collaborative workflow where an AI drafts an analysis and a human refines it. That is the way financial AI is actually used in practice.
FINOS, the Fintech Open Source Foundation, backed by major banks and technology firms, is working to change this. Their shared benchmarking initiative launched in 2025 argues that "traditional model-level benchmarks fall short of capturing what matters most in financial contexts." Their evaluation framework (still in development) explicitly calls for "system-level evaluation" that covers "workflows, agents, orchestration, not just the LLM." It is the first serious industry effort to benchmark AI the way finance actually uses it.
5. Cost-performance tradeoffs
Vals AI reports that some queries cost over $5 per question for models like o3. No benchmark systematically reports cost alongside accuracy. For enterprise deployment, a model that scores 60% at $0.10 per query may be more valuable than one scoring 62% at $5.
As Vincent Caldeira from Red Hat and FINOS puts it: "The challenge is to design fine-grained and scalable evaluation methods that reflect cost-efficiency [...]"
How to evaluate your financial AI product
If you are working on financial AI, here is what helps most when it comes to evaluation.
Use existing benchmarks as a baseline, not as proof of readiness:
- FinBen gives you the broadest coverage across financial task types.
- Vals AI FAB v1.1 gives you the closest proxy for real analyst work.
- FinanceQA tells you where numerical reasoning breaks down.
Build internal evaluations for the gaps. No public benchmark tests your specific use case. If you are building an FP&A copilot, create your own test suite for scenario modeling, assumption generation, etc. Use real data from your domain.
Watch the FINOS initiative. As mentioned above, FINOS is building a shared benchmarking framework specifically for financial services. Their roadmap targets expanding shared benchmarks by Q2 2026. The project focuses on "establishing guardrails, repeatable test datasets, and reference architectures that financial institutions can trust."
Track benchmark saturation, and look for new ones. A benchmark that was discriminative last year may be saturated today. Kensho's sunsetting was a preview. The benchmarks that matter going forward will test reasoning, agency, and judgment. Those are the capabilities that are hardest to build and hardest to measure.
As FINOS put it: "The financial services industry cannot afford to adopt AI blindly. Benchmarks designed for consumer apps won't cut it.". Time to build!
.png%3Ftoken%3DeyJraWQiOiJzdG9yYWdlLXVybC1zaWduaW5nLWtleV9jZWQwMGE3Zi02MThhLTRlMWQtYWQ5My05MDYxMmI5N2M5MTAiLCJhbGciOiJIUzI1NiJ9.eyJ1cmwiOiJibG9nLWltYWdlcy9mcG5hLWJlbmNoL2ltZy1hcnRpY2xlLWNvaW5zICgxKS5wbmciLCJpYXQiOjE3NzM2NDU0MzYsImV4cCI6NDkyNzI0NTQzNn0.VoQ4Up7UPuyVyMMZYaIso4eBztlWVxlhpG2iVKW3aes&w=3840&q=75)