Evaluating AI on Combat Arms
As LLMs are integrated into more warfighting functions, it’s critical that the Department of War (DoW), industry, and academia have robust capabilities for benchmarking and evaluating the effectiveness of models for key military tasks. This blog post is part of our broader series describing our evaluation platform, starting with our overview, here.
At EdgeRunner, we build gold test sets for major military specialities so that we can perform static benchmarking on our own models, as well as all major models in the marketplace. We engage subject matter experts with extensive experience to create question-and-answer pairs that are factually and doctrinally correct, that are common and realistic, and that conform to the military’s unique language. The focus of this test set was combat arms, grounded in the U.S. Army operational environment and extended with joint doctrine for fires, targeting, and multi-domain operations.
Our Combat Arms test set spans tactical to operational levels, from small-unit decision-making to company, battalion, and brigade-level planning and execution. It also includes tactical-level intelligence problems for small units operating without dedicated intelligence staff. It reflects how problems are actually framed and solved across combat arms formations under real conditions.
The Combat Arms benchmark was built as an evaluation framework to gauge how well LLMs perform in operational military contexts, particularly in planning and decision support roles. The dataset consists of 544 questions, combining direct Q&A with scenario-based problems requiring multi-step reasoning.
Questions were developed from operational experience across multiple combat arms roles and refined for realism. They were structured to require multi-step reasoning where appropriate, emphasizing the integration of doctrine, context, and judgment as seen in practice. Each was grounded in doctrine, with answers tied to specific sources. The goal was not coverage alone, but fidelity to how tasks were encountered in practice.
The test set was grounded in a wide range of U.S. and joint doctrine spanning operations, planning, fires, intelligence, engineering, and small-unit tactics, including, but not limited to:
These sources spanned foundational doctrine and recent updates, ensuring both baseline accuracy and relevance to current operations.
The dataset was organized into mission-aligned categories reflecting real functional areas across combat arms:
This structure allowed performance to be evaluated not just overall, but within specific operational domains.
The benchmark reflects tasks performed by personnel operating at the point of planning and execution, including:
These roles represented environments where informed and rapid decision-making directly impacts outcomes.
When testing non-military-tuned models using this benchmark, we observed off-topic drift, as well as references to video games or fictional scenarios, loss of task focus, and generic or non-actionable responses. Often errors were not due to a knowledge gap alone, but a failure to maintain alignment under realistic task conditions. In 528 ranked model comparisons, evaluators tagged off-topic behavior in 65 comparisons, or 12.3% of cases. Below we provide examples of common error modes.
Prompt: What are the fundamentals of a raid?
For the prompt above, the doctrinal answer is:
Prompt: What is the max range of a M2?
Though some of the effects above can be mitigated in general-purpose models using a military-specific system prompt, the accuracy differential still remains even when controlling for system prompt. Quantitative results are described further in the next section.
In our recent blog post on EdgeRunner 2, we provided some results on the Combat Arms benchmark using our new models. Those results are reproduced here:

Here we see that with military specialization even EdgeRunner Light 2 can nearly match the performance of frontier models, all while running on-device at no incremental cost to the user. Our Light model is designed for commodity laptops and certain smartphones. It runs on hardware with 8+ GB of VRAM. Our Medium model, which is designed for more powerful workstation-style laptops, runs with as little as 16 GB of VRAM.
Benchmarks are most valuable when they reflect how systems will actually be used. In military contexts, that means evaluating performance against real tasks, grounded in doctrine, under conditions that resemble operational use. If a system’s limitations can’t be determined during evaluation, they will show up when the warfighter needs the AI assistant the most.
Cole Herring is a former U.S. Army Special Forces Green Beret and executive leader with more than 20 years of experience leading teams in complex, high-stakes environments. His background spans international partnerships, combat deployments, private equity-backed business leadership, and artificial intelligence implementation, with a focus on turning emerging technologies into practical operational advantages. He holds an MBA, a M.S. in Defense Innovation from King’s College London, executive education credentials in artificial intelligence from Berkeley Haas, and is pursuing a PhD in Organizational Leadership. Having experienced combat firsthand and received a Purple Heart, Cole is passionate about supporting U.S. warfighters.
Dylan Bates received his Ph.D. in Applied Mathematics from North Carolina State University in 2021, specializing in machine learning. A former mathematics professor, he is currently a research engineer at EdgeRunner AI, where he leads the model benchmarking program and develops military-specific large language models for deployment in edge and resource-constrained environments.
If you’re interested in this research or with collaborating with us, please contact us here or at: research@edgerunnerai.com