EdgeRunner now available for DoW users at no cost. Click here to get started for free.

EDGE-OPD: Internalizing Private Data with On-Policy Distillation

Read our new paper here: https://arxiv.org/abs/2605.23493v1

Our mission is to build the world’s best military-specialized LLMs. To this end we are aggressively searching for more efficient and performant post-training methods. One of the biggest bottlenecks today is that models are usually trained on static datasets generated by humans or other models, and not by the model being trained by itself. This creates a fundamental mismatch: the student model is forced to learn behaviors, reasoning patterns, and token preferences from someone else’s view rather than its own. In Reinforcement Learning (RL) terms, this is known as off-policy learning, and while it powers many current training pipelines, it also introduces inefficiencies, instability, and bias.

On-policy learning takes a different approach: The model learns from data generated by itself. In a standard RL setup, an agent interacts with an environment, receives rewards based on its own actions, and gradually learns to select better actions through those reward signals. Since the learning process is directly tied to the agent’s own behavior and exploration, this generally constitutes an on-policy setting. 

Recent methods such as On-Policy Distillation (OPD) and On-Policy Self-Distillation (OPSD) have adapted this idea to LLM post-training, allowing student models to learn from their own generated rollouts while receiving feedback from stronger teacher models. However, these methods still depend heavily on the student exploring useful trajectories on its own, which becomes a major limitation for “rare” tokens and behaviors that are correct but extremely unlikely to appear during sampling. This includes, for example, cases where the responses should include private or proprietary data, which is essentially impossible for the models to generate correctly on their own.

In our new paper, EDGE-OPD (EviDence GuidEd On-Policy Distillation), we introduce two key ideas for improving on-policy distillation. First, we use guided rollouts that inject privileged context into a subset of the student’s prompts during generation (such as hints, hidden evidence, or rare personas), allowing rare target behaviors to actually appear in the on-policy data, while masking that privileged information during loss computation. 

Second, we introduce an evidence mask that updates the student only on token positions supported by the privileged context, instead of learning from the entire rollout indiscriminately. We show empirically that, together, these mechanisms enable the student to internalize rare but correct behaviors without directly training on the hidden context itself. Beyond the method itself, we also analyze how these rare-token signals localize within specific regions of the rollout, offering new insights into efficient knowledge transfer and preservation of general-purpose capabilities during post-training.

Aris Lazaridis is a Research Engineer at EdgeRunner AI. He holds a PhD in Reinforcement Learning from Aristotle University of Thessaloniki (AUTH). His expertise spans Reinforcement Learning methods and their applications in complex real-world domains, as well as the intersection of RL and Large Language Models across both research and industry.

Contact us: research@edgerunnerai.com