LLMs produce near-identical outputs when sampled repeatedly, wasting attempts. UpSkill trains models to produce diverse, complementary reasoning strategies by conditioning on a discrete latent variable $z$.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to $z$. Experiments on GSM8K with three open-weight models, Llama 3.1–8B, Qwen 2.5–7B, and R1-Distilled–Qwen2.5–Math–1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of $\sim$3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.
Overview
Challenge: Response Collapse
In many practical settings -- code generation with unit tests, formal proof verification, and mathematical reasoning -- LLMs are queried multiple times on the same problem, and success requires only one correct output out of $k$ attempts. This is measured by pass@k: the probability that at least one of $k$ completions is correct.
However, standard RL fine-tuning (RLVR) optimizes for pass@1, rewarding the single most likely answer. This causes the model to collapse onto one dominant strategy, producing near-identical outputs across repeated attempts. Additional samples provide diminishing returns because they are effectively redundant -- the model keeps trying the same approach. This is the mode collapse problem: optimizing pass@1 often hurts pass@k.
Idea: Mutual Information Skill Learning
UpSkill introduces a discrete latent variable $z \in \{1, \ldots, N\}$ that is prepended to the prompt during training. The key insight is to maximize the mutual information $I(\tau; z \mid x)$ between the latent $z$ and the model's output trajectory $\tau$. This objective simultaneously pushes for (i) high marginal entropy so that trajectories cover a broad range of solution strategies, and (ii) low conditional entropy given $z$ so that each value of $z$ reliably produces a specific, reproducible strategy.
We implement this as a token-level MI reward within GRPO training: at each token, the reward measures how much more likely that token is under the current strategy $z$ compared to the average over all strategies:
$$r_{\text{TMI}}(\tau; x, z) = \sum_{t=1}^{|\tau|} \log \frac{p_\pi(y_t \mid x, z, y_{<t})}{p_\pi(y_t \mid x, y_{<t})} \qquad \text{where} \quad p_\pi(y_t \mid x, y_{<t}) = \frac{1}{N}\sum_{z'=1}^{N} p_\pi(y_t \mid x, z', y_{<t})$$
$$r(\tau; x, z) \;=\; r_{\text{corr}}(\tau) \;-\; \beta\, \Delta_{\text{KL}}(\tau) \;+\; \alpha_1\, r_{\text{TMI}}(\tau; x, z)$$
The first term $r_{\text{TMI}}$ is the token-level mutual information reward, measuring how specific the trajectory is to the chosen strategy $z$. The combined reward balances correctness, KL regularization toward the base model, and the MI diversity term. At inference, we query the model once per distinct $z$, guaranteeing that each attempt uses a different strategy.
How UpSkill works. Before MISL (left), the trajectory distribution is independent of $z$. Standard GRPO training collapses diversity further (middle). Adding the token-level MI reward (right) yields well-separated clusters indexed by $z$, reducing conditional entropy while preserving high marginal entropy.
Advantages
UpSkill prevents response collapse by ensuring each $z$ induces a distinct mode, improving multi-attempt accuracy without hurting pass@1. We prove that pass@k improvement is both upper and lower bounded by the mutual information $I(\tau; z \mid x)$, providing theoretical justification for the MI objective. Our experiments validate this: pass@k gains closely track MI increases during training, and UpSkill can even improve pass@k without ground-truth correctness labels -- using MI rewards alone.
Experiments
Does UpSkill prevent distribution collapse?
We test on a controlled arithmetic environment where a small transformer must produce arithmetic expressions evaluating to a target. This lets us directly inspect the learned strategy distributions.
MISL prevents entropy collapse.
Under GRPO alone (blue), pass@1 and pass@5 converge together,
indicating multiple attempts provide little benefit. With MISL
(orange), pass@5 improves substantially (+10%) while pass@1
remains modest. Operator distributions show distinct strategy
specialization with MISL.
Does UpSkill improve multi-attempt metrics on LLMs?
We fine-tune LoRA adapters (~80M parameters) on Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B using 2,000 GSM8K training problems and evaluate on 500 held-out examples.
UpSkill improves mean multi-attempt accuracy on Qwen 2.5-7B: +3.4% pass@k, +9.1% plurality@k.
Performance on 500 held-out problems with $N$=5 strategies across Qwen and Llama.
| Model | Method | $\Delta$pass@k | Method | $\Delta$pass@k |
|---|---|---|---|---|
| Qwen | Correctness | +3.8% | MI | +5.2% |
| MI + correctness | +7.0% | MI + KL | +4.0% | |
| Llama | Correctness | +2.8% | MI | +0.8% |
| MI + correctness | +3.6% | MI + KL | +2.4% |
Combining MI with correctness rewards yields the largest gains: +7.0% pass@k on Qwen and +3.6% on Llama. Notably, MI alone (without correctness labels) still improves pass@k, showing UpSkill can boost diversity without ground-truth answers.
When does improvement happen during training?
The model learns to maximize mutual information within a span of ~100 training steps. Most of the eventual pass@k improvement happens in this range.
GSM8K pass@5 over training checkpoints, overlaid on MI rewards. The tuples in the legend are ($\alpha_1$, lr).
Open Questions
- Semantic Embeddings for Diversity: Can we define a semantic MI reward using text embeddings to encourage meaningfully different strategies, rather than superficial token-level variation?
- More Complex Datasets: GSM8K problems mostly admit a single solution strategy. On harder benchmarks with multiple valid approaches, does UpSkill show larger gains across more model families?
- Relaxing Theoretical Assumptions: Our theory assumes the mixture distribution remains stationary. Can we relax this and still prove similar guarantees on pass@k improvement?
${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$
@misc{shah2025upskill,
title = {UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs},
author = {Shah, Devan and Yang, Owen and Yang, Daniel and Zheng, Chongyi and Eysenbach, Benjamin},
year = {2025},
eprint = {2602.22296},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2602.22296}
}