UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

LLMs produce near-identical outputs when sampled repeatedly, wasting attempts. UpSkill trains models to produce diverse, complementary reasoning strategies by conditioning on a discrete latent variable $z$.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to $z$. Experiments on GSM8K with three open-weight models, Llama 3.1–8B, Qwen 2.5–7B, and R1-Distilled–Qwen2.5–Math–1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of $\sim$3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.

Overview

Challenge: Response Collapse

In many practical settings -- code generation with unit tests, formal proof verification, and mathematical reasoning -- LLMs are queried multiple times on the same problem, and success requires only one correct output out of $k$ attempts. This is measured by pass@k: the probability that at least one of $k$ completions is correct.

However, standard RL fine-tuning (RLVR) optimizes for pass@1, rewarding the single most likely answer. This causes the model to collapse onto one dominant strategy, producing near-identical outputs across repeated attempts. Additional samples provide diminishing returns because they are effectively redundant -- the model keeps trying the same approach. This is the mode collapse problem: optimizing pass@1 often hurts pass@k.

Idea: Mutual Information Skill Learning

UpSkill introduces a discrete latent variable $z \in \{1, \ldots, N\}$ that is prepended to the prompt during training. The key insight is to maximize the mutual information $I(\tau; z \mid x)$ between the latent $z$ and the model's output trajectory $\tau$. This objective simultaneously pushes for (i) high marginal entropy so that trajectories cover a broad range of solution strategies, and (ii) low conditional entropy given $z$ so that each value of $z$ reliably produces a specific, reproducible strategy.

We implement this as a token-level MI reward within GRPO training: at each token, the reward measures how much more likely that token is under the current strategy $z$ compared to the average over all strategies:

$$r_{\text{TMI}}(\tau; x, z) = \sum_{t=1}^{|\tau|} \log \frac{p_\pi(y_t \mid x, z, y_{<t})}{p_\pi(y_t \mid x, y_{<t})} \qquad \text{where} \quad p_\pi(y_t \mid x, y_{<t}) = \frac{1}{N}\sum_{z'=1}^{N} p_\pi(y_t \mid x, z', y_{<t})$$

$$r(\tau; x, z) \;=\; r_{\text{corr}}(\tau) \;-\; \beta\, \Delta_{\text{KL}}(\tau) \;+\; \alpha_1\, r_{\text{TMI}}(\tau; x, z)$$

The first term $r_{\text{TMI}}$ is the token-level mutual information reward, measuring how specific the trajectory is to the chosen strategy $z$. The combined reward balances correctness, KL regularization toward the base model, and the MI diversity term. At inference, we query the model once per distinct $z$, guaranteeing that each attempt uses a different strategy.

How UpSkill works. Before MISL (left), the trajectory distribution is independent of $z$. Standard GRPO training collapses diversity further (middle). Adding the token-level MI reward (right) yields well-separated clusters indexed by $z$, reducing conditional entropy while preserving high marginal entropy.

Advantages

UpSkill prevents response collapse by ensuring each $z$ induces a distinct mode, improving multi-attempt accuracy without hurting pass@1. We prove that pass@k improvement is both upper and lower bounded by the mutual information $I(\tau; z \mid x)$, providing theoretical justification for the MI objective. Our experiments validate this: pass@k gains closely track MI increases during training, and UpSkill can even improve pass@k without ground-truth correctness labels -- using MI rewards alone.

Experiments

Does UpSkill prevent distribution collapse?

We test on a controlled arithmetic environment where a small transformer must produce arithmetic expressions evaluating to a target. This lets us directly inspect the learned strategy distributions.

MISL prevents entropy collapse.
Under GRPO alone (blue), pass@1 and pass@5 converge together, indicating multiple attempts provide little benefit. With MISL (orange), pass@5 improves substantially (+10%) while pass@1 remains modest. Operator distributions show distinct strategy specialization with MISL.

Does UpSkill improve multi-attempt metrics on LLMs?

We fine-tune LoRA adapters (~80M parameters) on Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B using 2,000 GSM8K training problems and evaluate on 500 held-out examples.

UpSkill improves mean multi-attempt accuracy on Qwen 2.5-7B: +3.4% pass@k, +9.1% plurality@k.

Performance on 500 held-out problems with $N$=5 strategies across Qwen and Llama.

**Improvement in pass@k relative to the base model for MI-based methods.**
Model	Method	$\Delta$pass@k	Method	$\Delta$pass@k
Qwen	Correctness	+3.8%	MI	+5.2%
Qwen	MI + correctness	+7.0%	MI + KL	+4.0%
Llama	Correctness	+2.8%	MI	+0.8%
Llama	MI + correctness	+3.6%	MI + KL	+2.4%

Combining MI with correctness rewards yields the largest gains: +7.0% pass@k on Qwen and +3.6% on Llama. Notably, MI alone (without correctness labels) still improves pass@k, showing UpSkill can boost diversity without ground-truth answers.

When does improvement happen during training?

The model learns to maximize mutual information within a span of ~100 training steps. Most of the eventual pass@k improvement happens in this range.

GSM8K pass@5 over training checkpoints, overlaid on MI rewards. The tuples in the legend are ($\alpha_1$, lr).

Open Questions

Semantic Embeddings for Diversity: Can we define a semantic MI reward using text embeddings to encourage meaningfully different strategies, rather than superficial token-level variation?
More Complex Datasets: GSM8K problems mostly admit a single solution strategy. On harder benchmarks with multiple valid approaches, does UpSkill show larger gains across more model families?
Relaxing Theoretical Assumptions: Our theory assumes the mixture distribution remains stationary. Can we relax this and still prove similar guarantees on pass@k improvement?

${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$

@misc{shah2025upskill,
    title         = {UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs},
    author        = {Shah, Devan and Yang, Owen and Yang, Daniel and Zheng, Chongyi and Eysenbach, Benjamin},
    year          = {2025},
    eprint        = {2602.22296},
    archivePrefix = {arXiv},
    primaryClass  = {cs.LG},
    url           = {https://arxiv.org/abs/2602.22296}
}