FutureFill pre-computes the contribution of past tokens on all future tokens via a single FFT-based operation. This primitive enables quasilinear-time generation from any convolutional sequence model—reducing complexity from $O(L^2)$ to $O(L \log^2 L)$.
Abstract
We address the challenge of efficient auto-regressive generation in sequence prediction models by introducing FutureFill—a method for fast generation that applies to any sequence prediction algorithm based on convolutional operators. Our approach reduces the generation time from quadratic to quasilinear relative to the context length. Additionally, FutureFill requires a prefill cache sized only by the number of tokens generated, which is smaller than the cache requirements for standard convolutional and attention-based models. We validate our theoretical findings with experimental evidence demonstrating correctness and efficiency gains in language generation tasks.
Overview
Quadratic Generation in Convolutional Models
Convolution-based sequence prediction models have emerged as strong alternatives to Transformers, leveraging the Fast Fourier Transform (FFT) to achieve near-linear scaling during training. These include State Space Models (SSMs), Spectral Transform Units (STUs), and the Hyena architecture.
However, during inference, generating tokens auto-regressively from convolutional models requires recomputing convolutions at each step, leading to $O(L^2)$ total generation time—the same quadratic cost as attention-based models. Previous attempts to address this involved distilling SSMs from convolutional models, but these are approximations with poorly understood gaps.
The FutureFill Primitive
FutureFill is a simple but powerful operation: given past tokens and a convolutional filter, it computes the contribution of all past tokens on every future position in a single FFT call. Formally, given two sequences $v \in \mathbb{R}^{t_1}$ and $w \in \mathbb{R}^{t_2}$:
$$\forall s \in [t_2 - 1] \quad [\text{FutureFill}(v, w)]_s = \sum_{i=1}^{t_2 - s} v_{t_1 - i + 1} \cdot w_{s+i}$$
Conceptually, $[\text{FutureFill}(v, w)]_s$ is the contribution of the input $v$ of length $t_1$ to the convolution output at position $t_1 + s$. This can be computed via FFT in time $O((t_1 + t_2) \log(t_1 + t_2))$.
The key structural insight is Proposition 1: any convolution can be decomposed at an arbitrary split point into a FutureFill operation plus a smaller convolution on only the recent tokens:
$$[a * b]_s = \begin{cases} [a_{1:t_1} * b_{1:t_1}]_s & \text{if } s \leq t_1 \\ [a_{t_1+1:t} * b_{1:t-t_1}]_{s-t_1} + [\text{FutureFill}(a_{1:t_1}, b)]_{s-t_1} & \text{otherwise} \end{cases}$$
This decomposition enables two algorithms with different runtime–memory trade-offs, both generating exactly from the convolutional model without any approximation.
Algorithms
Epoched-FutureFill
The first algorithm batches the FutureFill computation into epochs of length $K$. Every $K$ steps, a full FutureFill is computed to pre-fill a cache of size $K$ with the contributions of all past tokens. Between epochs, only a small inner product of at most $K$ terms is needed per token.
Epoched-FutureFill. Every $K$ steps, a FutureFill operation costing $O(L \log L)$ pre-computes future contributions. Between epochs, each token costs $O(K)$. Setting $K = \sqrt{L \log L}$ balances the two terms.
Input: Filter $\phi \in \mathbb{R}^L$. Input seq. $u \in \mathbb{R}^L$, streaming coordinate-wise. $K$, the epoch length.
Set $\tau = 1$. Set FutureFill cache $C \in \mathbb{R}^K$ to $0$.
for $t = 1, 2, \ldots, L$ do
Receive $u_t$, and compute and output $\hat{y}_t = \sum_{j=1}^{\tau} u_{t+1-j} \cdot \phi_j + C_\tau$
if $\tau = K$ then
Compute FutureFill cache $C \in \mathbb{R}^K$ defined as $C_j = [\text{FutureFill}(u_{1:t}, \phi_{1:t+K})]_j$
$\tau \leftarrow 1$
else
$\tau \leftarrow \tau + 1$
end if
end for
Runtime Analysis
The total runtime of Epoched-FutureFill consists of two components. First, at every iteration the output requires a sum of $\tau$ products—costing $O(\tau)$. Second, every $K$ iterations, the FutureFill cache is updated via FFT in $O(L \log L)$ time. Summing over $L$ iterations:
$$\frac{L}{K} \left( L \log L + \sum_{\tau=1}^{K} \tau \right) = O\!\left( \frac{L^2 \log L}{K} + KL \right) = O\!\left( L^{3/2} \sqrt{\log L} \right)$$
where the last equality holds when $K = \sqrt{L \log L}$ is chosen to minimize the sum.
Continuous-FutureFill: Quasilinear Generation
The second algorithm uses a divide-and-conquer strategy. At each time step $t$, it computes a FutureFill whose size depends on the highest power of 2 dividing $t$. Small FutureFills happen frequently, large ones happen rarely—yielding $O(L \log^2 L)$ total time.
Input: Convolutional filter $\phi \in \mathbb{R}^L$. Input sequence $u \in \mathbb{R}^L$, streaming one coordinate every round.
Set $b = \lfloor \log L \rfloor$. Set FutureFill cache $C \in \mathbb{R}^L$ to $0$.
for $t = 1, \ldots, L$ do
Receive $u_t$. Output $\hat{y}_t = C_t + u_t \cdot \phi_1$.
Let $k(t)$ be the highest power of 2 that divides $t$, i.e. $k = \max\{i \in [b] : t \bmod 2^i = 0\}$.
Compute $\text{FF} = \text{FutureFill}(u_{t-2^{k(t)}+1:t},\; \phi_{1:2^{k(t)+1}})$
Set $C_i = C_i + \text{FF}_{i-t}$ $\forall\, i \in [t+1,\, t + 2^{k(t)}]$
end for
Theorem 3. Algorithm 2 computes the online convolution of sequences with length $L$ and runs in total time $O(L \log^2 L)$ with a total additional memory requirement of $O(L)$.
Complexity Comparison
| Method | Runtime | Memory |
|---|---|---|
| Standard Conv | $L^2$ | $1$ |
| Standard Attn. | $L^2$ | $1$ |
| EpochedFF (ours) | $L^{3/2} \sqrt{\log L}$ | $\sqrt{L \log L}$ |
| ContinuousFF (ours) | $L \log^2 L$ | $L$ |
| Method | Prefill + Generation | Cache |
|---|---|---|
| Standard Conv | $LK + L\log L + K^2$ | $L + K$ |
| Standard Attn. | $L^2 + KL$ | $L + K$ |
| EpochedFF (ours) | $L\log L + K^{3/2}\sqrt{\log K}$ | $K$ |
| ContinuousFF (ours) | $L\log L + K\log^2 K$ | $K$ |
Importantly, our algorithms generate exactly from the convolutional model without relying on any approximations. Our methods are applicable to any convolutional model, regardless of how it was trained—including STUs, Hyena, LongConv, SGConv, and hybrid architectures.
Experiments
Empirical Validation on Isolated Convolutions
We verify our theoretical results on isolated online convolution operations with randomly initialized filters. The behavior is consistent with theory: the naive algorithm runs in amortized $O(L)$ per step, while Epoched-FutureFill achieves sublinear and Continuous-FutureFill achieves logarithmic per-step complexity.
Total and amortized generation time when generating $L$ tokens from scratch. Naive convolution grows quadratically; both FutureFill variants exhibit clear sub-quadratic scaling.
Convolutional Language Models
We evaluate Epoched-FutureFill on FlashSTU-T models—both fully convolutional and hybrid (STU + attention) variants—on a single NVIDIA H100 GPU. FutureFill achieves substantial speedups that grow with sequence length.
| Model | Cache Type | $L_{\text{gen}}$ = 4096 | $L_{\text{gen}}$ = 8192 | $L_{\text{gen}}$ = 16384 |
|---|---|---|---|---|
| 515.46M STU-only | Epoched FutureFill | 13.12 | 26.18 | 52.22 |
| Baseline | 25.23 | 52.31 | 111.92 | |
| 670.75M STU-only | Epoched FutureFill | 19.06 | 37.80 | 75.66 |
| Baseline | 37.21 | 77.15 | 165.13 |
Key results: At the largest generation length without prefill ($L_{\text{gen}} = 126{,}976$), we observe a 1.7× speedup for STU-only and 1.5× for the hybrid variant. With prefill ($L_{\text{prompt}} = 32{,}768$, $L_{\text{gen}} = 16{,}384$), we observe a 2× speedup for both models. Accuracy on downstream tasks (WinoGrande, HellaSwag, PiQA) matches the naive convolution baseline exactly.
Applicability
FutureFill applies to any convolutional sequence model. We highlight several architectures:
- Spectral Transform Units (STU): Convolutional models using fixed spectral filters derived from a Hankel matrix. FutureFill directly accelerates the $k$ convolutions per auto-regressive step, and preserves the dimension-free sublinear regret guarantees of spectral filtering.
- Hyena: Models that sequentially apply learnable convolutions and element-wise products. FutureFill accelerates each convolution in the Hyena operator, achieving up to 1.38× speedup at sequence length $1{,}048{,}576$.
- Hybrid models: Architectures combining convolutional blocks with local attention. FutureFill accelerates only the convolutional layers, and the speedup grows as the convolutional fraction of the model increases.
${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$
@inproceedings{agarwal2026futurefill,
title = {FutureFill: Fast Generation from Convolutional Sequence Models},
author = {Agarwal, Naman and Chen, Xinyi and Dogariu, Evan and Shah, Devan
and Strauss, Hubert and Feinberg, Vlad and Suo, Daniel
and Bartlett, Peter and Hazan, Elad},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}