FutureFill: Fast Generation from Convolutional Sequence Models

Naman Agarwal* Xinyi Chen* Evan Dogariu* Devan Shah Hubert Strauss Vlad Feinberg* Daniel Suo* Peter Bartlett* Elad Hazan*
*Google DeepMind    Princeton University
FutureFill operation between an input sequence and a convolutional filter

FutureFill pre-computes the contribution of past tokens on all future tokens via a single FFT-based operation. This primitive enables quasilinear-time generation from any convolutional sequence model—reducing complexity from $O(L^2)$ to $O(L \log^2 L)$.

Abstract

We address the challenge of efficient auto-regressive generation in sequence prediction models by introducing FutureFill—a method for fast generation that applies to any sequence prediction algorithm based on convolutional operators. Our approach reduces the generation time from quadratic to quasilinear relative to the context length. Additionally, FutureFill requires a prefill cache sized only by the number of tokens generated, which is smaller than the cache requirements for standard convolutional and attention-based models. We validate our theoretical findings with experimental evidence demonstrating correctness and efficiency gains in language generation tasks.

Overview

Quadratic Generation in Convolutional Models

Convolution-based sequence prediction models have emerged as strong alternatives to Transformers, leveraging the Fast Fourier Transform (FFT) to achieve near-linear scaling during training. These include State Space Models (SSMs), Spectral Transform Units (STUs), and the Hyena architecture.

However, during inference, generating tokens auto-regressively from convolutional models requires recomputing convolutions at each step, leading to $O(L^2)$ total generation time—the same quadratic cost as attention-based models. Previous attempts to address this involved distilling SSMs from convolutional models, but these are approximations with poorly understood gaps.

The FutureFill Primitive

FutureFill is a simple but powerful operation: given past tokens and a convolutional filter, it computes the contribution of all past tokens on every future position in a single FFT call. Formally, given two sequences $v \in \mathbb{R}^{t_1}$ and $w \in \mathbb{R}^{t_2}$:

$$\forall s \in [t_2 - 1] \quad [\text{FutureFill}(v, w)]_s = \sum_{i=1}^{t_2 - s} v_{t_1 - i + 1} \cdot w_{s+i}$$

Conceptually, $[\text{FutureFill}(v, w)]_s$ is the contribution of the input $v$ of length $t_1$ to the convolution output at position $t_1 + s$. This can be computed via FFT in time $O((t_1 + t_2) \log(t_1 + t_2))$.

The key structural insight is Proposition 1: any convolution can be decomposed at an arbitrary split point into a FutureFill operation plus a smaller convolution on only the recent tokens:

$$[a * b]_s = \begin{cases} [a_{1:t_1} * b_{1:t_1}]_s & \text{if } s \leq t_1 \\ [a_{t_1+1:t} * b_{1:t-t_1}]_{s-t_1} + [\text{FutureFill}(a_{1:t_1}, b)]_{s-t_1} & \text{otherwise} \end{cases}$$

This decomposition enables two algorithms with different runtime–memory trade-offs, both generating exactly from the convolutional model without any approximation.

Algorithms

Epoched-FutureFill

The first algorithm batches the FutureFill computation into epochs of length $K$. Every $K$ steps, a full FutureFill is computed to pre-fill a cache of size $K$ with the contributions of all past tokens. Between epochs, only a small inner product of at most $K$ terms is needed per token.

Illustration of Epoched-FutureFill (Algorithm 1)

Epoched-FutureFill. Every $K$ steps, a FutureFill operation costing $O(L \log L)$ pre-computes future contributions. Between epochs, each token costs $O(K)$. Setting $K = \sqrt{L \log L}$ balances the two terms.

Algorithm 1: Epoched-FutureFill — Memory Efficient Prediction

Input: Filter $\phi \in \mathbb{R}^L$. Input seq. $u \in \mathbb{R}^L$, streaming coordinate-wise. $K$, the epoch length.

Set $\tau = 1$. Set FutureFill cache $C \in \mathbb{R}^K$ to $0$.

for $t = 1, 2, \ldots, L$ do

Receive $u_t$, and compute and output $\hat{y}_t = \sum_{j=1}^{\tau} u_{t+1-j} \cdot \phi_j + C_\tau$

if $\tau = K$ then

Compute FutureFill cache $C \in \mathbb{R}^K$ defined as $C_j = [\text{FutureFill}(u_{1:t}, \phi_{1:t+K})]_j$

$\tau \leftarrow 1$

else

$\tau \leftarrow \tau + 1$

end if

end for

Runtime Analysis

The total runtime of Epoched-FutureFill consists of two components. First, at every iteration the output requires a sum of $\tau$ products—costing $O(\tau)$. Second, every $K$ iterations, the FutureFill cache is updated via FFT in $O(L \log L)$ time. Summing over $L$ iterations:

$$\frac{L}{K} \left( L \log L + \sum_{\tau=1}^{K} \tau \right) = O\!\left( \frac{L^2 \log L}{K} + KL \right) = O\!\left( L^{3/2} \sqrt{\log L} \right)$$

where the last equality holds when $K = \sqrt{L \log L}$ is chosen to minimize the sum.


Continuous-FutureFill: Quasilinear Generation

The second algorithm uses a divide-and-conquer strategy. At each time step $t$, it computes a FutureFill whose size depends on the highest power of 2 dividing $t$. Small FutureFills happen frequently, large ones happen rarely—yielding $O(L \log^2 L)$ total time.

Algorithm 2: Continuous-FutureFill — Quasilinear Generation

Input: Convolutional filter $\phi \in \mathbb{R}^L$. Input sequence $u \in \mathbb{R}^L$, streaming one coordinate every round.

Set $b = \lfloor \log L \rfloor$. Set FutureFill cache $C \in \mathbb{R}^L$ to $0$.

for $t = 1, \ldots, L$ do

Receive $u_t$. Output $\hat{y}_t = C_t + u_t \cdot \phi_1$.

Let $k(t)$ be the highest power of 2 that divides $t$, i.e. $k = \max\{i \in [b] : t \bmod 2^i = 0\}$.

Compute $\text{FF} = \text{FutureFill}(u_{t-2^{k(t)}+1:t},\; \phi_{1:2^{k(t)+1}})$

Set $C_i = C_i + \text{FF}_{i-t}$   $\forall\, i \in [t+1,\, t + 2^{k(t)}]$

end for

Theorem 3. Algorithm 2 computes the online convolution of sequences with length $L$ and runs in total time $O(L \log^2 L)$ with a total additional memory requirement of $O(L)$.

Complexity Comparison

Generating $L$ tokens from scratch. Runtime in asymptotic notation.
Method Runtime Memory
Standard Conv $L^2$ $1$
Standard Attn. $L^2$ $1$
EpochedFF (ours) $L^{3/2} \sqrt{\log L}$ $\sqrt{L \log L}$
ContinuousFF (ours) $L \log^2 L$ $L$
Generating $K$ tokens from a prompt of length $L$.
Method Prefill + Generation Cache
Standard Conv $LK + L\log L + K^2$ $L + K$
Standard Attn. $L^2 + KL$ $L + K$
EpochedFF (ours) $L\log L + K^{3/2}\sqrt{\log K}$ $K$
ContinuousFF (ours) $L\log L + K\log^2 K$ $K$

Importantly, our algorithms generate exactly from the convolutional model without relying on any approximations. Our methods are applicable to any convolutional model, regardless of how it was trained—including STUs, Hyena, LongConv, SGConv, and hybrid architectures.

Experiments

Empirical Validation on Isolated Convolutions

We verify our theoretical results on isolated online convolution operations with randomly initialized filters. The behavior is consistent with theory: the naive algorithm runs in amortized $O(L)$ per step, while Epoched-FutureFill achieves sublinear and Continuous-FutureFill achieves logarithmic per-step complexity.

Total and average number of seconds per step when generating L tokens

Total and amortized generation time when generating $L$ tokens from scratch. Naive convolution grows quadratically; both FutureFill variants exhibit clear sub-quadratic scaling.


Convolutional Language Models

We evaluate Epoched-FutureFill on FlashSTU-T models—both fully convolutional and hybrid (STU + attention) variants—on a single NVIDIA H100 GPU. FutureFill achieves substantial speedups that grow with sequence length.

End-to-end inference time (seconds) with prefill on a prompt of 32,768 tokens.
Model Cache Type $L_{\text{gen}}$ = 4096 $L_{\text{gen}}$ = 8192 $L_{\text{gen}}$ = 16384
515.46M STU-only Epoched FutureFill 13.12 26.18 52.22
Baseline 25.23 52.31 111.92
670.75M STU-only Epoched FutureFill 19.06 37.80 75.66
Baseline 37.21 77.15 165.13

Key results: At the largest generation length without prefill ($L_{\text{gen}} = 126{,}976$), we observe a 1.7× speedup for STU-only and 1.5× for the hybrid variant. With prefill ($L_{\text{prompt}} = 32{,}768$, $L_{\text{gen}} = 16{,}384$), we observe a 2× speedup for both models. Accuracy on downstream tasks (WinoGrande, HellaSwag, PiQA) matches the naive convolution baseline exactly.

Applicability

FutureFill applies to any convolutional sequence model. We highlight several architectures:

  • Spectral Transform Units (STU): Convolutional models using fixed spectral filters derived from a Hankel matrix. FutureFill directly accelerates the $k$ convolutions per auto-regressive step, and preserves the dimension-free sublinear regret guarantees of spectral filtering.
  • Hyena: Models that sequentially apply learnable convolutions and element-wise products. FutureFill accelerates each convolution in the Hyena operator, achieving up to 1.38× speedup at sequence length $1{,}048{,}576$.
  • Hybrid models: Architectures combining convolutional blocks with local attention. FutureFill accelerates only the convolutional layers, and the speedup grows as the convolutional fraction of the model increases.

${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$

@inproceedings{agarwal2026futurefill,
    title     = {FutureFill: Fast Generation from Convolutional Sequence Models},
    author    = {Agarwal, Naman and Chen, Xinyi and Dogariu, Evan and Shah, Devan
                 and Strauss, Hubert and Feinberg, Vlad and Suo, Daniel
                 and Bartlett, Peter and Hazan, Elad},
    booktitle = {International Conference on Learning Representations (ICLR)},
    year      = {2026}
}