AI That Thinks Before It Speaks

The rapid advancement of language models has primarily been driven by scaling up model size and increasing training data. However, an emerging approach challenges this paradigm: scaling test-time computation through latent reasoning. A new paper titled “Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach”[1] introduces a novel recurrent depth transformer architecture designed to enhance the reasoning capabilities of language models by iterating over internal latent states. This method enables models to allocate more computational effort to complex tasks without requiring larger models or longer context windows.

They train a 3.5B parameter language model with depth recurrence. At test time, the model can iterate longer to use more compute and improve its performance. Instead of scaling test-time reasoning by “verbalizing” in long Chains-of-Thought, the model Improves entirely by reasoning in latent space. Tasks that require less reasoning like OpenBookQA converge quicker than tasks like GSM8k, which effectively make use of more computing.

Traditionally, language models improve their reasoning by generating intermediate steps through chain-of-thought prompting [2]. While effective, this approach is constrained by the need to verbalize every intermediate step into tokens, which can be computationally inefficient and memory-intensive. In contrast, humans often reason internally, exploring multiple lines of thought before verbalizing an answer. The authors propose a model that mimics this latent internal reasoning process by iteratively updating a hidden state before outputting a token.

Moreover, they have published the model[3], code, and Data[4]. So you can try this incredible idea on your own!

Core Idea: Recurrent Depth

The paper introduces a depth-recurrent transformer architecture. Instead of increasing the model's depth statically, this architecture includes a core recurrent block that iterates over a latent state multiple times during inference. This allows the model to spend more computational effort per token when needed, enabling flexible computational scaling based on task complexity.

The recurrent block is situated between two fixed components: the prelude and the coda. The prelude maps input tokens into a latent space, while the coda decodes the final latent state into token probabilities. The recurrent block can be unrolled for an arbitrary number of iterations, allowing the model to refine its internal representation before making a prediction.

During training, the number of recurrent steps is sampled randomly from a log-normal Poisson distribution. This stochastic approach encourages the model to generalize across different depths and prevents overfitting to a specific computational budget. At test time, the number of recurrent steps can be adjusted dynamically, enabling the model to allocate more computation to difficult queries and less to simple ones.

Key motivations for this approach include:

Adaptive Computation: Complex queries can be given more computational resources, while simpler ones are processed quickly.
Latent Reasoning: The model reasons in a continuous latent space, potentially capturing abstract thought processes not easily verbalized.
Reduced Memory Overhead: Since the model does not rely on longer context windows, it requires less memory compared to traditional chain-of-thought approaches.
Increased Computational Efficiency: The model performs more floating-point operations per parameter, reducing inter-device communication overhead during distributed training.
Enhanced Problem-Solving Ability: The iterative refinement process encourages the model to develop generalizable problem-solving strategies, moving beyond rote memorization towards reasoning and abstraction.

Model Architecture

The model consists of three main components that work in conjunction to enable latent reasoning through recurrent depth:

Each block consists of a number of sub-layers. The blue prelude block embeds the inputs into latent space, whereas the green shared recurrent block is a block of layers that is repeated to compute the final latent state, which is decoded by the layers of the red coda block.

Prelude: This component is responsible for embedding the input tokens into a high-dimensional latent space. It consists of multiple transformer layers that transform the token sequence into an initial latent state representation. The prelude can be thought of as a standard transformer encoder[5] that prepares the input for iterative refinement.
Recurrent Block: The core innovation lies in this block. The recurrent block is a sequence of transformer sub-layers that update the latent state iteratively. The input embeddings from the prelude are injected into the recurrent block at each iteration, ensuring that the model continuously considers the original input while refining its latent state. Each iteration processes the latent state and generates an updated representation. This iterative process can be repeated an arbitrary number of times at test time, enabling adaptive computational depth.
- Hidden State Initialization: The initial latent state is sampled from a Gaussian distribution. This randomized initialization stabilizes the recurrence process and promotes path independence, ensuring that the final state does not overly depend on the initial state.
- Token-wise Recurrence: The recurrent block operates independently on each token's latent state while allowing attention-based interactions between tokens. This allows the model to refine individual token representations while still capturing contextual dependencies.
- Unrolling: At test time, the recurrent block can be unrolled for as many iterations as needed. Each iteration involves applying the transformer sub-layers to the latent state, with the number of iterations serving as a computational budget.
Coda: The coda is a final sequence of transformer layers that decode the refined latent state into token probabilities. It performs the final transformation from the latent space back to the vocabulary space, producing the output token distribution. Like the prelude, the coda resembles a standard transformer decoder.

The architecture can be summarized as follows:

Input tokens are embedded by the prelude into an initial latent state.
The latent state is refined iteratively by the recurrent block, with the number of iterations being flexible.
The final latent state is decoded into token probabilities by the coda.

This design enables the model to adjust its computational depth dynamically, balancing efficiency and performance based on the complexity of the task.

Training Methodology

The model was trained on a mixture of data, including general web text, code, scientific literature, and mathematical content. The training process involved pretraining the model on 800 billion tokens, with a parameter size of 3.5 billion. This data mixture was chosen to encourage the development of reasoning and problem-solving skills, particularly in mathematical and coding contexts.

Distribution of data sources that are included during training. The majority of our data is comprised of generic webtext, scientific writing and code

A key aspect of the training process was the randomization of the number of recurrent steps. During each training step, the number of recurrent iterations was sampled from a log-normal Poisson distribution. This distribution was selected because it produces a heavy tail, occasionally resulting in a high number of iterations. This approach trained the model to handle both low-computation and high-computation scenarios effectively. The mean recurrence was set to 32 steps, but the actual number varied around this value.

The training utilized truncated backpropagation through time to reduce memory consumption. Gradients were only backpropagated through the last eight iterations of the recurrent block. This technique allowed the model to benefit from long recurrences without incurring excessive memory costs.

The training was conducted on a large-scale computing cluster with parallel processing across multiple GPUs. The Adam optimizer was employed with a constant learning rate after a brief warm-up period. Weight initialization and normalization were carefully tuned to stabilize the recurrent training process and prevent gradient explosions.

Results

The model demonstrated competitive performance across multiple benchmarks [6], particularly excelling in tasks requiring complex reasoning. Key findings include:

Adaptive Improvement: Performance improved with more recurrent steps, with some benchmarks showing dramatic gains when increasing from 1 to 32 steps. Notably, benchmarks such as GSM8K (a mathematical reasoning task) exhibited up to fivefold improvements as the number of recurrent steps increased.
Task Dependence: Tasks with lower computational complexity, like OpenBookQA, saturated with fewer steps, while more complex benchmarks like GSM8K benefited from extended computation. This task-specific behavior highlights the model's ability to allocate computational resources dynamically based on the difficulty of the task.
Mathematical and Code Reasoning: The model achieved strong results on math and coding tasks, suggesting that latent reasoning is particularly valuable in domains requiring multi-step logical processes. In tasks like HumanEval and MBPP (coding benchmarks), the model outperformed similarly sized non-recurrent models, indicating that recurrent latent processing is particularly beneficial for structured problem-solving.

Performance on GSM8K CoT (strict match and flexible match), HellaSwag (acc norm.), and HumanEval (pass@1). As we increase compute, the performance on these benchmarks increases. HellaSwag only needs 8 recurrences to achieve near-peak performance while other benchmarks make use of more compute.

Emergent Computation Patterns

One of the most intriguing aspects of the paper is the emergence of distinct computational patterns in the latent space. Visualizing token trajectories during recurrent processing revealed behaviors such as:

Convergence: The latent state stabilizes as the model iterates.
Orbits: Cyclical patterns for numerical reasoning, reminiscent of patterns observed in models trained on arithmetic tasks.
Sliders: Gradual drift in latent space, potentially representing counting mechanisms.

These findings suggest that recurrent depth models develop internal strategies akin to human cognitive processes.

Practical Implications

The recurrent depth approach offers several practical benefits:

Per-Token Adaptive Compute: The model can terminate computation early when confident, optimizing inference efficiency. This reduces latency and computational costs, making it suitable for real-time applications such as chatbots and virtual assistants.
KV-Cache Sharing: Previous latent states can be reused across tokens, reducing memory overhead. This is particularly advantageous in long-sequence processing tasks, enabling more efficient memory management.
Self-Speculative Decoding: The model can draft multiple tokens with fewer iterations and then verify them with more compute. This allows faster text generation while maintaining output quality, improving user experience in applications requiring rapid responses like search engines and content generation platforms.

Philosophical Considerations

The paper also touches on a broader philosophical question: Can models reason beyond the limitations of token sequences? Latent reasoning opens up the possibility of capturing non-verbal cognitive processes such as spatial intuition and motor planning. This aligns more closely with human cognition, where internal thought often precedes verbalization.

The approach builds upon and diverges from several lines of research:

Fixed-Depth Transformers: Standard transformers rely on fixed computational depth, limiting their adaptability.
Chain-of-Thought Prompting: This technique externalizes reasoning but is constrained by the need for verbalized steps.
Deep Equilibrium Models: These models also iterate towards a stable state but focus on finding fixed points, whereas recurrent depth models optimize reasoning capacity through latent iteration.

Challenges and Limitations

While promising, the recurrent depth approach has limitations:

Training Complexity: The stochastic nature of recurrent steps introduces additional training complexity.
Evaluation Ambiguity: Determining the optimal number of recurrent steps for a specific task remains an open problem.
Limited Data and Compute: The authors acknowledge that their model was trained with fewer resources compared to leading models like GPT, indicating room for further scaling and optimization.

Future Directions

The paper suggests several promising avenues for future research and development, which could further enhance the capabilities and efficiency of the recurrent depth approach:

Hybrid Architectures: Combining the recurrent depth approach with mixture-of-experts (MoE) models could enable models to leverage both adaptive compute and selective activation of specialized subnetworks. This hybrid approach could lead to even more efficient and capable models, balancing memory, computation, and performance more effectively.
Fine-Tuning for Domain-Specific Reasoning: Post-training fine-tuning on specific domains, such as scientific research, law, or medicine, could tailor the recurrent depth mechanism to better address domain-specific reasoning tasks. Fine-tuning could also involve optimizing the recurrence mechanism to align with human reasoning chains for specific applications.
Task-Adaptive Recurrence: Developing more sophisticated algorithms to automatically determine the optimal number of recurrent steps based on task complexity during inference could further improve efficiency. Reinforcement learning or adaptive controllers could be explored to dynamically adjust the computational budget in real-time.
Scaling Up: Training larger models with more parameters and additional data could amplify the observed performance gains. Extending the recurrent depth approach to models on the scale of GPT-4 or beyond could reveal new emergent reasoning capabilities.
Multi-Modal Integration: Integrating the recurrent depth mechanism into multi-modal models that process text, images, and other data types could enable better joint reasoning across modalities. This would be particularly useful in tasks requiring spatial reasoning or visual understanding.
Neuroscientific Inspiration: Further exploration into cognitive science and neuroscience could inspire refinements to the recurrent latent reasoning process. Studying how humans allocate cognitive resources and model uncertainty during complex problem-solving could inform improvements to computational resource allocation in AI systems.
Evaluation Metrics: Developing more granular evaluation metrics to measure the quality and efficiency trade-offs of recurrent reasoning processes could help standardize assessments and drive progress in adaptive compute architectures.

These future directions highlight the potential for recurrent depth models to evolve into more robust, flexible, and general-purpose reasoning systems. By combining adaptive computation with larger-scale models and domain-specific optimizations, the recurrent depth approach could play a key role in shaping the next generation of AI models.

Conclusion

The recurrent depth transformer represents a significant step towards models that reason more like humans—allocating computational resources flexibly and exploring multiple latent possibilities before verbalizing an answer. By shifting the focus from static scaling to adaptive, test-time computation, this approach challenges conventional wisdom and opens the door to more efficient and capable language models. As the field progresses, latent reasoning may become a cornerstone of next-generation AI systems.

References

[1] Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B. R., ... & Goldstein, T. (2025). Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. arXiv preprint arXiv:2502.05171.

[2] DeepSeek, the game-changing model, Transcendent AI

[3] Model Weights, HuggingFace

[4] Code and Data, Github

[5] Waswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., ... & Polosukhin, I. (2017, December). Attention is all you need. In NIPS.

[6] Measuring Intelligence: Key Benchmarks and Metrics for LLMs, Transcendent AI