Tested Recursive Language Models across 4 GPT models (6,600 evals). RLMs scale with model capability: -9pp on nano, +3pp on mini, +22pp on 5.4-mini, +30pp on 5.2.

Exploring Recursive Language Models Across Multiple GPT Variants: A Comprehensive Evaluation

In the rapidly evolving landscape of artificial intelligence, the capability of language models to perform complex tasks continues to advance. Recent research has focused on assessing how recursive language models (RLMs)—models that utilize internal programmatic structures—scale with varying model sizes and capabilities. This investigation involved extensive testing across four different GPT models, comprising approximately 6,600 evaluation runs, providing valuable insights into how RLM performance changes in relation to model capacity.

Key Findings on Model Scaling and Performance

The study indicates a clear correlation between a model’s size and the effectiveness of recursive techniques. Specifically:

Nano models experienced a marginal decrease of approximately 9 percentage points in performance when integrating RLMs.
Mini models showed a modest improvement of around 3 percentage points, demonstrating some benefit but limited impact.
5.4-mini models witnessed a notable increase of about 22 percentage points, highlighting the potential of RLMs at this scale.
5.2 models achieved a substantial 30 percentage point enhancement, underscoring the scalability of recursive methods with more advanced models.

The Role of minRLM and the REPL Approach

A critical innovation examined in this research is the minRLM methodology, which adopts a unique approach to data storage during model execution. Unlike traditional methods that include data within the prompt, minRLM stores data dynamically in a Python Read-Eval-Print Loop (REPL) variable, allowing the model to write code that interacts directly with stored data.

This approach offers two notable advantages:

On smaller models, the performance difference between minRLM and traditional RLM approaches is negligible—essentially a tie.
Conversely, on larger models, the REPL-based method demonstrates a striking 30 percentage point improvement, significantly enhancing problem-solving accuracy.

The case of GPT-5.4-mini stands out as particularly intriguing. Both vanilla (standard) RLMs and official implementations experienced substantial regressions compared to GPT-5-mini. However, the REPL-based minRLM maintained steady performance, suggesting robustness of this technique across different model versions and sizes.

Open Source and Reproducibility

The entire evaluation framework is openly available, promoting transparency and enabling researchers and practitioners to reproduce the experiments or build upon this foundation. The detailed instructions, code, and data are accessible at https://avilum.github.io/minrlm/.

Conclusion and Future Directions

This comprehensive evaluation underscores the importance of architecture and method modifications in optimizing language model performance. The positive scaling trend with model capacity reinforces the potential of recursive approaches, especially when combined with innovative data management strategies like the REPL method.

As models continue to grow in size and complexity, techniques such as minRLM could play a crucial role in unlocking higher levels of reasoning and problem-solving abilities. Continued research and open collaboration will be vital in harnessing these advancements to develop more robust, efficient, and intelligent AI systems.

For those interested in exploring this methodology or conducting further experiments, full details and resources are available at the provided link.

Holidays in Europe

Tested Recursive Language Models across 4 GPT models (6,600 evals). RLMs scale with model capability: -9pp on nano, +3pp on mini, +22pp on 5.4-mini, +30pp on 5.2.

Key Findings on Model Scaling and Performance

The Role of minRLM and the REPL Approach

Open Source and Reproducibility

Conclusion and Future Directions

Leave a Reply Cancel reply