GymVerse: How Far Are We from Fully Automated Environment Scaling for Self-Evolving Agents?

Zhuoran Jin^1,2,*, Longxiang Wang^1,2,*, Shengjia Hua^1,2,*, Zhitao He^1,2,

Hongbang Yuan^1,2, Kejian Zhu^1,2, Yupu Hao^1,2, Jiachun Li^1,2, Kang Liu^1,2, Jun Zhao^1,2

¹Institute of Automation, Chinese Academy of Sciences, ²School of Artificial Intelligence, University of Chinese Academy of Sciences

*Equal Contribution

Paper arXiv Code

“Welcome to the Era of Experience.”

— David Silver and Richard S. Sutton

Overview of the automated environment synthesis pipeline. (a) Environment synthesis workflow with complexity control; (b) Key functions defining the synthesized environment program; (c) Interaction loop between the agent and the synthesized environment.

Abstract

As the learning paradigm transitions from static data to interaction-driven experience, environments play a central role in enabling agents to learn, adapt, and evolve continuously through interaction. Yet the high cost and limited scalability of manually constructed environments pose a fundamental bottleneck to experience learning for agents. Therefore, automatically scaling environments is a necessary step toward the era of experience.

In this paper, we investigate how far we are from fully automated environment scaling for self-evolving agents. Specifically, our contributions are threefold: (1) We propose an automated environment synthesis workflow with explicit control over environment complexity, enabling environments to adapt to the agent’s evolving capabilities; (2) We introduce a principled evaluation framework to assess synthesized environments along the dimensions of correctness, difficulty, and diversity; (3) We conduct a systematic study and find that environment properties such as scale, complexity, correctness, and feedback design play a critical role in agent learning.

To this end, we introduce GymVerse, a comprehensive framework for environment synthesis, evaluation, and agent training. We further propose a simple yet effective reinforcement learning algorithm (PERPO) to support stable training on synthesized and evolving environments.

Extensive experiments on GymVerse demonstrate that training on synthesized environments enables effective generalization to unseen environments.

🔍 Research Questions

RQ I How can environments be automatically synthesized and continuously evolved?

RQ II What dimensions are essential for evaluating synthesized environments?

RQ III Which environment properties most strongly influence agent learning?

🧠 Method

Environment Synthesis

We posit that effective environments should adapt their difficulty dynamically as an agent’s learning progresses, thereby maintaining an appropriate level of challenge that supports efficient learning.

To this end, we formalize environment complexity as an explicit control variable in environment synthesis, reflected in factors such as interaction length, environment structure, and transition dependencies.

We model environment synthesis as a two-stage LLM workflow: an abstraction stage that produces high-level environment design concepts capturing the core interaction logic, followed by a construction stage that translates these concepts into executable and verifiable environment code.

Our synthesis workflow supports the generation of environments across multiple domains, including Tool, Game, Algo, and Logic.

Environment Evaluation

We therefore decompose environment evaluation into three dimensions: correctness, difficulty, and diversity.

For correctness, we adopt a three-stage evaluation pipeline consisting of an execution checker that validates environment functionality via unit tests, a solvability solver that algorithmically verifies the existence of valid solutions, and a rubric judge that assesses compliance with predefined requirements.

For difficulty, we quantify environment difficulty based on agent performance measured by success rate over multiple randomly sampled task instances. We find that the synthesized environment complexity is strongly correlated with the resulting environment difficulty.

For diversity, our analysis shows that strong LLMs generate environments that are well separated across domains, while maintaining meaningful variability within the same domain.

Agent Training (PERPO)

To support training on synthesized environments in GymVerse, we propose Progressive Environment Relative Policy Optimization (PERPO), which performs advantage normalization at the environment level and progressively evolves environment complexity, leading to more stable and effective agentic reinforcement learning.

We compute the environment-level advantage by normalizing returns over all trajectories sampled from the same environment:

$$A^{(n)}_t = \frac{G^{(n)}_t - \operatorname{mean}(\mathcal{G}_e)}{\operatorname{std}(\mathcal{G}_e)}$$

The resulting PERPO objective is defined as:

$$\mathcal{J}_{\text{PERPO}}(\theta)=\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T}A^{(n)}_t\log \pi_\theta\!\left(a^{(n)}_t \mid o^{(n)}_t\right)$$

In addition, PERPO incorporates progressive complexity scaling, where the environment complexity is adaptively adjusted based on the agent’s performance. Let $c_t$ denote the environment complexity at training step $t$:

$$c_t \leftarrow \min(c_t+1, c_{\max})\ \text{if } \bar{G}>\tau^{+},\quad c_t \leftarrow \max(c_t-1, c_{\min})\ \text{if } \bar{G}<\tau^{-}$$

📊 Experiments

Environment Scaling

Environment scaling: Increasing the number of environments enhances generalization to unseen environments, though the gains saturate beyond a certain scale. Among the 64 rigorously verified environments, 32 are used for training (GymVerse-ID) and 32 unseen environments are reserved for evaluation (GymVerse-OOD).

Training dynamics on GymVerse-ID.

Evaluation performance on GymVerse-OOD.

Complexity Evolution

Complexity evolution: Gradually increasing environment complexity leads to improved learning performance and greater training stability compared to fixed-difficulty environments.

Training dynamics with different complexity levels.

Correctness Filtering

Correctness filtering: Filtering out incorrect or unsound environments via environment evaluation significantly improves training efficiency. As shown in Figure, the presence of incorrect environments negatively interferes with learning, even on environments that are otherwise correctly verified.

Training dynamics comparing clean and mixed environment-quality settings across four shared correct environments.

Feedback Design

Feedback design: Richer feedback, including informative environment observations and dense process rewards, consistently accelerates learning compared to sparse outcome rewards.

Training dynamics with different feedback designs.

Domain Evaluation

GymVerse-OOD is constructed to assess cross-environment generalization and consists of 8 environments from each of the four domains, namely Tool, Games, Algo, and Logic.

Evaluation performance across four domains under different environment complexity levels.

Conclusion

📏 We introduce GymVerse, a comprehensive framework for environment synthesis, evaluation, and agent training.

💥GymVerse supports the automatic generation of diverse environments across multiple domains, together with a multi-stage correctness filtering pipeline that ensures environment reliability for downstream training and evaluation.

🧠Building on GymVerse, we systematically investigate how key environment properties, including environment scale, complexity evolution, correctness, and feedback design, affect agent learning dynamics and generalization.

🚀In particular, training Qwen3-4B-Instruct on only 16 synthesized environments yields consistent and substantial generalization gains across both multi-turn and single-turn benchmarks, demonstrating robustness to unseen environments.

Supplementary

BibTeX

[BibTeX placeholder]