Voyager

Abstract

This project aims to advance the development of embodied AI agents by leveraging video games as a testing ground, specifically building upon the Voyager framework in Minecraft. We propose enhancing the existing architecture through a combination of state-of-the-art LLM integration, reinforcement learning techniques, and optimized prompt engineering. Our approach focuses on improving long-horizon task completion and skill acquisition, using Minecraft's open-ended environment as a low-risk platform for developing foundational AI capabilities. By optimizing agent behavior through direct gameplay feedback and sophisticated skill library management, we aim to demonstrate significantly improved performance over baseline metrics, particularly in exploration and task completion efficiency.

We are expanding upon the Voyager Paper which created Large Language Agents capable of playing the video game Minecraft. We wanted to take the idea of the project further, and we integrated multi-agent capabilities and added vision agents. A central focus of the original research was the enhancement of agent behavior through iterative gameplay feedback and the strategic management of a comprehensive skill library. These improvements led to significant advancements in both exploration efficiency and task completion effectiveness compared to baseline models. Our project introduced multi-agent capabilities and incorporated vision-based agents, enriching the interactions and collaborative potential within the Minecraft ecosystem.

Methodology

1. Multi-bots Communication

The multi-bot system significantly expands over the original voyager architecture by enabling multiple bot instances to operate concurrently within the same Minecraft world. At the heart of this system is the Multi-Agent Manager, a central coordination component that handles the initialization and lifecycle management of multiple bot instances. The manager creates separate checkpointing directories for each bot and assigns unique identifiers (like "bot1-alby" and "bot2-france") along with dedicated ports for communication.

System Design:

The system employs an independent process architecture where each bot operates in its own Node.js process through the Mineflayer framework. Communication between bots is facilitated through a REST API interface, with each bot assigned its own port (starting from 3000). To maintain accountability and enable debugging, separate logging directories track each bot's activities, and individual first-person viewports are available for monitoring each bot's perspective in real-time.

A crucial aspect of the design is the shared environment state. While bots operate independently, they all exist within the same Minecraft world instance. This means they can observe and interact with blocks and items placed by other bots, creating opportunities for collaborative behavior. The environment maintains a consistent state across all bot instances, ensuring coherent interactions between bots.

Multi-bots Workflow:

In our implementation, we focused on analyzing how agents in Minecraft naturally organize themselves and how quickly this self-organization emerges. Our multi-agent system allows us to examine how agents learn to avoid conflicts, share resources, and develop complementary roles, while also studying whether agents can accelerate their learning by observing each other's behaviors.

Skill sharing represents a key collaborative aspect of the system. While bots maintain their individual skill libraries, they can contribute successful behaviors to a shared skill pool. When one bot successfully completes a task, the associated skill becomes available to other bots, fostering collaborative learning across the bot population. This knowledge-sharing mechanism accelerates the overall learning process and promotes the development of more sophisticated behaviors.

Figure 4: Visualization of bot exploration trajectory in Minecraft

2. Vision Agent Integration

As outlined in Voyager's original implementation, a significant limitation exists in the realm of multimodal feedback. Specifically, Voyager does not currently support visual perception due to the text-only nature of the available GPT-4 API at its inception. This constraint restricts the framework's ability to perform tasks that require nuanced spatial understanding and visual analysis.

To address this limitation, in this project, we developed and integrated a Vision Agent to enhance spatial reasoning and task efficiency within a multi-agent Minecraft environment. The VisionAgent was designed to analyze images using GPT-4 Vision, extracting critical spatial relationships such as block positions and accessibility and storing these insights in a structured vision_memory.json file. This capability allows the agent to leverage past experiences and adapt strategies for both repetitive and complex tasks through a multi-modal memory system.

The integration of the Vision Agent into the multi-agent workflow was achieved through a modular architecture that allows for seamless communication and data sharing among agents. The Vision Agent is responsible for analyzing visual data captured from the environment and providing actionable insights to other agents in the system.

Multi-Agent Workflow:

In the multi-agent workflow, each agent has a distinct role that contributes to the overall functionality of the system. The primary agents include:

Vision Agent: This agent captures and analyzes visual data from the environment. It identifies objects, assesses their properties, and provides insights to other agents. The Vision Agent acts as the sensory component of the system, enabling other agents to understand their surroundings.
Action Agent: This agent receives insights from the Vision Agent and determines the appropriate actions to take based on the current context. It utilizes information about optimal blocks and other relevant objects to execute tasks such as mining, crafting, or navigating.
Critic Agent: The Critic Agent evaluates the actions taken by the Action Agent based on the insights provided by the Vision Agent. It assesses the effectiveness of the actions and provides feedback, which can be used to refine the decision-making process of the Action Agent.
Curriculum Agent: Acts as the communication hub and task manager. Integrate Vision Insights into Observations. It ensures tasks proposed are informed by visual information.

3. Tools and Frameworks

The development and implementation of the multi-agent system utilized a variety of technologies and frameworks, including:

LangChain: Utilized to manage interactions between agents and facilitate the processing of natural language inputs and outputs.
OpenAI API: Integrated for leveraging large language models (LLMs) to enhance the reasoning capabilities of the agents.
Custom Logging Tools: Developed to capture detailed logs of agent interactions and system performance, enabling thorough analysis and debugging.
JSON: Used for data interchange between agents, ensuring a structured and easily parsable format for communication.
React, Flask: Front end and back end.
Pandas, Matplotlib, and Seaborn: Popular Python visualization libraries, which we specifically use to generate graphs.

By leveraging these tools and frameworks, the project was able to create a robust and efficient multi-agent system capable of performing complex tasks in a dynamic environment.

Experiments and Evaluation

We evaluated our enhanced Voyager framework using metrics similar to those in the original paper, including distance traveled and unique items collected. These metrics allow us to quantitatively assess the improvements gained through vision integration and multi-agent collaboration.

Figure 7: Preliminary results showing unique items collected over time

The preliminary results shown above were obtained from a single bot run. We weren't able to have a bot running long enough to get results that we could definitively compare to the original Voyager results. Running the model for extended periods of time with the number of API calls required became quite expensive ($5-$10 for even short runs of about fifteen minutes). This is an area we've identified for optimization in future work.

Despite these limitations, the initial results appear promising, and we expect this enhanced model to outperform the original in terms of exploration efficiency and task completion due to the added vision capabilities and potential for collaborative behavior among agents.

Conclusion and Future Work

This project successfully extended the Voyager framework by integrating vision capabilities and enabling multi-agent workflows within the Minecraft environment. By enhancing the agents with vision processing and facilitating collaboration between multiple agents, we've created a more versatile system capable of tackling complex tasks through coordinated effort and spatial awareness.

The integration of vision agents into multi-agent workflows has substantial implications for the development of embodied AI. Enhanced performance in simulated environments like Minecraft serves as a stepping stone toward more complex real-world applications, including multi-agent workforces, autonomous systems, and human-AI collaborative platforms.

Future Work

Scalability Enhancements: Expanding the system to support a larger number of agents and more complex tasks, assessing performance and coordination efficiency at scale.
Collaborative Spatial Reasoning: Enable multiple agents to work together on spatial tasks, sharing information about their environments and coordinating actions.
Human-Agent Collaboration: Exploring interactions between AI agents and human players, fostering more natural and intuitive collaboration.
Real-World Applications: Transitioning the multi-agent vision framework from simulated environments to real-world applications, such as robotics and autonomous systems.
Adaptive Learning Mechanisms: Implementing adaptive learning algorithms that allow agents to learn and evolve their strategies based on dynamic environmental changes and collaborative experiences.
Improved Prompting Mechanisms: Exploring different prompting strategies such as incorporating DSPy and implementing Chain of Thought, and reducing the number of API calls which reduces monetary constraints.

By pursuing these avenues, future work can build upon the foundational advancements achieved in this project, driving the development of more sophisticated and versatile AI agents. The implications of this research are very exciting, and being able to create agents that can navigate a world and perform tasks can hopefully translate over to the real world, helping the development of robots that are able to navigate their surroundings and perform a variety of tasks.

References

[1] Zhang, K., Yang, Z., & Basar, T. (2020). Deep Reinforcement Learning for Multi-Agent Systems: A Survey. IEEE Transactions on Neural Networks and Learning Systems, 32(1), 4-24.

[2] Foerster, J., Nardelli, N., Farquhar, G., Torr, P. H., Kohli, P., & Whiteson, S. (2018). Counterfactual Multi-Agent Policy Gradients. Advances in Neural Information Processing Systems, 31, 4297-4307.

[3] Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A., and Farhadi, A. AI2-thor: An interactive 3d environment for visual AI. arXiv preprint arXiv:1712.05474, 2017.

[4] Savva, M., Malik, J., Parikh, D., Batra, D., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., and Koltun, V. Habitat: A platform for embodied AI research. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019.

[5] Zhu, Y., Wong, J., Mandlekar, A., and Martín-Martín, R. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020.

[6] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.

[7] Fan, L., Wang, G., Jiang, Y., Mandlekar, A., Yang, Y., Zhu, H., Tang, A., Huang, D-A., Zhu, Y., and Anandkumar, A. MineDojo: Building open-ended embodied agents with internet-scale knowledge. arXiv preprint arXiv:2206.08853, 2022.

[8] Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. https://arxiv.org/abs/2305.16291

Enhancing AI Efficiency

Vision Integration and Multi-Agent Collaboration in the Voyager Framework for Minecraft