Memp (2025) equips AI agents with procedural memory, boosting efficiency, adaptability, and lifelong learning across benchmarks like TravelPlanner & ALFWorld.

The advancement of artificial intelligence (AI) hinges on enabling machines to execute complex, multi-step tasks with the efficiency and adaptability of human experts. A significant limitation of traditional large language models (LLMs) is their stateless operation, requiring solutions to be recomputed for each task, which hinders performance in dynamic environments. Procedural memory architectures provide a transformative solution by equipping AI agents with the ability to store, retrieve, and refine task-specific procedures, mirroring the human capacity to master skills through practice. This article, the first in a series exploring cutting-edge AI memory innovations, focuses on the Memp framework, introduced in a 2025 research paper by Zhejiang University and Alibaba Group. Future installments will explore semantic, episodic, and hybrid memory systems, offering a comprehensive view of AI’s cognitive evolution.

he 2025 Memp framework highlights the potential of procedural memory to help AI agents store, refine, and reuse task strategies, making them more adaptable and efficient. The paper describes how Memp gives LLM-based agents a learnable, updatable, and long-term procedural memory that could strengthen their resilience and performance. It also lays out the framework’s mechanics, benchmark results, and broader implications, offering a solid reference point for researchers, developers, and industry professionals.

Procedural Memory: Emulating Human Skill Acquisition

In human cognition, procedural memory governs the implicit knowledge required to perform tasks, such as playing an instrument or navigating a familiar route, without conscious recall of each step. Supported by brain regions like the basal ganglia, it encodes habits through repetition, becoming automatic with practice. For example, a skilled typist executes keystrokes effortlessly, relying on procedural memory rather than explicit deliberation.

In AI, replicating this capability addresses a fundamental drawback of LLMs, which process inputs independently within limited context windows, losing prior interactions after each session. Procedural memory architectures create persistent repositories of operational knowledge, enabling agents to recall and refine task procedures. The Memp framework, developed by Zhejiang University and Alibaba Group, draws on this human paradigm, allowing agents to capture task trajectories, retrieve relevant procedures, and update their knowledge based on new experiences, paving the way for autonomous, adaptive systems.

The Memp Framework: Design and Functionality

Introduced in a 2025 research paper by Zhejiang University and Alibaba Group, the Memp framework represents a significant advancement in procedural memory for LLM-based agents. It overcomes the rigidity of static memory systems by enabling agents to learn from experience, adapt to changes, and maintain efficiency. Below, we detail its three core components, as outlined in the paper.

1. Memory Construction

Memp captures agent trajectories—sequences of actions, decisions, and outcomes from past tasks—and distills them into reusable formats:

  • Fine-grained instructions: Detailed, step-by-step logs of actions, such as “search for flights, filter by price, book the cheapest option” for a travel planning task.

  • High-level abstractions: Generalized, script-like summaries, such as “plan travel by prioritizing cost and schedule,” which enable flexibility across similar tasks.

This dual representation, achieved through a process termed “Proceduralization,” ensures agents can leverage both precise guidance and adaptable principles. LLMs parse raw trajectories to extract key actions and outcomes, structuring the memory for efficient reuse.

2. Memory Retrieval

Effective retrieval ensures agents access relevant procedures during task execution. Memp employs two strategies:

  • Query-based retrieval: Uses embeddings to match the current task’s semantic context with stored trajectories, ensuring relevance.

  • AveFact retrieval: Averages keyword similarities to target specific terms or metadata, such as task type or tools used.

These mechanisms enable agents to efficiently fetch applicable procedures, even in complex or noisy environments. For instance, in a household task like “clean the kitchen,” Memp retrieves prior cleaning workflows with successful outcomes.

3. Memory Updating

Memp’s dynamic updating ensures the memory repository evolves with experience. Three strategies facilitate this:

  • Validation filtering: Retains only accurate and successful trajectories, discarding those that lead to errors, such as procedures relying on outdated APIs.

  • Adjustment (reflection): Agents assess their performance and revise failed procedures, for example, refining a travel plan to account for layover times after a suboptimal booking.

  • Remove (deprecation): Obsolete or redundant memories are pruned to maintain a lean repository, preventing performance degradation.

This continuous updating cycle emulates human skill refinement, ensuring agents adapt to evolving task requirements or data formats.

Technical Integration

Memp integrates with LLMs as an external memory module. The framework explicitly relies on vector embeddings and retrieval mechanisms to fetch relevant procedures efficiently. While the paper itself doesn’t prescribe deployment details, in practice, teams may implement retrieval through vector databases and choose whether to host memory locally or expose it via APIs. Additional optimizations like compression are engineering choices, not part of the official specification. Importantly, Memp demonstrated compatibility with both closed-source models (e.g., GPT-4o) and open-source ones (e.g., Qwen2.5-14B), showing that procedural memory can enhance weaker models by transferring knowledge from stronger ones

Empirical Performance: Validated Results

The Memp framework was rigorously evaluated on two benchmarks: TravelPlanner, which tests travel itinerary planning, and ALFWorld, which focuses on household task automation. Key findings include:

  • Improved Success Rates: On TravelPlanner, GPT-4o improved CS from 71.93 (No Memory) to 79.94 with Proceduralization, reflecting enhanced planning accuracy. On ALFWorld, GPT-4o improved Test success from 42.14% (No Memory) to 77.86% with Proceduralization; Dev from 39.28% to 87.14%.

  • Transferability Across Models: Procedural memory constructed from a stronger model (e.g., GPT-4o) significantly boosts the performance of weaker models (e.g., Qwen2.5-14B). On TravelPlanner, Qwen2.5-14B gained +5% completion and −1.6 steps using GPT-4o’s memory, reducing reliance on resource-intensive LLMs.

  • Lifelong Learning: Memp enables agents to continuously improve without retraining, as evidenced by steady performance gains across evaluation rounds in both benchmarks.

These results highlight Memp’s ability to enhance efficiency and adaptability, making it suitable for tasks requiring sequential reasoning and dynamic adjustment.

Applications: Industry Impact

Memp’s procedural memory architecture has significant implications for industries requiring adaptive, autonomous agents:

  • Enterprise Automation: Agents can learn and refine workflows for tasks like supply chain optimization or customer query resolution, adapting to disruptions such as inventory changes.

  • Healthcare: Diagnostic agents can store and refine procedural knowledge for analyzing medical data, improving accuracy with each case.

  • Software Development: Coding agents can recall debugging patterns, streamlining development and reducing errors.

  • Robotics: Autonomous systems can optimize movements by learning from past trajectories, enhancing efficiency in manufacturing or logistics.

By enabling lifelong learning, Memp ensures agents evolve in dynamic environments, such as cybersecurity or real-time analytics, where static models are insufficient.

Addressing Challenges

Procedural memory introduces challenges that Memp effectively mitigates:

  • Memory Obsolescence: Procedures may become outdated due to changes in task requirements. Memp’s validation and deprecation strategies ensure only relevant memories are retained.

  • Privacy Concerns: Persistent memory risks storing sensitive data. Compliance with regulations like GDPR is critical, and Memp’s modular design supports transparent memory management.

  • Bias Risks: If initial trajectories are biased, agents may perpetuate errors. Memp’s validation filtering prioritizes successful outcomes, reducing this risk.

These solutions ensure Memp remains robust and trustworthy in production settings.

Future Directions: Toward Integrated Cognition

The Memp framework lays a foundation for advanced AI cognition. Future research could explore integrating procedural memory with other memory types to create holistic systems. Part 2 of this series will examine semantic memory architectures, exploring their role in grounding factual accuracy and complementing procedural systems.

Procedural memory architectures, exemplified by Memp, represent a significant leap in AI’s pursuit of human-like adaptability. By enabling agents to capture, retrieve, and refine task procedures, Memp fosters efficiency, resilience, and lifelong learning. Its empirical success on benchmarks like TravelPlanner and ALFWorld, coupled with its ability to enhance weaker models, underscores its transformative potential. As we continue this series, the interplay of memory architectures promises to redefine AI’s capabilities, bringing us closer to systems that evolve alongside us.

Citation: Zhejiang University and Alibaba Group. (2025). Memp: Exploring Agent Procedural Memory..


Discover more from Poniak Times

Subscribe to get the latest posts sent to your email.