历史 Token 丢弃机制:在 ViT 上层(后 4 层),只保留当前帧的 token,输出形状从 (B*K, N, D) 还原为 (B, N, D),推理 token 数量与单帧完全一致:
# Encoder.__call__ 中的 drop_history 逻辑 if use_video and lyr == drop_after: BK, N, D = x.shape B = BK // K x = x.reshape(B, K, N, D)[:, -1, :, :] # 只取最后一帧(当前帧) use_video = False # 后续层正常处理 B 帧
for t, subtask in enumerate(subtasks): # 只用成功步骤构建历史 successful = [ f"Step {i+1}: {s['instruction']}" for i, s in enumerate(subtasks[:t]) if s.get("success", True) # ← 关键:只记录成功的步骤 ] new_memory = self._call_llm(successful) if successful else ""
# (A) 长期语言记忆 if self.use_language_memory and obs.tokenized_memory is not None: memory_tokens = self.PaliGemma.llm(obs.tokenized_memory, method="embed") tokens.append(memory_tokens)
# (B) 图像 tokens(视频编码器融合历史帧) for name in obs.images: if self.use_video_memory and obs.image_history is not None: image_tokens = self._encode_video_frames( obs.image_history[name], # (B, K-1, H, W, C) obs.images[name] # (B, H, W, C) ) else: image_tokens, _ = self.PaliGemma.img(obs.images[name], train=False) tokens.append(image_tokens)
# (C) 语言 prompt(原有逻辑不变) if obs.tokenized_prompt is not None: tokens.append(self.PaliGemma.llm(obs.tokenized_prompt, method="embed"))
# (D) 本体感觉历史(连续线性投影,不用文本 token) if self.use_state_history and obs.state_history is not None: state_hist_tokens = self.state_history_proj(obs.state_history) # (B, K-1, D) tokens.append(state_hist_tokens)
# Episode 循环 policy.reset_episode(task_goal="Clean the kitchen") for step in range(max_steps): obs = env.get_observation() result = policy.infer(obs) # 自动维护帧缓存和记忆状态 env.step(result["actions"])
十、后续计划 [ ] 预训练权重发布:基于开源机器人数据集训练 π0.6-MEM 基础权重 [ ] 实机验证:在 ALOHA / SO-100 机械臂上验证长任务效果 [ ] 记忆时间窗口扩展:从 54 秒(18 帧)进一步扩展到小时级别 [ ] 轻量化:针对 edge 推理场景的量化与延迟优化 参考资料 Physical Intelligence,MEM: Multi-Scale Embodied Memory for Vision Language Action Models,2026, arxiv Physical Intelligence,π0: A Vision-Language-Action Flow Model for General Robot Control,2024 Physical Intelligence,π0.5: A Vision-Language-Action Model with Open-World Generalization,2025 Bertasius et al., Is Space-Time Attention All You Need for Video Understanding?,ICML 2021 openpi 代码库: https://github.com/Physical-Intelligence/openpi ----------------------------------- 开源发布!完整实现π0.6-MEM机器人长时记忆架构——基于Physical Intelligence的最新研究成果 https://blog.51cto.com/u_15444/14575291