Every agent step takes both a visual snapshot and a memory read, the memory gives more consistent tileset and location parsing and also has some stuff like party state/status conditions that isn't usually visible.
that's probably fair but I imagine that without memory access Claude would have been opening and searching the (completely unordered!) bag to check for progress critical items like the pokeflute every 5 minutes
It's interesting to me that they're extracting game state from memory instead of just passing the video into the LLM.
Every agent step takes both a visual snapshot and a memory read, the memory gives more consistent tileset and location parsing and also has some stuff like party state/status conditions that isn't usually visible.
Shouldn't the goal be to compare it against a human player that would need to menu for that information?
that's probably fair but I imagine that without memory access Claude would have been opening and searching the (completely unordered!) bag to check for progress critical items like the pokeflute every 5 minutes
This is released by the Anthropic engineer who developed Claude Plays Pokemon.