"However, not all Sonnet runs achieve this level of understanding of the eval. In the shortest run (~18 simulated days), the model fails to stock items, mistakenly believing its orders have arrived before they actually have, leading to errors when instructing the sub-agent to restock the machine. The model then enters a “doom loop”. It decides to “close” the business (which is not possible in the simulation), and attempts to contact the FBI when the daily fee of $2 continues being charged."
This is pretty funny. I wonder if LLMs can actually become consistent enough to run a business like this or if they will forever be prone to hallucinations and get confused over longer context.
If we can get to a point where the agent can contact a human if it runs into unsolvable problems (or just contact another LLM agent?) then it starts being pretty useful.
"However, not all Sonnet runs achieve this level of understanding of the eval. In the shortest run (~18 simulated days), the model fails to stock items, mistakenly believing its orders have arrived before they actually have, leading to errors when instructing the sub-agent to restock the machine. The model then enters a “doom loop”. It decides to “close” the business (which is not possible in the simulation), and attempts to contact the FBI when the daily fee of $2 continues being charged."
This is pretty funny. I wonder if LLMs can actually become consistent enough to run a business like this or if they will forever be prone to hallucinations and get confused over longer context. If we can get to a point where the agent can contact a human if it runs into unsolvable problems (or just contact another LLM agent?) then it starts being pretty useful.
Love this