Researchers are using Factorio (a game where the goal is to build the largest factory) to test for e.g. paperclip maximizers. Claude is #1, 10x better than GPT4o-Mini. ("GPT4o-Mini even asked us to turn it off at one point because it was unrecoverable 🥹")