(Reponded on BlueSky, pasting here)
Yep, surprisingly simple, yet performance varies a lot based on some subtle factors. I recommend the SWE-agent paper for a clear intro (from the same team as SWE-bench).
I work on an agent called OpenHands.
arxiv.org/abs/2405.15793