#6 LeadDev's The Shift: Do we really care about SWE-bench?
Plus: James Stanier wants to distill your leadership wisdom
When it comes to AI models and coding agents, it’s easy to obsess over public benchmarks like SWE-bench. They give a tidy view of performance, and a good starting point for evaluating the plethora of available large language models (LLMs) out there, which are subject to change at any moment. But they also have significant limitations.
The big story
Benchmarks like SWE-bench, HumanEval, LiveCodeBench, or the latest one – Crosscheck by LinkedIn Labs – all take various approaches to try and benchmark models by testing their ability to complete real coding tasks.
But, as Lizzie Matusov writes in Research-Driven Engineering Leadership, these benchmarks are pretty limited.
When engineering teams pick an AI-coding agent to roll out, they often lean on public benchmarks like SWE-bench to compare models. But those benchmarks are built from Python GitHub issues, while your developers are typing rough, context-laden requests into a chat window inside a giant private monorepo.
The eventual performance of those models in real world circumstances comes down to far more than just how they perform on those benchmarks.
“SWE-bench has burned me in the past. There was a model on top of the list a year ago that was getting things wildly wrong, so I started tuning it out honestly,” James Garret, AI enablement lead at Tilt, told me. “I’ve also seen a couple of engineers become so concerned with the models and testing them repetitively that they are just caught up in the noise of it and kind of forget that they are a tool to deliver business value with.”
In a world where being seen as an AI leader in your organization is valuable, there will be engineers that decide to over-leverage on switching models and reporting back to get that visibility, but in a way that isn’t delivering any real value.
As Ravi Mehta points out in Ravi on Product, Anthropic’s models often come out second best on the popular benchmarks, but that doesn’t tell the whole story. Look at these two charts:
“OpenAI is competing at the model layer. Anthropic is competing at the platform layer — and that’s where applied AI gets won.”
“Benchmarks aren’t irrelevant. A significantly less capable model would have lost regardless of the platform it’s plugged into. Hitting benchmarks is necessary, not sufficient.”
While benchmarks are valuable for initial shortlisting, the decision should never end there. What this also shows is just how bunched up these models are at the top end. Six models now score within 0.8 points of each other on SWE-bench Verified, with three of them launching in the last five weeks. Instead of obsessing over those 0.8 points, or arguing about it online, the more important thing is how to make that model work for you and your use case.
Why this matters
Because context matters.
“Tooling can matter as much as the model. A richer harness (with roughly 3x the tools of a simpler baseline) consistently outperformed the basic one. Adding developer-authored Context Files lifted the weaker harness by 6.4pp, but on the stronger harness the same files added ‘context noise without new information’ and slightly hurt performance,” Matusov wrote.
To test this out, researchers at Meta built REAP (Relevance and Execution-Audited Pipeline), “an automated curation pipeline that constructs evaluation benchmarks directly from real developer-agent sessions in their monorepo.”
The paper points towards a better way to evaluate models in the real world. Rather than relying on sanitized public benchmarks, REAP evaluates models based on real developer-agent sessions. This does require building telemetry in to your agentic systems, but is the most reliable way to actually evaluate them.
On the other hand, Garret says he “put a lot of thought into building out my own benchmarks and gave up because I am OK with subjectiveness right now. If a developer thinks a model is better, let them have the tool they feel most productive with. That’s more streamlined than benchmarking the models yourself in my opinion.”
Hot links
Distilling leadership wisdom - The Engineering Manager
Another great example of how James Stanier thinks outside the box when it comes to working with agents.
This one is particularly wild: “You can create your own coaches and advisors from some of the smartest minds out there, using nothing more than their interviews, podcasts, and talks.”
I don’t think AI will make your processes go faster - Frederick Vanbrabant
Yes, AI can generate code quickly (whether that’s a good thing is open for debate), but that doesn’t mean it’s generating the correct code.
In comparisons between human vs AI development they always ignore the handholding that is needed for AI to do its thing.
The AI-native developer - Engineering Enablement
Brian Houck, now an Applied Scientist at DX, digs into a paper he co-authored, which surveyed 1,300 developers and interviewed 22 AI-fluent practitioners.
The framing of AI as a productivity tool—one that reduces toil and frees up time for meaningful work—may be too simple. The tasks developers find tedious aren’t necessarily the ones they trust AI to handle. And the tasks they trust AI to handle aren’t necessarily the ones creating the gap.
The more useful question for leaders isn’t “what can we automate?” but “are our tools earning the trust to touch what developers find meaningful?” and “what about the work is worth protecting?”
Andrej Karpathy, Tesla Alum and OpenAI Co-Founder, Joins Anthropic - Wall Street Journal
Another big coup for Anthropic. The OpenAI cofounder will go back to his roots by joining Anthropic’s pretraining team, overseeing the data training process for Claude models.
How I Choose Which Cloudflare Employees to Replace With AI - Matthew Prince
A typically bullish assessment of things from the Cloudflare CEO.
Upcoming events
LDX3 London
Our biggest event of the year, LDX3, kicks off in less than two weeks in London and you can still be there. We’ve got Justin Reock, Michael Lopp, Maude Lemaire, and more speaking, as well as big debates, table talks, and most importantly, I’ll be giving a sneak peek into our latest engineering leadership research findings, on stage at the start of day 2.






