Huge issue with reasoning model benchmarks

After seeing some initial thoughts on QwQ, it seems that the consensus is that it is a solid model, but utilizes 2-3x tokens to solve tasks vs R1. And then if we look at R1 vs o3-mini, R1 is the one that often utilizes more tokens to solve the same tasks (which results in a very competitive price point for o3-mini in practice - something people often overlook).

And you cannot see this quantified anywhere on any official benchmarks. Things like this greatly influence the pricing when it comes to actually deploying/using these models. There really needs to be a control for this in some form on some notable benchmark tbh.

Things like how much tokens does it take to get the claimed benchmarks + what does the updated pricing look like with this considered? The potential effect on latency is also something that should be considered. And lastly, how are other models able to perform when they are encouraged to 'think harder' in order to match the token output that the respective model is using to reach its claimed benchmarks?

There is a clear connection with longer COTs and increased accuracy with reasoning models. o3-mini-high vs o3-mini default is a great example.)