Rendered at 03:21:38 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
wincy 1 days ago [-]
Just tried it out for a prod issue was experiencing. Claude never does this sort of thing, I had it write an update statement after doing some troubleshooting, and I said “okay let’s write this in a transaction with a rollback” and GPT-5.5 gave me the old “okay,
BEGIN TRAN;
-- put the query here
commit;
I feel like I haven’t had to prod a model to actually do what I told it to in awhile so that was a shock. I guess that it does use fewer tokens that way, just annoying when I’m paying for the “cutting edge” model to have it be lazy on me like that.
This is in Cursor the model popped up and so I tried it out from the model selector.
XCSme 1 days ago [-]
I feel like the last 2-3 generations of models (after gpt-5.3-codex) didn't really improve much, just changed stuff around and making different tradeoffs.
pixel_popping 1 days ago [-]
I disagree, it improved enormously especially at staying consistent for long-tasks, I have a task running for 32 days (400M+ tokens) via Codex and that's only since gpt-5.4
ericpauley 1 days ago [-]
Has that task accomplished anything yet?
codemog 1 days ago [-]
I think the OP is in for a rude surprise when the task is “finished”.
hagbard_c 1 days ago [-]
It will go somewhat like this:
“You're really not going to like it," observed Codex.
"Tell us!"
"All right, said Codex. "The answer to your Great Question..."
"Yes...!"
"Is..." said Codex, and paused.
"Yes...!"
"Is..."
"Yes...!!!...?"
"Forty-two," said Codex, with infinite majesty and calm.
pixel_popping 1 days ago [-]
I bet you've asked Codex for that joke :p
xp84 1 days ago [-]
Too soon to tell, give it a billion tokens before we make up our minds
pixel_popping 1 days ago [-]
Oh boy, you are far from what it requires, we are probably talking 3B+, but note that this is just codex, obviously codex is also doing automatic adversarial with the regular zoo (gemini-3.1-pro-preview, opus-4.6/4.7, gpt-5.3-codex, minimax-2.7, glm-5.1, mimo-2 (now 2.5) and so-on, you get the gist) :)
fl4regun 1 days ago [-]
what is that task doing???
anonym00se1 1 days ago [-]
The correction question is: what isn't that task doing?
owebmaster 17 hours ago [-]
Interesting is that they had the opportunity to explain but decided that hyping it more made more sense. 3 billion tokens!!1!
elAhmo 1 days ago [-]
It made Sam richer.
pixel_popping 12 hours ago [-]
I don't know their margin so I can't really say, but do we have 8 OpenAI accounts, I doubt they are making that much with us seeing that there isn't a single hour where we don't saturate the accounts.
cheevly 10 hours ago [-]
Wtf are you even talking about? Sam has zero stake in OpenAI.
elAhmo 8 hours ago [-]
Of course he doesn't.
SecretDreams 1 days ago [-]
Kept the OP employed for a full extra month at their high AI metric firm, hopefully.
pixel_popping 1 days ago [-]
Just making Jensen proud is all.
lowdude 1 days ago [-]
That’s actually crazy, what kind of task is that? And is that a recurring kind of task like some analysis, or coding related?
pixel_popping 1 days ago [-]
Coding (along with docs, tests obviously), rewriting a huge chunk of the KVM hypervisor (in Kernel 7, started in the -rc2) and KSM and other modules, can't say too much about it yet (might do an announcement in coming weeks). The coding is automated but the plan took days of manual arguing (with all models possible) prior (while doing other things during waiting times as I currently manage 70 repos for an upcoming release of our Beta).
I think users really underestimate the capabilities of "AI" when using the right tooling/combinations of models and procedures (and loops), that's talking with 2 decades of dev behind me, genuinely I'm not on phase with people saying it produces slop of any kind, at this stage, it's mostly the fault of the prompter (or the prompter not having enough tokens to do mass adversarial), but clearly, I can genuinely state that the code produced is overall the SAME quality as I would by being extremely meticulous.
I'm like a bot following 30+ threads concurrently, sometimes it's fun, sometimes it feels like playing casino, sometimes it's boring, but this is truly an insane era if you have the funding for it, obviously we stack many MANY accounts in rotation 24/7, equivalent in API cost by myself is about 100K$+ (a month) but we pay only a fraction of that cost thanks to the plans.
PS: I have 8 monitors in front of me to manage all that (portable monitors stacked together).
Urahandystar 1 days ago [-]
Please do an update when you're ready, this sounds like madness to me so I'd love to see what the output is. Whatever it is I have to know.
owebmaster 17 hours ago [-]
Typical AI psychosis. They might notice it soon or stay in this condition for months.
pixel_popping 13 hours ago [-]
I don't think you really grasp the direction the world is taking or even really understand AI capabilities when it's put together to reach high automation, you might not agree or embrace it yet, but you will be joining the loop wagon, soon enough.
owebmaster 8 hours ago [-]
Yeah right. Sam Altman is as high as you on this drug, but you both are going to wake up soon.
pixel_popping 7 hours ago [-]
can you explain further? Most especially, why do you see AI stopping anytime soon and not getting just insanely better and better for the next decades (that is through combination of models or models alone, harnesses or whatever, that's just a technicality)?
Why would I need to "wake up"?
owebmaster 3 minutes ago [-]
Is what you working public? Publish it and let us know how it goes.
7 minutes ago [-]
AlexCoventry 1 days ago [-]
Is it hitting intermediate milestones with solid pre-written and human-reviewed acceptance tests? If not, sounds like a very risky commitment.
ericreg92 1 days ago [-]
Please do a post about this (though I realize that takes time). This sounds amazing. I have always dreamed of doing this too but just don't have the budget.
stirfish 1 days ago [-]
Specifically, write a post about this and do not have Claude write a post about this.
PeterStuer 11 hours ago [-]
I'm also in that boat of not understanding how people fail to get a huge productivity boost from GenAI. And it's not just novices but sometimes seriously accomplished coders. It can't be they're just typing 'Make me an ERP' and then go 'these thing are dumb slop machines' right?
7thpower 1 days ago [-]
I have yet to talk to someone who is taking this approach and doesn’t end up with a dumpster fire, but here is to hoping this time is different.
Hope it works and you post about it.
Culonavirus 1 days ago [-]
I hope it doesn't work and they don't post about it.
ziml77 1 days ago [-]
It's just too bad the subsidized costs mean they won't actually feel any real punishment for their failure. Like normally time wasted on its own is enough of a punishment for making a poor decision, but they're not even doing anything themselves here!
jamwil 1 days ago [-]
I’m vague on a specific reason for this feeling because there are a few to choose from and no one overpowers the other, but the emotion that comes to mind when I read this is disgust. As a society I feel we will look back on the subsidized opulence of this moment with total and utter contempt.
deaux 19 hours ago [-]
I know exactly the feeling you mean. I get a much stronger feeling of that when I talk with friends who frequently take a plane for a 250 mile trip which has a world-class comfortable high-speed train connection with very frequent trains, each taking less than 3 hours. I'm sure you have friends who would do this in this situation - do you feel the same disgust when you hear them talking about such choices?
I still haven't seen a single person who actually cares about the environment and has willingly made significant sacrifices for it, who clamors about the environmental cost of AI. Every time I see someone do it it's someone who never cared about this before, and still doesn't really. Who buys plenty of new clothes and furniture, loves a good burger, has the latest iPhone, flies 4 times per year.
Maybe you're the unicorn in which case fair enough, you've earned the right to feel disgusted.
holmesworcester 1 days ago [-]
Or nostalgia for simpler times
jamwil 1 days ago [-]
That as well. But everyone reading GP’s posts knows in their bones that it’s unsustainable. It’s economically unsustainable and environmentally unsustainable, and in that context it strikes me as pure hoarding behaviour. Taking as much as they can for themselves before the house of cards crashes down.
I have no sympathy for OpenAI or Anthropic as corporations, but if these are the new tools of the trade, then platform abuse like GP is bragging about serves only to destroy the livelihoods of the rest of us who are content to use our fair share.
There’s no such thing as a free lunch, and the bill always comes at the end.
ragequittah 10 hours ago [-]
I mostly hate it because the token crunch is now coming for us regular users because of people like this. A few people always ruin it for the rest of us.
jamwil 10 hours ago [-]
Yea. It’s greed, pure and simple. And also a major misstep on the part of the inference providers to offer these subsidized plans and not anticipate these slop mills.
owebmaster 17 hours ago [-]
There's no opulence in spending tokens for entertainment. Vibecoding your own game is the new viral game.
owebmaster 17 hours ago [-]
> (might do an announcement in coming weeks).
Don't be surprised if/when people ignore your AI slop
r_lee 1 days ago [-]
...what? what kind of a task are you running?
ninkendo 1 days ago [-]
Sorry if I’m not getting it, but what was wrong exactly? Is the issue that it merely put “-- put the query here” in the reply, instead of repeating it again?
If so, I’m not sure I’d even consider that a problem. If the goal is for it to give you a query to run, and you ask it “let’s do it in a transaction”, it’s a reasonable thing for it to simply inform you, “yeah you can just type begin first” since it’s assuming you’re going to be pasting the query in anyway. And yeah, it does use fewer tokens, assuming the query was long. Similar to how, if it gave me a command to run, and I say “I’m getting a permission denied”, it would be reasonable for it to say “yeah do it as root, put sudo before the command”, and it’s IMO reasonable if it didn’t repeat the whole thing verbatim just with the word “sudo” first.
But if the context was that you actually expected it to run the query for you, and instead it just said “here, you run it”, then yeah that’s lazy and I’d understand the shock.
endymi0n 1 days ago [-]
OpenAI is the first company that has reached a level of intelligence so high, the model has finally become smart enough to make YOU do all the work. Emergent behavior in action.
All earnesty aside, OpenAI’s oddly specific singular focus on “intelligence per token” (also in the benchmarks) that literally noone else pushes so hard eerily reminds me of Apple’s Macbook anorexia era pre-M1. One metric to chase at the cost of literally anything else. GPT-5.3+ are some of the smartest models out there and could be a pleasure to work with, if they weren’t lazy bastards to the point of being completely infuriating.
syspec 1 days ago [-]
Can't tell if above is good or bad.
wincy 1 days ago [-]
I mean, I was doing triage, so wanted an immediate fix. The actual issue is we’re getting some exploding complexity when double checking the action the API is taking is valid in the data. So that needs to be refactored. I suppose it reduces token usage, but Claude Opus will happily do exactly what I want it to.
hbn 1 days ago [-]
GPT-5.5 shatters benchmarks for amount of faith it puts in the user.
I know it's only on a single benchmark, but I dont understand how it can be so bad...
Art9681 13 hours ago [-]
A junior tinkering in their garage in domains they have little experience executed a flawed test and decided to call it a benchmark. It's extremely common nowadays because words dont mean anything anymore. The forums that used to be filled with technical people doing real work are now filled with the masses of vibe researchers doing this kind of stuff. This is what happens when anything goes over some popularity threshold.
HN is the last bastion of serious inquiry these days. But its not immune as OPs comment proves.
goldenarm 1 days ago [-]
gemma4-e4b is 50% better than gemma4-26b in your benchmark, something's wrong
guilamu 1 days ago [-]
Yes those two models were tested on my own PC (local inference using my own CPU/GPU). So something my be bugged on my setup. gemma4-26b should be far better than gemma4-e4b.
data-ottawa 15 hours ago [-]
The early quants for Gemma4 26b had issues and needed to be updated, might be worth checking
embedding-shape 1 days ago [-]
Sounds like maybe using worse quantization on the bigger model? Quantization matters a lot for the quality, basically anything below Q8 is borderline unusable. If it isn't specified in a benchmark already it probably should.
ac29 1 days ago [-]
Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.
guilamu 1 days ago [-]
Yes Opus 4.7 fast (no reasoning) did a worst job than Sonnet 4.6 high (with reasoning) according to Gemini 3.1 Pro evaluation.
ac29 1 days ago [-]
Your table doesn't indicate reasoning vs non-reasoning, or reasoning level
guilamu 1 days ago [-]
When nothing is noted it's max reasoning (xhigh in copilot chat in vscode if available).
The models not availble on copilot were tested through opencode (max reasoning) and deepseek v4 was tested through Cline (with max reasoning too).
mosselman 1 days ago [-]
You even traveled in time to deliver us this benchmark.
I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark.
guilamu 1 days ago [-]
Haha, just fixed the date!
I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks.
BTW, if you explore the repo, sorry for all the French files...
DrProtic 1 days ago [-]
Seems like benchmark for how good a model is for vibe coding.
Your prompt is extremely slim yet you score it on a bunch of features.
guilamu 1 days ago [-]
Yes, the prompt is slim by design. I might be wrong, but the point was to see what the model can do "on it's own".
That’s the thing, not everyone wants and values the model based on that. But I guess it works for you, and that benchmark achieves it.
I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec.
I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you.
guilamu 1 days ago [-]
Yeah as I said this a benchmark for my usecase only, a single use case, which is obvisouly not representative of everybody's needs.
What strike me as very strange though is that 0 model were able to just use the search input already present in GravitYForms forms list page and all created a second input.
Also, I know it's not in the prompt, but adding a ctrl+f shortcut to a search input? Is that that crazy? I don't know.
gizmodo59 1 days ago [-]
[dead]
Topfi 1 days ago [-]
Pricing by context length:
Input: $5/M tokens at <=272K, $10/M tokens above 272K.
Output: $30/M tokens at <=272K, $45/M tokens above 272K.
Cache read: $0.50/M tokens at <=272K, $1/M tokens above 272K.
Significantly more expensive than Opus 4.7 beyond 272K and at least in my tasks, I haven't seen the model that much more token efficient, certainly not to such a degree that it'd compensate this difference. GPT-5.4 had a solid context window at 400k with reliable compaction, both appear somewhat regressed, though still to early to truly say whether compaction is less reliable. Also, I have found frontend output to still skew towards that one very distinct, easily noticeable, card laden, bluesy hue overindulged template that made me skeptical of Horizon Alpha/Beta pre GPT-5s release. Ended up doing amazing at the time for task adherence, which made it very useful for me outside that one major deficit. The fact that GPT-5.5 is still so restricted in that area is weird considering it's supposed to be an entirely new foundation.
gertlabs 1 days ago [-]
Comprehensive coding reasoning benchmark results for GPT 5.5 with max reasoning are up at https://gertlabs.com/
Live decision and heavier agentic evals will continue being uploaded for 24 hours but I don't expect its leaderboard position to change at this point.
GPT 5.5 is the most intelligent public model. And significantly faster than its predecessor.
sigmoid10 1 days ago [-]
Huh. Yesterday they said:
>API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale.
And now this. I guess one day counts as "very soon." But I wonder what that meant for these safeguards and security requirements.
FINDarkside 1 days ago [-]
When stuff is delayed due to "safeguards" it just means they don't think they have the compute to release it right now.
simonw 1 days ago [-]
I wonder if the fact that GPT-5.5 was already available in their Codex-specific API which they had explicitly told people they were allowed to use for other purposes - https://simonwillison.net/2026/Apr/23/gpt-5-5/#the-openclaw-... - accelerated this release!
embedding-shape 1 days ago [-]
The same person who've mercilessly lied about safety is still running the company, so not sure why anyone would expect any different from them moving forward. Previous example:
> In 2023, the company was preparing to release its GPT-4 Turbo model. As Sutskever details in the memos, Altman apparently told Murati that the model didn’t need safety approval, citing the company’s general counsel, Jason Kwon. But when she asked Kwon, over Slack, he replied, “ugh . . . confused where sam got that impression.”
In the only one that feels that OpenAI has bots/commenters on payroll on all this kind of news downplaying Claude and stating how much better Codex is?
There is too much and there are too many, and some of their takes don’t fly if you use Claude daily.
AlexCoventry 1 days ago [-]
Yeah, it's eerie, same with how everyone seems to have forgotten that OpenAI betrayed democracy by committing to work on unsupervised autonomous weapons and domestic mass surveillance.
int3trap 22 hours ago [-]
Honestly I find comments like yours much more eerie. By all accounts they never agreed to any of that but you say it with such confidence like it's a fact.
AlexCoventry 21 hours ago [-]
The Trump administration's handling of Anthropic showed that regardless of what the contract or the law says or means, they will severely penalize any vendor who refuses their demands. And OpenAI stepped right into that relationship immediately after the administration showed that. So either they were signing up for a supply-chain risk designation and whatever other punishments the Trump administration dreams up, or they're complying.
If this sounds crazy to you, though, I'd like to know, and understand why. I miss ChatGPT/Codex.
psteinweber 17 hours ago [-]
I find that very obvious too. It started (visibly) shortly after the Opus 4.6 hype.
Aboutplants 1 days ago [-]
Of course they do. As do all of the other companies pushing their product these days.
neosat 1 days ago [-]
Enterprise user here and still seeing only 5.4.
Yesterday's announcement said that it will take a few hours to roll out to everybody. OpenAI needs better GTM to set the right expectations.
neosat 1 days ago [-]
Just refreshed and see 5.5 now - yay! Love the speedy resolution ;) Thanks folks, I'll complain faster next time....
API page lists the knowledge cutoff as Dec 01, 2025 but when prompting the model it says June 2024.
Knowledge cutoff: 2024-06
Current date: 2026-04-24
You are an AI assistant accessed via an API.
BeetleB 1 days ago [-]
I don't know why this keeps coming up. This has always been the least reliable way to know the cutoff date (and indeed, it may well have been trained on sites with comments like these!)
Just ask it about an event that happened shortly before Dec 1, 2025. Sporting event, preferably.
czk 1 days ago [-]
the model obviously knows things after the reported date but its just curious that it reports that date consistently
could be they do it intentionally to encourage more tool calls/searches or for tuning reasons
jumploops 1 days ago [-]
[dead]
htrp 1 days ago [-]
Can you really believe things that the model says? (A lot of prior model api pages say knowledge cutoffs of June 2024, maybe the model picks that up?)
czk 1 days ago [-]
you cant but its pretty reproducible across api and codex and other agents so i just thought it was odd. full text it gives:
Knowledge cutoff: 2024-06
Current date: 2026-04-24
You are an AI assistant accessed via an API.
# Desired oververbosity for the final answer (not analysis): 5
An oververbosity of 1 means the model should respond using only the minimal content necessary to satisfy the request, using
concise phrasing and avoiding extra detail or explanation."
An oververbosity of 10 means the model should provide maximally detailed, thorough responses with context, explanations, and
possibly multiple examples."
The desired oververbosity should be treated only as a *default*. Defer to any user or developer requirements regarding
response length, if present.
1 days ago [-]
bakugo 1 days ago [-]
Models don't know what their cutoff dates are unless told via a system prompt.
The proper way to figure out the real cutoff date is to ask the model about things that did not exist or did not happen before the date in question.
A few quick tests suggest 5.5's general knowledge cutoff is still around early 2025.
czk 1 days ago [-]
i wonder if they put an older cutoff date into the prompt intentionally so that when asked on more current events it leans towards tool calls / web searches for tuning
ssl-3 1 days ago [-]
I wonder if the cutoff date is the result of so many people posting about the date over time and poisoning the data. "Dead cutoff date theory," perhaps.
Whatever it is, the cutoff date reporting discrepancy isn't new. Back when Musk was making headlines about buying/not buying Twitter, I was able to find recent-ish related news that was published well after the bot's stated cutoff date.
ChatGPT was not yet browsing/searching/using the web at that point. That tool didn't come for another year or so.
MallocVoidstar 1 days ago [-]
OpenAI does tell the model the current date via API, so it's odd for them not to also tell the model its cutoff
soco 1 days ago [-]
Stupid question: wouldn't it then search the web for that event?
bakugo 1 days ago [-]
If you have web search enabled, sure. But if you're testing on the API, you can just not enable it.
swyx 1 days ago [-]
can u test it on say who won the 2024 US election
ghurtado 1 days ago [-]
I can't really think of a less reliable test for anything at all than making a random guess as to something that had about 50/50 odds to begin with
Easiest Turing test ever...
himata4113 1 days ago [-]
ask it 10 times.
pixel_popping 1 days ago [-]
MASSIVE ADVERSARIAL x50
WarmWash 1 days ago [-]
Usually the labs do some kind of post training on major events so the model isn't totally lost.
A better test is something like "what is the latest version of NumPy?"
bakugo 1 days ago [-]
That sort of test isn't super reliable either, in my experience.
You're probably better off asking something like "what are the most notable changes in version X of NumPy?" and repeating until you find the version at which it says "I don't know" or hallucinates.
czk 1 days ago [-]
with thinking off and tools disabled:
Donald Trump won the 2024 U.S. presidential election.
redsocksfan45 1 days ago [-]
I thought that one specifically was placed in the default system prompts of basically all providers.
jumploops 1 days ago [-]
[dead]
robertwt7 1 days ago [-]
Gpt 5.5 combined with codex is really good. I actually have no doubt whenever I asked questions, plan, or implement a code with it. With opus 4.7, I have to keep double checking because it doesnt follow the CLAUDE.md instruction, it hallucinates a lot, by default it makes things up when it can’t find the answer to something. Its crazy how quickly people are saying that OpenAI is left behind last year when they declared code red and look at where we are now
zerof1l 1 days ago [-]
I don't see any meaningful performance improvements in those paid models anymore.
They all roughly produce junior developer-level code, continue to have mental breakdowns in their “thinking” stage, occasionally hallucinate things, delete pieces of code/docs they don’t understand or don’t like, use 1.5 times the necessary words to explain things when generating docs and so on.
I'm now testing "avoid sycophancy, keep details short and focus on the facts" in my AGENTS.md files.
podnami 1 days ago [-]
This is snark. Since when has a junior level dev managed to debug and deploy say a cloudformation stack and follow up with notes under 3 minutes?
gjsman-1000 1 days ago [-]
Heard this analogy elsewhere, but worth repeating:
AI is like having the greatest developer who ever lived, but she is always on 4 beers.
nathan-hello 1 days ago [-]
personifying ai is incredibly cringe no matter how weird your comparison is
jubilanti 1 days ago [-]
It's an analogy.
nathan-hello 15 hours ago [-]
that’s the personification i’m referring to, yes. incredibly weird.
gjsman-1000 1 days ago [-]
Imagine a drunk developer. Sparks of brilliance while missing obvious trees.
xXSLAYERXx 13 hours ago [-]
I know of a publicly traded company which in its early years was built on beer. Literally. 3 guys in a co-working space in Cambridge, MA. Beer fueled their progress. 15 years later the software is still the backbone of the org.
nathan-hello 15 hours ago [-]
weird.
nimchimpsky 1 days ago [-]
[dead]
ftonon 1 days ago [-]
Looks like the default config in the chat is instant 5.3, it only uses the 5.5 on the thinking variant
bnm04 1 days ago [-]
They moved a few months ago to have separate instant and thinking models. 5.3 is the latest instant, and 5.5 is a reasoning model.
QuadrupleA 1 days ago [-]
Exactly double the cost of GPT 5.4 - $5 per MTok input, $0.50 cached, $30 output.
All the AI players definitely seem to be trying to claw more money out of their users at the moment.
languid-photic 1 days ago [-]
It's 2x/token, but for default reasoning we've found GPT-5.5 uses fewer tokens overall, so net cheaper on median. [1]
(Note, that stops being true at higher reasoning levels, where our observed total cost goes up ~2-3x.)
I think that's Pro. Regular 5.5 is 2x regular 5.4.
croemer 15 hours ago [-]
It refuses to write bioinformatics code that involves analysis of SARS-CoV-2. Even when it's totally obvious I'm not trying to do any bioengineering of any sorts. Totally harmless stuff I'm doing and I just get rejected.
redsaber 1 days ago [-]
not available for Github Copilot pro(only in pro+, business and enterprise), I am really now feeling the era of subsidized AI is over.
This is where the emigration to Chinese providers begins.
throw03172019 1 days ago [-]
Faster than anticipated because of Deepseek release?
XCSme 1 days ago [-]
Doubt it, DeepSeek v4 is quite underwhelming.
swyx 1 days ago [-]
more like they wanted to release it yesterday but merely had some last min flags they wanted to hold off for
1 days ago [-]
Jhonwilson 1 days ago [-]
ok not bad
m3kw9 1 days ago [-]
Maybe but no one serious is using deepseek
brianbest101 1 days ago [-]
[dead]
pants2 1 days ago [-]
Is anyone here actually using pro models through the API? I'd be very curious what the use-case is.
chadash 1 days ago [-]
Yes. High value work where cost (mostly) doesn't matter. For example, if I need to look over a legal doc for possible mistakes (part of a workflow i have), it doesn't matter (in my case) whether it costs $0.01 or $10.00, since it's a somewhat infrequent event. So i'll pay $9.99 more, even if the model is only slightly better.
bogtog 1 days ago [-]
I'm surprised I never heard people talking about using -Pro variants, even though their rates ($125-175/M?) aren't drastically larger than old Opus ($75/M), which people seemed to use
freedomben 1 days ago [-]
Indeed, even just Terms of Service and Privacy Policy work. Infrequent enough that cost isn't an issue, but model quality absolutely is
ComputerGuru 1 days ago [-]
Yes? The same reason you would use it via the tooling.
_pdp_ 1 days ago [-]
A very expensive model for API usage. Fine in codex I think.
gigatexal 1 days ago [-]
what's the real world comparison to opus 4.7 fellow coders?
Sembiance 1 days ago [-]
I gave 4.6, 4.7 and GPT 5.5 the same prompt and task to reverse engineer a collection of sample vector files from an obscure Amiga CAD program and create a detailed txt specification and a python converter that converts to SVG and produce a report so I can visually verify.
4.6 did very well. 90% perfect on first try, got to 100% with just a few followups.
4.7 failed horribly. First produced garbage output and claimed it was done, admitted it did that when called out, proceeded to work at it a lot longer and then IT GAVE UP.
GPT 5.5 codex was shockingly good. Achieved 90% perfect on first try in about a fourth of the time. Got to 100% faster and with fewer follow-ups.
I’m impressed.
gigatexal 20 hours ago [-]
Interesting that 4.7 failed like that. Seems 5.5 is impressive but is oh so expensive.
Would be interesting if you ran your same test with Deepseek v4 and some of the other Chinese models.
Sembiance 13 hours ago [-]
Just tried with DeepSeek V4 Pro with OpenCode. It didn't do great. First attempt produced somewhat correct drawings for some of the original samples, but most were just a spaghetti messs of lines. Some prodding got it to do a little better, but still not right. A third prod and it went down a wild rabbit hole and was much worse. I gave up.
I also tried GLM 5.1, it's first attempt was such a disaster I didn't bother working with it any further. It also took by far the longest and wasted a bunch of time/tokens trying to find other converters online (and failing) instead of just reverse engineering the format from the sample files given.
Please consider the ethical aspects of giving money to OpenAI versus alternatives.
AlexCoventry 1 days ago [-]
You need to be more specific. OpenAI's commitment to assist the Trump administration with domestic mass surveillance seems to have been largely memory-holed.
pillefitz 24 hours ago [-]
You're right, unfortunately. How naive of me to think that at least the HN audience would care.
refulgentis 1 days ago [-]
I'm absolutely stunned by what I've seen from 5.5. I thought it'd be a nothingburger and ~= Opus.
Gave it two very long-running problems I haven't had the courage to work on in the last 2.5 years, solved each within an hour.
- An incremental streaming JSON decoder that can optionally take a list of keys to stop decoding after. 1800 LOC about 30 minutes later, and now my local-first apps first sync time is 0.8s instead of 75s when there's 1.5 GB of data locally.
- Flutter Web can compile to WASM and then render via Skia WASM. I've been getting odd crashes during rapid animation for months. In an hour, it got Skia WASM checked out, building locally, a Flutter test script, and root caused the issue to text shadows and font glyphs (technically, not solved yet, I want to get to the point we have Skia / Flutter patch(es))
If you told me a week ago that an LLM could do either of these, without heavy guidance, I'd be stunned. And I regularly push them to limits, ex. one of Opus' last projects was a tolerant JSON decoder, and it ended up being 8% faster than the one built-in to Dart/Flutter, which has plenty of love and attention. (we're cheating a little, that's why it's faster. TL;DR: LLMs will emit control characters in JSON and that's fine for me, treating them as fine means file edit error rates go from ~2% to 0%)
I just wish it was cheaper, but, don't we all...
Jhonwilson 1 days ago [-]
that is great news
willj 1 days ago [-]
[dead]
benjiro3000 1 days ago [-]
[dead]
woohin 1 days ago [-]
[dead]
theo_park87 1 days ago [-]
[dead]
XCSme 1 days ago [-]
GPT 5.5 is close to Opus 4.7, but at 7x the cost[0]...
Either Opus 4.7 miscounts reasoning tokens, or it's A LOT more efficient than GPT 5.5
I thought they made GPT 5.5 more token efficient than 5.4, but it uses 2x the reasoning tokens.
Very bad habit these safeguards. These "safety" filters are counter-productive and even can be dangerous.
In my place for example, a lot of doctors are using ChatGPT both to search diagnosis and communicate with non-English speaking patients.
Even yourself, when you want to learn about one disease, about some real-world threats, some statistics, self-defense techniques, etc.
Otherwise it's like blocking Wikipedia for the reason that using that knowledge you can do harmful stuff or read things that may change your mind.
Freedom to read about things is good.
NicuCalcea 1 days ago [-]
> a lot of doctors are using ChatGPT both to search diagnosis and communicate with non-English speaking patients
I think that's the problem. Who's going to claim responsibility when ChatGPT hallucinates or mistranslates a patient's diagnosis and they die? For OpenAI, this would at best be a PR nightmare, so that's why they have safeguards.
rvnx 1 days ago [-]
Adults bear responsibility for choices about their own lives. In fact, the more educated they are, the better choices they can make.
A doctor who gets refused by ChatGPT doesn't stop needing to communicate with the patient; they fall back to a worse option (Google Translate, a family member interpreting, guessing). Refusal isn't safety, it's liability-shifting dressed up as safety.
If there's no doctor, no interpreter, no pharmacist, just a person with a sick kid and a phone, then "refuse and redirect to a professional" is advice from a world that doesn't exist for them. The refusal doesn't send them to a better option; there is no better option, it's a large majority of people on this planet.
Hell is paved of good intentions, but open-education and unlimited access to knowledge is very good.
It doesn't change the human nature of some people, bad people stay bad, good people stay good.
About PR, they're optimizing for not being the named defendant in a lawsuit or the subject of a bad news cycle, it's self-interest wearing benevolence as a costume.
This is because harms from answering are punishable (bad PR, unhappy advertisers, unhappy investors, unhappy politicians / dictators, unhappy lobbies, unhappy army, etc); but harms from refusing are invisible and unpunished.
NicuCalcea 1 days ago [-]
> A doctor who gets refused by ChatGPT doesn't stop needing to communicate with the patient; they fall back to a worse option
I think AI proves the contrary. There are plenty of examples of things that are getting worse because of technological advancement, particularly AI. Software quality, writing, online discourse, misinformation have all suffered over the last few years. I truly believe the internet is a worse place than it was 5 years ago, and I can't imagine bringing that to medicine would work out differently.
The medical system shouldn't rely on falling back to crappy workarounds, it should aspire to build the best system it reasonably can.
1 days ago [-]
hellohello2 1 days ago [-]
The doctor would be responsible.
I had a choice better a doctor that used AI or not, I would much prefer one that did...
NicuCalcea 1 days ago [-]
The doctor would be responsible for the accuracy of their translation tool, something they can't verify but you expect them to use?
hellohello2 1 days ago [-]
I was answering for hallucinations, not really for translation. Re-reading your initial post I do agree with what you are saying (i.e. you are explaining why OpenAI is looking to avoid a PR nightmare). What I meant to express is that I would personally trust doctors to use these tools as best they can to provide care.
lacunary 1 days ago [-]
"what you see is all there is." it's generally much easier to verify something you've been made aware of than it is to know of it in the first place (and still verify it.)
rvnx 1 days ago [-]
The irony is that licensed interpreters / translators usually perform worse than AI.
Only the liability shifts from OpenAI to them.
Furthermore, where the alternative to a licensed professional was nothing, or a random untrained person or a weak professional, then it's harming the user on the pretext of protecting him.
(like in the other mentioned contexts).
rvnx 1 days ago [-]
What's the alternative then ?
-> You are in China, you go to emergency, nobody speaks your language
Move hands ? DeepSeek is better than using hands, even Baidu Translate, ChatGPT or whatever you find.
Other solutions are theoretically nice on paper but almost delusional.
An imperfect solution is better than no solution.
==
Similarly, a deaf-person is theorically better with a certified interpreter that can talk with the hands, but they may prefer voice-recognition software or AI tools.
(or... talking with hands is more confusing and annoying or less understandable for them).
Of course ChatGPT transcription can have issues, but that's the difference between the real-world and Silicon Valley's disconnected lawyers world.
==
If ChatGPT says: "sorry I won't be able, please go to see a licensed interpreter, good luck!" then it's just OpenAI trying to save their asses, at your risk/expense.
If you have a choice, you can make the choice, and you can double-check what is said. In other cases, you have no choice, nothing to check, only problems but no hints of solutions.
This is why openness is important.
NicuCalcea 1 days ago [-]
When I registered with my GP in the UK, they asked me whether I would need an interpreter and what language. They then provide professional interpreters.
BEGIN TRAN;
-- put the query here
commit;
I feel like I haven’t had to prod a model to actually do what I told it to in awhile so that was a shock. I guess that it does use fewer tokens that way, just annoying when I’m paying for the “cutting edge” model to have it be lazy on me like that.
This is in Cursor the model popped up and so I tried it out from the model selector.
“You're really not going to like it," observed Codex.
"Tell us!"
"All right, said Codex. "The answer to your Great Question..."
"Yes...!"
"Is..." said Codex, and paused.
"Yes...!"
"Is..."
"Yes...!!!...?"
"Forty-two," said Codex, with infinite majesty and calm.
I think users really underestimate the capabilities of "AI" when using the right tooling/combinations of models and procedures (and loops), that's talking with 2 decades of dev behind me, genuinely I'm not on phase with people saying it produces slop of any kind, at this stage, it's mostly the fault of the prompter (or the prompter not having enough tokens to do mass adversarial), but clearly, I can genuinely state that the code produced is overall the SAME quality as I would by being extremely meticulous.
I'm like a bot following 30+ threads concurrently, sometimes it's fun, sometimes it feels like playing casino, sometimes it's boring, but this is truly an insane era if you have the funding for it, obviously we stack many MANY accounts in rotation 24/7, equivalent in API cost by myself is about 100K$+ (a month) but we pay only a fraction of that cost thanks to the plans.
PS: I have 8 monitors in front of me to manage all that (portable monitors stacked together).
Why would I need to "wake up"?
Hope it works and you post about it.
I still haven't seen a single person who actually cares about the environment and has willingly made significant sacrifices for it, who clamors about the environmental cost of AI. Every time I see someone do it it's someone who never cared about this before, and still doesn't really. Who buys plenty of new clothes and furniture, loves a good burger, has the latest iPhone, flies 4 times per year.
Maybe you're the unicorn in which case fair enough, you've earned the right to feel disgusted.
I have no sympathy for OpenAI or Anthropic as corporations, but if these are the new tools of the trade, then platform abuse like GP is bragging about serves only to destroy the livelihoods of the rest of us who are content to use our fair share.
There’s no such thing as a free lunch, and the bill always comes at the end.
Don't be surprised if/when people ignore your AI slop
If so, I’m not sure I’d even consider that a problem. If the goal is for it to give you a query to run, and you ask it “let’s do it in a transaction”, it’s a reasonable thing for it to simply inform you, “yeah you can just type begin first” since it’s assuming you’re going to be pasting the query in anyway. And yeah, it does use fewer tokens, assuming the query was long. Similar to how, if it gave me a command to run, and I say “I’m getting a permission denied”, it would be reasonable for it to say “yeah do it as root, put sudo before the command”, and it’s IMO reasonable if it didn’t repeat the whole thing verbatim just with the word “sudo” first.
But if the context was that you actually expected it to run the query for you, and instead it just said “here, you run it”, then yeah that’s lazy and I’d understand the shock.
All earnesty aside, OpenAI’s oddly specific singular focus on “intelligence per token” (also in the benchmarks) that literally noone else pushes so hard eerily reminds me of Apple’s Macbook anorexia era pre-M1. One metric to chase at the cost of literally anything else. GPT-5.3+ are some of the smartest models out there and could be a pleasure to work with, if they weren’t lazy bastards to the point of being completely infuriating.
I know it's only on a single benchmark, but I dont understand how it can be so bad...
HN is the last bastion of serious inquiry these days. But its not immune as OPs comment proves.
The models not availble on copilot were tested through opencode (max reasoning) and deepseek v4 was tested through Cline (with max reasoning too).
I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark.
I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks.
BTW, if you explore the repo, sorry for all the French files...
Your prompt is extremely slim yet you score it on a bunch of features.
The eval prompt is quite extensive: https://github.com/guilamu/llms-wordpress-plugin-benchmark/b...
I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec.
I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you.
What strike me as very strange though is that 0 model were able to just use the search input already present in GravitYForms forms list page and all created a second input.
Also, I know it's not in the prompt, but adding a ctrl+f shortcut to a search input? Is that that crazy? I don't know.
Input: $5/M tokens at <=272K, $10/M tokens above 272K.
Output: $30/M tokens at <=272K, $45/M tokens above 272K.
Cache read: $0.50/M tokens at <=272K, $1/M tokens above 272K.
Significantly more expensive than Opus 4.7 beyond 272K and at least in my tasks, I haven't seen the model that much more token efficient, certainly not to such a degree that it'd compensate this difference. GPT-5.4 had a solid context window at 400k with reliable compaction, both appear somewhat regressed, though still to early to truly say whether compaction is less reliable. Also, I have found frontend output to still skew towards that one very distinct, easily noticeable, card laden, bluesy hue overindulged template that made me skeptical of Horizon Alpha/Beta pre GPT-5s release. Ended up doing amazing at the time for task adherence, which made it very useful for me outside that one major deficit. The fact that GPT-5.5 is still so restricted in that area is weird considering it's supposed to be an entirely new foundation.
Live decision and heavier agentic evals will continue being uploaded for 24 hours but I don't expect its leaderboard position to change at this point.
GPT 5.5 is the most intelligent public model. And significantly faster than its predecessor.
>API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale.
And now this. I guess one day counts as "very soon." But I wonder what that meant for these safeguards and security requirements.
> In 2023, the company was preparing to release its GPT-4 Turbo model. As Sutskever details in the memos, Altman apparently told Murati that the model didn’t need safety approval, citing the company’s general counsel, Jason Kwon. But when she asked Kwon, over Slack, he replied, “ugh . . . confused where sam got that impression.”
Lots of cases where Altman hass not been entirely forthcoming about how important (or not) safety is for OpenAI. https://www.newyorker.com/magazine/2026/04/13/sam-altman-may... (https://archive.is/a2vqW)
There is too much and there are too many, and some of their takes don’t fly if you use Claude daily.
If this sounds crazy to you, though, I'd like to know, and understand why. I miss ChatGPT/Codex.
Cheaper and slower than Opus.
Just ask it about an event that happened shortly before Dec 1, 2025. Sporting event, preferably.
could be they do it intentionally to encourage more tool calls/searches or for tuning reasons
The proper way to figure out the real cutoff date is to ask the model about things that did not exist or did not happen before the date in question.
A few quick tests suggest 5.5's general knowledge cutoff is still around early 2025.
Whatever it is, the cutoff date reporting discrepancy isn't new. Back when Musk was making headlines about buying/not buying Twitter, I was able to find recent-ish related news that was published well after the bot's stated cutoff date.
ChatGPT was not yet browsing/searching/using the web at that point. That tool didn't come for another year or so.
Easiest Turing test ever...
A better test is something like "what is the latest version of NumPy?"
You're probably better off asking something like "what are the most notable changes in version X of NumPy?" and repeating until you find the version at which it says "I don't know" or hallucinates.
They all roughly produce junior developer-level code, continue to have mental breakdowns in their “thinking” stage, occasionally hallucinate things, delete pieces of code/docs they don’t understand or don’t like, use 1.5 times the necessary words to explain things when generating docs and so on.
I'm now testing "avoid sycophancy, keep details short and focus on the facts" in my AGENTS.md files.
AI is like having the greatest developer who ever lived, but she is always on 4 beers.
All the AI players definitely seem to be trying to claw more money out of their users at the moment.
(Note, that stops being true at higher reasoning levels, where our observed total cost goes up ~2-3x.)
[1] https://x.com/voratiq/status/2047737190323769488?s=20
30/180 usd on Openrouter. Did I miss something?
4.6 did very well. 90% perfect on first try, got to 100% with just a few followups. 4.7 failed horribly. First produced garbage output and claimed it was done, admitted it did that when called out, proceeded to work at it a lot longer and then IT GAVE UP. GPT 5.5 codex was shockingly good. Achieved 90% perfect on first try in about a fourth of the time. Got to 100% faster and with fewer follow-ups.
I’m impressed.
Would be interesting if you ran your same test with Deepseek v4 and some of the other Chinese models.
I also tried GLM 5.1, it's first attempt was such a disaster I didn't bother working with it any further. It also took by far the longest and wasted a bunch of time/tokens trying to find other converters online (and failing) instead of just reverse engineering the format from the sample files given.
Gave it two very long-running problems I haven't had the courage to work on in the last 2.5 years, solved each within an hour.
- An incremental streaming JSON decoder that can optionally take a list of keys to stop decoding after. 1800 LOC about 30 minutes later, and now my local-first apps first sync time is 0.8s instead of 75s when there's 1.5 GB of data locally.
- Flutter Web can compile to WASM and then render via Skia WASM. I've been getting odd crashes during rapid animation for months. In an hour, it got Skia WASM checked out, building locally, a Flutter test script, and root caused the issue to text shadows and font glyphs (technically, not solved yet, I want to get to the point we have Skia / Flutter patch(es))
If you told me a week ago that an LLM could do either of these, without heavy guidance, I'd be stunned. And I regularly push them to limits, ex. one of Opus' last projects was a tolerant JSON decoder, and it ended up being 8% faster than the one built-in to Dart/Flutter, which has plenty of love and attention. (we're cheating a little, that's why it's faster. TL;DR: LLMs will emit control characters in JSON and that's fine for me, treating them as fine means file edit error rates go from ~2% to 0%)
I just wish it was cheaper, but, don't we all...
Either Opus 4.7 miscounts reasoning tokens, or it's A LOT more efficient than GPT 5.5
I thought they made GPT 5.5 more token efficient than 5.4, but it uses 2x the reasoning tokens.
[0]: https://aibenchy.com/compare/openai-gpt-5-5-medium/openai-gp...
In my place for example, a lot of doctors are using ChatGPT both to search diagnosis and communicate with non-English speaking patients.
Even yourself, when you want to learn about one disease, about some real-world threats, some statistics, self-defense techniques, etc.
Otherwise it's like blocking Wikipedia for the reason that using that knowledge you can do harmful stuff or read things that may change your mind.
Freedom to read about things is good.
I think that's the problem. Who's going to claim responsibility when ChatGPT hallucinates or mistranslates a patient's diagnosis and they die? For OpenAI, this would at best be a PR nightmare, so that's why they have safeguards.
A doctor who gets refused by ChatGPT doesn't stop needing to communicate with the patient; they fall back to a worse option (Google Translate, a family member interpreting, guessing). Refusal isn't safety, it's liability-shifting dressed up as safety.
If there's no doctor, no interpreter, no pharmacist, just a person with a sick kid and a phone, then "refuse and redirect to a professional" is advice from a world that doesn't exist for them. The refusal doesn't send them to a better option; there is no better option, it's a large majority of people on this planet.
Hell is paved of good intentions, but open-education and unlimited access to knowledge is very good.
It doesn't change the human nature of some people, bad people stay bad, good people stay good.
About PR, they're optimizing for not being the named defendant in a lawsuit or the subject of a bad news cycle, it's self-interest wearing benevolence as a costume.
This is because harms from answering are punishable (bad PR, unhappy advertisers, unhappy investors, unhappy politicians / dictators, unhappy lobbies, unhappy army, etc); but harms from refusing are invisible and unpunished.
I think AI proves the contrary. There are plenty of examples of things that are getting worse because of technological advancement, particularly AI. Software quality, writing, online discourse, misinformation have all suffered over the last few years. I truly believe the internet is a worse place than it was 5 years ago, and I can't imagine bringing that to medicine would work out differently.
The medical system shouldn't rely on falling back to crappy workarounds, it should aspire to build the best system it reasonably can.
I had a choice better a doctor that used AI or not, I would much prefer one that did...
Only the liability shifts from OpenAI to them.
Furthermore, where the alternative to a licensed professional was nothing, or a random untrained person or a weak professional, then it's harming the user on the pretext of protecting him.
(like in the other mentioned contexts).
-> You are in China, you go to emergency, nobody speaks your language
Move hands ? DeepSeek is better than using hands, even Baidu Translate, ChatGPT or whatever you find.
Other solutions are theoretically nice on paper but almost delusional.
An imperfect solution is better than no solution.
==
Similarly, a deaf-person is theorically better with a certified interpreter that can talk with the hands, but they may prefer voice-recognition software or AI tools.
(or... talking with hands is more confusing and annoying or less understandable for them).
Of course ChatGPT transcription can have issues, but that's the difference between the real-world and Silicon Valley's disconnected lawyers world.
==
If ChatGPT says: "sorry I won't be able, please go to see a licensed interpreter, good luck!" then it's just OpenAI trying to save their asses, at your risk/expense.
If you have a choice, you can make the choice, and you can double-check what is said. In other cases, you have no choice, nothing to check, only problems but no hints of solutions.
This is why openness is important.
https://www.england.nhs.uk/interpreting/