Hacker Newsnew | past | comments | ask | show | jobs | submit | SatvikBeri's commentslogin

I'm a fairly moderate user, never hit any kind of usage limits, but I used 44 million cache create tokens and 1.5 billion cache read tokens, which ccusage estimates would have cost $990, and calculates the different categories separately.

Not OP, but Dario said on Dwarkesh's podcast 3 months ago that their gross margins are "significantly higher than 50%"

Run rate is annualized revenue based on some recent period, e.g. taking the last month of revenue and multiplying by 12. Revenue (classic) is a historical measure, e.g. revenue in 2025.

Julia does this – you generally write synchronous, single-threaded functions most of the time, and can use code like `t = @spawn foo(b)` to get a Task, and then `output = fetch(t)` to wait for it and get the value.

I like this general approach a lot, it's overall quite nice for Julia's core use case of number crunching, it means you typically make decisions around concurrency at the call sites. Though it does rely heavily on Julia's runtime, and it can be a bit difficult to figure out what's going on under the hood.


See also Cilk/Cilk++

I usually aim to have Claude end up with about 500 lines of code after a night of work. Most of what it's doing is experimenting with many different approaches, summarizing them, and then giving me a relatively small diff to review and modify.

This is the way to go. I usually play with relatively stable software where the improvements are either performance or very small niche features that are built on top of already existing ones. Big changes are undesirable by both the others working on it and its users.

The context window has nothing to do with RAM usage and even if it did, a million tokens of context is maybe 5mb.


'A million tokens of context' is literally Terrabytes of KV cache VRAM on very expensive Nvidia silicon - on the model.

On the Agent, yes, the context window does relate to RAM, because the 'entire conversational history' is generally kept in memory. So ballpark 1M 'words' across a bunch of strings. It's not that-that much.

Claude Code is not inneficient because 'it's not Rust' - it's just probably not very efficiently designed.

Rust does not bestow magical properties that make memory more efficient really.

A bit more, but it's not going to change this situation.

'Dong it in Rust' might yield amazing returns just because the very nature of the activity is 'optimization'.


Rust "denialism" is as annoying as rust evangelism.

Of course any seemingly idiomatic rust is going to run circles around TS transpiled into JIT-compiled JS.


Lamenting any 'not even criticism' of Rust as 'denialism' is just evidence of the insane cult that is Rust.

Rebuilding Claude Code in Rust will make almost no difference in terms of real world performance. V8 is 'relatively fast', and there wouldn't be any noticeable improvements there, and probably not memory footprint either.

The source for Claude Code was leaked and it's a vibe-coded mess, there's not much thought given to clean architecture, it's unlikely they've just cleaned up a bit and given thought to memory consumption etc, if they did, they'd get by far most of the way there and likely abnegate and real want to 'do it in rust', unless there are other architectural considerations.


You're the delusional one for bringing up the memory usage of the inference server that clearly isn't running inside the coding agent.

The problem with your comments is that you're showing off a fundamental lack of understanding between managed languages and unmanaged languages.

The vast majority of GCs are optimized for throughput and allocate big chunks of memory. They also tend to never release it if there was a temporary memory spike. The most advanced GCs also tend to have either read or write barriers, which slow down basic object accesses.

Just in time compilation and managed languages in general need to retain a runtime representation of the source code to perform JIT compilation and then they have to store the compiled code in memory as well.

JavaScript uses references against dynamic objects, which means you have to pay the indirection cost of a pointer but you also need to store type information as well to monomorphize the object literals and classes at runtime and fall back to a regular hashmap when fields are added dynamically.

All of these things will add up and increase the amount of memory the application uses and how slow it runs.

Sure Claude Code has severe architectural issues causing it to leak hundreds of gigabytes of RAM, but if those were not there you could easily build a C++ based alternative that runs circles around a hypothetical JavaScript based Claude Code that got its act together.


1) I'm not 'delusional' for bringing up 'What Memory is Used Where' - I'm clarifying for the people who seem a bit confused (see above) as to 'where the context lives' - and trying to provide a simple mental model for that.

That's the opposite of delusional.

It's just information.

Attacking people for anything 'Rust related' however - is the quintessential reason why everyone hates the Rust community.

2) 'The problem with your comment' is that it's presumptive and arrogant - as if I 'don't know the difference between GC and managed languages'.

I've been writing software since 1990.

Embedded (on custom Silicon), UI, SaaS, backend, some embedded work I've done is still in production today from almost 30 years ago.

I've written a scripting languages (for production), and cyclic ref-count gc (didn't make it to production).

Your comments about GC etc. are fine - but they but they don't really offer any insight into the actual problem.

There's one critical detail aka 'memory not released after spikes', yes, this is observed behaviour, but it's usually accommodated with a little bit of decent Engineering.

If you're going to make the comparative basis an an 'Idiomatic Rust' solution (aka good patterns), the we should make the assumption of an 'Idiomatic Node' solution for Claude Code.

3) 'The other problem with your comment' is that your conclusion is wrong - by your own hand.

Right here: "Claude Code has severe architectural issues causing it to leak hundreds of gigabytes of RAM," - the implication being that Claude Claude does not inherently have to 'leak all that RAM' - and would run just as fine with some basic work.

An 'Idiomatic Node' implementation of Claude Code wouldn't exhibit those problems, and would perform pragmatically just as well as an Idiomatic Rust implementation.

From a memory management situation, Rust might use significantly less memory, but a 150Mb footprint vs 350Mb foot print for an average session is 'pragmatically immaterial'.

The difference in 'perceived performance' would be negligible - if any.

The 'cost' of writing a the 'kind of program that Claude code is' in a systems-level language would be quite a lot, for not really much benefit.

The 'Rust or C++' solution would not 'run circles' around the 'node' implementation in anything but some 'preformative', inward looking benchmarks, aka 'the worst kind of Engineering'.

Consider pondering why almost nobody writes such applications in Rust or C++.


You have a point but it's definitely not TBs for 1M. Should be more like 100G.


It has nothing to do with local RAM usage. But a million tokens of LLM context is decidedly not 5mb.

The rough estimate is 2 * L * H_kv * D * bytes per element

Where:

* L = number of layers * H_kv = # of KV heads * D = head dimension * factor of 2 = keys + values

The dominant factor here is typically 2 * H_kv * D since it’s usually at least 2048 bytes. Per token.

For Llama3 7B youre looking at 128gib if you’re context is really 1M (not that that particular model supports a context so big). DeepSeek4 uses something called sparse attention so the above calculus is improved - 1M of context would use 5-10GiB.

But regardless of the details, you’re off by several orders of magnitude.


Pretty sure we're talking about the output text, not the tensors.


These LLM replies are really getting annoying.


Mine? I literally wrote what I wrote because “context window” as a term of art refers to the LLM’s context window.

I guess get better at detecting LLMs instead of accusing everything of being an LLM reply?


I've never actually run into the issues that people talk about online, like Claude suddenly getting dumb or running out of usage. So there's just not a lot of incentive for me to shop around. I've used Amp a bit, and it's quite nice, but a bit more expensive without the subsidized subscription.


It has always been like this. We actually know that the model performance has been mostly steady[0], but you cannot beat the notion of "evil companies secretly serving us worse models." The meme value is too strong.

[0]: https://marginlab.ai/trackers/claude-code/


Your data support actual strength shifts, not narrative manipulation:

Range of 48-73.5 (peak 53.1+% higher than trough) with a single day shift of ~30%.

You suggest people are usually influenced more by narrative than data, but provide a narrative-heavy, data-light comment, e.g. "always" "know" "mostly steady" (hazy terms for data) "cannot beat" "evil companies" "meme strong".

A followup defining "mostly" and "steady" more clearly, and your purpose in writing in a narrative-shaping style would be helpful.


Hmm, today's pass rate raised to 73% - interesting, are they AB-testing some new model? This is too high for Opus 4.7.


Are you using Opus? Sonnet remains as useful as it was while Opus efficacy and token burn rate has soured over the last 4 months.


I'm using Opus on xhigh 10+ hours a day, and I've only reached 80% of weekly limits when doing massive ports or refactors. I haven't once hit hourly limits, and I've used Claude very, very aggressively. I guess its a pain point for power users.


I sometimes run multiple claudes at the same time, with each terminal working on a different task. I have 2 going right now.

Its very easy to burn through your quota if you work like that. Especially on high / xhigh.


I used to be mostly at high/xhigh but now at medium I think it actually performs quite well both on results and token usage.


Yes, I've pretty much used Opus exclusively for the last year, except for a brief period when Sonnet was ahead


When do you use it the most? I’ve noticed that it most often starts to degrade during 10-5 US East coast time. Late at night, I have the least amount of issues, but without fail, if I’m trying to do anything complex during the day, Claude gets loopy.


9-5 Pacific Time


Same here. Works every time. Never ran into usage limits either.


It's definitely closer to matlab than python, but it's closer to python than most mainstream programming languages. I ported ~20k lines of python code to Julia over a couple years manually, and for the most part could do line-by-line translations that worked (but weren't necessarily performant until I profiled and switched to using Julia idioms.)


Well, my workflow uses Revise.jl. I develop either in Jupyter notebooks or in the REPL, prototyping code there and then moving functions to files when they're ready. In that context, rapid iteration is fairly fast.

Nowadays I often use Claude Code, working with a Julia REPL in a tmux or zellij session via send-keys. I'll have it prototype and try to optimize an algorithm there, then create a notebook to "present its results", then I'll take the bits I like and add them to the production codebase.


How do you develop a program which will run for longer duration on HPCs. How do you quickly modify struct definitations, how do you define imports (using vs include syntax is so confusing!)

REPL-based workflow doesn't make sense to me other than scripting work.


Re: REPL use, you just use it to run code and look at results. e.g. for TDD – you can modify your code files normally in the IDE, changes get picked up by revise, and then you re-run the tests in the REPL.

For long-running jobs, I basically follow the same process as in any other language: make the functions I want to run, test them locally on a small dataset that runs relatively quickly, then launch them on the remote machines with the full data.

Revise.jl has struct redefinition now, but before that I would just use NamedTuples while iterating, then make a struct when I was ready to move something to production.

`using` is for importing modules, `include` is for specific files. At work, we currently have a monorepo, with one top-level OurProject.jl file that uses `using` to import external packages, and `include` for all the internal files.


> How do you develop a program which will run for longer duration on HPCs.

The main strategy is to have a way of parameterize the program to bring the runtime down to seconds-minutes on a laptop. E.G. for PDEs, you may be running the HPC version on a giant mesh, but you can run the same algorithm on your local computer on a much coarser mesh.

> How do you quickly modify struct definitations

Thankfully on 1.12 this has been solved. You can redefine structs while keeping the REPL up.

> how do you define imports (using vs include syntax is so confusing!)

Yeah julia messed this up. The basic rule is that include and using are basically the same.


This is 7 years old. Julia is a totally different language by now.

As a quick anecdote, in our take-home interview exercise, we usually receive answers in C++ or Julia, and the two fastest answers have been in Julia.


I'd have to guess that this is because of ease of use. C++ lets you get as close to the metal as you choose to, so there is no reason why a C++ solution shouldn't be at least as fast as one written in any other language, and yet ...

Of course it also depends on what additional libaries you are using, especially when it comes to parallel/GPU programming in C++, but easy to believe that Julia out of the box makes it easy to write high performance parallel software.


> C++ lets you get as close to the metal as you choose to

This only ends up being true (for any language, but it's too often cited for C++) in a pretty useless Turing Tarpit sort of sense.

So it's not "no reason" it's just sometimes impractical to solve some problems as well in C++ as in a language that was better suited.

Now people do do impractical things sometimes. It's not very practical to swim across the English channel, but people do it. It's not very practical to climb Mt Everest, but loads of people do that for some reason. Going to the moon wasn't practical but the Americans decided to do it anyway. But the reason even the Americans stopped going for a long time is that actually "that was too hard and I don't want to" is in fact a reason.


Drawing from the analogies, what’s the Julia equivalent of them?


Yes, with unlimited development time I would expect C++ solutions to be as fast or faster. But Julia hits a really nice combination of development speed and performance that I haven't found in other languages, at least for number crunching and data pipelines.


> This is 7 years old.

Yeah, I actually totally forgot to check the date...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: