These are ChatGPT and Claude Desktop crawlers we’re talking about? Or what is it exactly? Are these really creating significant load while not honoring robots.txt?
Is this the first time you are reading HN? Every day there are posts from people describing how AI crawlers are hammering their sites, with no end. Filtering user agents doesn't work because they spoof it, filtering IPs doesn't work because they use residential IPs. Robots.txt is a summer child's dream.
They seem to mostly be third-party upstarts with too much money to burn, willing to do what it takes to get data, probably in hopes of later selling it to big labs. Maaaybe Chinese AI labs too, I wouldn't put it past them.
And doing it over, and over, and over and over again. Because sure it didn't change in the last 8 years but maybe it's changed since yesterdays scrape?
Genuinely interested.