Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Many AI projects in academia or research get all of their web data from Common Crawl -- in addition to many not-AI usages of our dataset.

The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big. We recommend that all of these folks respect robots.txt and rate limits.



Thank you!

> The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big.

But how can they aspire to do any of that if they cannot build a basic bot?

My case, which I know is the same for many people:

My content is updated infrequently. Common Crawl must have all of it. I do not block Common Crawl, and I see it (the genuine one from the published ranges; not the fakes) visiting frequently. Yet the LLM bots hit the same URLs all the time, multiple times a day.

I plan to start blocking more of them, even the User and Search variants. The situation is becoming absurd.


Well, yes, it is a bit distressing that ill behaved crawlers are causing a lot of damage -- and collateral damage, too, when well-behaved bots get blocked.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: