The author claims that IPFS enables a "permanent web" and eliminates 404-like experiences. How does IPFS guarantee that all published content will be available forever? (In my beginner-level knowledge of IPFS, this is the first I've heard that claim. It seems absurd.)
And quite annoyingly, it's the opposite of how IPFS works. IPFS nodes only cache the content as long as it's actively requested. Depending on the cache policy of the node, this can be as short as 24 hours. Yes, it could still exist on your private node that your running from your laptop, but this is the equivalent of saying that all published content is available forever because it's on your hard drive.
Honestly, I like IPFS as a tech, and blockchain isn't useless, but the entire community just makes me want to puke because it's full of so many extremely ignorant people (at best), but mostly just fraudulent liars (at worse and most common).
> Yes, it could still exist on your private node that your running from your laptop, but this is the equivalent of saying that all published content is available forever because it's on your hard drive.
No, the difference is that IPFS will use the same address to fetch content from anyone who's seeding it.
If Hacker News shuts down, it will no longer be accessible at 'news.ycombinator.com'; all existing links to that address will die, or worse will start showing some unrelated content (probably domain-squatter spam). That cannot be prevented by making a copy on my hard drive (or even the Wayback Machine).
On the other hand, an IPFS version will continue to exist at the same address for as long as anyone is seeding it. All links will remain working; anyone can join in the hosting if they like; even if it eventually stops getting seeded, it may still re-appear if someone re-inserts it, e.g. if they insert the contents an old hard drive they found in an attic (as long as the same hash algorithm is used, then the address will stay the same).
> On the other hand, an IPFS version will continue to exist at the same address for as long as anyone is seeding it.
I wonder how that would work, a naive direct translation seems impractical. An address identifies an exact piece of content, so a hacker news article gets a new adress every time a comment is added?
If the HTML differs, it would get a different address; just like the Wayback Machine, but the addresses are based on content rather than timestamp, and it's distributed. Anyone can do that right now, without any changes by (or permission from) the site operator (YCombinator in this case).
You couldn't host the "live" version of Hacker News (with user accounts, new comments, etc.) unless YCombinator open up their databases, and re-architect the system to work in a distributed-friendly way.
The latter is an interesting idea, but isn't required to fix HTTP issues like link rot.
pulling things out the ether my guess is that during normal operation news.ycombinator.com constantly creates and propagates new content ids as new comments are added (the front page simply fetches the newest ids, and there is a system for querying for updated ids which simulate how browser refresh works now), if news.ycombinator.com dies it becomes impossible to create new ids or post comments, but all already propagated pieces of contents are still accessible (possibly refresh still works on outdated content).
Bingo. The marketing strongly implies permanence, but the only permanence is that same content chunks will have the same hash, not that you can always retrieve content for a given hash. Then you start digging deeper and realize you need to have a paid pinning service to ensure that your content remains available. And these pinning services are way more expensive per gb than traditional storage options.
To be fair, this is all hard work to get right. I think there’s a third, much larger category of people who are pursuing the idea but just haven’t solved all the problems yet.
IPFS doesn't give you any guarantees about if the content is actually stored anywhere, it however gives you reasonable guarantees that the addresses for that content stay the same. Meaning if somebody finds an old backup tape decades down the road, they can just stuff it back onto IPFS and all the dead links start functioning again. That's something that is impossible with HTTP, as there isn't even a guarantee that the same URL will return the same content when you access it twice. With IPFS anybody can mirror the content and keep it alive as long as they want, they don't have to hope that the server that hosts it right now keeps running.
That said, IPFS isn't quite perfect here, as IPFS hashes do not actually point to the content itself, they point to a package that contains the content and depending on how that package was build, the hash will change.
Seems like this pointlessly duplicates the "Naming things with hashes" ni: and nih: URI schemes, standardised as https://tools.ietf.org/html/rfc6920 . (IPFS actually uses a proprietary "package" representation as parent points out, so there's little harm in it not using ni: URI's. But standard hashes of the underlying resource should use those. "Magnet" links are similarly problematic, but these at least support a few convenience features.)
That still requires you to deliberately decide to archive a certain page, which you can do just as easily with http. The SingleFile browser extension will grab everything that's linked to a page and build a locally stored directory with all the dependencies of that page.
> That still requires you to deliberately decide to archive a certain page
You don't "archive" a page with IPFS, with IPFS everything you keep a copy of stays available under the same address, that either happens automatically via cache or via a manual 'pin'. That's fundamentally different than what HTTP does.
The archive copy you create of a HTTP site is your own personal copy and completely inaccessible to anybody else. Even if you put it online, people would have no clue where to find it. With IPFS the document never leaves the address space and stays accessible to everybody under the same address, no matter who decides to host it.
Another important practical difference is that IPFS has native support for directories, so you don't have to try to spider around to try to guess all the URLs, you can just grab the whole directory at once. That in turn also has the nice side effect that .zip archives essentially become irrelevant, as you can just upload the directory itself.
It's the fundamental lie of IPFS. IPFS people will jump in and say that it's not a lie, what they're actually saying is yadda yadda, but then they turn around and say exactly that five minutes later. It's the motte-and-bailey ( https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy ) of IPFS.
In a world of finite storage, nobody's going to keep up a copy of everything. Nobody will have a copy of most things. Even if IPFS worked acceptably, even if it worked as very narrowly promised, plenty of stuff would fall off the web. At best, we'd see somewhat fewer temporary disruptions of very currently popular content.
Yeah in the article there is an example of a video that was downloaded multiple times. That is so inefficient because "HTTP".
Solution for it becoming more efficient is that someone else should host it for you ideally for free.
It is just exercise in throwing big numbers and utter ignorance to impress people but downloaded megabytes are not magically going away.
Just like torrents - no one wants to seed or pay hosting costs everyone wants to download. There is no protocol that is going to fix that. Why is everyone mining BTC like crazy, because they get money for that
A big problem in the NFT space of late is that OpenSea sells NFTs that link to an IPFS URL, then doesn't bother seeding the image after it's sold - so a pile of NFT images no longer exist anywhere on the IPFS.
Isn't that a feature? Some NFTs are hashes of recordings of destroying some physical piece of art. Surely the next level is for NFT of jpegs where only the hash is left! It will be the next big thing to go to the moon! The uncertainty about someone finding a forgotten copy adds incredible depth to the sport.
I don't know about IPFS but Arweave solved the "permanent" part by asking nodes to periodically prove that they're actually storing what they're supposed to be storing: https://www.arweave.org/
The idea is to incentivise hosting via crypto - with ARWeave, there's a bigger upfront fee but no ongoing fees. In theory, it's meant to secure 200 years of storage by putting aside most of the upfront fees for later and (conservatively) assuming storage costs decrease by at least 0.5% per year.
Basically with ipfs you try to download some content by referring to it by a hash. This means that the content will be available forever - as long as there is someone willing to cache that content. Which is, obviously, something that won't always happen. But it _can_ theoretically happen, so people with a cache of some old web page may revive it, even after the original site is gone.
How do you refer to something "by name"? Let's say the current version of Wikipedia's article on Lagrange multiplier - the data is changing, so can't use a hash of the content.
With IPFS you access content by hash. So if you have a plain IPFS link to "Lagrange multiplier", it will always point to exactly the same version you are looking at right now. No way to change or update that.
If you want to update content, you have to point the user to a new hash. IPFS has the IPNS mechanism for that, this adds a layer of indirection, so instead of pointing to the IPFS hash directly, you point to the IPNS name which in turn points to the current hash. What an IPNS name is pointing to can be updated by the owner of that IPNS name.
Another option is to do in via plain old DNS and have the DNS record point to the current hash of the website.
The hash is not a literal hash of the content- I think it's like a key that lets anyone looking for it ensure it's legitimate. Then the IPFS p2p search mechanism is what lets you find it.
Thanks! I was thinking too vaguely of “hash”, and didn’t think about the requirement that the hash of something large and the hash of that thing but with a big chunk of it replaced with , uh, the hash or some kind of digest of that large bit, be different.
I wonder if it could make running a service like archive.org easier. Or if it could allow for a distributed archive.org service where nerds worldwide can contribute x GB storage to the cause?
This is one of the biggest stated use cases of IPFS as it effectively replaces archive.org with a better version that is more distributed, has better uptime, and more importantly has a much larger and nearly complete and pristine version of the entire internet for conceivably as long as the internet itself exists.
It does only the first of those things: using content hashes means that anyone can populate an archive which is easily discovered.
For the rest, hosting takes money. People will not archive the entire internet for free and IPFS is not a magic wand which eliminates the need to have people like the skilled IA team. It could make their jobs easier but that’s far from “nearly complete” and no more or less pristine.
It gives you the tools to build an archive.org equivalent using volunteer storage though, rather
Than asking for monetary donations. All you need is a database of known content hashes and a database of volunteers and you randomly distribute content among volunteers and periodically ensure a minimum number of clients are replicating each known hash.
“All you need” is only true at the highest level: IPFS gives you a great way to discover replicated content. It doesn't help you know that the list of known hashes is complete (consider how much work IA has spent making sure that they crawl sites completely enough to be able to replay complex JavaScript), handle the scale of that list (this is a VERY large database which updates constantly), or provide networked storage at a scale measured in the hundreds of petabytes.
Volunteer capacity at the scale of many petabytes of online storage is unproven and the long tail of accesses is enough that you're going to have to think not just about the high replication factor needed but also the bandwidth available to serve that content on a timely manner and rebuild a missing replica before another fails.
> It doesn't help you know that the list of known hashes is complete
Right, which is why I said it would have to periodically scan the registered clients to ensure a minimum number of clients has each block to ensure redundancy.
> also the bandwidth available to serve that content on a timely manner and rebuild a missing replica before another fails.
I think a slow, cheap but reliable archive is better than "more expensive but lower latency", so I'm not particularly concerned with timeliness.
> > It doesn't help you know that the list of known hashes is complete
> Right, which is why I said it would have to periodically scan the registered clients to ensure a minimum number of clients has each block to ensure redundancy.
That's the easy problem, not the hard one I was referring to: doing what IA does requires you to be able to crawl web resources and identify everything which needs to be available for a page snapshot to be usable. IPFS only helps with that in the sense that you can tell whether you have the same URL payload without requesting it — you still need to handle dynamic behaviour and that's most of the work.
> I think a slow, cheap but reliable archive is better than "more expensive but lower latency", so I'm not particularly concerned with timeliness.
What I would be concerned with is “more expensive, higher latency, and greater risk of irrecoverable failure”. Relying on volunteers means that you need far more copies because nobody has a commitment to provide resources or even tell you if they decide to stop (“ooops, out of space. Let me clear some up — someone else must have this…”), and the network capacity isn't just a factor for user experience — although that can prevent adoption if it's too slow — but more importantly because it needs to be available enough to rebuild missing nodes before other ones also disappear.
> you still need to handle dynamic behaviour and that's most of the work.
I'm not sure what you think would be difficult exactly. You've said that archive.org has already done the programming needed to ensure dynamic resources are discovered, and now those resources are content ids rather than URLs. Nothing's really changed on this point.
> Relying on volunteers means that you need far more copies because nobody has a commitment to provide resources or even tell you if they decide to stop
Yes, but you would also have many more volunteers. Many people who wouldn't donate financially would donate CPU and storage. We saw this with SETI@home and folding@home, for instance.
> nobody has a commitment to provide resources or even tell you if they decide to stop
Why not? If you provide a client to participate as a storage node for archive.org, like SETI@home, then they would know your online/offline status and how much storage you're willing to donate. If you increase/decrease the quota, it could notify the network of this change.
> I'm not sure what you think would be difficult exactly. You've said that archive.org has already done the programming needed to ensure dynamic resources are discovered, and now those resources are content ids rather than URLs. Nothing's really changed on this point.
The point was that it's outside of the level which IPFS can possibly help with. IA actively maintains the code which does this and any competing project would need to spend the same time on that for the same reasons.
> Yes, but you would also have many more volunteers. Many people who wouldn't donate financially would donate CPU and storage. We saw this with SETI@home and folding@home, for instance.
That's an interesting theory but do we have any evidence suggesting that it's likely? In particular, SETI@home / folding@home did not involve either substantial resource commitments or potential legal problems, both of which would be a concern for a web archiving project. There's a substantial difference between saying something can use idle CPU and a modest amount of traffic versus using large amounts of storage and network bandwidth.
SETI@home appears to have on the order of ~150k participating computers. IA uses many petabytes of storage so let's assume that each of those computers has 10TB of storage free to offer — which is far more than the average consumer system — so if we assume all of them switch, that'd be 1.5EB of storage. That sounds like a lot but the need to have many copies to handle unavailable nodes and one factor controlling how many copies you need is the question of how much bandwidth the owner can give you — it doesn't help very much if someone has 2PB of storage if they're on a common asymmetric 1000/50Mbps connection and want to make sure that archive access doesn't interfere with their household's video calls or gaming. Once you start making more than a couple of copies, that total capacity is not looking like far more resources than IA.
> > nobody has a commitment to provide resources or even tell you if they decide to stop
> Why not? If you provide a client to participate as a storage node for archive.org, like SETI@home, then they would know your online/offline status and how much storage you're willing to donate. If you increase/decrease the quota, it could notify the network of this change.
All of what you're talking about is voluntary. One challenge of systems like this is that you don't know whether a node which simply disappears is going to come back or you need create a new replica somewhere else. Did someone disappear because they had a power outage or ISP failure, tripped over the power cord for an external hard drive, temporarily killed the client to avoid network contention, just got hit with ransomware, etc. or did they decide they were bored with the project and uninstalled it?
Since you don't have an SLA, you have to take conservative approach — lots of copies, geographically separated, etc. — which reduces the total system capacity and introduces performance considerations.
> IA actively maintains the code which does this and any competing project would need to spend the same time on that for the same reasons.
I'm not sure why they would have to compete. They're literally solving the same problem in basically the same way. I see no reason to fork this code.
> That's an interesting theory but do we have any evidence suggesting that it's likely? In particular, SETI@home / folding@home did not involve either substantial resource commitments or potential legal problems, both of which would be a concern for a web archiving project.
But this isn't a concern of a web archiving project any more if content-based addressing becomes the standard, because pervasive caching is built into the protocol itself. Publishing anything on such a network means you are already giving up some control you would otherwise have in where this content will be served from, how it's cached, how long it lasts, etc.
> All of what you're talking about is voluntary. One challenge of systems like this is that you don't know whether a node which simply disappears is going to come back or you need create a new replica somewhere else.
Yes, you would have to be more pessimistic and plan for more redundancy than you would otherwise need. Each node in a Google-scale distributed system has a low expected failure rate, but they still see regular failures. No doubt they have a minimum redundancy calculation based on this failure rate. The same logic applies here, but the failure rate would likely have to be jacked up.
> Since you don't have an SLA, you have to take conservative approach — lots of copies, geographically separated, etc. — which reduces the total system capacity and introduces performance considerations.
Whether there would be performance problems isn't clear. Content-based addressing is already fairly slow (at this time), but once content is resolved, fragments of content can be delivered from multiple sources concurrently, and from more spatially close sources. Higher latency, but more parallelism.
I'm not willing to invest the time needed to gather the data you're asking about to actually quantify all of the requirements, but despite the points you've raised, I still don't see any real obstacles in principle.
> I'm not sure why they would have to compete. They're literally solving the same problem in basically the same way. I see no reason to fork this code.
The point was simply that the original comment I was replying to claiming that this made it easy to replace archive.org was really only relevant to one fraction of what an archiving project would involve. If IA is going strong on their side, it's not clear why this project would get traction.
> > > That's an interesting theory but do we have any evidence suggesting that it's likely? In particular, SETI@home / folding@home did not involve either substantial resource commitments or potential legal problems, both of which would be a concern for a web archiving project.
> But this isn't a concern of a web archiving project any more if content-based addressing becomes the standard, because pervasive caching is built into the protocol itself. Publishing anything on such a network means you are already giving up some control you would otherwise have in where this content will be served from, how it's cached, how long it lasts, etc.
That's a separate problem: the two which I described are covering the commitment of storage, which is unlike the distribution computing projects in that it's only valuable if they do so for more than a short period of time, and the legal consideration. If you run SETI@Home you aren't going to get a legal threat or FBI agent inquiring why your IP address was serving content which you don't have rights to or isn't legal where you live.
> The same logic applies here, but the failure rate would likely have to be jacked up.
Yes, that's the point: running a service like this on a voluntary basis requires significantly more redundancy because you're getting fewer resources per node, have a higher risk of downtime or permanent loss of a node, and replication times are significantly greater. Yes, all of those are problems which can be addressed with careful engineering but I think they're also a good explanation for why P2P tools have been far less compelling in practice than many of us hoped. Trying to get volunteers to host things which don't personally and directly benefit them seems like more than a minor challenge unless the content is innocuous and relatively small.
> Whether there would be performance problems isn't clear. Content-based addressing is already fairly slow (at this time), but once content is resolved, fragments of content can be delivered from multiple sources concurrently, and from more spatially close sources. Higher latency, but more parallelism.
The problem is bootstrapping: until you get a lot of people those assumptions won't be true and a worse experience is one of the major impediments to getting more people. In the case of something like web archiving where the hardest part is crawling, which this doesn't help with at all, and there's a popular service which is generally well-liked it seems like have a detailed plan for that is the most important part.
its more that ipfs enables a permanent web, not that this technology immediately has this property as a feature. it requires players like internet archive, libraries, website hosters and individuals to pin content and develop interesting pinning strategies. for example browsers could pin everything that is bookmarked and everything that is in browser cache.
The claims are too strong, but it is strictly more available than today: if the original host stays it is guaranteed to stay, same as today. If they disappear, it may stay, which is stronger than today (ignoring internet archive).