This is one of the biggest stated use cases of IPFS as it effectively replaces a...

acdha · on Oct 17, 2021

It does only the first of those things: using content hashes means that anyone can populate an archive which is easily discovered.

For the rest, hosting takes money. People will not archive the entire internet for free and IPFS is not a magic wand which eliminates the need to have people like the skilled IA team. It could make their jobs easier but that’s far from “nearly complete” and no more or less pristine.

naasking · on Oct 18, 2021

It gives you the tools to build an archive.org equivalent using volunteer storage though, rather Than asking for monetary donations. All you need is a database of known content hashes and a database of volunteers and you randomly distribute content among volunteers and periodically ensure a minimum number of clients are replicating each known hash.

acdha · on Oct 18, 2021

“All you need” is only true at the highest level: IPFS gives you a great way to discover replicated content. It doesn't help you know that the list of known hashes is complete (consider how much work IA has spent making sure that they crawl sites completely enough to be able to replay complex JavaScript), handle the scale of that list (this is a VERY large database which updates constantly), or provide networked storage at a scale measured in the hundreds of petabytes.

Volunteer capacity at the scale of many petabytes of online storage is unproven and the long tail of accesses is enough that you're going to have to think not just about the high replication factor needed but also the bandwidth available to serve that content on a timely manner and rebuild a missing replica before another fails.

naasking · on Oct 18, 2021

> It doesn't help you know that the list of known hashes is complete

Right, which is why I said it would have to periodically scan the registered clients to ensure a minimum number of clients has each block to ensure redundancy.

> also the bandwidth available to serve that content on a timely manner and rebuild a missing replica before another fails.

I think a slow, cheap but reliable archive is better than "more expensive but lower latency", so I'm not particularly concerned with timeliness.

acdha · on Oct 18, 2021

> > It doesn't help you know that the list of known hashes is complete

> Right, which is why I said it would have to periodically scan the registered clients to ensure a minimum number of clients has each block to ensure redundancy.

That's the easy problem, not the hard one I was referring to: doing what IA does requires you to be able to crawl web resources and identify everything which needs to be available for a page snapshot to be usable. IPFS only helps with that in the sense that you can tell whether you have the same URL payload without requesting it — you still need to handle dynamic behaviour and that's most of the work.

> I think a slow, cheap but reliable archive is better than "more expensive but lower latency", so I'm not particularly concerned with timeliness.

What I would be concerned with is “more expensive, higher latency, and greater risk of irrecoverable failure”. Relying on volunteers means that you need far more copies because nobody has a commitment to provide resources or even tell you if they decide to stop (“ooops, out of space. Let me clear some up — someone else must have this…”), and the network capacity isn't just a factor for user experience — although that can prevent adoption if it's too slow — but more importantly because it needs to be available enough to rebuild missing nodes before other ones also disappear.

naasking · on Oct 18, 2021

> you still need to handle dynamic behaviour and that's most of the work.

I'm not sure what you think would be difficult exactly. You've said that archive.org has already done the programming needed to ensure dynamic resources are discovered, and now those resources are content ids rather than URLs. Nothing's really changed on this point.

> Relying on volunteers means that you need far more copies because nobody has a commitment to provide resources or even tell you if they decide to stop

Yes, but you would also have many more volunteers. Many people who wouldn't donate financially would donate CPU and storage. We saw this with SETI@home and folding@home, for instance.

> nobody has a commitment to provide resources or even tell you if they decide to stop

Why not? If you provide a client to participate as a storage node for archive.org, like SETI@home, then they would know your online/offline status and how much storage you're willing to donate. If you increase/decrease the quota, it could notify the network of this change.

acdha · on Oct 18, 2021

> I'm not sure what you think would be difficult exactly. You've said that archive.org has already done the programming needed to ensure dynamic resources are discovered, and now those resources are content ids rather than URLs. Nothing's really changed on this point.

The point was that it's outside of the level which IPFS can possibly help with. IA actively maintains the code which does this and any competing project would need to spend the same time on that for the same reasons.

> Yes, but you would also have many more volunteers. Many people who wouldn't donate financially would donate CPU and storage. We saw this with SETI@home and folding@home, for instance.

That's an interesting theory but do we have any evidence suggesting that it's likely? In particular, SETI@home / folding@home did not involve either substantial resource commitments or potential legal problems, both of which would be a concern for a web archiving project. There's a substantial difference between saying something can use idle CPU and a modest amount of traffic versus using large amounts of storage and network bandwidth.

SETI@home appears to have on the order of ~150k participating computers. IA uses many petabytes of storage so let's assume that each of those computers has 10TB of storage free to offer — which is far more than the average consumer system — so if we assume all of them switch, that'd be 1.5EB of storage. That sounds like a lot but the need to have many copies to handle unavailable nodes and one factor controlling how many copies you need is the question of how much bandwidth the owner can give you — it doesn't help very much if someone has 2PB of storage if they're on a common asymmetric 1000/50Mbps connection and want to make sure that archive access doesn't interfere with their household's video calls or gaming. Once you start making more than a couple of copies, that total capacity is not looking like far more resources than IA.

> > nobody has a commitment to provide resources or even tell you if they decide to stop

> Why not? If you provide a client to participate as a storage node for archive.org, like SETI@home, then they would know your online/offline status and how much storage you're willing to donate. If you increase/decrease the quota, it could notify the network of this change.

All of what you're talking about is voluntary. One challenge of systems like this is that you don't know whether a node which simply disappears is going to come back or you need create a new replica somewhere else. Did someone disappear because they had a power outage or ISP failure, tripped over the power cord for an external hard drive, temporarily killed the client to avoid network contention, just got hit with ransomware, etc. or did they decide they were bored with the project and uninstalled it?

Since you don't have an SLA, you have to take conservative approach — lots of copies, geographically separated, etc. — which reduces the total system capacity and introduces performance considerations.

naasking · on Oct 18, 2021

> IA actively maintains the code which does this and any competing project would need to spend the same time on that for the same reasons.

I'm not sure why they would have to compete. They're literally solving the same problem in basically the same way. I see no reason to fork this code.

> That's an interesting theory but do we have any evidence suggesting that it's likely? In particular, SETI@home / folding@home did not involve either substantial resource commitments or potential legal problems, both of which would be a concern for a web archiving project.

But this isn't a concern of a web archiving project any more if content-based addressing becomes the standard, because pervasive caching is built into the protocol itself. Publishing anything on such a network means you are already giving up some control you would otherwise have in where this content will be served from, how it's cached, how long it lasts, etc.

> All of what you're talking about is voluntary. One challenge of systems like this is that you don't know whether a node which simply disappears is going to come back or you need create a new replica somewhere else.

Yes, you would have to be more pessimistic and plan for more redundancy than you would otherwise need. Each node in a Google-scale distributed system has a low expected failure rate, but they still see regular failures. No doubt they have a minimum redundancy calculation based on this failure rate. The same logic applies here, but the failure rate would likely have to be jacked up.

> Since you don't have an SLA, you have to take conservative approach — lots of copies, geographically separated, etc. — which reduces the total system capacity and introduces performance considerations.

Whether there would be performance problems isn't clear. Content-based addressing is already fairly slow (at this time), but once content is resolved, fragments of content can be delivered from multiple sources concurrently, and from more spatially close sources. Higher latency, but more parallelism.

I'm not willing to invest the time needed to gather the data you're asking about to actually quantify all of the requirements, but despite the points you've raised, I still don't see any real obstacles in principle.

acdha · on Oct 18, 2021

> I'm not sure why they would have to compete. They're literally solving the same problem in basically the same way. I see no reason to fork this code.

The point was simply that the original comment I was replying to claiming that this made it easy to replace archive.org was really only relevant to one fraction of what an archiving project would involve. If IA is going strong on their side, it's not clear why this project would get traction.

> > > That's an interesting theory but do we have any evidence suggesting that it's likely? In particular, SETI@home / folding@home did not involve either substantial resource commitments or potential legal problems, both of which would be a concern for a web archiving project.

> But this isn't a concern of a web archiving project any more if content-based addressing becomes the standard, because pervasive caching is built into the protocol itself. Publishing anything on such a network means you are already giving up some control you would otherwise have in where this content will be served from, how it's cached, how long it lasts, etc.

That's a separate problem: the two which I described are covering the commitment of storage, which is unlike the distribution computing projects in that it's only valuable if they do so for more than a short period of time, and the legal consideration. If you run SETI@Home you aren't going to get a legal threat or FBI agent inquiring why your IP address was serving content which you don't have rights to or isn't legal where you live.

> The same logic applies here, but the failure rate would likely have to be jacked up.

Yes, that's the point: running a service like this on a voluntary basis requires significantly more redundancy because you're getting fewer resources per node, have a higher risk of downtime or permanent loss of a node, and replication times are significantly greater. Yes, all of those are problems which can be addressed with careful engineering but I think they're also a good explanation for why P2P tools have been far less compelling in practice than many of us hoped. Trying to get volunteers to host things which don't personally and directly benefit them seems like more than a minor challenge unless the content is innocuous and relatively small.

> Whether there would be performance problems isn't clear. Content-based addressing is already fairly slow (at this time), but once content is resolved, fragments of content can be delivered from multiple sources concurrently, and from more spatially close sources. Higher latency, but more parallelism.

The problem is bootstrapping: until you get a lot of people those assumptions won't be true and a worse experience is one of the major impediments to getting more people. In the case of something like web archiving where the hardest part is crawling, which this doesn't help with at all, and there's a popular service which is generally well-liked it seems like have a detailed plan for that is the most important part.