Is it irony or something else that a system of supposed vast computing power and learning (and certain real world power via its distribution of search riches) is broken by a tiny thing like this fix for a missing slash ?
URLs are really, really, really, really hard to get right on a large scale. For a side project I've written my own crawler/indexer and I try to do deduplication where possible, and the reality is that:
domain.com/this-page-here
can serve entirely different content from
domain.com/this-page-here/
depending on the server (and application) configuration.
Pretty much the only way to 100% reliably deduplicate URLs is to look at their content, and somehow magically compare content that can change from page load to page load -- which is a whole other problem.
Exactly. It's so difficult to get URLs "right", and that's quite non-obvious until you do something like writing a crawler.
Another example is whether foo.com/bar is the same as foo.com/BAR. Usually yes, but it's entirely possible that they will serve different content.
Also, which URL parameters should be disregarded, and which should be considered important? A crawler must do quite a bit of nontrivial page introspection in order to figure out the answer to that all on its own.
Often pages that are essentially the same will be a bit different. Timestamps and time-sensitive data (eg. listings on a marketplace) will trip you up, here.
I wouldn't say the crawler is broken at all. It's picky, as it should be. An URL that ends with a / be an entirely different web page than an URL that doesn't end with a /.
Also, you might ask why Google won't ignore the canonical URL if it's an invalid URL.. well, that's what you get with the canonical URL - you're explicitly telling Google this is the "real" url of the web page. You can't have it both ways, and then complain Google is ignoring your canonical tag.
Well, if you give a canonical tag out, you're really taking on the responsibility of resolving different url's, so you should make sure you do it right.
<link rel="canonical" href="http://blog.mavnn.co.uk/type-providers-from-the-ground-up">
for a URL which cannot possibly return an HTTP 200. (It 301s to a URL with a / on the end.)
This combination could cause Google to conclude that you have no page which requires inclusion in their main index.