Is this file system corruption? The article isn’t 100% clear it ‘only’ loses som...

cmurf · on Jan 2, 2021

There's no enough information to know if all the reported problems are the result of the same defect. But in: https://github.com/microsoft/WSL/issues/5895

The first instance of a problem is:

    [    1.956835] JBD2: Invalid checksum recovering block 97441 in log

And that's corruption that leads to log replay failing, i.e. rejecting it because honoring the replay in the face of checksum errors could make things much worse. Subsequently mount fails:

    [   21.151232] ERROR: MountExt4:1659: mount(/dev/sdb) failed 5

That's good because the purpose of journal replay is to make the file system consistent following a crash/power fail. And if the file system is dirty, replay is called for, but can't happen due to a corrupt journal, so now an fsck is required. i.e. it is in an inconsistent (you could say partly broken) state and needs repair.

I haven't seen syslog/systemd journal for other cases to know if there's instances of ext4 log replay that succeeds, but with missing files. That's not file system corruption, even if it leads to an inconsistent state in a git repository (or even a database). But this is still concerning, because to get a situation where log replay is clean but files are missing suggests an entire transaction was just dropped. It never made it to stable media, and even the metadata was not partially written to the ext4 journal.

qemu-kvm has a (host) cache setting called "unsafe". Default is typically "none" or "write back". The unsafe mode can result in file system corruption if the host crashes or has a power failure. The guest's IO is faster with this mode, but the write ordering expected by the file system is not guaranteed if the host crashes. i.e. writes can hit stable media out of order. If the guest crashes, my experience has been that things are fine - subsequent log replay (in the guest) is successful, because the guest writes that made it to the host cache do make it to stable media by the same the guest reboots. The out of order writes don't matter... unless the host crashes, and then it's a big problem. The other qemu cache modes have rather different flush/fua policies that can still keep a guest file system consistent following a host crash. But they are slower.

So it makes me suspicious that for performance reasons, WSL2 might be using a possibly volatile host side caching policy. Merely for additional data point, it might be interesting to try to reproduce this problem using e.g. Btrfs for the guest file system. If write order is honored and flushed to stable media appropriate for default out of the box configuration of a VM, I'd expect Btrfs never complains, but might drop up to 30s of writes. But if there's out of order writes making it to stable media, Btrfs will also complain, I'd expect transid errors which are also a hallmark of drive firmware not consistently honoring flush/fua and then you get a badly timed crash. (And similar for ZFS for that matter - nothing is impervious to having its write order expectations blown up.)

Someone · on Jan 2, 2021

Thanks. That definitely is file system corruption. And that is very scary, as (assuming you have backups, which you should) losing files you’re working on is not the biggest problem you can have. That will lose you a few days at most.

Silent corruption of parts of the disk that you rarely access but still want to keep is scarier (you might have rotating backups for months or years and still eventually lose data)

cmurf · on Jan 3, 2021

So long as the file system is fixed, it should be straightforward to fix the git repository. I'm no git expert but maybe 'git repair' can deal with it; and if not then 'rm -rf' and 'git clone'.

To avoid silent corruption requires full metadata and data checksumming, ala Btrfs or ZFS. In those cases, not only is corruption unambiguously detected, but it's not allowed to propagate.

Gibbon1 · on Jan 2, 2021

Not my area but I seem to remember bitches that Linux lies about fsync. As in it'll swear up and down that it flushed everything to disk, but it's lying.

Also over the years it seems like everyone I've seen that habitually edits files remotely ends up with this sort of pain and butthurt.