Bug 201685
Summary: | Incorrect disk IO caused by blk-mq direct issue can lead to file system corruption | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Claude Heiland-Allen (claude) |
Component: | Block Layer | Assignee: | Jens Axboe (axboe) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | a3at.mail, abennett72, adilger.kernelbugzilla, alexander, alzaagman, angelsl, aros, axboe, beniamino, bjo, bjoernv, borneo.antonio, bvanassche, c.andersen, calestyo, caravena, carlphilippreh, ck+kernelbugzilla, damien.wyart, dang.sananikone, dharding, donwulff, dvyukov, ego.cordatus, elliot.li.tech, eric, frederick888, ghibo, harry, hb, henrique.rodrigues, himself, jaapbuurman, James, jaygambrel, jbuchert+kbugs, Jimmy.Jazz, kernel.org, kernel.org, kernel, kevin, L.Bonnaud, linux, linuxkernel.severach, lskrejci, m, Manfred.Knick, me, michael, michael, michel, michel, molgaard, mricon, nclauzel, nestorm_des, omarandemad, pbrobinson, rdnetto, reg, richts, rlrevell, rob.izzard, scotte, seth, shopper2k, snitzer, stefan.hoelldampf, stevefan1999, steven, sven.koehler, thilo.wiesner, thomas.tomdan, tom.leiming, tytso, void1976, zajec5 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.19.x 4.20-rc | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg 4.18.18 amdgpu.dc=0
more infos (lspci, dmesg, etc.) eiskaffee - logs and other info dm-7 device w/ bad extra_isize errors dmesg w/ EXT4-fs error only .config of 4.19.2 .config of 4.20.0-rc3 4.19 patch 4.20-rc3 patch dmesg EXT4 errors dumpe2fs debug info Ext4 from 4.18 (tar.gz) fsck output kernel 4.18 dmesg shows errors before reboot logs show no error after reboot Config of first computer Config of second computer new generated server tecciztecatl linux kernel 4.19.6 .config Reproducer git bisect between v4.18 and 4.19-rc1 description of my Qemu and Ubuntu configuration dmesg with mdraid1 4.19 fix 4.19 patch v2 4.19/4.20 patch v3 |
Similar problems on Linux Mint 19 Tara using kernels 4.19.0 and 4.19.1 On rebooting my ASUS UX 430U laptop I ended up at the initramfs prompt and had to run fsck to repair the root file system. I was then able to continue booting. Sorry didn't save the log files. This has happened randomly twice. I will post them if this happens again. Did not happen under 4.18 I'm using a 4.19.0 based kernel (with some ext4 patches for the 4.20 mainline) and I'm not noticing any file system problems. I'm running a Dell XPS 13 with an NVME SSD, and Debian testing as my userspace. It's hard to do anything with a "my file system is corrupted" report without any kind of reliable reproduction information. Remember that file system corruptions can be caused by any number of things --- buggy device drivers, buggy Nvidia binary modules that dereference wild pointers and randomly corrupt kernel memory, RAID code if you are using RAID, etc., etc. Also, the symptoms reported by Claude and Jason are very different. Claude has reported that a data block in a shared library file has gotten corrupted. Jason has reported that file system metadata corruption. This could very well be coming from different root causes. So it's better with these sorts of things to file separate bugs, and to include detailed hardware configuration details, kernel configuration, dumpe2fs outputs of the file system in question, as well as e2fsck logs. (In reply to Theodore Tso from comment #2) > I'm using a 4.19.0 based kernel (with some ext4 patches for the 4.20 > mainline) and I'm not noticing any file system problems. I'm running a > Dell XPS 13 with an NVME SSD, and Debian testing as my userspace. > > It's hard to do anything with a "my file system is corrupted" report without > any kind of reliable reproduction information. Remember that file system > corruptions can be caused by any number of things --- buggy device drivers, > buggy Nvidia binary modules that dereference wild pointers and randomly > corrupt kernel memory, RAID code if you are using RAID, etc., etc. > > Also, the symptoms reported by Claude and Jason are very different. Claude > has reported that a data block in a shared library file has gotten > corrupted. Jason has reported that file system metadata corruption. This > could very well be coming from different root causes. > > So it's better with these sorts of things to file separate bugs, and to > include detailed hardware configuration details, kernel configuration, > dumpe2fs outputs of the file system in question, as well as e2fsck logs. Thank you for your reply Theodore and I apologize for my unhelpful post. I am relatively new to this space so I find your advice very helpful. If the file system corruption happens a 3rd time (hopefully it won't), I will post a separate bug report. I also wasn't previously aware of dumpe2fs, so I will provide that helpful information next time. I have also searched to find any additional logs and it looks like fsck logs the boot info under /var/logs/boot.log and potentially /var/logs/syslog. Unfortunately the information from my last boot requiring fixation had already been overwritten. I will keep this in mind for the future. If it helps, my system uses an i7 with integrated Intel graphics. I am not running any proprietary drivers. No Raid. 500gb SSD. 16gb ram with a 4gb swap file (not a swap partition). I have been using ukuu to install mainline kernels. I did not change anything else on my system. When I jumped from 4.18.17 to 4.19.0 this problem first appeared. Then it occurred again after updating to 4.19.1. I'm uncertain as to whether it would be helpful or not, but while trying to figure out why this happened to me, I came across a post on Ask Ubuntu with a few others reporting similar problems. They did provide some debugging information in their post at: https://askubuntu.com/questions/1092558/ubuntu-18-04-4-19-1-kernel-after-closing-the-lid-for-the-night-not-logging-ou Again it might be a different problem from what Claude Heiland-Allen is experiencing. Thank you very much for your advice and I will try and provide some useful information including logs in a separate bug report if it happens again. Thanks for pointing out that bug. I'll note that the poster who authoritatively claimed that 4.19 is safe, and the bug obviously was introduced in 4.19.1 didn't bother to do a "git log --stat v4.19 v4.19.1". This would show that the changes were all in the Sparc architecture support, networking drivers, the networking stack, and a one-line change in the crypto subsystem.... This is why I always tell users to report symptoms, not diagnosis. And for sure, not to bias their observations by their their certainty that they have diagnosed the problem. (If they think they have diagnosed the problem, send me a patch, preferably with a reliable repro so we can add a regression test. :-) Two of my Linux machines experience regular ext4 file system corruption since I updated them to 4.19. 4.18 was fine. I noticed the problem first when certain file operations returned "Structure needs cleaning". fsck then mostly finds dangling inodes (of files I have written recently), incorrect reference counts, and so on. Both machines do not use RAID, don't use any proprietary drivers and both have an Intel board. One of them uses an SSD and one of them a HDD. Unfortunately, I don't know which information might be useful to you. Please send detailed information about your hardware (lspci -v and dmesg while it is booting would be helpful). Also please send the results of running dumpe2fs on the file system, and the kernel logs when file system operations started returning "Structure needs cleaning". I want to see if there are any other kernel messages in and around the ext4 error messages that will be in the kernel logs. Also please send me the fsck logs, and what sort of workload (what programs) you have running on your system. Also, do you do anything unusual on your machine; do you typically do clean shutdowns, or do you just do forced power-offs? Are you regularly running into a large amount of memory pressure (e.g., are you regularly using a large percentage of the physical memory available on your system.) This is going to end up being a process of elimination. 4.19 works for me. I'm using a 2018 Dell XPS 13, Model 9370, with 16GB of memory and I run a typical kernel developer workload. We also run a large number of ext4 regression testing, which generally happens on KVM for one developer, and I use Google Compute Engine for my tests. None of this detected any problems before 4.19 was released. So the question then is --- what makes people who are experiencing difficulties different from my development laptop (which also has an Intel board, and an SSD connected using NVMe) from those who are seeing problems? This is why getting lots of details about the precise hardware configuration is going to be critically important. In the ideal world we would come up with a clean, simple, reliable reproducer. Then we can experiment and see if the reliable reproducer continues to reproduce on different hardware, etc. Finally, since in order to figure things out we may need a lot of detail about the hardware, the software, and the applications running on each of the systems where people are seeing problems, it's helpful if new people upload all of this information onto new kernel bugzilla issues, and then mention the kernel bugzilla issue here, so people can follow the links. I'll note that a few years ago, we had a mysterious "ext4 failure" that ultimately turned out to be a Intel virtualization hardware bug, and it was the *host* version that mattered, not the *guest* kernel version that mattered. Worse, it was fixed in the very next vesion of the kernel, and so it was only people using Debian host kernels that ran into troubles --- but **only** if they were using a specific Intel chipset and Intel CPU generation. Everyone kept on swearing up and down it was an ext4 bug, and there were many angry people arguing this on bugzilla. Ultimately, it was a problem caused by a hardware bug, and a kernel workaround that was in 3.18 but not in 3.17, and Debian hadn't noticed they needed to backport the kernel workaround.... And because everyone was *certain* that the host kernel version didn't matter --- after all, it was *obviously* an ext4 bug in the guest kernel --- they didn't report it, and that made figuring out what the problem was (it took over a year) much, Much, MUCH harder. Confirming for now, similar problems with 4.19.1, 4.19.2, 4.20-rc1, 4.20-rc2 and 4.20-rc3 from the Ubuntu mainline-kernel repository (http://kernel.ubuntu.com/~kernel-ppa/mainline/). I could reproduce the issue. Probably not relevant but I had to modify the initramfs script, 4.19.0 kernel for any reason changed the mdp raid major number from 245 to 9 (i.e on a devtmpfs filesystem) and renamed them /dev/mdX instead of /dev/md_dX as before. Partitions are now major 259. Also the md/lvm devices became faster by the way. I tried ext4 mmp protection but without success (i.e not a multi mount issue). 4.20.0-rc3 kernel gives the same sort of issue. It is reproducible on an other amd64 machine with the same configuration but different hardware. Applications used for the test, sys-fs/lvm2-2.02.173, sys-fs/mdadm-4.1, sys-fs/e2fsprogs-1.44.4, sys-apps/coreutils-8.30, sys-apps/util-linux-2.33 More info in ext4_iget.txt.xz Created attachment 279557 [details]
more infos (lspci, dmesg, etc.)
compressed text file
(In reply to Jimmy.Jazz from comment #8) typo: read -rc2 not -rc3 I'm using rc3 release now. /etc/mke2fs.conf default_mntopts user_xattr is deactivated (tune2fs -o ^user_xattr /dev/mapper/xx) on all my lvm devices. Native mdp devices still have the option set. One of my machines is always on heavy load because of daily compilations basis I do in a nilfs sandbox environment. No error for the moment. The issue was all of sudden and affected all my lvm devices. Next reboot, fsck randomly couldn't "see" any failure the kernel had reported but detection improved when used with the -D optimization option. It could be some old corruptions undetected until now. One of the server is more then 5 years old without reinstall but still with regular updates. But the other one still has the issue at its first installation. The kernel was compiled with GCC and LD=ld.bfd I was unsuccessful with CLANG. # gcc --version gcc (Gentoo 8.2.0-r4 p1.5) 8.2.0 I'm bit puzzled. Thanks Jimmy for your report. Can you specify what sort of LVM devices are you using? Is it just a standard LVM volume (e.g., no LVM raid, no LVM snapshops, no dm-thin provisioning)? The reason why I ask is because I've run gce-xfstests on 4.19, 4.19.1, and 4.19.2, and it uses LVM (nothing fancy just standard LVM volumes, although xfstests will layer some dm-error and dm-thin on top of the LVM volumes for specific xfstests) on top of virtio-scsi on top of Google Compute Engine's Persistent Disks, and I'm not noticing any problems. I just noticed that my .config file for my GCE testing has CONFIG_SCSI_MQ_DEFAULT set to "no", which means I'm not using the new block-mq data path. So perhaps this is a MQ specific bug? (Checking... hmm, my laptop running 4.19.0 plus the ext4 commits landing in 4.20-rc2+ is *also* using CONFIG_SCSI_MQ_DEFAULT=n.) And Kconfig recommends that SCSI_MQ_DEFAULT be defaulted to y. This is why having people include their Kernel configs, and what devices they use is so important. The vast amount of time, given the constant testing which we do in the ext4 layer, more often than not the problem is somewhere *else* in the storage stack. There have been bugs which have escaped notice by our tests, yes. But it's rare, and it's almost never the case when a large number of users are reporting the same problem. Created attachment 279569 [details]
eiskaffee - logs and other info
I replaced my single dmesg attached with further information (as requested) in a tarball. Contains kernel configs, dmesg logs, tune2fs -l, lspci -vvv. I don't use LVM or MD on my machine (eiskaffee). The file system corruption I experienced was on the root partition on SSD.
(In reply to Theodore Tso from comment #11) > Thanks Jimmy for your report. Can you specify what sort of LVM devices are > you using? Standard lvm linear volumes on top of a full /dev/md0p5 pv partition on both machines with respectively kernel 4.19.2 and 4.20.0-rc3 > > I just noticed that my .config file for my GCE testing has > CONFIG_SCSI_MQ_DEFAULT set to "no", which means I'm not using the new > block-mq data path. So perhaps this is a MQ specific bug? (Checking... > hmm, my laptop running 4.19.0 plus the ext4 commits landing in 4.20-rc2+ is > *also* using CONFIG_SCSI_MQ_DEFAULT=n.) And Kconfig recommends that > SCSI_MQ_DEFAULT be defaulted to y. CONFIG_SCSI_MQ_DEFAULT=y on both machines CONFIG_DM_MQ_DEFAULT is not set > This is why having people include their Kernel configs, and what devices > they use is so important. sorry it was an oversight. See attachments. server with kernel 4.19.2 # smartctl -i /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.2-radeon] (local build) === START OF INFORMATION SECTION === Device Model: MKNSSDRE512GB Serial Number: MK15090210005157A LU WWN Device Id: 5 888914 10005157a Firmware Version: N1007C User Capacity: 512 110 190 592 bytes [512 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Nov 21 19:02:07 2018 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled # smartctl -i /dev/sda smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.2-radeon] (local build) === START OF INFORMATION SECTION === Device Model: MKNSSDRE512GB Serial Number: MK150902100051556 LU WWN Device Id: 5 888914 100051556 Firmware Version: N1007C User Capacity: 512 110 190 592 bytes [512 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Nov 21 19:04:01 2018 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled server with 4.20.0-rc3 # smartctl -i /dev/sda === START OF INFORMATION SECTION === Model Family: HGST Travelstar 7K1000 Device Model: HGST HTS721010A9E630 Serial Number: JR10004M0BD4YF LU WWN Device Id: 5 000cca 8a8c52dba Firmware Version: JB0OA3J0 User Capacity: 1 000 204 886 016 bytes [1,00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Nov 21 19:02:21 2018 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled I do periodic backup and the strange is dm-7 is only accessed read only and it just triggered ext4-iget failures. Created attachment 279571 [details]
dm-7 device w/ bad extra_isize errors
Created attachment 279573 [details]
dmesg w/ EXT4-fs error only
Created attachment 279575 [details]
.config of 4.19.2
Created attachment 279577 [details]
.config of 4.20.0-rc3
kernel 4.19.2 md0 (formerly md_d0) is a raid1 GPT bootable device md0 : active raid1 sda[0] sdb[1] 499976512 blocks super 1.2 [2/2] [UU] bitmap: 3/4 pages [12KB], 65536KB chunk md0p1 is the grub boot partition # fdisk -l /dev/md0 Disque /dev/md0 : 476,8 GiB, 511975948288 octets, 999953024 secteurs Unités : secteur de 1 × 512 = 512 octets Taille de secteur (logique / physique) : 512 octets / 512 octets taille d'E/S (minimale / optimale) : 512 octets / 512 octets Type d'étiquette de disque : gpt Identifiant de disque : 9A6C46CD-3B9C-4C64-AE3C-EDB416548134 Périphérique Début Fin Secteurs Taille Type /dev/md0p1 40 2088 2049 1M Amorçage BIOS /dev/md0p2 2096 264240 262145 128M Système de fichiers Linux /dev/md0p3 264248 2361400 2097153 1G Système de fichiers Linux /dev/md0p4 2361408 6555712 4194305 2G Système de fichiers Linux /dev/md0p5 6555720 999952984 993397265 473,7G Système de fichiers Linux # fdisk -l /dev/sdb Disque /dev/sdb : 477 GiB, 512110190592 octets, 1000215216 secteurs Modèle de disque : MKNSSDRE512GB Unités : secteur de 1 × 512 = 512 octets Taille de secteur (logique / physique) : 512 octets / 512 octets taille d'E/S (minimale / optimale) : 512 octets / 512 octets Type d'étiquette de disque : dos Identifiant de disque : 0x58e9a5ac Périphérique Amorçage Début Fin Secteurs Taille Id Type /dev/sdb1 8 1000215215 1000215208 477G fd RAID Linux autodétec idem for second computer same configuration but one HGST Travelstar 7K1000 disk attached and kernel cmdline has mdraid=forced If I missed something just let me know. Thx for your help. (In reply to Theodore Tso from comment #11) So perhaps this is a MQ specific bug? I checked old .config and I had CONFIG_SCSI_MQ_DEFAULT=y activated since version 4.1.6. MQ investigation will probably lead us to a dead end. Can someone try 4.19.3? I was working with another Ubuntu user who did *not* have see the problem with 4.19.0, but did see it with 4.19.1, but one of the differences in his config was: -# CONFIG_SCSI_MQ_DEFAULT is not set +CONFIG_SCSI_MQ_DEFAULT=y Furthermore, he tried 4.19.3 and after two hours of heavy I/O, he's no longer seeing problems. Based on the above observation, his theory is this commit may have fixed things, and it *is* blk-mq specific: commit 410306a0f2baa5d68970cdcf6763d79c16df5f23 Author: Ming Lei <ming.lei@redhat.com> Date: Wed Nov 14 16:25:51 2018 +0800 SCSI: fix queue cleanup race before queue initialization is done commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream. c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has already fixed this race, however the implied synchronize_rcu() in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused performance regression. Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") tried to quiesce queue for avoiding unnecessary synchronize_rcu() only when queue initialization is done, because it is usual to see lots of inexistent LUNs which need to be probed. However, turns out it isn't safe to quiesce queue only when queue initialization is done. Because when one SCSI command is completed, the user of sending command can be waken up immediately, then the scsi device may be removed, meantime the run queue in scsi_end_request() is still in-progress, so kernel panic can be caused. In Red Hat QE lab, there are several reports about this kind of kernel panic triggered during kernel booting. This patch tries to address the issue by grabing one queue usage counter during freeing one request and the following run queue. This commit just landed in mainline and is not in 4.20-rc2, so the theory that it was a blk-mq bug that was fixed by the above commit is consistent with all of the observations made to date. My kernels have this: 4.18.19.config:CONFIG_SCSI_MQ_DEFAULT=y 4.19.2.config:CONFIG_SCSI_MQ_DEFAULT=y 4.20-rc3.config:CONFIG_SCSI_MQ_DEFAULT=y I will ... if I can reboot safely. This time it affects / (i.e /dev/md0p3) What a nightmare. Jimmy, I don't blame you. Unfortunately, I don't have a clean repro of the problem because when I tried building a 4.20-rc2 kernel with CONFIG_SCSI_MQ_DEFAULT=y, and tried running gce-xfstests, no problems were detected. And I'm too chicken to try running a kernel version which does have the problem reported with CONFIG_SCSI_MQ_DEFAULT=y on my primary development laptop. :-) I will say that if you are seeing problems on a particular file system (e.g. /), by the time the kernel is reporting inconsistencies, the damage is already done. Yes, you might want to try doing a backup before you reboot, in case the system doesn't come back, but realistically speaking, the longer you keep running, the problems are more likely to compound. So from a personally very selfish perspective, I'm hoping someone who has already suffered corruption problems is willing to try either 4.19.3, or disabling CONFIG_SCSI_MQ_DEFAULT, or both, and report that they are no longer seeing problems, than my putting my own personal data at risk.... Maybe over T-day weekend, I'll try doing a full backup, and then try using 4.19.3 on my personal laptop --- but a "it works fine for me" report won't necessarily mean anything, since to date I'm not able to reproduce the problem on one of my systems. It'd be critical to know if 4.19.3 is still showing the issue with MQ being on. I'm going to try my luck at reproducing this issue as well, but given that there hasn't been a lot of noise about it, not sure I'll have too much luck. I've got a few suspects, so I'm also willing to spin a patch against 4.19.3 if folks are willing to give that a go. Ted, it seems to be affecting nvme as well, so there's really no escaping for you. But there has to be some other deciding factor here, or all the block testing would surely have caught this. Question is just what it is. What would be the most helpful is if someone who can reproduce this at well could run a bisect between 4.18 and 4.19 to figure out wtf is going on here. This commit: commit 410306a0f2baa5d68970cdcf6763d79c16df5f23 Author: Ming Lei <ming.lei@redhat.com> Date: Wed Nov 14 16:25:51 2018 +0800 SCSI: fix queue cleanup race before queue initialization is done might explain the SCSI issues seen, but the very first comment is from someone using nvme that the above patch has no bearing on that at all. It is, however, possible that some of the queue sync patches caused a blk-mq issue, and that is why nvme is affected as well, and why the above commit seems to fix things on the SCSI side. I'm going to attach a patch here for 4.19 and it'd be great if folks could try that. Created attachment 279579 [details]
4.19 patch
Created attachment 279581 [details]
4.20-rc3 patch
If it's not this, another hint might be a discard change. Is everyone affected using discard? Sorry for the late response, but I have been trying to reproduce the problem with 4.19.2 for some while now. It seems that the problem I was experiencing only happens with 4.19.1 and 4.19.0, and it did so very frequently. I can at least confirm that I have CONFIG_SCSI_MQ_DEFAULT=y set in 4.19 but I didn't in 4.18. I hope that this is, at least for me, fixed for now. (In reply to Theodore Tso from comment #23) > So from a personally very selfish perspective, I'm hoping someone who has > already suffered corruption problems is willing to try either 4.19.3, or > disabling CONFIG_SCSI_MQ_DEFAULT, or both, and report that they are no > longer seeing problems, than my putting my own personal data at risk.... On a Ubuntu 18.10 machine I've upgraded to 4.19.0 and started getting these corruption errors. Yesterday I've upgraded to 4.19.3 and was still getting corrupted. 4.18 was fine. Unfortunately the latest corruption rendered the operating system unbootable. I'm going to try and fix it tonight and then will try to disable CONFIG_SCSI_MQ_DEFAULT and test. I'm in a somewhat fortunate position since the data that I care about lives on another disk with a different filesystem type, so the corruption on the root filesystem is just annoying and not really that dangerous. I'm also running 4.20-rc2 but does not experience any corruption for now (*crossing fingers*) (In reply to Theodore Tso from comment #11) > hmm, my laptop running 4.19.0 plus the ext4 commits landing in 4.20-rc2+ is > *also* using CONFIG_SCSI_MQ_DEFAULT=n But I do have CONFIG_SCSI_MQ_DEFAULT: $ zgrep CONFIG_SCSI_MQ_DEFAULT=y /proc/config.gz CONFIG_SCSI_MQ_DEFAULT=y $ head /sys/block/dm-0/dm/use_blk_mq 1 (In reply to Jens Axboe from comment #25) > Ted, it seems to be affecting nvme as well And I do have nvme ssd: nvme0n1 259:0 0 953.9G 0 disk ├─nvme0n1p1 259:1 0 260M 0 part /boot └─nvme0n1p2 259:2 0 953.6G 0 part └─cryptroot 254:0 0 953.6G 0 crypt / And as you can see I have dm-crypt So it looks like that this is not that simple (IOW not every setup/env/hw affected). (In reply to Jens Axboe from comment #28) > If it's not this, another hint might be a discard change. Is everyone > affected using discard? And what a coincidence, before upgrading to 4.20-rc2 I enabled discard: # findmnt / TARGET SOURCE FSTYPE OPTIONS / /dev/mapper/cryptroot ext4 rw,relatime,discard # cat /proc/cmdline cryptdevice=...:cryptroot:allow-discards # cryptsetup status cryptroot /dev/mapper/cryptroot is active and is in use. ... flags: discards Plus I triggered fstrim manually at start: # systemctl status fstrim Nov 19 00:16:14 azat fstrim[23944]: /boot: 122.8 MiB (128716800 bytes) trimmed on /dev/nvme0n1p1 Nov 19 00:16:14 azat fstrim[23944]: /: 0 B (0 bytes) trimmed on /dev/mapper/cryptroot But what is interesting here is that it did not do any discard for the "/", hm (does ext4 did it for me at start?) (In reply to Jens Axboe from comment #28) > If it's not this, another hint might be a discard change. Is everyone > affected using discard? All 'cat /sys/block/dm-*/dm/use_blk_mq' are zero. Could MQ still be a suspect ? I reproduced the issue with 4.19.3 as well but without your patch. The difference is, it happens less often but still under heavy load (hours of work, mostly compilations and monitoring). The strange is, the affected disks are not obliged to be under load and on the next reboot fsck -f show some of them as clean despite they were declared with ext4_iget corruptions (tested during reboot from 4.19.2 to 4.19.3 kernel)! It's like some shared fs cache failure to me with unpleasant consequences. Disabling user_xattr seems to be more helpful with 4.20.0-rc3 anyway, no error since. Actually not under heavy load. Failures appear also on an plain old HDD device. For me, SSD discard is more a consequence as a reason but it's worse investigating it. I will try your patch ASAP. Thx I hit filesystem corruption with a desktop system running openSUSE Tumbleweed, kernel v4.19.3 and ext4 on top of a SATA SSD with scsi_mod.use_blk_mq=Y in /proc/cmdline. Discard was not enabled in /etc/fstab. After having enabled fsck.mode=force the following appeared in the system log after a reboot: /dev/sda2: Inode 12190197 extent tree (at level 2) could be narrower. IGNORED. > /dev/sda2: Inode 12190197 extent tree (at level 2) could be narrower.
> IGNORED
that is completly unrelated, i see that for years now on several machines and not cleaned up automatically and wasting my time to boot in rescure mode is not worth given the low importance of "could"
(In reply to Reindl Harald from comment #35) > > /dev/sda2: Inode 12190197 extent tree (at level 2) could be narrower. > > IGNORED > > that is completly unrelated, i see that for years now on several machines > and not cleaned up automatically and wasting my time to boot in rescure mode > is not worth given the low importance of "could" That's good to know. The reason I commented on this bug report and that I replied that I hit data corruption is because my workstation failed to boot due to fsck not being able to repair the file system automatically. I had to run fsck manually, answer a long list of scary questions and reboot. @Jen Axboe please read worth not worse in comment 33 I tried your patch for 4.19.3 and still get quite harmful ext4 errors like this one, EXT4-fs error (device dm-4): ext4_xattr_ibody_get:592: inode #4881425: comm rsync: corrupted in-inode xattr The filesystems were clean at boot time and the system was idle. tune2fs ends with FS Error count: 64 First error time: Fri Nov 23 00:19:25 2018 First error function: ext4_xattr_ibody_get First error line #: 592 First error inode #: 4881425 First error block #: 0 Last error time: Fri Nov 23 00:19:25 2018 Last error function: ext4_xattr_ibody_get Last error line #: 592 Last error inode #: 4881430 Last error block #: 0 MMP block number: 9255 MMP update interval: 5 If you are interested in its dumpe2fs result let me know. I don't use binary modules. (In reply to Jimmy.Jazz from comment #37) > @Jen Axboe > > please read worth not worse in comment 33 > > I tried your patch for 4.19.3 and still get quite harmful ext4 errors like > this one, > > EXT4-fs error (device dm-4): ext4_xattr_ibody_get:592: inode #4881425: comm > rsync: corrupted in-inode xattr > > The filesystems were clean at boot time and the system was idle. > > tune2fs ends with > FS Error count: 64 > First error time: Fri Nov 23 00:19:25 2018 > First error function: ext4_xattr_ibody_get > First error line #: 592 > First error inode #: 4881425 > First error block #: 0 > Last error time: Fri Nov 23 00:19:25 2018 > Last error function: ext4_xattr_ibody_get > Last error line #: 592 > Last error inode #: 4881430 > Last error block #: 0 > MMP block number: 9255 > MMP update interval: 5 > > If you are interested in its dumpe2fs result let me know. > > I don't use binary modules. Jimmy, what *I* would do if I were in your shoes is - run a kernel < 4.19, make sure the fs is OK and *backup important data* - compile 4.19.3 with CONFIG_SCSI_MQ_DEFAULT *not* set and see what happens. If you still get corruption, CONFIG_SCSI_MQ_DEFAULT probably is not the culprit. If not, it has at least something to do with it. It seems that CONFIG_SCSI_MQ_DEFAULT *not* set was the default <= 4.18.19. Others here obviously don't have problems with CONFIG_SCSI_MQ_DEFAULT=y and kernels >= 4.19, but you never know. I also experienced an ext4 file system corruption with 4.19.1, after resuming from suspend-to-ram. I've ran 4.12, 13, 14, 16, 17, and 18 on the same machine with near identical .config and never had a file system corruption. For all those kernels, I've had CONFIG_SCSI_MQ_DEFAULT=y. (In reply to AdamB from comment #39) > I also experienced an ext4 file system corruption with 4.19.1, after > resuming from suspend-to-ram. > > I've ran 4.12, 13, 14, 16, 17, and 18 on the same machine with near > identical .config and never had a file system corruption. > > For all those kernels, I've had CONFIG_SCSI_MQ_DEFAULT=y. I can say the same for CONFIG_SCSI_MQ_DEFAULT=n. But this is not exactly the same as running 4.19.x with CONFIG_SCSI_MQ_DEFAULT=n. As someone already pointed out: the best way to find out what's behind this is bisecting between 4.18 and 4.19 by someone affected by the problem. This is time consuming but in the end may also be the fastest way. A backup is IMO mandatory in this case. I have this bug with ubuntu 18.04 kernel 4.15.0-39, too. My Desktop: SSD (Samsung 840) with three partions: /boot : ext2 / : ext4 swap HDD1: one ext4 partition HDD2: luks encrypted, never mounted at boot time and not used when the error happens. No Raid-Stuff used. The problems only occurs on the ext4 part. from the ssd. Sometimes at booting there are some message like "could not access ata devices", there are some timeouts with ATA-commands. It retries several times until it gives up, I dont reach the busy box command line. Sometimes I reach the busybox command line but cant fix it there, because there is no fsck in busybox. I have to connect the ssh to a notebook via usb2sata Adapter, the two partions were recognised without problems and are in most cases automounted. If I force a fsck there are some orphaned inodes discovered and the fs is fixed. After this I can boot from this SSD in the desktop without problems until it happens again. The weired thing is that sometimes the SSD is not recognised after this and has this ATA-Timeouts above. Even turning the desktop completeley powerless (disconnecting from power socket and waiting some minutes then doing a cold boot) it gets stuck the same way. On the notbook were I fix the SSD is the same OS installed and there never occured this type of problem. Maybe I had only luck until now, I dont use the notebook very much. add: I (In reply to HB from comment #41) > I have to connect the ssh I mean the "SSD" not ssh. I do some bisecting on the linux-master git source. I'm at the kernel version 4.19.0-rc2-radeon-00922-gf48097d294-dirty currently. I hope that all the fscks didn't make my system immune to this issue :) Thanks a lot, Jimmy! That's what we need to make some progress here, in lieu of me and/or Ted being able to reproduce this issue. (In reply to Rainer Fiebig from comment #38) > what *I* would do if I were in your shoes is > > - run a kernel < 4.19, make sure the fs is OK and *backup important data* > - compile 4.19.3 with CONFIG_SCSI_MQ_DEFAULT *not* set > > and see what happens. This is what I did: recompile my 4.19.3 kernel with CONFIG_SCSI_MQ_DEFAULT=n. I've used the computer normally, ran some heavy read/write operations, rebooted a bunch of times and had no problems since then. I'm on Ubuntu 18.10 with an SSD, encrypted LUKS root partition. So Henrique, the only difference between the 4.19.3 kernel that worked and the one where you didn't see corruption was CONFIG_SCSI_MQ_DEFAULT? Can you diff the two configs to be sure? What can you tell us about the SSD? Is it a SATA-attached SSD, or NVMe-attached? What I can report is my personal development laptop is running 4.19.0 (plus the ext4 patches that landed in 4.20-rc1) with CONFIG_SCSI_MQ_DEFAULT=n? (Although as others have pointed out, that shouldn't matter since my SSD is NVMe-attached, and so it doesn't go through the SCSI stack.) My laptop runs Debian unstable, and uses an encrypted LUKS partition on top of which I use LVM. I do use regular suspend-to-ram (not suspend-to-idle, since that burns way too much power; there's a kernel BZ open on that issue) since it is a laptop. I have also run xfstest runs using 4.19.0, 4.19.1, 4.19.2, and 4.20-rc2 with CONFIG_SCSI_MQ_DEFAULT=n; it's using the gce-xfstests[1] test appliance which means I'm using virtio-SCSI on top of LVM, and it runs a large number of regression tests, many with heavy read/write loads, but none of the file systems is mounted for more than 5-6 minutes before we unmount and then run fsck on it. We do *not* do any suspend/resumes, although we do test the file system side of suspend/resume using the freeze and thaw ioctls. There were no unusual problems noticed. [1] https://thunk.org/gce-xfstests I have also run gce-xfstests on 4.20-rc2 with CONFIG_SCSI_MQ_DEFAULT=y, with the same configuration as above --- vrtio-scsi with LVM on top. There was nothing unusual that was detected there. Bart, in #34, was the only thing which e2fsck reported this: /dev/sda2: Inode 12190197 extent tree (at level 2) could be narrower. IGNORED. That's not a file system problem; it's a potential optimization which e2fsck detected, which would eliminate a random 4k read when running random read workload against that inode. If you don't want to see this, you can use e2fsck's "-E fixes_only" option. (In reply to Jimmy.Jazz from comment #43) > I do some bisecting on the linux-master git source. > I'm at the kernel version 4.19.0-rc2-radeon-00922-gf48097d294-dirty > currently. I hope that all the fscks didn't make my system immune to this > issue :) Great! Everyone who's had his share of bisecting knows to value you effort! ;) to be short, Release 4.19.0-rc2-radeon-00922-gf48097d294-dirty A: Nothing append at first when the computer is nearly idle. B: I mounted an usb SD media first ro (default) then rw. Transfer to it some big files (cp and tar) from two different xterms. Lot of errors, stick became for the kernel read only. Transfer failed. Umount then remount the filesystem without doing an fsck and restart the transfer again. Transfer ok. umount ok. I will next declare it git bisect bad and then reboot. please see attachements. Created attachment 279655 [details]
dmesg EXT4 errors
Created attachment 279657 [details]
dumpe2fs
FYI, I need 2 patches for my initramfs to generate and IMO should not interfere. drivers/tty/vt/defkeymap.map to get the fr kbd mapping usr/Makefile due to shell evaluation --- usr/Makefile.orig 2017-02-19 23:34:00.000000000 +0100 +++ usr/Makefile 2017-02-22 23:44:24.554921038 +0100 @@ -43,7 +43,7 @@ targets := $(datafile_y) # do not try to update files included in initramfs -$(deps_initramfs): ; +$(deps_initramfs): ; $(deps_initramfs): klibcdirs # We rebuild initramfs_data.cpio if: @@ -52,5 +52,6 @@ # 3) If gen_init_cpio are newer than initramfs_data.cpio # 4) arguments to gen_initramfs.sh changes $(obj)/$(datafile_y): $(obj)/gen_init_cpio $(deps_initramfs) klibcdirs - $(Q)$(initramfs) -l $(ramfs-input) > $(obj)/$(datafile_d_y) + $(Q)$(initramfs) -l $(ramfs-input) | \ + sed '2,$$s/:/\\:/g' > $(obj)/$(datafile_d_y) $(call if_changed,initfs) [quote] I don't want to break T.Tso rules, but I remember, I have encountered a similar issue when I initially tried partitionable array with major 9. At that time I switched to major 254 as explain in comment 8 and the problem didn't come up since... until the recent kernel 4.19 with mdadm 4.1 and kernel devtmpfs that switched the metadevices to major 9. Also, why? A big mystery. [/quote] Now to the fact, I was able to reboot in rescue mode, I use the world service to illustrate the process. Nothing to do with Debian. # service mdraid start # service vg0 start # cd /dev/mapper # for i in *-*; do fsck /dev/mapper/$i; done All clean except sys-scm (f word) the usb stick is clean too. I need a terminal for interactive repairs so I write the beginning by hand. Inode 58577 has extra size (103) which is invalid Fix<y>? yes Timestamp(s) on inode 58577 beyond 2310-04-04 are likely pre-1970 + 9 others Inodes that were part of corrupted orphan linked list found. Fix<y>?yes + 3 others i_size is 139685221367808, shoud be 0. i_blocks is 32523, should be 0. + 22 others Pass 2: checking directory structure Inode 58577 (/git/toolkit.git/objects/e0) has invalid mode (0150) + 9 others Unattached inode 17013 Connect to /lost+found Inode 17013 ref count is 2, should be 1 + 35 others Inode 58586 (...) has invalid mode (0122) + 5 others [...] Unattached inode 262220 Connect to /lost+found<y>? yes Inode 262220 ref count is 2, should be 1. Fix<y>? yes Pass 5: Checking group summary information Block bitmap differences: -(9252--9255) -10490 -(10577--10578) -(16585--16589) -295391 -(682164--682165) Fix<y>? yes Free blocks count wrong for group #0 (2756, counted=2768). Fix<y>? Block bitmap differences: -(9252--9255) -10490 -(10577--10578) -(16585--16589) -295391 -(682164--682165) Fix<y>? yes Free blocks count wrong for group #0 (2756, counted=2768). Fix<y>? yes Free blocks count wrong for group #9 (2784, counted=2785). Fix<y>? yes Free blocks count wrong for group #20 (14702, counted=14704). Fix<y>? yes Free blocks count wrong (718736, counted=718751). Fix<y>? yes Inode bitmap differences: -58591 Fix<y>? yes Free inodes count wrong for group #7 (6283, counted=6284). Fix<y>? yes Directories count wrong for group #7 (1133, counted=1121). Fix<y>? yes Free inodes count wrong (322025, counted=322026). Fix<y>? yes scm: ***** FILE SYSTEM WAS MODIFIED ***** scm: 71190/393216 files (0.1% non-contiguous), 854113/1572864 blocks fsck from util-linux 2.32.1 e2fsck 1.44.4 (18-Aug-2018) service vg0 stop service mdraid stop ctrl-alt-del I was able to reboot with init 1 from grub then init 4 from tty1 as root with kernel 4.19.4. Filesystems were clean. exit from tty1 log in again as normal user under X su - root next bisect in action I must admit, it is time consuming. [...] > > I must admit, it is time consuming. You have been warned. ;) But in the end you will be rewarded with something like this: > git bisect good 1234xx56789yy is the first bad commit ... And honors and glory will rain down on you! OK, this may be a bit exaggerated. ;) (In reply to Theodore Tso from comment #46) > So Henrique, the only difference between the 4.19.3 kernel that worked and > the one where you didn't see corruption was CONFIG_SCSI_MQ_DEFAULT? Can > you diff the two configs to be sure? The bad news is that I've seemed to have made a mistake and there are more changes than that one. The other bad news is that I got another corruption even with CONFIG_SCSI_MQ_DEFAULT=n. > What can you tell us about the SSD? Is it a SATA-attached SSD, or > NVMe-attached? It's a SATA attached SSD. I'll attach more information (dmesg, lspci, kernel config, etc). Unfortunately fsck now tells me I've got a bad magic number in super-block, so I think I better start copying some stuff over to another disk before attempting anything else. I didn't make it with 4.18.0-radeon-07013-g54dbe75bbf-dirty because the radeon module gives me a black screen. With 4.18.0-radeon-03131-g0a957467c5-dirty, ext4 filesystems were stable but 2hours later an exception followed by a sudden reboot w/o warning. Next try, immediate reboot. Also bad too. During bzImage compilation, ld returned: ld.bfd: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text' ld.bfd: warning: creating a DT_TEXTREL in object Is that something suspicious for you? FIK, I was stuck many times with the following message until I realized usr/.initramfs_data.cpio.xz.d file were not removed from the directory (sig). # make CALL scripts/checksyscalls.sh DESCEND objtool CHK include/generated/compile.h usr/Makefile:48: *** motifs de cible multiples. Arrêt. make: *** [Makefile:1041: usr] Error 2 FWIW: I've installed a defconfig-4.19.3 in a VirtualBox-VM. But our bug hasn't shown up so far. I have seen the problem on two of four systems running v4.19.4. All systems are System 1: MSI B450 TOMAHAWK (MS-7C02) Ryzen 2700X Drive 1: NVME (500GB) Drive 2: SATA HDD (WD4001FAEX-00MJRA0, 4TB) Problem seen on SATA HDD, with both 4.19.3 and 4.19.4 System 2: MSI B350M MORTAR (MS-7A37) Ryzen 1700X Drive 1: SSD Samsung SSD 840 PRO 250GB Drive 2: SSD Samsung SSD 840 EVO 250GB Problem seen on both drives, with both 4.19.3 and 4.19.4 System 3: Gigabyte AB350M-Gaming 3 Ryzen 1700X Drive 1: SSD Samsung SSD 840 PRO 250GB Drive 2: SSD M4-CT256M4SSD2 (250GB) Problem not seen (yet) System 4: MSI B350M MORTAR (MS-7A37) Ryzen 1700X Drive 1: NVME (500GB) Problem not seen (yet) Default configuration was CONFIG_SCSI_MQ_DEFAULT=y. I tried with CONFIG_SCSI_MQ_DEFAULT=n on system 2 (with 4.19.4) and hit the problem again almost immediately. Created attachment 279685 [details]
debug info
Created attachment 279687 [details]
Ext4 from 4.18 (tar.gz)
I'm pretty sure the problem is not in the ext4 changes between 4.18 and 4.19, since the changes are all quite innocuous (and if it was in the ext4 code, the regression testing really should have picked it up).
But just to rule things out, I've uploaded the contents of fs/ext4 from 4.18. I've verified it can be transplanted on top of 4.19 kernel. Could the people who are experiencing problems with 4.19 try building a kernel with the 4.18 fs/ext4 directory? If you still see problems, then the problem has to be elsewhere. If you don't, then we can take a closer look at the ext4 changes (although I'd then be really puzzled why it's only showing up for some folks, but not others).
Henrique -- what is dm-0? How is it configured? And are you using discard (either the mount option, or fstrim)? Thanks!! (In reply to Theodore Tso from comment #60) > Henrique -- what is dm-0? How is it configured? And are you using discard > (either the mount option, or fstrim)? Thanks!! dm-o is a LUKS encrypted partition that I use as /. I have fstrim running weekly with "fstrim -av" (Ubuntu's default). If you don't mind, I'll continue bisecting. But generating a kernel becomes harder with the genuine kernel. Three times a raw I was unable to compile the kernel. It fails with, arch/x86/entry/vdso/vclock_gettime-x32.o:vclock_gettime.c:fonction__vdso_gettimeofday : erreur : débordement de relocalisation : référence à « vvar_vsyscall_gtod_data » If it fails again, I will need to patch it: --- arch/x86/entry/vdso/Makefile~ 2016-10-02 23:24:33.000000000 +0000 +++ arch/x86/entry/vdso/Makefile 2016-11-16 09:35:13.406216597 +0000 @@ -97,6 +97,7 @@ CPPFLAGS_vdsox32.lds = $(CPPFLAGS_vdso.lds) VDSO_LDFLAGS_vdsox32.lds = -Wl,-m,elf32_x86_64 \ + -fuse-ld=bfd \ -Wl,-soname=linux-vdso.so.1 \ -Wl,-z,max-page-size=4096 \ -Wl,-z,common-page-size=4096 ld gold is my default. And sorry, I should have patched l1tf_vmx_mitigation too. My mistake. diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 27830880e7a7..cb4a16292aa7 100644 --- arch/x86/kernel/cpu/bugs.c +++ arch/x86/kernel/cpu/bugs.c @@ -664,10 +664,9 @@ void x86_spec_ctrl_setup_ap(void) enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_FLUSH; #if IS_ENABLED(CONFIG_KVM_INTEL) EXPORT_SYMBOL_GPL(l1tf_mitigation); - +#endif enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO; EXPORT_SYMBOL_GPL(l1tf_vmx_mitigation); -#endif static void __init l1tf_select_mitigation(void) { Neverless the ext4 issue as it seems to be doesn't make sens. I can compile packages during the test to maintain the cpu's activity on top to ease reproducing the issue. Each time I do a reboot, I do a fsck on the ext4 partitions (in both rescue mode and normal init process) and it's like for some partitions e2fsck is unable to handle (in any undetermined circumstances) 'Structure needs cleaning' issue (remember my remark about fsck -D). If that's confirmed, a corrupt fs could still be corrupt on the next reboot and misguide us. In that case, Jens Axboe 4.19.4 patch does its work. I'm bisecting the kernel on a 4.19.4 patched kernel version. The only fs that's stay corrupt after each reboot is my backup partition (sig). Could someone investigate in that direction please ? I'm using e2fsprogs 1.44.4 package. (In reply to Jason Gambrel from comment #3) > No Raid. 500gb SSD. 16gb ram with a 4gb swap file (not a swap partition). FWIW, I was running into ext4 metadata corruption every few days with 4.19 using swap files (on the ext4 / on LVM on LUKS). On a hunch, switched to a swap partition on LVM on LUKS two weeks ago, and haven't run into it since. Swap files were working fine with pre-4.19 kernels. In case it matters, I run fstrim in a weekly cronjob, with discard enabled in /etc/lvm/lvm.conf and /etc/crypttab. (In reply to Jimmy.Jazz from comment #62) > In that case, Jens Axboe 4.19.4 patch does its work. I'm bisecting the > kernel on a 4.19.4 patched kernel version. The only fs that's stay corrupt > after each reboot is my backup partition (sig). > Could someone investigate in that direction please ? How certain are you that my 4.19 patch fixes the issue completely for you? If 100%, can you also try with 4.19.4 + just the first hunk of that patch? In other words, only apply the part to block/blk-core.c, not the one to block/blk-mq.c Thanks! Hi. My distro is gentoo testing, I also use 4 partitions raid1 ,are two discs of 1T WD black that have never failed. These raid1 partitions use mdadm with metadata 0.90. $ lsblk /dev/md* RM RO MODEL NAME LABEL FSTYPE MOUNTPOINT SIZE PHY-SEC LOG-SEC MODE 0 0 md0 GentooBoot ext4 128M 512 512 brw-rw---- 0 0 md1 GentooSwap swap [SWAP] 4G 512 512 brw-rw---- 0 0 md2 GentooRaiz ext4 / 50G 512 512 brw-rw---- 0 0 md3 GentooHome ext4 /home 877,4G 512 512 brw-rw---- Effectively from 4.19.0 I started to have problems with the boot, the system always closed perfectly unmount all the partitions, but when booting the next time I fall in fsck and end in recovery console, ignoring these errors, restart again and I choose the kernel 4.18.20 and it does not fall in fsck, it also does not detect any error in the ext4 partitions. Sometimes these errors trigger the resynchronization of the partition that fsck detects false positives, I see it with $ cat /proc/mdstat For now I will continue using 4.18.20, the faults I have been doing since 4.19.0 4.19.1 4.19.2 4.19.3 4.19.4 and 4.19.5, given that this is something from the 4.19.x branch $ uname -a Linux pc-user 4.18.20-gentoo #1 SMP PREEMPT Sat Nov 24 14:39:41 $ eselect kernel list Available kernel symlink targets: [1] linux-4.18.20-gentoo [2] linux-4.19.4-gentoo [3] linux-4.19.5-gentoo * Regards (In reply to Jens Axboe from comment #64) > How certain are you that my 4.19 patch fixes the issue completely for you? Without your patch the failure was mostly systematic in the time. Synchronization mechanism is not trivial anyway. But statically there is hope. > If 100%, can you also try with 4.19.4 + just the first hunk of that patch? > In other words, only apply the part to block/blk-core.c, not the one to > block/blk-mq.c I understand, probably syncronize_rcu was a bit too much :). Let me 24h please. I am also experiencing ext4 corruptions with 4.19.x kernels. One way to trigger this bug that works almost every time on my system is to backup the whole FS with BorgBackup using this command: nice -19 ionice -c3 borg create -v --stats --list --filter=AME --one-file-system --exclude-caches --compression zstd --progress my-server:/borg-backup::'{hostname}-{now:%Y-%m-%d_%H:%M}' / Here are kernel messages: [ 916.082499] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318098: comm borg: bad extra_isize 35466 (inode size 256) [ 916.093908] Aborting journal on device sda1-8. [ 916.096417] EXT4-fs (sda1): Remounting filesystem read-only [ 916.096799] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318101: comm borg: bad extra_isize 35466 (inode size 256) [ 916.101544] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318103: comm borg: bad extra_isize 35466 (inode size 256) [ 916.106531] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318107: comm borg: bad extra_isize 35466 (inode size 256) [ 916.111039] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318110: comm borg: bad extra_isize 35466 (inode size 256) [ 916.115763] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318112: comm borg: bad extra_isize 35466 (inode size 256) If there is some interest, I can provide more details, but in another bug report since this one is already loaded with attached files. # uname -a Linux seal 4.18.0-rc1-radeon-00048-ge1333462e3-dirty #36 SMP PREEMPT Wed Nov 28 18:30:01 CET 2018 x86_64 AMD A10-5800K APU with Radeon(tm) HD Graphics AuthenticAMD GNU/Linux I finally run a promising kernel that compiles, doesn't crash and cares about my filesystems. 4.18.0-rc1-radeon-00048-ge1333462e3-dirty could be a winner. This time the backup file system dm-4 could be efficiently cured and dirvish has done its work has expected. I could compile with it the kernel 4.19.4 as J.Axboe asked me to. @T.Tso, if you still have an interest in dmesg, fsck (quite impressive) output with that kernel version, let me know. Actually, the interaction between e2fsck 1.44.4 and kernel 4.18 differs from 4.19 An dmesg excerpt, [12421.017028] EXT4-fs warning (device dm-4): kmmpd:191: kmmpd being stopped since filesystem has been remounted as readonly. [12434.457445] EXT4-fs warning (device dm-4): ext4_multi_mount_protect:325: MMP interval 42 higher than expected, please wait. The warning didn't show off with kernel 4.19 and remount is slower. No ext4 errors to see. # git bisect log git bisect start # good: [94710cac0ef4ee177a63b5227664b38c95bbf703] Linux 4.18 git bisect good 94710cac0ef4ee177a63b5227664b38c95bbf703 # bad: [9ff01193a20d391e8dbce4403dd5ef87c7eaaca6] Linux 4.20-rc3 git bisect bad 9ff01193a20d391e8dbce4403dd5ef87c7eaaca6 # bad: [9ff01193a20d391e8dbce4403dd5ef87c7eaaca6] Linux 4.20-rc3 git bisect bad 9ff01193a20d391e8dbce4403dd5ef87c7eaaca6 # bad: [84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d] Linux 4.19 git bisect bad 84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d # bad: [f48097d294d6f76a38bf1a1cb579aa99ede44297] dt-bindings: display: renesas: du: Document r8a77990 bindings git bisect bad f48097d294d6f76a38bf1a1cb579aa99ede44297 # bad: [f48097d294d6f76a38bf1a1cb579aa99ede44297] dt-bindings: display: renesas: du: Document r8a77990 bindings git bisect bad f48097d294d6f76a38bf1a1cb579aa99ede44297 # bad: [54dbe75bbf1e189982516de179147208e90b5e45] Merge tag 'drm-next-2018-08-15' of git://anongit.freedesktop.org/drm/drm git bisect bad 54dbe75bbf1e189982516de179147208e90b5e45 # bad: [0a957467c5fd46142bc9c52758ffc552d4c5e2f7] x86: i8259: Add missing include file git bisect bad 0a957467c5fd46142bc9c52758ffc552d4c5e2f7 # bad: [958f338e96f874a0d29442396d6adf9c1e17aa2d] Merge branch 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 958f338e96f874a0d29442396d6adf9c1e17aa2d # bad: [85a0b791bc17f7a49280b33e2905d109c062a47b] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux git bisect bad 85a0b791bc17f7a49280b33e2905d109c062a47b # bad: [8603596a327c978534f5c45db135e6c36b4b1425] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 8603596a327c978534f5c45db135e6c36b4b1425 # bad: [2406fb8d94fb17fee3ace0c09427c08825eacb16] Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 2406fb8d94fb17fee3ace0c09427c08825eacb16 # bad: [cd23ac8ddb7be993f88bee893b89a8b4971c3651] rcu: Add comment to the last sleep in the rcu tasks loop git bisect bad cd23ac8ddb7be993f88bee893b89a8b4971c3651 Hi Jimmy, how certain are you that e1333462e3 is stable for you? i.e., how long have you been running with that kernel and how quickly do your other git bisect bad build fail for you? And I assume you have run a forced fsck (ideally while 4.18 is booted) on the file system before installing each kernel that you were bisect testing, right? Otherwise it's possible that a previous bad kernel had left the file system corrupted, and so a particular kernel stumbled on a corruption, but it wasn't actually *caused* by that kernel. The reason why I'm asking these question is that based on your bisect, it would *appear* that the problem was introduced by an RCU change. If you look at the output of "git log --oneline e1333462e3..cd23ac8ddb7" all of the changes are RCU related. That's a bit surprising, since given that only some users are seeing this problem. If there was a regression was introduced in the RCU subsystem, I would have expected a large number of people would have been complaining, with many more bugs than just in ext4. And there is some evidence that your file system has gotten corrupted. The warnings you report here: [12421.017028] EXT4-fs warning (device dm-4): kmmpd:191: kmmpd being stopped since filesystem has been remounted as readonly. [12434.457445] EXT4-fs warning (device dm-4): ext4_multi_mount_protect:325: MMP interval 42 higher than expected, please wait. Are caused by the MMP feature being enabled on your kernel. It's not enabled by default, and unless you have relatively exotic hardware (e.g., dual-attached SCSI disks that can be reached by two servers for failover) there is no reason to turn on the MMP feature. You can disable it via: "tune2fs -O ^mmp /dev/dm-4". (And you can enable it via "tune2fs -O mmp /dev/dm-4".) So apparently while you were running your tests, the superblock had at least one bit (the MMP feature bit) flipped by a rogue kernel. (In reply to Theodore Tso from comment #59) > > But just to rule things out, I've uploaded the contents of fs/ext4 from > 4.18. I've verified it can be transplanted on top of 4.19 kernel. Could > the people who are experiencing problems with 4.19 try building a kernel > with the 4.18 fs/ext4 directory? If you still see problems, then the > problem has to be elsewhere. If you don't, then we can take a closer look > at the ext4 changes (although I'd then be really puzzled why it's only > showing up for some folks, but not others). > I copied /fs/ext4 from tree 4.18.20 to tree 4.19.5 and compile everything from scratch the tree 4.19.5. Well, now we'll have to wait and cross our fingers every time I restart the PC. So far I had no problems, if they appear I would be posted again with data. Regarding my configuration of CONFIG_SCSI_MQ_DEFAULT it was always enabled for eons. # cat /boot/config-4.18.20-gentoo |grep CONFIG_SCSI_MQ_DEFAULT= CONFIG_SCSI_MQ_DEFAULT=y # cat /boot/config-4.19.4-gentoo |grep CONFIG_SCSI_MQ_DEFAULT= CONFIG_SCSI_MQ_DEFAULT=y # cat /boot/config-4.19.5-gentoo |grep CONFIG_SCSI_MQ_DEFAULT= CONFIG_SCSI_MQ_DEFAULT=y # eix -Ic e2fsprogs [I] sys-fs/e2fsprogs (1.44.4@07/11/18): Standard EXT2/EXT3/EXT4 filesystem utilities [I] sys-libs/e2fsprogs-libs (1.44.4@06/11/18): e2fsprogs libraries (common error and subsystem) Found 2 matches Regards If it helps, I do NOT see this bug and I've run all 4.18.y and 4.19.y kernels: CONFIG_SCSI_MQ_DEFAULT=y CONFIG_MQ_IOSCHED_DEADLINE=y rootfs on RAID-0 on 2 SSDs: cat /proc/mdstat Personalities : [raid0] md127 : active raid0 sdb1[1] sda3[0] 499341824 blocks super 1.2 256k chunks /dev/md127 on / type ext4 (rw,noatime,discard,stripe=128) (In reply to Theodore Tso from comment #69) I didn't trust the kernel enough to let it work all the night without close observation (i.e I need some rest). In comparison with the latest tests, I feel certain the kernel is good after one day with parallel running compilations.That's why I postponed J.Axboe request. Actually, I'm working with 4.18 e1333462e3 and after three clean reboot, the disks stayed clean. Dirvish is running today and nothing bad has append. I can say 4.18 e1333462e3 is good. $ uptime 17:12:44 up 3:23, 6 users, load average: 10,54, 10,99, 10,13 Also, I didn't change my .config except when asked during the current commit. > how quickly do your other git bisect bad build fail ? The builds failed after I solicit the kernel or when I back up the system (dirvish/rsync). When the activity is low I didn't observe anything suspicious. Also, the server is not a stupid idle beagle. To resume, - I jumped to 4.19 because they were no improvement with 4.20-c3... and I feared for my datas. - From f48097d2 to 54dbe75b radeon module didn't work (i.e no display) - 0a957467c5 crashed. Next try, crashed immediately during the boot. (comment 55) - 958f338e I missed 'l1tf' patch (comment 62) - From 958f338e to cd23ac8d I missed 'vdso' patch (comment 62) - e1333462e3 I applied both patches 'l1tf' and 'vdso' With commit e1333462e3, dm-4 partition could be cleaned efficiently (see attachement). > And I assume you have run a forced fsck I have run a fsck /dev/dm-XX with 4.18 commit e1333462e3 first in rescue mode than from init script during normal boot. It was not necessary to force an fsck distinguished from 4.19 and higher releases. > a previous bad kernel had left the file system corrupted I thought about it too (comment 62 second paragraph). In that case, why does only 4.18 + e2fsprogs be able to clean the partitions and not with more recent kernels ? Doesn't e2fsprogs be compatible with 4.19 branch, does it ? > git log --oneline e1333462e3..cd23ac8ddb7 I'm using gcc (Gentoo 8.2.0-r4 p1.5) 8.2.0 and use LD=ld.bfd. My linker is gold by default. Sadly, I didn't find a way to compile it with clang. > I would have expected a large number of people. I understand. But race conditions are not always trivial. > your file system has gotten corrupted. dm-4 is marked read only until a backup is performed. I add (temporarily) mmp to the file systems because I though I had a multi remount issue at first. The report what intended to attract your attention on the following; remount,rw or remount,ro are really slow with 4.18 commit e1333462e3 and the warning has never appeared in that way on other builds. That was not observed with vanilla 4.18.X. Please, I didn't intend to misguide you. Just consider the warning as a false positive. If the warning show of a rogue kernel, then it is the kernel 4.18 (a contradiction). My computers are on ups and I do an fsck on every reboot but force it again only when an error has been detected. Anyway, corruptions that appear and disappear all of sudden on the majority of fs with such a frequency is quite remarkable. The file systems are now clean over reboots. I propose to test if 4.19.5 kernel stops showing corruptions. If they stop, it still opens a new question, why was fsck missing some file system corruptions ? Created attachment 279739 [details]
fsck output kernel 4.18
the fsck has been done in rescue mode w/ 4.18
(In reply to Laurent Bonnaud from comment #67) > I am also experiencing ext4 corruptions with 4.19.x kernels. > > One way to trigger this bug that works almost every time on my system is to > backup the whole FS with BorgBackup using this command: > Ouch, me too. I've already been through two hard drives and a new SATA controller. I was just about to resign myself to replacing the whole PC. My system is an older AMD Phenom, with absolutely nothing fancy going on. Boring spinning disks, no RAID, and exactly the symptom above. After upgrading to 4.19.0 everything was fine for a week, and then Borg started reporting these errors. If I boot to a rescue CD and fsck, things go back to "normal," but then after a few more days I get corruption again. IIRC I skipped 4.19.1 but had the same problem with 4.19.2, and now again on 4.19.3. (In reply to Jens Axboe from comment #64) > only apply the part to block/blk-core.c @T.Tso and J.Axboe e1333462e3 was not able to compile the 4.19.5 kernel. Long story, gcc begins to complain of missing elfutils package (it was installed already). I felt also in an old CONFIG_UNWINDER_ORC bug "Cannot generate ORC metadata". Compilations begin to fail with a "cannot make executable" error. As unbelievable it is, the bug was reported recently (https://lkml.org/lkml/2018/11/5/108). I'm using dev-libs/elfutils-0.175 and the kernel isn't affected by https://bugs.gentoo.org/671760 The good news. 4.19.4 kernel with the part to 'block/blk-core.c' of your patch has compiled 4.19.5. It doesn't show any sign of ext4 corruption. I'm waiting for the next backup tomorrow. (In reply to Jimmy.Jazz from comment #75) > I'm waiting for the next backup tomorrow. @J.Axboe No need to wait. ext4 error resurfaced on dm-8 this time. block/blk-core.c patch doesn't correct the issue. [ 3774.584797] EXT4-fs error (device dm-8): ext4_iget:4985: inode #1614666: comm emerge: bad extended attribute block 1 (In reply to Jimmy.Jazz from comment #76) > (In reply to Jimmy.Jazz from comment #75) > > > I'm waiting for the next backup tomorrow. > > @J.Axboe > No need to wait. ext4 error resurfaced on dm-8 this time. block/blk-core.c > patch doesn't correct the issue. > > [ 3774.584797] EXT4-fs error (device dm-8): ext4_iget:4985: inode #1614666: > comm emerge: bad extended attribute block 1 Are you still confident the full patch works? It's interesting since that has RCU, and the other changes point in that direction, too. Well, guys, this seems to pass the test, after several reboots every several hours of use, I had no more corruptions of my four partitions mdadm raid1. The solution was to delete the ext4 folder from my kernel 4.19.5 and copy the ext4 folder from my previous kernel 4.18.20 and recompiling the tree 4.19.5. $ uname -a Linux pc-user 4.19.5-gentoo #1 SMP PREEMPT Thu Nov 29 00:45:31 2018 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux I was comparing both folders /fs/ext4 of the 4.18.20 and 4.19.5 trees with meld and there are several modifications, unfortunately it exceeds my knowledge. At least on this side I affirm that the problem is gone, we will see it happen with later patches. (In reply to Néstor A. Marchesini from comment #78) > Well, guys, this seems to pass the test, after several reboots every several > hours of use, I had no more corruptions of my four partitions mdadm raid1. > The solution was to delete the ext4 folder from my kernel 4.19.5 and copy > the ext4 folder from my previous kernel 4.18.20 and recompiling the tree > 4.19.5. > > $ uname -a > Linux pc-user 4.19.5-gentoo #1 SMP PREEMPT Thu Nov 29 00:45:31 2018 x86_64 > AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux > > I was comparing both folders /fs/ext4 of the 4.18.20 and 4.19.5 trees with > meld and there are several modifications, unfortunately it exceeds my > knowledge. > At least on this side I affirm that the problem is gone, we will see it > happen with later patches. If this is the case, perhaps you could bisect fs/ext4 between tags v4.18 and v4.19? $ git bisect start v4.19 v4.18 -- fs/ext4 (In reply to Néstor A. Marchesini from comment #78) > Well, guys, this seems to pass the test, after several reboots every several > hours of use, I had no more corruptions of my four partitions mdadm raid1. > The solution was to delete the ext4 folder from my kernel 4.19.5 and copy > the ext4 folder from my previous kernel 4.18.20 and recompiling the tree > 4.19.5. > > $ uname -a > Linux pc-user 4.19.5-gentoo #1 SMP PREEMPT Thu Nov 29 00:45:31 2018 x86_64 > AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux > > I was comparing both folders /fs/ext4 of the 4.18.20 and 4.19.5 trees with > meld and there are several modifications, unfortunately it exceeds my > knowledge. > At least on this side I affirm that the problem is gone, we will see it > happen with later patches. If you can bisect it as suggested in comment 79, please mind what Ted Tso has said in comment 69, para. 2. So, after you have hit a bad kernel, make sure that your fs is OK and do the next step (compiling) with a known-as-good-kernel (4.18.20). Otherwise you might get false negatives (wrong bads). (In reply to Jens Axboe from comment #77) > (In reply to Jimmy.Jazz from comment #76) > > (In reply to Jimmy.Jazz from comment #75) > > > > > I'm waiting for the next backup tomorrow. > > > > @J.Axboe > > No need to wait. ext4 error resurfaced on dm-8 this time. block/blk-core.c > > patch doesn't correct the issue. > > > > [ 3774.584797] EXT4-fs error (device dm-8): ext4_iget:4985: inode #1614666: > > comm emerge: bad extended attribute block 1 > > Are you still confident the full patch works? It's interesting since that > has RCU, and the other changes point in that direction, too. It looks like the problem may be caused by changes in fs/ext4 (see comment 78). But I'm wondering why this only affects some (quite a few, though) and not all. Like others, I'm running 4.19.5 without any problem here, it's just nice. I fear the bug might be caused by some interaction between something new in fs/ext4 and something new elsewhere... Sounds unlikely, but it's possible. Since 4.18 ext4 seems to work on 4.19 kernel, maybe it's worth trying 4.19 ext4 on 4.18 kernel (before a bisect), just to make sure the bisect won't lead us down a false trail? (In reply to Hao Wei Tee from comment #82) > I fear the bug might be caused by some interaction between something new in > fs/ext4 and something new elsewhere... Sounds unlikely, but it's possible. > > Since 4.18 ext4 seems to work on 4.19 kernel, maybe it's worth trying 4.19 > ext4 on 4.18 kernel (before a bisect), just to make sure the bisect won't > lead us down a false trail? Interesting idea. But what works in one direction might not necessarily work the other way round. Personally, I'd rather like to be on the safe side here. So before doing this it might be wise to here what Ted Tso thinks about it, just IMO. And I don't think that bisecting just fs/ext4 would be misleading. If we find a bad commit there, Ted and others will look at it anyway and will see whether this alone explains the problems or whether an interaction with something else would be necessary to make sense of it. Hi, thanks for investigating the issue. I "costed" my some inodes on my ext4 rootfs , rMBP, SSD, dm-crypt disk. It appeared on 4.19.1 my case. I just wanted to add that I run btrfs / dmcrypt /samessd on the /home and that one is not affected by that issue as far as I can tell. rgds, j (In reply to Rainer Fiebig from comment #83) > Interesting idea. But what works in one direction might not necessarily work > the other way round. Exactly my point. We know that (4.19 ext4 and kernel is broken), (4.18 ext4 and kernel is working), and (4.18 ext4 and 4.19 kernel is working). If (4.19 ext4 and 4.18 kernel) is broken, then _most likely_ the bug is caused by something that changed in v4.18..v4.19. If (4.19 ext4 and 4.18 kernel) *works*, then either the bug is in something else that changed, or there is an interaction between two changes that happened in v4.18..v4.19. In any case, bisecting v4.18..v4.19 will probably give us a clue. (In reply to Hao Wei Tee from comment #85) > (In reply to Rainer Fiebig from comment #83) > > Interesting idea. But what works in one direction might not necessarily > work > > the other way round. > > Exactly my point. We know that (4.19 ext4 and kernel is broken), (4.18 ext4 > and kernel is working), and (4.18 ext4 and 4.19 kernel is working). > > If (4.19 ext4 and 4.18 kernel) is broken, then _most likely_ the bug is > caused by something that changed in v4.18..v4.19. If (4.19 ext4 and 4.18 > kernel) *works*, then either the bug is in something else that changed, or > there is an interaction between two changes that happened in v4.18..v4.19. > Sure, I understand this. I would just shy away from recommending this to others without a nod from higher powers. But of course it's up to Nestor whether he wants to try this or not. > In any case, bisecting v4.18..v4.19 will probably give us a clue. Let's hope for the best. Perhaps bisecting fs/ext4 will provide enough of a clue already and spare the poor bisecter to have to bisect the whole beast. ;) Regression testing could be carried out in a VM running on top of a ramdisk (e.g. tmpfs) to speed up the process. I guess someone with a decent amount of persistence and spare time could do that and test each individual commit between 4.18 and 4.19, however that doesn't guarantee success since the bug might be hardware related and not reproducible in a virtual environment. Or it might require obscene amounts of RAM/disk space which would be difficult, if not impossible to reproduce in a VM. I for one decided to stay on 4.18.x and not upgrade to any more recent kernels until the regression is identified and dealt with. Maybe one day someone will become truly invested in the kernel development process and we'll have proper QA/QC/unit testing/regression testing/fuzzying, so that individuals won't have to sacrifice their data and time because kernel developers are mostly busy with adding new features and usually not really concerned with performance, security and stability of their code unless they are pointed at such issues. (In reply to Jens Axboe from comment #77) > Are you still confident the full patch works? It's interesting since that > has RCU, and the other changes point in that direction, too. 4.19.4 full patched is stable. I'm just puzzled in its capability to clean a failed file system with sys-fs/e2fsprogs-1.44.4. To all, Please add a large among of read-write mounted file systems attached to your test system. That will increase the probability of the failure. My experience is, the issue doesn't affect a specific mountpoint over and over but rather a random one. FYI, I didn't have any issue with one of the tmpfs filesystems installed. You should take it into consideration when creating your VM test environment. nilfs is stable. (In reply to Artem S. Tashkinov from comment #87) > Maybe one day someone will become truly invested in the kernel development > process and we'll have proper QA/QC/unit testing/regression > testing/fuzzying What we have now is not proper? syzkaller bot, Linux Test Project, kernelci.org, xfstests, and more that I don't know of. Probably more than any other OS. I think it's fair to say Linux has by far more configuration options than any other kernel out there. It's not feasible to test every single possible combination. Things will slip through the cracks, especially bugs like this one where clearly there is something wrong, but not everyone is able to reproduce it at all. Automated tests are going to miss bugs like this. We're not doing things worse than anyone else. Apple's APFS had major issues. Microsoft just had a big problem with their Windows 10 1809 rollout. Anyway, I remember you from another post you made to LKML complaining about Linux. You really don't like the way Linux is developed. Why do you still use it? I digress. For me, the problem started with the release of 4.19.0, and looking at the commits of the tree 4.19.0, I see that many things of ext4 have been changed ... very many I would say. If you search with ext4 within the list of comnits you will find several and with very important changes. https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.19 There are several massive, one of the most important is: https://github.com/torvalds/linux/commit/c140f8b072d16595c83d4d16a05693e72d9b1973 This weekend I will try with git bisect, but it will be a very time-consuming task due to the large number of ext4 commits. I'm still using 4.19.5 with the ext4 folder of 4.18.20. I have not had problems so far. (In reply to Néstor A. Marchesini from comment #91) > For me, the problem started with the release of 4.19.0, and looking at the > commits of the tree 4.19.0, I see that many things of ext4 have been changed > ... very many I would say. > If you search with ext4 within the list of comnits you will find several and > with very important changes. There are only 32 new commits in fs/ext4 in v4.19 from v4.18. See [1], count until commit "ext4: fix check to prevent initializing reserved inodes". [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/fs/ext4?id=v4.19 > There are several massive, one of the most important is: > > > https://github.com/torvalds/linux/commit/c140f8b072d16595c83d4d16a05693e72d9b1973 This isn't in v4.19? It only got pulled in the v4.20 merge window. Most of the ext4 patches in v4.19 have been backported to v4.18.y. Since v4.18.20 is reported to be stable, it is quite likely that the problem lies with one or more of the patches which have _not_ been backported. This would be one of the following patches. ext4: close race between direct IO and ext4_break_layouts() ext4: add nonstring annotations to ext4.h ext4: readpages() should submit IO as read-ahead dax: remove VM_MIXEDMAP for fsdax and device dax ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa() ext4: improve code readability in ext4_iget() ext4: handle layout changes to pinned DAX mappings ext4: use swap macro in mext_page_double_lock ext4: check allocation failure when duplicating "data" in ext4_remount() ext4: fix warning message in ext4_enable_quotas() ext4: super: extend timestamps to 40 bits ext4: use timespec64 for all inode times ext4: use ktime_get_real_seconds for i_dtime ext4: use 64-bit timestamps for mmp_time block: Define and use STAT_READ and STAT_WRITE In terms of ext4 changes, it'd be interesting to just revert this one: commit ac22b46a0b65dbeccbf4d458db95687e825bde90 Author: Jens Axboe <axboe@kernel.dk> Date: Fri Aug 17 15:45:42 2018 -0700 ext4: readpages() should submit IO as read-ahead as that guy is generally just not trust worthy. In all seriousness, though, it shouldn't cause issues (or I would not have done it), and we already do this for readpages in general, but I guess we could have an older bug in ext4 that depends deeply on read-ahead NOT failing. Not sure how likely that is, Ted can probably comment on that. But it's a trivial revert, and it could potentially be implicated. BTW, if that patch is to blame, then the bug is elsewhere in ext4 as there should be no way that read-ahead failing should cause corruption. (In reply to Jens Axboe from comment #94) > In terms of ext4 changes, it'd be interesting to just revert this one: > > commit ac22b46a0b65dbeccbf4d458db95687e825bde90 > Author: Jens Axboe <axboe@kernel.dk> > Date: Fri Aug 17 15:45:42 2018 -0700 > > ext4: readpages() should submit IO as read-ahead > > as that guy is generally just not trust worthy. In all seriousness, though, > it shouldn't cause issues (or I would not have done it), and we already do > this for readpages in general, but I guess we could have an older bug in > ext4 that depends deeply on read-ahead NOT failing. Not sure how likely that > is, Ted can probably comment on that. > > But it's a trivial revert, and it could potentially be implicated. Jens, could you provide the patch here, so that perhaps Jimmy and Nestor can revert it on their 4.19.x and tell us what they see? Thanks. #94 makes me wonder if the problem may be related to https://lkml.org/lkml/2018/5/21/71. Just wondering, and I may be completely off track, but that problem is still seen against the mainline kernel. #96: commit ac22b46a0b65 can be reverted cleanly with "git revert". (In reply to Guenter Roeck from comment #97) > #94 makes me wonder if the problem may be related to > https://lkml.org/lkml/2018/5/21/71. Just wondering, and I may be completely > off track, but that problem is still seen against the mainline kernel. > > #96: commit ac22b46a0b65 can be reverted cleanly with "git revert". Yep, but perhaps some people here don't use git and/or have cloned the repo. #98: Good point. I am going to give it a try with the following on top of v4.19.5: Revert "ext4: handle layout changes to pinned DAX mappings" Revert "dax: remove VM_MIXEDMAP for fsdax and device dax" Revert "ext4: close race between direct IO and ext4_break_layouts()" Revert "ext4: improve code readability in ext4_iget()" Revert "ext4: readpages() should submit IO as read-ahead" Wild shot, but I figured it may be worth try. So far I have been unable to reproduce the problem after reverting the patches mentioned in #99. I'll now install this kernel on a second previously affected system. I'll report back tomorrow morning. Guenter, what is your kernel config? A number of these changes are related to CONFIG_DAX. Are you building kernels with or without CONFIG_DAX enabled? Enabled: $ grep DAX .config CONFIG_NVDIMM_DAX=y CONFIG_DAX_DRIVER=y CONFIG_DAX=y CONFIG_DEV_DAX=m CONFIG_DEV_DAX_PMEM=m CONFIG_FS_DAX=y CONFIG_FS_DAX_PMD=y It doesn't look like dax is loaded, though. /dev/daxX does not exist on any of the affected systems, and lsmod doesn't show any dax modules. #101 Both 4.19.x kernels (VM/real HW):
> grep DAX .config
# CONFIG_DAX is not set
# CONFIG_FS_DAX is not set
Both kernels did *not* have the problem.
This may explain why some see the problem and others don't.
FYI: 4.19.6 was just released, the is a DAX fixup inside: dax: Avoid losing wakeup in dax_lock_mapping_entry Maybe the courageous testers should also consider that one, if the issue is DAX related? I have seen the problem again tonight, but I am not sure if I cleaned the affected file system correctly with an older kernel before I started the test. I'll keep running with the same reverts for another day. #104: Possibly, but it doesn't explain why I see the problem only on two of four systems, all running the same kernel. (In reply to Guenter Roeck from comment #107) > #104: Possibly, but it doesn't explain why I see the problem only on two of > four systems, all running the same kernel. Right. Bye bye DAX-theory. I'm wondering by now whether I made a config-mistake somewhere and *that's* why I don't have the problem. ;) (In reply to Guenter Roeck from comment #107) > #104: Possibly, but it doesn't explain why I see the problem only on two of > four systems, all running the same kernel. Could it be hardware related like ie. blacklisted "trim" for ie. Samsung 850 Pro? Are the 4 machines absoutely equal hardware-wise (at least on the block layer)? Maybe such a quirk is needed for just another device... Running ext4 on 4.19.{5,4,3,2,1,0} with not one error with the following setup: root@marc:~ # lsblk /dev/sda NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 477G 0 disk ├─sda1 8:1 0 1G 0 part /boot ├─sda2 8:2 0 2M 0 part ├─sda3 8:3 0 2M 0 part ├─sda4 8:4 0 2M 0 part ├─sda5 8:5 0 1G 0 part ├─sda6 8:6 0 408G 0 part │ └─crypt-home 254:1 0 408G 0 crypt /home ├─sda7 8:7 0 59G 0 part / └─sda8 8:8 0 8G 0 part └─crypt-swap 254:0 0 8G 0 crypt root@marc:~ # mount | grep home /dev/mapper/crypt-home on /home type ext4 (rw,nosuid,noatime,nodiratime,quota,usrquota,grpquota) root@marc:~ # cryptsetup status crypt-home /dev/mapper/crypt-home is active and is in use. type: LUKS1 cipher: aes-xts-plain64 keysize: 512 bits key location: dm-crypt device: /dev/sda6 sector size: 512 offset: 4096 sectors size: 855633920 sectors mode: read/write root@marc:~ # egrep -i "(ext4|dax)" /boot/config-4.19.5loc64 CONFIG_DAX=y # CONFIG_DEV_DAX is not set CONFIG_EXT4_FS=y CONFIG_EXT4_USE_FOR_EXT2=y CONFIG_EXT4_FS_POSIX_ACL=y CONFIG_EXT4_FS_SECURITY=y # CONFIG_EXT4_ENCRYPTION is not set # CONFIG_EXT4_DEBUG is not set CONFIG_FS_DAX=y root@marc:~ # parted --list Modell: ATA Samsung SSD 860 (scsi) Festplatte /dev/sda: 512GB Sektorgröße (logisch/physisch): 512B/512B Partitionstabelle: gpt Disk-Flags: Nummer Anfang Ende Größe Dateisystem Name Flags 1 1049kB 1075MB 1074MB ext4 2 1075MB 1077MB 2097kB boot, esp 3 1077MB 1079MB 2097kB 4 1079MB 1081MB 2097kB 5 1081MB 2155MB 1074MB ext4 6 2155MB 440GB 438GB 7 440GB 504GB 63,4GB ext4 8 504GB 512GB 8518MB ... ... ... Modell: Linux device-mapper (crypt) (dm) Festplatte /dev/mapper/crypt-home: 438GB Sektorgröße (logisch/physisch): 512B/512B Partitionstabelle: loop Disk-Flags: Nummer Anfang Ende Größe Dateisystem Flags 1 0,00B 438GB 438GB ext4 (In reply to Marc Koschewski from comment #109) > Could it be hardware related like ie. blacklisted "trim" for ie. Samsung 850 > Pro? Are the 4 machines absoutely equal hardware-wise (at least on the block > layer)? Maybe such a quirk is needed for just another device... Marc, are you using an I/O scheduler? I'm not using an I/O scheduler: $ cat /sys/block/sda/queue/scheduler [none] (In reply to Bart Van Assche from comment #110) > (In reply to Marc Koschewski from comment #109) > > Could it be hardware related like ie. blacklisted "trim" for ie. Samsung > 850 > > Pro? Are the 4 machines absoutely equal hardware-wise (at least on the > block > > layer)? Maybe such a quirk is needed for just another device... > > Marc, are you using an I/O scheduler? I'm not using an I/O scheduler: > > $ cat /sys/block/sda/queue/scheduler > [none] I do: root@marc:~ # cat /sys/block/sda/queue/scheduler [mq-deadline] kyber bfq none might be relevant as well: root@marc:~ # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.19.5loc64 root=/dev/sda7 ro init=/sbin/openrc-init root=PARTUUID=6d19e60a-72a8-ee44-89f4-cc6f85a9436c real_root=/dev/sda7 ro resume=PARTUUID=fbc25a25-2d09-634d-9e8b-67308f2feddf real_resume=/dev/sda8 acpi_osi=Linux libata.dma=3 libata.noacpi=0 threadirqs rootfstype=ext4 acpi_sleep=s3_bios,s3_beep devtmpfs.mount=0 net.ifnames=0 vmalloc=512M noautogroup elevator=deadline libata.force=noncq nouveau.noaccel=0 nouveau.nofbaccel=1 nouveau.modeset=1 nouveau.runpm=0 nmi_watchdog=0 i915.modeset=0 cgroup_disable=memory scsi_mod.use_blk_mq=y dm_mod.use_blk_mq=y vgacon.scrollback_persistent=1 processor.ignore_ppc=1 intel_iommu=off crashkernel=128M apparmor=1 security=apparmor $ zcat /proc/config.gz |grep DAX CONFIG_DAX=m # CONFIG_FS_DAX is not set DAX I have it as a module, but I've never seen it loaded with lsmod. $ cat /sys/block/sda/queue/scheduler [none] Always use the mounting parameters in my partitions barrier=1,data=ordered $ cat /etc/fstab |grep LABEL LABEL=GentooBoot /boot ext4 noatime,noauto,barrier=1,data=ordered 0 2 LABEL=GentooSwap none swap swap 0 0 LABEL=GentooRaiz / ext4 noatime,barrier=1,data=ordered 0 1 LABEL=GentooHome /home ext4 noatime,barrier=1,data=ordered 0 2 Excellent point given by Guenter Roeck in comment 93 I would have to try to create several trees 4.19.5 and be removed one or two at a time, to isolate the fault. I still use 4.19.5 with the ext4 folder of 4.18.20 and zero problems. Regards (In reply to Néstor A. Marchesini from comment #112) > $ zcat /proc/config.gz |grep DAX > CONFIG_DAX=m > # CONFIG_FS_DAX is not set > > DAX I have it as a module, but I've never seen it loaded with lsmod. > > $ cat /sys/block/sda/queue/scheduler > [none] > > Always use the mounting parameters in my partitions barrier=1,data=ordered > > $ cat /etc/fstab |grep LABEL > LABEL=GentooBoot /boot ext4 noatime,noauto,barrier=1,data=ordered 0 > 2 > LABEL=GentooSwap none swap swap 0 0 > LABEL=GentooRaiz / ext4 noatime,barrier=1,data=ordered 0 1 > LABEL=GentooHome /home ext4 noatime,barrier=1,data=ordered 0 2 > > Excellent point given by Guenter Roeck in comment 93 > I would have to try to create several trees 4.19.5 and be removed one or two > at a time, > to isolate the fault. > I still use 4.19.5 with the ext4 folder of 4.18.20 and zero problems. > > Regards Have you given up on your plan to bisect this like suggested in comment 79? It would be only 5 steps for those 32 commits. And compile times should be rather short. If you know how to reproduce/provoke the errors it could be done within 2 hours or less. So I've gotten a query off-line about whether I'm still paying attention to this bug. The answer is that I'm absolutely paying attention. The reason why I haven't commented much is because there's not much else to say, and I'm still waiting for more information. On that front --- I am *absolutely* grateful for people who are still trying to debug this issue, especially when it may be coming at the risk of their data. However, one of the challenges is that it's very easy for reports to be either false positives or false negatives. False positives come from booting a kernel which might be fine, but the file system was corrupted from running a previous kernel. Remember, when you get an EXT4-fs error report, that's when the kernel discovers the file system corruption; it doesn't necessarily mean that the currently running kernel is buggy. To prevent this false positives, please run "e2fsck -fy /dev/sdXX > /tmp/log.1" to make sure the file system is clear before rebooting into the new kernel. If e2fsck -fy shows any problems that are fixed, please then run "echo 3 > /proc/sys/vm/drop_caches ; e2fsck -fn /dev/sdXX > /tmp/log.2" to make sure the file system is really clean. False negatives come from booting a kernel which is buggy, but since this bug seems to be a bit flakey, you're getting lucky/unlucky enough to such that after N hours/days, you just haven't tripped over the bug --- or you *have* tripped over the bug, but the kernel hasn't noticed the problem yet, and so it hasn't reported the EXT4-fs error yet. There's not a lot we can do to absolutely avoid all false negatives, but if you are running a kernel which you report is OK, and then a day later, it turns out you see corruption, please don't forget to submit a comment to bugzilla, saying, "my comment in #NNN, where I said a particular kernel was problem-free; turns out I have seen a problem with it." Again, my thanks for trying to figure out what's going on. Rest assure that Jens Axboe and I are both paying very close attention. This bug is a really scary one, both because of how the severity of its consequences, *and* because neither of us can reproduce it on our own systems or regression tests --- so we are utterly reliant on those people who *can* reproduce the issue to give us data. We very much want to make sure this gets fixed ASAP! Still playing. I have now seen the problem several times with the patches per #99 reverted, I am just not 100% sure if I see false positives. For those claiming that upstream developers don't care: I for my part do plan to spend as much time on this as needed to nail down the problem, though I have to admit that the comment in #87 almost made me quit (and wonder why I spend time, energy, and money running kerneltests.org). Here are two other data points, just for the record: 1. Like comment #65, I've only actually seen this corruption on three physical disks, and all of them were Western Digital Caviar Blacks. There is another disk in my system -- a different model -- that has been lucky so far; but this may be pure chance. My /dev/sda has had multiple problems, but /dev/sdb only got corrupted once even though they're the same model. 2. The corruption for me is occurring in files (contained in directories) that I haven't touched in a long time. They get backed up -- which means that they get read -- but few if any have been written recently. In addition, all of my mounts are "noatime." Normally I wouldn't expect corruption from *reading* files, which is what lead me to start swapping out disks and SATA controllers. (In reply to Artem S. Tashkinov from comment #87) > Regression testing could be carried out in a VM running on top of a ramdisk > (e.g. tmpfs) to speed up the process. > > I guess someone with a decent amount of persistence and spare time could do > that and test each individual commit between 4.18 and 4.19, however that > doesn't guarantee success since the bug might be hardware related and not > reproducible in a virtual environment. Or it might require obscene amounts > of RAM/disk space which would be difficult, if not impossible to reproduce > in a VM. > > I for one decided to stay on 4.18.x and not upgrade to any more recent > kernels until the regression is identified and dealt with. > > Maybe one day someone will become truly invested in the kernel development > process and we'll have proper QA/QC/unit testing/regression > testing/fuzzying, so that individuals won't have to sacrifice their data and > time because kernel developers are mostly busy with adding new features and > usually not really concerned with performance, security and stability of > their code unless they are pointed at such issues. You obviously have no idea wtf you are talking about, I suggest you go investigate just how much testing is done, continuously, on things like file system and storage. I take personal offense in in claims that developers "are not really concerned with performance and stability of their code". Here's a news flash for you - bugs happen, no matter how much testing is done on something. I have not observed the problem, but I have been thinking of maybe a more reliable way to detect a problem. btrfs has a "scrub" command that essentially verifies the checksum of every file on the disk. Now, ext4 does not have such a feature (as far as I know). How about people who are seeing this problem, do a recursive sha1sum -b of every file on the disk while in a known good state, and then do a sha1sum -c of every file on the disk to see which ones got corrupted. This might help when doing git bisect and checking that we are back to a known good file system, and in cases like comment #116, item 2. Also, I think there is a way to force a reboot to a particular kernel, using grub, so one could script and git bisect, reboot to old working kernel, fsck, then reboot to problem kernel and start next git bisect all using automated scripts. Anyway, just ideas. I think I need some education. It has been suggested suggested several times - both here and elsewhere on the web - that the problem might possibly be caused by bad drives. Yet, I don't recall a single instance where a disk error was reported in conjunction with this problem. I most definitely don't see one on my systems. Can hard drives and SSDs nowadays fail silently by reading bad data instead of reporting read (and/or write) errors ? I would find that thought quite scary. Can someone point me to related literature ? In this context, it seems odd that this presumed silent disk error would only show up with v4.19.x, but not with earlier kernels. > How about people who are seeing this problem, do a recursive sha1sum -b of > every file on the disk while in a known good state, and then do a sha1sum -c > of every file on the disk to see which ones got corrupted. FWIW, https://github.com/claudehohl/checksummer does that (and saves the checksums in a sqlite database). I have the problem again with kernel 4.19.5 (J. Axboe patches). Sorry I don't trust 4.19.x without Axboe patch because if some fs corruptions reappear they will be less violent than without the patch. OS uptime 8:51, 37 static ext4 mountpoints as they are reported by 'mount' command. I have checked with the same kernel 4.19.5 in rescue mode (i.e none of the file systems are mounted) summary: - e2fsck -fy /dev/dm-4 > log.1 - echo 3 > /proc/sys/vm/drop-caches - e2fsck -fn /dev/dm-4 log.2 - reboot again in normal mode ( / is mounted) - fsck -TRAC -r -M -p /dev/dm-X || fsck -f -TRAC -y -s /dev/dm-X - if there is a new stable release then I compile and test the new release Result: The system doesn't see any error (fsck) on reboot (rescue and normal boot). see attachments. Created attachment 279771 [details]
dmesg shows errors before reboot
Created attachment 279773 [details]
logs show no error after reboot
Please read more violent without the patch (In reply to James Courtier-Dutton from comment #118) > How about people who are seeing this problem, do a recursive sha1sum -b of > every file on the disk while in a known good state, and then do a sha1sum -c > of every file on the disk to see which ones got corrupted. > This might help when doing git bisect and checking that we are back to a > known good file system, and in cases like comment #116, item 2. That could take a lot of CPU time. On my system git status told me that about ten source files had disappeared from the kernel tree that I definitely had not deleted myself. In other words, git can be used to detect filesystem corruption. FWIW I believe this issue is affecting ZFS as well. I'm getting the occasional checksum error on a random drive of a RAID-Z configuration (five 4T WD Reds). I'd initially suspected a chipset (Intel 5400) issue as it's spread more or less evenly across the devices. However it's definitely a software issue, as it only occurs running kernel 4.19.[26] and disappears entirely with 4.18.20. On a different workstation with ZFS and Seagate drives I haven't been able to reproduce the issue. In reply to #118, from James Courtier-Dutton: While there have been a few people who have reported problems with the contents of their files, the vast majority of people are reporting problems that seem to include complete garbage being written into metadata blocks --- i.e., completely garbage in to inode table, block group descriptor, and superblocks. This is getting detected by the kernel noticing corruption, or by e2fsck running and noticing that the file system metadata is inconsistent. More modern ext4 file systems have metadata checksum turned on, but the reports from e2fsck seem to indicate that complete garbage (or, more likely, data meant for block XXX is getting written to block YYY); as such, the corruption is not subtle, so generally the kernel doesn't need checksums to figure out that the metadata blocks are nonsensical. It should be noted that ext4 has very strong checks to prevent this from happening. In particular, when a inode's logical block number is converted to a physical block number, there is block_validity checking to make sure that the physical block number for a data block does not map onto a metadata block. This prevents a corrupted extent tree from causing ext4 to try to write data meant for a data block on top of an inode table block, which would cause the sort of symptoms that some users have reported. One possible cause is that something below ext4 (e.g. the block layer, or an I/O scheduler) is scrambling the block number so that a file write meant for data block XXX is getting writen to metadata block YYY. If Eric Benoit's report in comment #126 is to believed, and he is seeing the same behavior with ZFS, then that might be consistent with a bug in the block layer. However, some people who have reported that transplanting ext4 from 4.18 onto 4.19 has caused the problem to go away. That would be an argument in favor of the problem being in ext4. Of course, both observations might be flawed (see my previous comments about false positive and negative reports). And there might be more than one bug that we are chasing at the moment. But the big question which we don't understand is why are some people seeing it, but not others. There are a huge number of variables, from kernel configs, to what I/O scheduler might be selected, etc. The bug also seems to be very flaky, and there is some hint that heavy I/O load is required to trigger the bug. So it might be that people who think their kernel is fine, might actually be buggy, because they simply haven't pushed their system hard enough. Or it might require heavy workloads of a specific type (e.g., Direct I/O or Async I/O), or one kind of workload racing with another type of workload. This is what makes tracking down this kind of bug really hard. To Guenter, re: #119. This is just my intuition, but this doesn't "smell" like a problem with a bad drive. There are too many reports where people have said that they don't see the problem with 4.18, but they do see it with 4.19.0 and newer kernels. The reports have been with different hardware, from HDD's to SSD's, with some people reporting NVMe-attached SSD And some reporting SATA-attached SSD's. Can hard drives and SSDs nowadays fail silently by reading bad data instead of reporting read (and/or write) errors? One of the things I've learned in my decades of storage experience is to never rule anything out --- hardware will do some very strange things. That being said.... no, normally this would be highly unlikely. Hard Drive and SSD's have strong error-correcting codes, parity and super-parity checks in their internal data paths, so silent read errors are unlikely, unless the firmware is *seriously* screwed up. In addition, the fact that some of the reports involve complete garbage getting written into the inode table, it seems more likely the problem is on the writing side rather than on the read side. One thing I would recommend is "tune2fs -e remount-ro /dev/sdXX". This will set the default mode to remount the file system read-only. So if the problem is on the read side, it makes it even more unlikely that the data will be written back to disk. Some people may prefer "tune2fs -e panic /dev/sdXX", especially on production servers. That way, when the kernel discovers a file system inconsistency, it will immediately force a reboot, and then the fsck run on reboot can fix up the problem. More importantly, by preventing the system from continuing to operate after a problem has been noticed, it avoids the problem metastasizing, making things even worse. (In reply to carlphilippreh from comment #29) > Sorry for the late response, but I have been trying to reproduce the problem > with 4.19.2 for some while now. It seems that the problem I was experiencing > only happens with 4.19.1 and 4.19.0, and it did so very frequently. I can at > least confirm that I have CONFIG_SCSI_MQ_DEFAULT=y set in 4.19 but I didn't > in 4.18. I hope that this is, at least for me, fixed for now. While I wasn't able to reproduce the bug for quite some time, it ended up coming back. I'm currently running 4.19.6 and I see invalid metadata in files that I have written using this version. Just as I was writing this, my third computer running Linux (currently at 4.19.6) is now also running into this issue. has anybody ever seen that bug within a virtual machine? i currently run 4.19.x only inside VMs on VMware Workstation / VMware ESXi and did not see any issues, my only phyiscal test was my homeserver which completley crahsed 4 times like because of the VMware Workstation 14 kernel-modules lasted only for a weekend (RAID10, 2 Samsung Evo 850 2 TB, 2 Samsung Evo 860 2 TB) after the last crash left a 0 byte "firewall.sh" in the nested VM i was working for hours (In reply to carlphilippreh from comment #129) > While I wasn't able to reproduce the bug for quite some time, it ended up > coming back. I'm currently running 4.19.6 and I see invalid metadata in > files that I have written using this version. I think it could be helpful if you provided your .config(s) here. And what kernel are you using: self-compiled/from your distribution (which)? If self-compiled: have you made changes to the .config? The set-up of the boxes you mentioned in comment 5 seems just right to hunt this down. ;) Created attachment 279779 [details]
Config of first computer
Created attachment 279781 [details]
Config of second computer
Doesn't the Linux kernel team have any procedures in place for when such a critical bug is found? There are many people running this "stable" 4.19 branch, many of whom are unaware of this bug. Shouldn't the stable branch be rolled back to the last known good version? Going back to 4.18 is certainly a better option, but people unaware of this bug might still be running 4.19. (In reply to Rainer Fiebig from comment #132) > (In reply to carlphilippreh from comment #129) > > While I wasn't able to reproduce the bug for quite some time, it ended up > > coming back. I'm currently running 4.19.6 and I see invalid metadata in > > files that I have written using this version. > > I think it could be helpful if you provided your .config(s) here. > > And what kernel are you using: self-compiled/from your distribution (which)? > If self-compiled: have you made changes to the .config? > > The set-up of the boxes you mentioned in comment 5 seems just right to hunt > this down. ;) I'm configuring the kernels myself. Two things that I always enable and that _might_ be related are: CONFIG_BLK_WBT / CONFIG_BLK_WBT_MQ and CONFIG_CFQ_GROUP_IOSCHED Maybe I can come up with a way to reproduce this bug more quickly. Writing a lot of (small) files and then deleting them seems like a good way so far. (In reply to jaapbuurman from comment #135) > Doesn't the Linux kernel team have any procedures in place for when such a > critical bug is found? There are many people running this "stable" 4.19 > branch, many of whom are unaware of this bug. Shouldn't the stable branch be > rolled back to the last known good version? Going back to 4.18 is certainly > a better option, but people unaware of this bug might still be running 4.19. That would mean depublishing of the 4.19 release as a whole as nobody knows _what_ exactly to roll back. And if one would know, they would fix the bug instead. I cannot remember such a scenario/bug in the past... at least continue security updates for 4.18.x would probably be a good idea Fedora 28 is already on 4.19.x i run now 4.18.20-100.fc27.x86_64 which was the last Fedora 27 update on every F28 server and until this problem is solved i refuse to run 4.19.x in production which essentially means no security fixes for a unknown amount of time (In reply to carlphilippreh from comment #136) > (In reply to Rainer Fiebig from comment #132) > > (In reply to carlphilippreh from comment #129) > > > While I wasn't able to reproduce the bug for quite some time, it ended up > > > coming back. I'm currently running 4.19.6 and I see invalid metadata in > > > files that I have written using this version. > > > > I think it could be helpful if you provided your .config(s) here. > > > > And what kernel are you using: self-compiled/from your distribution > (which)? > > If self-compiled: have you made changes to the .config? > > > > The set-up of the boxes you mentioned in comment 5 seems just right to hunt > > this down. ;) > > I'm configuring the kernels myself. Two things that I always enable and that > _might_ be related are: > > CONFIG_BLK_WBT / CONFIG_BLK_WBT_MQ > and > CONFIG_CFQ_GROUP_IOSCHED > > Maybe I can come up with a way to reproduce this bug more quickly. Writing a > lot of (small) files and then deleting them seems like a good way so far. I have these config options set and _currently_ no corruption. Having this compiled in is *probably* not what to look for. Rather people should seek for actual *usage* of these features. I use the deadline scheduler. root@marc:~ # egrep "(BLK_WBT|IOSCH)" /boot/config-4.19.5loc64 CONFIG_BLK_WBT=y CONFIG_BLK_WBT_SQ=y CONFIG_BLK_WBT_MQ=y CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_CFQ_GROUP_IOSCHED=y CONFIG_DEFAULT_IOSCHED="deadline" CONFIG_MQ_IOSCHED_DEADLINE=y CONFIG_MQ_IOSCHED_KYBER=y CONFIG_IOSCHED_BFQ=y # CONFIG_BFQ_GROUP_IOSCHED is not set Could someone gather a list of what actually is in .configs but is relevant/irrelevant? I don't want to do is but I'm not really sure to not mess is up. I mean there was "DAX enabled in the .config" talked about but I have it compiled-in but I'm not actually using is. I would, moreover, like to gather actual setting used by people who run into the bug and those who are not, like currently used schedulers, nr_requests, discard, ... (In reply to Marc Burkhardt from comment #137) > (In reply to jaapbuurman from comment #135) > > Doesn't the Linux kernel team have any procedures in place for when such a > > critical bug is found? There are many people running this "stable" 4.19 > > branch, many of whom are unaware of this bug. Shouldn't the stable branch > be > > rolled back to the last known good version? Going back to 4.18 is certainly > > a better option, but people unaware of this bug might still be running > 4.19. > > That would mean depublishing of the 4.19 release as a whole as nobody knows > _what_ exactly to roll back. And if one would know, they would fix the bug > instead. > > I cannot remember such a scenario/bug in the past... I know it sounds bad, but isn't depublishing 4.19 the best course of action right now? There's probably a lot of people running 4.19 that are completely unaware of this bug and might or might not run into this later. Data corruption issues are one of the worst, and should be addressed ASAP, even if it means temporary depublishing the latest kernel, right? (In reply to jaapbuurman from comment #140) > (In reply to Marc Burkhardt from comment #137) > > (In reply to jaapbuurman from comment #135) > > > Doesn't the Linux kernel team have any procedures in place for when such > a > > > critical bug is found? There are many people running this "stable" 4.19 > > > branch, many of whom are unaware of this bug. Shouldn't the stable branch > > be > > > rolled back to the last known good version? Going back to 4.18 is > certainly > > > a better option, but people unaware of this bug might still be running > > 4.19. > > > > That would mean depublishing of the 4.19 release as a whole as nobody knows > > _what_ exactly to roll back. And if one would know, they would fix the bug > > instead. > > > > I cannot remember such a scenario/bug in the past... > > I know it sounds bad, but isn't depublishing 4.19 the best course of action > right now? There's probably a lot of people running 4.19 that are completely > unaware of this bug and might or might not run into this later. > > Data corruption issues are one of the worst, and should be addressed ASAP, > even if it means temporary depublishing the latest kernel, right? 4.18.20 is from Nov 21st and came with 4.19.3. It lacks 3 releases of fixes parallel to 4.19.6 due to 4.18 being EOL. 4.19 is out in the wild now. You cannot "get it back" ... And people are probably more aware of a new 4.19 release pushed by the distros than a rollback of the 4.19 release. (In reply to carlphilippreh from comment #134) > Created attachment 279781 [details] > Config of second computer Thanks. It'll take a while to sift through this. As an alternative to 4.19 you may want to use one of the latest LTS-kernels, 4.14.84 perhaps.[1] But before compiling/installing it, make sure the fs is OK (s. comment 114). [1] https://www.kernel.org/ Another datapoint: I have observed Ext4 metadata corruption under both 4.19.1 and 4.19.4. I'm using LVM (but no RAID); the underlying drive is a 1GB SATA-attached Samsung 850 PRO SSD. I've not been able to reliably reproduce, but an rsync-based backup of my home partition runs once an hour and it usually starts reporting corruption errors within a day or two of booting a 4.19.x kernel. So far the corruption has only happened in directories that I am not actively using - as far as I know they are only being accessed by the rsync process. Since I started seeing the corruption under 4.19.x, I've run 4.18.16 for two stretches, one of which was twelve days, without any problems, so I'm quite confident it is not an issue of defective hardware. I have a weekly cron job which runs fstrim, but at least once I booted into 4.19.4 (previous boot was 4.18.16), and started seeing metadata corruption after about 36 hours, but fstrim had not run during that time. Some (possibly) relevant kernel configs: CONFIG_SCSI_MQ_DEFAULT=y # CONFIG_DM_MQ_DEFAULT is not set # CONFIG_MQ_IOSCHED_DEADLINE is not set # CONFIG_MQ_IOSCHED_KYBER is not set CONFIG_DAX_DRIVER=y CONFIG_DAX=y # CONFIG_DEV_DAX is not set # CONFIG_FS_DAX is not set $ cat /sys/block/sda/queue/scheduler [none] bfq I'm happy to report any more info about my kernel/system if it would be helpful, but unfortunately I don't have the bandwidth to do any bisection right now. (In reply to Daniel Harding from comment #143) > Another datapoint: I have observed Ext4 metadata corruption under both > 4.19.1 and 4.19.4. I'm using LVM (but no RAID); the underlying drive is a > 1GB SATA-attached Samsung 850 PRO SSD. I've not been able to reliably > reproduce, but an rsync-based backup of my home partition runs once an hour > and it usually starts reporting corruption errors within a day or two of > booting a 4.19.x kernel. So far the corruption has only happened in > directories that I am not actively using - as far as I know they are only > being accessed by the rsync process. Since I started seeing the corruption > under 4.19.x, I've run 4.18.16 for two stretches, one of which was twelve > days, without any problems, so I'm quite confident it is not an issue of > defective hardware. I have a weekly cron job which runs fstrim, but at > least once I booted into 4.19.4 (previous boot was 4.18.16), and started > seeing metadata corruption after about 36 hours, but fstrim had not run > during that time. > > Some (possibly) relevant kernel configs: > CONFIG_SCSI_MQ_DEFAULT=y > # CONFIG_DM_MQ_DEFAULT is not set > # CONFIG_MQ_IOSCHED_DEADLINE is not set > # CONFIG_MQ_IOSCHED_KYBER is not set > CONFIG_DAX_DRIVER=y > CONFIG_DAX=y > # CONFIG_DEV_DAX is not set > # CONFIG_FS_DAX is not set > > $ cat /sys/block/sda/queue/scheduler > [none] bfq > > I'm happy to report any more info about my kernel/system if it would be > helpful, but unfortunately I don't have the bandwidth to do any bisection > right now. Bisecting just fs/ext4 (comment 79) wouldn't cost much time. Just 32 commits, 5 steps. It won't get much cheaper than that. (In reply to Reindl Harald from comment #138) > at least continue security updates for 4.18.x would probably be a good idea > > Fedora 28 is already on 4.19.x > > i run now 4.18.20-100.fc27.x86_64 which was the last Fedora 27 update on > every F28 server and until this problem is solved i refuse to run 4.19.x in > production which essentially means no security fixes for a unknown amount of > time Perhaps you can use one of the LTS-kernels, like 4.14.84. > Perhaps you can use one of the LTS-kernels, like 4.14.84
on Fedora 28?
seriously?
(In reply to Reindl Harald from comment #131) > has anybody ever seen that bug within a virtual machine? > > i currently run 4.19.x only inside VMs on VMware Workstation / VMware ESXi > and did not see any issues, my only phyiscal test was my homeserver which > completley crahsed 4 times like because of the VMware Workstation 14 > kernel-modules lasted only for a weekend (RAID10, 2 Samsung Evo 850 2 TB, 2 > Samsung Evo 860 2 TB) after the last crash left a 0 byte "firewall.sh" in > the nested VM i was working for hours I've installed 4.19.x with a defconfig in a VirtualBox-VM, hoping the issue would show up and I could bisect it there. I've also varied the config-params that have been discussed here. But unfortunately that damn thing runs as nicely in the VM as it does on real iron - at least here. :) (In reply to Reindl Harald from comment #146) > > Perhaps you can use one of the LTS-kernels, like 4.14.84 > > on Fedora 28? > seriously? Sorry, just trying to help. And I didn't know that one can't run LTS-kernels on Fedora 28. (In reply to Jimmy.Jazz from comment #121) > I have the problem again with kernel 4.19.5 (J. Axboe patches). Sorry I > don't trust 4.19.x without Axboe patch because if some fs corruptions > reappear they will be less violent than without the patch. You had the issue with the full block patch applied, the one that includes both the synchronize_rcu() and the quiesce? Or just the partial one I suggested earlier? > Result: > The system doesn't see any error (fsck) on reboot (rescue and normal boot). Interesting, so it didn't make it to media. Whole block corrupted at once with each inode/file returning "Structure needs cleaning" and bad extra_isizes in syslog while three hours in doing plain cp -ax from ext4 to BTRFS mdraid in Ubuntu 18.04.1 mainline kernel 4.19.6 in init level 1 (Ubuntu rescue mode with almost nothing else running). Saved the corrupted block via debugfs, filesystem mounts read-only, dropped disk caches and block is fine again with files accessible. Seeing if I can find the corrupted block contents anywhere else in the filesystem. Error first happened for me in 4.19.5 after running cleanly for days, now it comes constantly. (In reply to Jens Axboe from comment #149) > You had the issue with the full block patch applied, the one that includes > both the synchronize_rcu() and the quiesce? Or just the partial one I > suggested earlier? synchronize_rcu() and the quiesce as you asked me. > Interesting, so it didn't make it to media. The following tests has be made on an other computer named orca to not be confused with earlier comments I have posted. Again, I can confirm it but only with your patches applied. On orca with 4.20 and w/o your patch the bug was able to entirely wipe out orca postgres database :( It gives me the opportunity to do a full reinstall of orca from the stick. Don't get confused with mmp_node_name host name on the new created partitions, it has an easy explanation. The bootable stick used to create the filesystems has a different hostname than the final server (i.e. orca) Please read the attached bug.orca.tar.xz tar file. You can follow the logs sequence from the file creation time. I underline, the new corruption on dm-10 after the server has rebooted has nothing to do with the one announced earlier in dmesg. Read dmesg-zalman.txt, dmesg-zalman-2.txt and dumpe2fs-dm-10-after-e2fsk.txt, dmesg-after-e2fsk.txt in that order. It shows that dm-10 corruption was initiate during the reboot. Created attachment 279801 [details]
new generated server
(In reply to Theodore Tso from comment #127) > > While there have been a few people who have reported problems with the > contents of their files, the vast majority of people are reporting problems > that seem to include complete garbage being written into metadata blocks --- > i.e., completely garbage in to inode table, block group descriptor, and > superblocks. This is getting detected by the kernel noticing corruption, > or by e2fsck running and noticing that the file system metadata is > inconsistent. More modern ext4 file systems have metadata checksum turned > on, but the reports from e2fsck seem to indicate that complete garbage (or, > more likely, data meant for block XXX is getting written to block YYY); as > such, the corruption is not subtle, so generally the kernel doesn't need > checksums to figure out that the metadata blocks are nonsensical. > Is it possible to determine the locality of these corruptions? I.e. Is the corruption to a contiguous page of data (e.g. 4096 bytes corrupted) or is the corruption scattered, a few bytes here, a few bytes there? From your comment about "data meant for block XXX is getting written to block YYY" can I assume this is fact, or is it still TBD? If it is contiguous data, is there any pattern to the data that would help us identify where it came from? Maybe that would help work out where the corruption was coming from. Maybe it is DMA from some totally unrelated device driver, but by looking at the data, we might determine which device driver it is? It might be some vulnerability in the kernel that some hacker is trying to exploit, but unsuccessfully, resulting in corruption. This could explain the reason why more people are not seeing the problem. Some people reporting that the corruptions are not getting persisted to disk in all cases, might imply that the corruption is happening outside the normal code paths, because the normal code path would have tagged the change as needing flushing to disk at some point. Looking at the corrupted data would also tell us if values are within expected ranges, that the normal code path would have validated. If they are outside those ranges, then it would again imply that the corrupt data is not being written by the normal ext4 code path, thus further implying that there is not a bug in the ext4 code, but something else in the kernel is writing to it by mistake. I have scanned all the comments. So far I have only seen 1 person who has this problem and have also reported what hardware they have. So, the sample size is statistically far too small to conclude that it is an AMD or a INTEL only problem. Is there anyone out there who sees this problem, and is running Intel hardware? How many people are seeing this problem? Can they each post the output of "lspci -vvv" and a dmesg showing the problem they have? This appears to be a problem that is reported by an extremely small amount of people. #154 re. Intel: start with comments 3/5/6. Status update: I have not been able to reproduce the problem with v4.19.6 minus the reverts from #99. I did see some failures, specifically exactly one per affected file system, but I attribute those to false positives (I did not run fsck as recommended prior to starting the test). stats: System 1: uptime: 18h 45m iostats: Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn loop0 0.00 0.00 0.00 0 0 nvme0n1 195.54 3.27 3.96 220723 267555 sda 128.88 0.36 18.91 24659 1277907 sdb 131.40 18.85 5.07 1273780 342404 nvme0n1 and sda were previously affected. sdb is a new drive. System 2: uptime: 14h 56m iostats: Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn loop0 0.00 0.00 0.00 5 0 sda 26.45 538.25 87.87 28965283 4728576 sdb 108.87 2875.25 4351.42 154728917 234167724 Both sda and sdb were previously affected. My next step will be to try v4.19.6 with the following reverts: Revert "ext4: handle layout changes to pinned DAX mappings" Revert "dax: remove VM_MIXEDMAP for fsdax and device dax" Revert "ext4: close race between direct IO and ext4_break_layouts()" I have had file system corruption with 4.19 on a BTRFS file system as well. 4.19.2 Kernel. I think this has to be related. No files were actually corrupted but the Kernel set the file system read-only as soon as the error occurred. I have even tried a NEW and FRESH btrfs file system created via a LiveCD system and it happened there as well as soon as I did a btrfs send/receive operation. I am on a Thinkpad T450. lspci -vvv https://paste.pound-python.org/show/9tZLWlry0Iy7Z629VPea/ Same paste as root: https://paste.pound-python.org/show/DZTBYXQBFhcHi69OBba8/ #156: A ray of hope. Underpins Nestors findings (comment 78). Another update: I hit the problem almost immediately with the reverts from #156. [ 1826.738686] EXT4-fs error (device sda1): ext4_iget:4796: inode #7633436: comm borg: bad extra_isize 28534 (inode size 256) [ 1826.740744] Aborting journal on device sda1-8. [ 1826.747339] EXT4-fs (sda1): Remounting filesystem read-only #160: Down to five now. As next step, I am going to try v4.19.6 with the following reverts: Revert "ext4: readpages() should submit IO as read-ahead" Revert "ext4: improve code readability in ext4_iget()" Those with btrfs problems might consider reverting commit 5e9d398240b2 ("btrfs: readpages() should submit IO as read-ahead") and report the results. First report specifies AMD, second report is Intel, and so on. I agree more detailed system information might help find commonalities and false positives, but the cross-platform nature of the problem seemed established right from the start. AMD Phenom(tm) II X4 B50 Processor, SSHD ST1000LM014-1EJ164 [11514.358542] EXT4-fs error (device dm-0): ext4_iget:4831: inode #18288150: comm cp: bad extra_isize 49917 (inode size 256) [11514.386613] Aborting journal on device dm-0-8. [11514.389070] EXT4-fs (dm-0): Remounting filesystem read-only Errors for each of the inodes on the block follow, until I dropped filesystem caches (drop_caches 3) and accessed them again and they were fine. Corrupted block looked random binary, but not compressed. BTRFS was reporting csum errors every time I dropped caches, which makes me wonder if people having the problem are using BTRFS? There was a recent post on linux-ext4 that this might relate to a compiler bug: > After four days playing games around git bisect - real winner is > debian gcc-8.2.0-9. Upgrade it to 8.2.0-10 or use 7.3.0-30 version for > same kernel + config - does not exhibit ext4 corruption. > I think I hit this https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87859 > with 8.2.0-9 version. Can people hitting this please confirm or deny whether this compiler is in use on your system. groeck@server:~$ gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. However, at least here: cat /proc/version Linux version 4.19.6-041906-generic (kernel@gloin) (gcc version 8.2.0 (Ubuntu 8.2.0-9ubuntu1)) #201812010432 SMP Sat Dec 1 09:34:07 UTC 2018 All 4.19.x kernels I tested were built with gcc version 8.2.1 20181025 [gcc-8-branch revision 265488]. Hereby my experience that may be related: [ 2451.982816] EXT4-fs error (device dm-1): ext4_iget:4831: inode #6029313: comm ls: bad extra_isize 5 (inode size 256) root@ster:/# debugfs -R 'ncheck 6029313' /dev/dm-1 debugfs 1.43.4 (31-Jan-2017) Inode Pathname 6029313 //sabnzb ncheck: Inode checksum does not match inode while doing inode scan root@ster:/# echo 2 > /proc/sys/vm/drop_caches root@ster:/# debugfs -R 'ncheck 6029313' /dev/dm-1 debugfs 1.43.4 (31-Jan-2017) Inode Pathname 6029313 //sabnzb ncheck: Inode checksum does not match inode while doing inode scan root@ster:/# echo 3 > /proc/sys/vm/drop_caches root@ster:/# debugfs -R 'ncheck 6029313' /dev/dm-1 debugfs 1.43.4 (31-Jan-2017) Inode Pathname 6029313 //sabnzb Kernel v4.19.5, CPU Intel Atom D525, Debian Linux 9.6, brand new WDC WD40EFRX-68N32N0, gcc 6.3.0-18+deb9u1. Also seen with an ext4 filesystem created on Nov 21 2018. Also seen with earlier 4.19.0 kernel, and older WDC WD30EFRX-68A in same computer. Going back to v4.18.<latest> kernel solved the issues. No disk corruption shown by e2fsck. FWIW, I didn't see any problems with 4.19.0, but see it on all my systems with 4.19.3 and onward (although I *did* skip 4.19.[12]. There fore, I embarked on a git bisect in linux-stable from v4.19 to v4.19.3, which is nearing its end, *so far with every iteration marked GOOD*. Referencing https://www.spinics.net/lists/linux-ext4/msg63498.html (#164), and noting that I usually run kernels from kernel.ubuntu.com/~kernel-ppa/mainline , I did the following: smo@dell-smo:~$ cat /proc/version Linux version 4.19.0-041900-generic (kernel@tangerine) (gcc version 8.2.0 (Ubuntu 8.2.0-7ubuntu1)) #201810221809 SMP Mon Oct 22 22:11:45 UTC 2018 Then, I downloaded 4.19.3 from kernel-ppa, unpacked, and: smo@dell-smo:~/src/deb/foo/boot$ strings vmlinuz-4.19.3-041903-generic |grep 8.2.0 4.19.3-041903-generic (kernel@gloin) (gcc version 8.2.0 (Ubuntu 8.2.0-9ubuntu1)) #201811210435 SMP Wed Nov 21 09:37:20 UTC 2018 BANG, as they say: 8.2.0-9. Whereas git bisect "GOOD"s continuously (as stated, it is not complete - only almost) are not impossible, they certainly don't seem entirely normal, but: sune@jekaterina:~$ gcc --version gcc (Ubuntu 8.2.0-7ubuntu1) 8.2.0 ...on the system where I self-compile during the bisect, *could* explain it. My impression is, that a lot of affected people are on Ubuntu, and I suspect the following: * Many of the affected Ubuntu folks do indeed use kernels from kernel-ppa * Some of those, as well as non-Ubuntu-folks, may have that compiler version for other reasons, and hit the bugs on that account * Bisecting yields inconclusive results, as it seems to do for me, since the Issues is non-kernel. * Theodore T'so and Jens Axboe are unable to reproduce due to unaffected compiler versions, which also explains the no-show in regression tests. Tso, Axboe: With the two of you being completely unable to replicate, could you be enticed to either try GCC 8.2.0-9 (or, possibly, just the packages from the following URLs, and run your regression tests against those? http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19 (presumed GOOD) http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19 (presumed BAD) Best regards, Sune Mølgaard Meh, typos: 1. "(or, possibly..." should end the parenthesis after "...the following URLs) 2. Last link should be http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19.3 To reiterate, I use gcc version "5.4.0-6ubuntu1~16.04.10" to build my kernels. Also, I build the kernels on a system not affected by the problem. It may well be that a compiler problem in gcc 8.2.0 causes additional trouble, but it is not the trouble observed on my affected systems. $ gcc --version gcc (Gentoo Hardened 7.3.0-r3 p1.4) 7.3.0 The commit 2a5cf35cd6c56b2924("block: fix single range discard merge") in linus tree may address one possible data loss, anyone who saw corruption in scsi may try this fix and see if it makes a difference. Given the merged discard request isn't removed from elevator queue, it might be possible to be submitted to hardware again. One of the reasons why this is bug hunt is so confounding. While I was looking at older reports to try to see if I could find common factors, I found Jimmy's dmesg report in #50, and this one looks different from many of the others that people have reported. In this one, the EXT4 errors are preceeded by a USB disconnect followed by disk-level errors. This is why it's important that we try very hard to filter out false positives and false negative reports. We have multiple reports which both strongly indicate that it's an ext4 bug, and others which strongly indicate it is a bug below the file system layer. And then we have ones like this which look like a USB disconnect.... [52967.931390] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd [52968.985620] sd 8:0:0:2: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 [52968.985624] sd 8:0:0:2: [sdf] tag#0 Sense Key : 0x6 [current] [52968.985626] sd 8:0:0:2: [sdf] tag#0 ASC=0x28 ASCQ=0x0 [52968.985628] sd 8:0:0:2: [sdf] tag#0 CDB: opcode=0x2a 2a 00 00 cc 60 28 00 08 00 00 [52968.985630] print_req_error: I/O error, dev sdf, sector 13393960 [52968.985641] EXT4-fs warning (device sdf2): ext4_end_bio:323: I/O error 10 writing to inode 522 (offset 6132072448 size 6295552 starting block 1674501) [52968.985643] Buffer I/O error on device sdf2, logical block 1673728 [52968.985651] Buffer I/O error on device sdf2, logical block 1673729 [52968.985654] Buffer I/O error on device sdf2, logical block 1673730 [52968.985659] Buffer I/O error on device sdf2, logical block 1673731 [52968.985663] Buffer I/O error on device sdf2, logical block 1673732 ... [52968.986231] EXT4-fs warning (device sdf2): ext4_end_bio:323: I/O error 10 writing to inode 522 (offset 6132072448 size 8388608 starting block 1675013) [52969.435367] JBD2: Detected IO errors while flushing file data on sdf2-8 [52969.435407] Aborting journal on device sdf2-8. [52969.435422] JBD2: Error -5 detected when updating journal superblock for sdf2-8. [52969.441997] EXT4-fs error (device sdf2): ext4_journal_check_start:61: Detected aborted journal [52985.065239] EXT4-fs error (device sdf2): ext4_remount:5188: Abort forced by user I guess I may have been biased towards the posts mentioning the GCC bug, then, but that would lead me to think that I am not alone in conflating that one with actual ext4 or block layer bugs. I shall go ahead and reference my comment above (#169) to the Ubuntu kernel-ppa folks, and in the event that this will then preclude others from mis-attributing the GCC bug to these, I should hope to at least effect an elimination of that noise source from this Bugzilla entry. My apologies, and keep up the good work! Hi Sune, alas for your theory in #169, I am already using gcc 8.2.0-9 from Debian testing. % gcc --version gcc (Debian 8.2.0-9) 8.2.0 Could it an Ubuntu-specific issue? I don't think so, since there have been some people running Debian and Gentoo who have reported the problem, and one person who reported the problem was running Debian and was using gcc 8.2.0-9. I have built kernels using gcc 8.2.0-9 and used them for regression testing using gce-xfstests: % objdump -s --section .comment /build/ext4-64/vmlinux Contents of section .comment: 0000 4743433a 20284465 6269616e 20382e32 GCC: (Debian 8.2 0010 2e302d39 2920382e 322e3000 .0-9) 8.2.0. The kernel I am using on my personal development laptop was compiled using gcc 8.2.0-8: % objdump -s --section .comment /usr/lib/debug/lib/modules/4.19.0-00022-g831156939ae8/vmlinux Contents of section .comment: 0000 4743433a 20284465 6269616e 20382e32 GCC: (Debian 8.2 0010 2e302d38 2920382e 322e3000 .0-8) 8.2.0. Of course, I'm not doing anything more exciting than running chrome, mutt, emacs, and building kernels most of the time... I did a lot of tests here, the first thing was to configure with tune2fs so that in each boot I forcefully check my three partitions, the / boot the / root and the / home partition. # tune2fs -c 1 /dev/md0 # tune2fs -c 1 /dev/md2 # tune2fs -c 1 /dev/md3 I have reinstalled and compiled tree 4.19.5 and 4.19.0 from scratch, as well as the new tree 4.19.6. I have not had problems with the 4.19.5 or with the new 4.19.6, many hours of use and restarts every time .. everything perfect. But at the first boot with 4.19.0 ... corruption of the root partition. it leaves me in the console for repair, I repair it with: # fsck.ext4 -y /dev/md2 After started, I'll see /lost+found and find many folders and files in perfect condition, not corrupt, but with the numeric names # # ls -l /lost+found/ total 76 -rw-r--r-- 1 portage portage 5051 dic 10 2013 '#1057825' drwxr-xr-x 3 portage portage 4096 dic 10 2013 '#1057827' -rw-r--r-- 1 root root 2022 oct 22 03:37 '#3184673' -rw-r--r-- 1 root root 634 oct 22 03:37 '#3184674' etc... etc... So decided I started with the bisection, download only from 4.18 onwards. $ su # cd /usr/src # git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git --shallow-exclude v4.17 linux-stable # eselect kernel list Available kernel symlink targets: [1] linux-4.18.20-gentoo [2] linux-4.19.0-gentoo [3] linux-4.19.5-gentoo [4] linux-4.19.6-gentoo * [5] linux-stable # eselect kernel set 5 # eselect kernel list Available kernel symlink targets: [1] linux-4.18.20-gentoo [2] linux-4.19.0-gentoo [3] linux-4.19.5-gentoo [4] linux-4.19.6-gentoo [5] linux-stable * # ls -l total 20 lrwxrwxrwx 1 root root 12 dic 2 21:27 linux -> linux-stable drwxr-xr-x 27 root root 4096 nov 24 14:44 linux-4.18.20-gentoo drwxr-xr-x 27 root root 4096 dic 2 20:28 linux-4.19.0-gentoo drwxr-xr-x 27 root root 4096 dic 2 03:47 linux-4.19.5-gentoo drwxr-xr-x 27 root root 4096 dic 2 14:50 linux-4.19.6-gentoo drwxr-xr-x 26 root root 4096 dic 2 19:18 linux-stable # cd linux # git bisect start v4.19 v4.18 -- fs/ext4 Bisectando: faltan 16 revisiones por probar después de esto (aproximadamente 4 pasos) [863c37fcb14f8b66ea831b45fb35a53ac4a8d69e] ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa() # git bisect log # bad: [84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d] Linux 4.19 # good: [94710cac0ef4ee177a63b5227664b38c95bbf703] Linux 4.18 git bisect start 'v4.19' 'v4.18' '--' 'fs/ext4' Just beginning, today was Sunday and ... besides little experience with git :) I was also looking at the ebuilds of the gentoo-sources trees to know what patches I applied to emerge when installing the sources. $ cat /usr/portage/sys-kernel/gentoo-sources/gentoo-sources-4.18.20.ebuild |grep K_GENPATCHES_VER= K_GENPATCHES_VER="24" $ ls -lh /usr/portage/distfiles/genpatches-4.18-24.base.tar.xz -rw-rw-r-- 1 portage portage 661K nov 21 10:13 /usr/portage/distfiles/genpatches-4.18-24.base.tar.xz $ tar -tf /usr/portage/distfiles/genpatches-4.18-24.base.tar.xz ./0000_README ./1000_linux-4.18.1.patch ./1001_linux-4.18.2.patch ./1002_linux-4.18.3.patch ./1003_linux-4.18.4.patch ./1004_linux-4.18.5.patch ./1005_linux-4.18.6.patch ./1006_linux-4.18.7.patch ./1007_linux-4.18.8.patch ./1008_linux-4.18.9.patch ./1009_linux-4.18.10.patch ./1010_linux-4.18.11.patch ./1011_linux-4.18.12.patch ./1012_linux-4.18.13.patch ./1013_linux-4.18.14.patch ./1014_linux-4.18.15.patch ./1015_linux-4.18.16.patch ./1016_linux-4.18.17.patch ./1017_linux-4.18.18.patch ./1018_linux-4.18.19.patch ./1019_linux-4.18.20.patch ./1500_XATTR_USER_PREFIX.patch ./1510_fs-enable-link-security-restrictions-by-default.patch ./2500_usb-storage-Disable-UAS-on-JMicron-SATA-enclosure.patch ./2600_enable-key-swapping-for-apple-mac.patch $ tar -xf /usr/portage/distfiles/genpatches-4.18-24.base.tar.xz ./1019_linux-4.18.20.patch $ ls -lh 1019_linux-4.18.20.patch -rw-r--r-- 1 nestor nestor 164K nov 21 10:01 1019_linux-4.18.20.patch $ cat /usr/portage/sys-kernel/gentoo-sources/gentoo-sources-4.19.0.ebuild |grep K_GENPATCHES_VER= K_GENPATCHES_VER="1" $ ls -lh /usr/portage/distfiles/genpatches-4.19-1.base.tar.xz -rw-rw-r-- 1 portage portage 4,0K oct 22 08:47 /usr/portage/distfiles/genpatches-4.19-1.base.tar.xz $ tar -tf /usr/portage/distfiles/genpatches-4.19-1.base.tar.xz ./0000_README ./1500_XATTR_USER_PREFIX.patch ./1510_fs-enable-link-security-restrictions-by-default.patch ./2500_usb-storage-Disable-UAS-on-JMicron-SATA-enclosure.patch ./2600_enable-key-swapping-for-apple-mac.patch As you can see in the 4.19.0 tree do not apply patches 1000_linux-4.19.x.patch My gcc version for quite some time: $ gcc -v gcc versión 8.2.0 (Gentoo 8.2.0-r5 p1.6 Obviously something happens with the inodes, but apparently only I'm doing it now with the tree 4.19.0. If I find something I will be reporting it. Regards Hi Theodore, I am not much of a kernel developer, let alone and FS one, so your guesses would be vastly better founded than mine. I could imagine, though, that a combination of GCC version, .config and, possibly, the creation time (kernel version-wise) of the FSs in question, could create a sort of "cocktail effect". For my part, none of my FSs are < at least a year old. FWIW, I started seeing the problem specifically with 4.19.3 (4.19.0 being good, and built with 8.2.0-7), but that was after skipping 4.19.[12]. I note that the first Ubuntu kernel-ppa kernel the be built with 8.2.0-9 was 4.19.1, so if my ongoing bisect ends without any triggering of the bug I see, I shall try kernel-ppa 4.19.1 - if that exhibits the bug, then that further points to GCC, but as you say, perhaps specifically for the Ubuntu kernels. Now, as someone else stated somewhere, the only things that kernel-ppa patches, are some Ubuntu-specific build and package structure, as well as .config hte lst part being available at http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19.1/0006-configs-based-on-Ubuntu-4.19.0-4.5.patch . As promised above, I have written the kernel-ppa team lead, Bradd Figg, and I would expect him and his team to be better at pinpointing which combination of GCC 8.2.0-9 and .config options might be problematic, but if they find that the problem goes away with a GCC upgrade, they might opt for letting that be it. Michael Orlitzky: In your report, you've indicated that you've only been seeing bugs in files that are being *read* and that these were files that were written long ago. If you reboot, or drop caches using "echo 3 > /proc/sys/vm/drop_caches" do the files stay corrupted? Some of the reports (but not others) seem to indicate the problem is happening on read, not on write. Of course, some of the reports are relating to metadata blocks getting corrupted on read, while your report is about data blocks getting reported on read. Thanks! I've been able to reproduce the issue on the other workstation I'd mentioned earlier with ZFS: NAME STATE READ WRITE CKSUM nipigon ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x5000c5006407e87e ONLINE 0 0 17 wwn-0x5000c5004e4e92a9 ONLINE 0 0 6 This isn't particularly hard to trigger; just a bunch of files filled with /dev/urandom (64M*14000) being read back concurrently (eight processes) over about 90 minutes. Kernel version is 4.19.6 compiled with gcc 8.2.0 (Gentoo 8.2.0-r5 p1.6). Have we any other ZFS users experiencing this? #180: Eric, would you mind sharing the script used to create the files and to read them back ? As for where problems are seen, for my part the problem is seen mostly when trying to read files created recently as part of kernel builds. The problems are reported with reads, but writes are definitely involved, at least for me. As for blaming gcc, or Ubuntu, or both, I would kindly like to remind people that I see the problem on two systems out of four running v4.19.x kernels, all with the same kernel build and configuration. Eric, re #180, could you upload your .config file for your kernel and the boot-command line? I'm interested in particular what I/O scheduler you are using. And are you using the same .config and boot command line (and other system configurations) on your other system where you were seeing the problem? Many thanks!! The commit 2a5cf35cd6c56b2924("block: fix single range discard merge") in linus tree may address one possible data loss, anyone who saw corruption in scsi may try this fix and see if it makes a difference. Given the merged discard request isn't removed from elevator queue, it might be possible to be submitted to hardware again.(In reply to Theodore Tso from comment #174) > One of the reasons why this is bug hunt is so confounding. While I was > looking at older reports to try to see if I could find common factors, I > found Jimmy's dmesg report in #50, and this one looks different from many of > the others that people have reported. In this one, the EXT4 errors are > preceeded by a USB disconnect followed by disk-level errors. > > This is why it's important that we try very hard to filter out false > positives and false negative reports. We have multiple reports which both > strongly indicate that it's an ext4 bug, and others which strongly indicate > it is a bug below the file system layer. And then we have ones like this > which look like a USB disconnect.... > > [52967.931390] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using > xhci_hcd IMO it should be a usb device reset instead of disconnect, and reset is often triggered in SCSI EH. Thanks, (In reply to Guenter Roeck from comment #181) > #180: Eric, would you mind sharing the script used to create the files and > to read them back ? Just a pair of trivial one-liners: for i in {00000..13999}; do echo dd bs=1M count=64 if=/dev/urandom of=urand.$i; done for i in urand.*; do echo dd bs=1M if=$i of=/dev/null; done | parallel -j8 I'm using /dev/urandom since I have lz4 compression enabled. I imagine /dev/zero would be just as effective if you don't. Created attachment 279807 [details]
tecciztecatl linux kernel 4.19.6 .config
(In reply to Theodore Tso from comment #183) > Eric, re #180, could you upload your .config file for your kernel and the > boot-command line? I'm interested in particular what I/O scheduler you are > using. And are you using the same .config and boot command line (and other > system configurations) on your other system where you were seeing the > problem? Many thanks!! [ 0.000000] Command line: BOOT_IMAGE=/root@/boot/vmlinuz root=simcoe/root triggers=zfs radeon.dpm=1 # cat /sys/block/sd[a-d]/queue/scheduler [none] [none] [none] [none] The config between 4.18.20 and 4.19.6 are about as identical as possible, the only differences being whatever was added in 4.19 and prompted by make oldconfig. Between this machine and the other (a server) the only differences would be in specific hardware support and options suitable for that application. In terms of schedulers, block devices, and filesystem support, they're the same. (In reply to Eric Benoit from comment #185) > for i in {00000..13999}; do echo dd bs=1M count=64 if=/dev/urandom > of=urand.$i; done Er whoops, might want to remove the echo or pipe it to parallel. I'm repeating things under 4.18.20 just for comparison. It's been about an hour now without a single checksum error reported. (In reply to Theodore Tso from comment #179) > Michael Orlitzky: In your report, you've indicated that you've only been > seeing bugs in files that are being *read* and that these were files that > were written long ago. If you reboot, or drop caches using "echo 3 > > /proc/sys/vm/drop_caches" do the files stay corrupted? Each time the corruption has been reported by the backup job that I run overnight. When I see the failed report in the morning, I reboot into SystemRescueCD (which is running 4.14.x) and then run fsck to fix things. The fsck does indeed find a bunch of corruption, and appears to fix it. The first couple of times I verified the corruption by running something like "git gc" in the affected directory, and IIRC I got the same "structure needs cleaning" error back. Before that, I hadn't touched that repo in a while. But since then, I've just been rebooting immediately and running fsck -- each time finding something wrong and (I hope) correcting it. It takes about a week for the corruption to show up, but if there's some test you need me to run I can boot back into 4.19.6 and roll the dice. (In reply to Michael Orlitzky from comment #189) > It takes about a week for the corruption to show up, but if there's some > test you need me to run I can boot back into 4.19.6 and roll the dice. Hm, interesting that it happens to my Ubuntu Bionic on bare metal Macbook Pro (SSD) within minutes - even if I don't habe much I/O load. I am doing the fsck mit 4.18.20. What could be the difference? (In reply to Eric Benoit from comment #185) > (In reply to Guenter Roeck from comment #181) > > #180: Eric, would you mind sharing the script used to create the files and > > to read them back ? > > Just a pair of trivial one-liners: > > for i in {00000..13999}; do echo dd bs=1M count=64 if=/dev/urandom > of=urand.$i; done > > for i in urand.*; do echo dd bs=1M if=$i of=/dev/null; done | parallel -j8 > > I'm using /dev/urandom since I have lz4 compression enabled. I imagine > /dev/zero would be just as effective if you don't. Don't know if my comments are relevant as I got no reply as of now but here some info regarding this test: I ran it without errors on my /home partition wich is a dm-crypt ext4 setup using the deadline mq-scheduler and the gcc 8 compiler branch. The partition is mounted /dev/mapper/crypt-home on /home type ext4 (rw,nosuid,noatime,nodiratime,quota,usrquota,grpquota,errors=remount-ro) [ 0.000000] Linux version 4.19.6loc64 (marc@marc) (gcc version 8.2.0 (Gentoo Hardened 8.2.0-r4 p1.5)) #1 SMP PREEMPT Sat Dec 1 16:00:21 CET 2018 [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.19.6loc64 root=/dev/sda7 ro init=/sbin/openrc-init root=PARTUUID=6d19e60a-72a8-ee44-89f4-cc6f85a9436c real_root=/dev/sda7 ro resume=PARTUUID=fbc25a25-2d09-634d-9e8b-67308f2feddf real_resume=/dev/sda8 acpi_osi=Linux libata.dma=3 libata.noacpi=0 threadirqs rootfstype=ext4 acpi_sleep=s3_bios,s3_beep devtmpfs.mount=0 net.ifnames=0 vmalloc=512M noautogroup elevator=deadline libata.force=noncq nmi_watchdog=0 i915.modeset=0 cgroup_disable=memory scsi_mod.use_blk_mq=y dm_mod.use_blk_mq=y vgacon.scrollback_persistent=1 processor.ignore_ppc=1 intel_iommu=igfx_off crashkernel=128M apparmor=1 security=apparmor nouveau.noaccel=0 nouveau.nofbaccel=1 nouveau.modeset=1 nouveau.runpm=0 nouveau.debug=disp=trace,i2c=trace,bios=trace nouveau.config=NvPmShowAll=true [ 0.000000] KERNEL supported cpus: [ 0.000000] Intel GenuineIntel Might be worth getting this guy aboard - got now reply though. https://www.phoronix.com/forums/forum/software/general-linux-open-source/1063976-some-users-have-been-hitting-ext4-file-system-corruption-on-linux-4-19?p=1064826#post1064826 #136
It has been suggested that I/O-schedulers may play a role in this. So here's are my settings for 4.19.x for comparison. They deviate from yours in some points but I really don't know whether this has any relevance. You may want to give it a try anyway. As I've said, 4.19.x is a nice kernel here.
> grep -i sched .config_4.19-rc5
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
CONFIG_SCHED_HRTICK=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
# IO Schedulers
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
CONFIG_DEFAULT_IOSCHED="deadline"
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
# CONFIG_IOSCHED_BFQ is not set
CONFIG_NET_SCHED=y
# Queueing/Scheduling
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_SCHED_INFO=y
CONFIG_SCHED_TRACER=y
I am desperately trying to reproduce this in a qemu/KVM virtual machine with the configs given by those users. But until now to no avail. If anybody has seen this issue in a VM please share your .config, mount options of all filesystems, kernel command line, and possibly workload that you are running. (In reply to Rainer Fiebig from comment #192) > #136 > > It has been suggested that I/O-schedulers may play a role in this. So here's > are my settings for 4.19.x for comparison. They deviate from yours in some > points but I really don't know whether this has any relevance. You may want > to give it a try anyway. As I've said, 4.19.x is a nice kernel here. > > > grep -i sched .config_4.19-rc5 > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y > CONFIG_CGROUP_SCHED=y > CONFIG_FAIR_GROUP_SCHED=y > CONFIG_RT_GROUP_SCHED=y > CONFIG_SCHED_AUTOGROUP=y > CONFIG_SCHED_OMIT_FRAME_POINTER=y > CONFIG_SCHED_SMT=y > CONFIG_SCHED_MC=y > CONFIG_SCHED_MC_PRIO=y > CONFIG_SCHED_HRTICK=y > # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set > # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set > # IO Schedulers > CONFIG_IOSCHED_NOOP=y > CONFIG_IOSCHED_DEADLINE=y > CONFIG_IOSCHED_CFQ=y > CONFIG_CFQ_GROUP_IOSCHED=y > CONFIG_DEFAULT_IOSCHED="deadline" > CONFIG_MQ_IOSCHED_DEADLINE=y > CONFIG_MQ_IOSCHED_KYBER=y > # CONFIG_IOSCHED_BFQ is not set > CONFIG_NET_SCHED=y > # Queueing/Scheduling > CONFIG_USB_EHCI_TT_NEWSCHED=y > CONFIG_SCHED_INFO=y > CONFIG_SCHED_TRACER=y Really, how come you say "these are your settings"? The settings are, what is actually being used not what has ben compiled-in or I miss anything? What's the coincidence between CONFIG_DEFAULT_IOSCHED="deadline" + CONFIG_IOSCHED_DEADLINE=y and cat /sys/block/sda/queue/scheduler mq-deadline [kyber] bfq none Please see #139 - wee need a list of what is effectively used and not what is actually possible. Bare metal or not. Intel? AMD? hugepages or nor? #193
>I am desperately trying to reproduce this in a qemu/KVM virtual machine with
>the configs >given by those users. But until now to no avail.
Good luck, Mr. Glück! ;)
My VirtualBox-VM seems immune to this issue. Perhaps VMs have just the right "hardware".
#194 Hair-splitting won't help in this matter. And btw: if you're so smart - how come you haven't solved this already? (In reply to Marc Burkhardt from comment #194) > (In reply to Rainer Fiebig from comment #192) > > #136 > > > > It has been suggested that I/O-schedulers may play a role in this. So > here's > > are my settings for 4.19.x for comparison. They deviate from yours in some > > points but I really don't know whether this has any relevance. You may want > > to give it a try anyway. As I've said, 4.19.x is a nice kernel here. > > > > > grep -i sched .config_4.19-rc5 > > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y > > CONFIG_CGROUP_SCHED=y > > CONFIG_FAIR_GROUP_SCHED=y > > CONFIG_RT_GROUP_SCHED=y > > CONFIG_SCHED_AUTOGROUP=y > > CONFIG_SCHED_OMIT_FRAME_POINTER=y > > CONFIG_SCHED_SMT=y > > CONFIG_SCHED_MC=y > > CONFIG_SCHED_MC_PRIO=y > > CONFIG_SCHED_HRTICK=y > > # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set > > # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set > > # IO Schedulers > > CONFIG_IOSCHED_NOOP=y > > CONFIG_IOSCHED_DEADLINE=y > > CONFIG_IOSCHED_CFQ=y > > CONFIG_CFQ_GROUP_IOSCHED=y > > CONFIG_DEFAULT_IOSCHED="deadline" > > CONFIG_MQ_IOSCHED_DEADLINE=y > > CONFIG_MQ_IOSCHED_KYBER=y > > # CONFIG_IOSCHED_BFQ is not set > > CONFIG_NET_SCHED=y > > # Queueing/Scheduling > > CONFIG_USB_EHCI_TT_NEWSCHED=y > > CONFIG_SCHED_INFO=y > > CONFIG_SCHED_TRACER=y > > Really, how come you say "these are your settings"? > > The settings are, what is actually being used not what has ben compiled-in > or I miss anything? > > What's the coincidence between > > CONFIG_DEFAULT_IOSCHED="deadline" + CONFIG_IOSCHED_DEADLINE=y > > and > > cat /sys/block/sda/queue/scheduler > mq-deadline [kyber] bfq none > > Please see #139 - wee need a list of what is effectively used and not what > is actually possible. Bare metal or not. Intel? AMD? hugepages or nor? I use an allegedly wrong compiler, I use 4.19.y, I use ext4 with and without dm-crypt, I use a scheduler, .... and I am currently NOT affected by that bug even running the tests that people say triggers the bug. Just to make it clear again: I'm not a kernel dev, ok, but I use Linux for along time and I'm willing to help out what setup is *not* affected. Maybe I do something totally wrong here but I'm willing to help out as a gibe-back to the community providing me the OS I use solely for 20+ years. I think the discussion should (at this point) not gather around what your kernel is *capable* of, but just what actually is set-up to trigger the bug. I can also confirm the fs corruption issue on Fedora 29 with 4.19.5 kernel. I run it on ThinkPad T480 with NVME Samsung drive. * Workload The workload involves doing a bunch of compile sessions and/or running a VM (under KVM hypervisor) with NFS server. It usually takes anywhere from few hours to a day for the corruption to occur. * Symptoms - /dev/nvm0n1* entries disappear from /dev/ - unable to start any program as i get I/O errors * System Info > uname -a Linux skyline.origin 4.19.5-300.fc29.x86_64 #1 SMP Tue Nov 27 19:29:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.19.5-300.fc29.x86_64 root=/dev/mapper/fedora_skyline-root ro rd.lvm.lv=fedora_skyline/root rd.luks.uuid=luks-b66e85a5-f7b1-4d87-8fab-a01687e35056 rd.lvm.lv=fedora_skyline/swap rhgb quiet LANG=en_US.UTF-8 > cat /sys/block/nvme0n1/queue/scheduler [none] mq-deadline > lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 238.5G 0 disk ├─nvme0n1p1 259:1 0 200M 0 part /boot/efi ├─nvme0n1p2 259:2 0 1G 0 part /boot ├─nvme0n1p3 259:3 0 160G 0 part │ └─luks-b66e85a5-f7b1-4d87-8fab-a01687e35056 253:0 0 160G 0 crypt │ ├─fedora_skyline-root 253:1 0 156G 0 lvm / │ └─fedora_skyline-swap 253:2 0 4G 0 lvm [SWAP] └─nvme0n1p4 259:4 0 77.3G 0 part ├─skyline_vms-atomic_00 253:3 0 20G 0 lvm └─skyline_vms-win10_00 253:4 0 40G 0 lvm This is dumpe2fs output on the currently booted system. > dumpe2fs /dev/mapper/fedora_skyline-root dumpe2fs 1.44.3 (10-July-2018) Filesystem volume name: <none> Last mounted on: / Filesystem UUID: 410261f3-0779-455b-9642-d52800292fd7 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file h uge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 10223616 Block count: 40894464 Reserved block count: 2044723 Free blocks: 26175785 Free inodes: 9255977 First block: 0 Block size: 4096 Fragment size: 4096 Group descriptor size: 64 Reserved GDT blocks: 1024 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Mon Feb 19 18:48:05 2018 Last mount time: Mon Dec 3 08:07:30 2018 Last write time: Mon Dec 3 03:07:29 2018 Mount count: 137 Maximum mount count: -1 Last checked: Sat Jul 14 07:11:08 2018 Check interval: 0 (<none>) Lifetime writes: 1889 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 First orphan inode: 9318809 Default directory hash: half_md4 Directory Hash Seed: ad5a6f9c-6250-4dc5-84d9-4a3b14edc7b7 Journal backup: inode blocks Journal features: journal_incompat_revoke journal_64bit Journal size: 1024M Journal length: 262144 Journal sequence: 0x00508e50 Journal start: 1 (In reply to Rainer Fiebig from comment #195) > My VirtualBox-VM seems immune to this issue. Perhaps VMs have just the right > "hardware". Yeah, I've been wondering. Qemu only exposes single-queue devices (virtio_blk) so bugs in MQ can not trigger here I guess. Also "hardware" timing is much different in VMs so race conditions may not trigger with the same frequency. Also CPU assignment/scheduling may be different with respect to barriers, so memory safety problems (RCU bugs, missing barriers) may behave differently. I had no luck with vastly overcommitting vCPUs either. Oh wow, qemu/KVM does support multi-queue disks! -drive file=${DISK},cache=writeback,id=d0,if=none -device virtio-blk-pci,drive=d0,num-queues=4 Is it remotely possible this has to do with SPECTRE mitigation updates? I'm assuming everyone has these enabled. Anyone with this issue that doesn't? Have we seen it with AMD processors, or non-x86 even? I should note the affected systems I've mentioned are Intel Core2 era. I haven't been able to trigger it on an older AMD system, but that's using ext4. I'll pull up my sleeves and put in some effort to sort this out later today. #201: Various AMD and Intel CPUs are affected. Search for "AMD" and "Intel" in this bug. #199 That it cannot be reproduced in VMs may still offer a clue, namely that it may indeed have something to do with hardware and/or configuration. In the end there has to be a discriminating factor between those systems that have the problem and those that don't. Another update: Still working on a reliable reproducer. I have been able to reproduce the problem again with v4.19.6 and the following patches reverted. Revert "ext4: handle layout changes to pinned DAX mappings" Revert "dax: remove VM_MIXEDMAP for fsdax and device dax" Revert "ext4: close race between direct IO and ext4_break_layouts()" I used a modified version of #185, running on each drive on the affected system, plus a kernel build running in parallel. #204 So, only 2 commits left in your creative drill-down-effort, right? I hope the huge amount of time you have invested pays off and puts an end to this crisis. If so, it will still be interesting to know: Why only some and not all? fsck does i/o itself. It doesn't aggravate or trigger the issue. Also, I can't believe it is just an hdd, sata or usb hardware problem. Moreover, it affects other type of file systems zfs or nfs for instance (what about ext3 and ext2 ?). So it should imply the code they share. Three ideas I hope not so stupid. + log journal On my computers, vmlinuz is written on an ext2 /boot file system every kernel upgrade. If I remember well ext2 never failed and the file system stayed clean during the tests. If ext2 is not affected, it could involve the journal code instead. + if that's not the log, than the cache. In my case, rsync and rm are involved in the file system corruption. It could be explained like that. rsync reads inodes and blocs to compare before any write. The kernel reports an inconsistency independently if the inode/bloc is read or written from/to the cache. As expected only the changes are sent to the media. It explains that some of the corruptions never reached the media and the next reboot fsck declares a disk clean because only read i/o has been done before the reboot. + what about synchronisation As I mention in other posts, even if the issue still lurks, the patch proposed here makes the issue less intrusive. My tests were made with a vanilla kernel source from gentoo portage sys-kernel/vanilla-sources #205: Not really. Still working on the script - I'll publish it once it is even more brutal - but I have now been able to reproduce the problem even with all patches from #99 reverted. Created attachment 279827 [details]
Reproducer
To reproduce, run the attached script on each mounted file system. Also, run a linux kernel build with as much parallelism as you dare. On top of that, run a backup program such as borg. I don't know if this is all needed, but with all that I am able to reproduce the problem quite reliably, for the most part within a few minutes.
Typical log:
[ 357.330900] EXT4-fs error (device sda1): ext4_iget:4795: inode #5519385: comm borg: bad extra_isize 4752 (inode size 256)
[ 357.351658] Aborting journal on device sda1-8.
[ 357.355728] EXT4-fs error (device sda1) in ext4_reserve_inode_write:5805: Journal has aborted
[ 357.355745] EXT4-fs error (device sda1) in ext4_reserve_inode_write:5805: Journal has aborted
[ 357.365397] EXT4-fs (sda1): Remounting filesystem read-only
[ 357.365942] EXT4-fs error (device sda1): ext4_iget:4795: inode #5519388: comm borg: bad extra_isize 2128 (inode size 256)
[ 357.366167] EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal
[ 357.371296] EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal
[ 357.375832] EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal
[ 357.382480] EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal
[ 357.382486] EXT4-fs (sda1): ext4_writepages: jbd2_start: 5114 pages, ino 5273647; err -30
[ 357.384839] EXT4-fs error (device sda1): ext4_lookup:1578: inode #5513104: comm borg: deleted inode referenced: 5519390
[ 357.387331] EXT4-fs error (device sda1): ext4_iget:4795: inode #5519392: comm borg: bad extra_isize 3 (inode size 256)
[ 357.396557] EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal
[ 357.428824] EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal
[ 357.437008] EXT4-fs error (device sda1) in ext4_dirty_inode:5989: Journal has aborted
[ 357.441953] EXT4-fs error (device sda1) in ext4_dirty_inode:5989: Journal has aborted
As you can see, it took just about six minutes after boot to see the problem. Kernel version in this case is v4.19.6 with the five patches per #99 reverted.
I am investigating the dates of the files and folders found by fsck.ext4 when repairing the partition and I find something surprising. # ls -l /lost+foud -rw-r--r-- 1 portage portage 5051 dic 10 2013 '#1057825' drwxr-xr-x 3 portage portage 4096 dic 10 2013 '#1057827' -rw-r--r-- 1 root root 2022 oct 22 03:37 '#3184673' -rw-r--r-- 1 root root 634 oct 22 03:37 '#3184674' -rw-r--r-- 1 root root 1625 oct 22 03:37 '#3184675' Many lost files appear on October 22 at 3:37hs, all with the same time and the same day belonging to the root user, then a folder of December 10, 2013 belonging to the user portage group portage, for those who do not use gentoo, only say that that user and that group is only in the system in /usr/portage the contents of this folder: # ls -l /lost+found/#1057827 drwxr-xr-x 11 portage portage 4096 dic 10 2013 vba and inside that folder vba many more folders and files, all from the same ebuid of libreoffice at the end of 2013, probably from this version, when it was updated. $ genlop -t libreoffice Fri Nov 1 00:34:35 2013 >>> app-office/libreoffice-4.1.3.2 merge time: 1 hour, 9 minutes and 14 seconds. This package for years that are no longer on my pc, when upgrade libreoffice they were deleted and now fsck finds them when scanning as if they were installed and they were corrupted, but it turns out that they were erased there for a long time and now they are found as broken and put in lost+found. So pay attention to the lost+found content of your partitions, to see if they are current files or something they had long ago and had already deleted. What I do not relate is because e2fsk.ext4 starts to detect these deleted fragments. It may be the journal of ext4 or one of its unsynchronized copies that remembers things that are no longer there and retrieves them from the liberated space? My system and partitions were created on April 10, 2012 and I never had corruption problems of this type. $ genlop -t gentoo-sources |head -n3 Wed Apr 11 23:39:02 2012 >>> sys-kernel/gentoo-sources-3.3.1 # tune2fs -l /dev/md2 |grep "Filesystem created:" Filesystem created: Tue Apr 10 16:18:28 2012 Regards #207 So the problem seems more generic. Can you reproduce it now also on those systems where you have *not* seen it yet? When I ran updatedb on 4.19.6, RETPOLINE disabled, I triggered within 2 minutes the following errors (which I never saw with 4.14.x and older): [ 117.180537] BTRFS error (device dm-8): bad tree block start, want 614367232 have 23591879 [ 117.222142] BTRFS info (device dm-8): read error corrected: ino 0 off 614367232 (dev /dev/mapper/linux-lxc sector 1216320) And ~20 minutes later (while again running updatedb and compiling the kernel): [ 1328.804705] EXT4-fs error (device dm-1): ext4_iget:4851: inode #7606807: comm updatedb: checksum invalid With debugfs I located the file of that inode, then I did an ls on it: root@ster:# ls -l /home//michel/src/linux/linux/drivers/firmware/efi/test/ ls: cannot access '/home//michel/src/linux/linux/drivers/firmware/efi/test/': Bad message (reproduces) Dropping dentry and inode cache (echo 2 > /proc/sys/vm/drop_caches) didn't resolve this, but dropping all caches (echo 3 > /proc/sys/vm/drop_caches) did. Both a simple 'ls' and also 'debugfs -R 'ncheck <inode>' did show errors, which were resolved by the 'echo 3 > /proc/sys/vm/drop_caches'. See my comment #168 for 4.19.5. My next step is to try without SMP. Does anybody have suggestions what else I can try, or where I should look? What information to share? I ran into this bug when using 4.19-4.19.5 and compiling Overwatch shaders. Am no longer running into it with 4.19.6 after mounting my ext4 partition as ext2. #210: I rather refrain from it. Messing up one of my systems is bad enough. Note that the problem is still spurious; it may happen a mionute into the test (or even during boot), or after an hour. I am now at a point where I still see the problem with almost all patches since 4.18.20 reverted; the only patch not reverted is the STAT_WRITE patch, because it is difficult to revert due to context changes. I'll revert that manually for the next round of tests. Here is the most recent log: [ 2228.782567] EXT4-fs error (device sda1): ext4_iget:4795: inode #6317073: comm borg: bad extra_isize 30840 (inode size 256) [ 2228.805645] Aborting journal on device sda1-8. [ 2228.814576] EXT4-fs (sda1): Remounting filesystem read-only [ 2228.815816] EXT4-fs error (device sda1): ext4_iget:4795: inode #6317074: comm borg: bad extra_isize 30840 (inode size 256) [ 2228.817360] EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal [ 2228.817367] EXT4-fs (sda1): ext4_writepages: jbd2_start: 4086 pages, ino 5310565; err -30 [ 2228.819221] EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal [ 2228.819227] EXT4-fs (sda1): ext4_writepages: jbd2_start: 9223372036854775745 pages, ino 5328193; err -30 ... and so on. (In reply to Guenter Roeck from comment #213) > I am now at a point where I still see the problem with almost all patches > since 4.18.20 reverted; the only patch not reverted is the STAT_WRITE patch, > because it is difficult to revert due to context changes. I'll revert that > manually for the next round of tests. Seems more and more likely that it's not a bug in ext4.. except perhaps some changes in ext4 make it easier to run into the bug. Do you think you'll be able to reproduce it with the 4.18 ext4? #214: Not yet. Still trying. This time the corrupt inode block came clearly from one of the JPG files I was checksumming (without writing) with rsync at the same time. Poor test because I was checking against BTRFS filesystem, so I don't know which fs the corrupt block came from. Also first time I hit actual corruption with filesystem mounted errors=remount-ro, somehow two blocks of inodes had multiply claimed inodes. To me this suggests that the corrupting block came from another reservation block, the kernel didn't notice that because the data structure was valid and wrote it back. If so, this would indicate it happens inside single filesystem and with metadata blocks as source as well. It seems to me like metadata blocks are remaining linked when evicted due to memory pressure. BTRFS csum errors probably from same source. Steps for reproducing would be causing evictions in large pagecache while re-accessing same inode blocks. Backup scripts do this when same block contains inodes created at different times, ie. for me it happens constantly when reading files in date-specific directories where files from different days are in same inode block so the copy command re-reads the same block after some evictions. Likely some race-condition in block reservation or the like, because otherwise it'd be crashing all the time, but the corrupt block stays in the cache. #213
>#210: I rather refrain from it. Messing up one of my systems is bad enough.
Absolutely.
You're doing a great job here!
Oh well. It took a long time, but: v4.19.6, with fs/ext4 from v4.18.20: [15903.283340] EXT4-fs error (device sdb1): ext4_lookup:1578: inode #5137538: comm updatedb.mlocat: deleted inode referenced: 5273882 [15903.284896] Aborting journal on device sdb1-8. [15903.286404] EXT4-fs (sdb1): Remounting filesystem read-only I guess the next step will be to test v4.18.20 with my test script, to make sure that this is not a long-time lingering problem. Other than that, I am open to ideas. #211
> Dropping dentry and inode cache (echo 2 > /proc/sys/vm/drop_caches) didn't
> resolve this, but dropping all caches (echo 3 > /proc/sys/vm/drop_caches)
> did.
So we have pagecache corruption. Sounds like a problem in vm code then. Is anybody seeing the problem when there is no swap?
(In reply to Ortwin Glück from comment #219) > #211 > > Dropping dentry and inode cache (echo 2 > /proc/sys/vm/drop_caches) didn't > > resolve this, but dropping all caches (echo 3 > /proc/sys/vm/drop_caches) > > did. > > So we have pagecache corruption. Sounds like a problem in vm code then. Is > anybody seeing the problem when there is no swap? I'm seeing this problem with and without swap. Two of the affected computers even have CONFIG_SWAP=n. #218
If 4.18.20 turns out to be OK, my idea would be to bisect between 4.18 and 4.19.
Jimmy.Jazz has already done that and the result pointed to RCU. But IIRC it was not a clear cut
> git bisect bad
xyz123 is the first bad commit
With your script we now have a tool to reproduce the problem which makes the distinction between "good" and "bad" more reliable. And everybody is now also aware how important it is to ensure that the fs is OK after a bad kernel has run and that the next step should be done with a known-good kernel. So it should be possible to identify a bad commit.
Perhaps one could limit the bisect to kernel/rcu or block in a first step. And if that's inconclusive, extent the search.
But if 4.18.20 is bad, I have no clue at all - at least at the moment.
Hello, Thank you all for your great work in this investigation. Just my 2 cents: as mentionned earlier by others, I think it is closely related to rsync. Or at least it is a good way to reproduce. On my machine I had the issue very often when my rsync script was activated at login. Since I deactivated this task it looks fine so far. I could easily see some issues in my rsync log file, even a text editor was reporting issues on this file. Hope this helps to find a fix quicker! Let me know if you need more information from my side. 4.18.20 seems to be ok, except that my script overburdens it a bit. [ 1088.450369] INFO: task systemd-tmpfile:31954 blocked for more than 120 seconds. [ 1088.450374] Not tainted 4.18.20+ #1 [ 1088.450375] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1088.450377] systemd-tmpfile D 0 31954 1 0x00000000 [ 1088.450380] Call Trace: [ 1088.450389] __schedule+0x3f1/0x8c0 [ 1088.450392] ? bit_wait+0x60/0x60 [ 1088.450394] schedule+0x36/0x80 [ 1088.450398] io_schedule+0x16/0x40 [ 1088.450400] bit_wait_io+0x11/0x60 [ 1088.450403] __wait_on_bit+0x63/0x90 [ 1088.450405] out_of_line_wait_on_bit+0x8e/0xb0 [ 1088.450408] ? init_wait_var_entry+0x50/0x50 [ 1088.450411] __wait_on_buffer+0x32/0x40 [ 1088.450414] __ext4_get_inode_loc+0x19f/0x3e0 [ 1088.450416] ext4_iget+0x8f/0xc30 There is a key difference, though: With 4.18.20, cfq is active. $ cat /sys/block/sd*/queue/scheduler noop deadline [cfq] noop deadline [cfq] In 4.19, cfq is not available due to commit d5038a13eca72 ("scsi: core: switch to scsi-mq by default"). I'll repeat my tests with SCSI_MQ_DEFAULT disabled on v4.19, and with it enabled on v4.18.20. We know that disabling SCSI_MQ_DEFAULT alone does not help, but maybe there is more than one problem. Any takers for a round of bisects as suggested in #221 ? (In reply to Guenter Roeck from comment #223) > There is a key difference, though: With 4.18.20, cfq is active. > > $ cat /sys/block/sd*/queue/scheduler > noop deadline [cfq] > noop deadline [cfq] > > In 4.19, cfq is not available <--- ? # grep -i cfq config-4.19.3-gentoo CONFIG_IOSCHED_CFQ=y < --- ! # CONFIG_CFQ_GROUP_IOSCHED is not set CONFIG_DEFAULT_CFQ=y CONFIG_DEFAULT_IOSCHED="cfq" did not run me into trouble. Being a production machine, reverted it to ### INFO: # uname -a Linux XXX 4.18.20-gentoo #1 SMP Wed Nov 28 12:30:28 CET 2018 x86_64 Intel(R) Xeon(R) CPU E3-1276 v3 @ 3.60GHz GenuineIntel GNU/Linux Running Gentoo "stable" (with *very* few exceptions) # equery list gcc [IP-] [ ] sys-devel/gcc-7.3.0-r3:7.3.0 Exploiting disks directly attached @ ASUS Workstation MoBo P9D-WS : Samsung SSD, multiple S-ATA HDD fom 5000 GB up to 6 TB, ... as well as e.g. Adaptec SCSI Raid-1, 2 x WD 500 running stable till this evening. (In reply to Manfred from comment #224) > (In reply to Guenter Roeck from comment #223) > > > There is a key difference, though: With 4.18.20, cfq is active. > > > > $ cat /sys/block/sd*/queue/scheduler > > noop deadline [cfq] > > noop deadline [cfq] > > > > In 4.19, cfq is not available <--- ? > > # grep -i cfq config-4.19.3-gentoo > > CONFIG_IOSCHED_CFQ=y < --- ! > # CONFIG_CFQ_GROUP_IOSCHED is not set > CONFIG_DEFAULT_CFQ=y > CONFIG_DEFAULT_IOSCHED="cfq" > When scsi_mod.use_blk_mq=1 (i.e. result of CONFIG_SCSI_MQ_DEFAULT=y), the I/O scheduler is just "none", and you cannot set a different scheduler. (In reply to Steven Noonan from comment #225) Thanks for pointing this out - forgot to mention: # grep CONFIG_SCSI_MQ_DEFAULT config-4.19.3-gentoo # CONFIG_SCSI_MQ_DEFAULT is not set < --- HTH Respectfully (In reply to Steven Noonan from comment #225) > When scsi_mod.use_blk_mq=1 (i.e. result of CONFIG_SCSI_MQ_DEFAULT=y), the > I/O scheduler is just "none", and you cannot set a different scheduler. That's not true, you can set MQ capable schedulers. CFQ is from the legacy stack, it doesn't support MQ. But you can set none/bfq/mq-deadline/kyber, for instance. I guess I should have been more specific. With CONFIG_SCSI_MQ_DEFAULT=y (or scsi_mod.use_blk_mq=1), cfq is not available. That applies to any kernel version with CONFIG_SCSI_MQ_DEFAULT=y (or scsi_mod.use_blk_mq=1), not just to 4.19, and it doesn't apply to 4.19 if CONFIG_SCSI_MQ_DEFAULT=n (or scsi_mod.use_blk_mq=0). It is quite irrelevant if other schedulers are available if CONFIG_SCSI_MQ_DEFAULT=y (or scsi_mod.use_blk_mq=1). cfq is not available, and it doesn't matter if it is set as default or not. I hope this is specific enough this time. My apologies if I missed some other means to enable or disable blk_mq. (In reply to Guenter Roeck from comment #228) > I guess I should have been more specific. With CONFIG_SCSI_MQ_DEFAULT=y (or > scsi_mod.use_blk_mq=1), cfq is not available. That applies to any kernel > version with CONFIG_SCSI_MQ_DEFAULT=y (or scsi_mod.use_blk_mq=1), not just > to 4.19, and it doesn't apply to 4.19 if CONFIG_SCSI_MQ_DEFAULT=n (or > scsi_mod.use_blk_mq=0). > > It is quite irrelevant if other schedulers are available if > CONFIG_SCSI_MQ_DEFAULT=y (or scsi_mod.use_blk_mq=1). cfq is not available, > and it doesn't matter if it is set as default or not. > > I hope this is specific enough this time. My apologies if I missed some > other means to enable or disable blk_mq. My clarification was for Steven, not you. In terms of scheduler, CFQ will change the patterns a lot. For the non-mq case, I'd recommend using noop or deadline for testing, otherwise I fear we're testing a lot more than mq vs non-mq. Guenter, can you attach the .config you are running with? (In reply to Jens Axboe from comment #227) > (In reply to Steven Noonan from comment #225) > > When scsi_mod.use_blk_mq=1 (i.e. result of CONFIG_SCSI_MQ_DEFAULT=y), the > > I/O scheduler is just "none", and you cannot set a different scheduler. > > That's not true, you can set MQ capable schedulers. CFQ is from the legacy > stack, it doesn't support MQ. But you can set none/bfq/mq-deadline/kyber, > for instance. My bad. I was basing my response on outdated information: https://mahmoudhatem.wordpress.com/2016/02/08/oracle-uek-4-where-is-my-io-scheduler-none-multi-queue-model-blk-mq/ (Also didn't want to risk turning on MQ on one of my machines just to word my response, especially if not having CFQ is somehow involved in this corruption bug!) Created attachment 279845 [details] git bisect between v4.18 and 4.19-rc1 Hello, I am able to reproduce the data corruption under Qemu, the issue usually shows itself fairly quickly (within a minute or two). Generally, the bug was very likely to appear when (un)installing packages with apt. I ran a bisect with the following result (full bisect log is attached): # first bad commit: [6ce3dd6eec114930cf2035a8bcb1e80477ed79a8] blk-mq: issue directly if hw queue isn't busy in case of 'none' You can revert the commit from linux v4.19 with: git revert --no-commit 8824f62246bef 6ce3dd6eec114 (did not try compiling and running the kernel myself yet) Obviously, this commit could just make the issue more prominent than it already is, especially since some are saying that CONFIG_SCSI_MQ_DEFAULT=n does not make the problem go away. The commit was added fairly early in the 4.19 merge window, though, so if v4.18 is fine, it should be one of the 67 other commits in that range. The only thing I can think of is that the people that had blk-mq off in the kernel config still had it enabled on the kernel command line (scsi_mod.use_blk_mq=1, /sys/module/scsi_mod/parameters/use_blk_mq would then be set to Y). The bad commits in the bisect log I am fairly certain of because the corruption was evident, the good ones less so since I did only limited testing (about 3-6 VM restarts and couple minutes of running apt) and did not use the reproducer script posted here. There are a few preconditions that make the errors much more likely to appear: - Ubuntu Desktop 18.10; Ubuntu Server 18.10 did not work (I guess there are a few more things installed by default like Snap packages that are mounted on startup, dpkg automatically searches for updates, etc.) - as little RAM as possible (300 MB), 256 MB did not boot - this makes sure swap is used (~200 MiB out of 472 MiB total) - drive has to be the default if=ide, virtio-blk (-drive <...>,if=virtio) and virtio-scsi (-drive file=<file>,media=disk,if=none,id=hd -device virtio-scsi-pci,id=scsi -device scsi-hd,drive=hd) did not produce corruption (I did not try setting num-queues, though) - scsi_mod.use_blk_mq=1 has to be used, no errors for me without it (Ubuntu mainline kernel 4.19.1 and later has this on by default) Before running the bisect, I tested these kernels (all Ubuntu mainline from http://kernel.ubuntu.com/~kernel-ppa/mainline/): Had FS corruption: 4.19-rc1 4.19 4.19.1 4.19.2 4.19.3 4.19.4 4.19.5 4.19.6 No corruption (yet): 4.18 4.18.20 Created attachment 279847 [details]
description of my Qemu and Ubuntu configuration
That's awesome, that makes some sense, finally! There's a later fix for that one that is also in 4.19, but I guess that doesn't fix every failure case. I'm going to run your qemu config and see if I can reproduce, then a real fix should be imminent. Excellent. Finally getting somewhere. FWIW, I am not able to reproduce the problem (anymore) with v4.19.6 and SCSI_MQ_DEFAULT=n. At this point I am not sure if my earlier test that saw it failing was a false positive. I'll try with the two reverts suggested in #232 next. (In reply to Lukáš Krejčí from comment #232) > Created attachment 279845 [details] > git bisect between v4.18 and 4.19-rc1 > > Hello, > > I am able to reproduce the data corruption under Qemu, the issue usually > shows itself fairly quickly (within a minute or two). Generally, the bug was > very likely to appear when (un)installing packages with apt. > > I ran a bisect with the following result (full bisect log is attached): > # first bad commit: [6ce3dd6eec114930cf2035a8bcb1e80477ed79a8] blk-mq: issue > directly if hw queue isn't busy in case of 'none' > [...] Congrats! Good to see progress here. Thanks! I also feel somewhat vindicated as my idea to catch and bisect this in VM wasn't so bad after all. ;) But obviously qemu has more knobs to turn than VB - and you just turned the right ones. Great! #232 That you could see the errors so early and reliably really baffles me. This very morning I *concurrently* - ran Guenter's script for 30 minutes - compiled a kernel - did some file copying with CONFIG_SCSI_MQ_DEFAULT=y and didn't see one error. But it is a Debian-8-VM, 1024 GB RAM and the two discs attached as SATA/SSD. I guess you must have played around with the settings for a while - or did you have the idea of limiting RAM and attaching the disc as IDE right from the start? Anyway - great that you found this out! Could anyone just sum up what needs to be set to trigger the bug (as of the understanding we have now)? I use scsi_mod.use_blk_mq=y and dm_mod.use_blk_mq=y for ages but I do not see the bug. I use the mq-deadline scheduler. #232 somehow suggests it needs additional memory pressure to trigger it, doesn't it? Quite confused here... As mentioned earlier, I only ever saw the problem on two of four systems (see #57), all running the same kernel and the same version of Ubuntu. The only differences are mainboard, CPU, and attached drive types. I don't think we know for sure what it takes to trigger the problem. We have seen various guesses, from gcc version to l1tf mitigation to CPU type, broken hard drives, and whatnot. At this time evidence points to the block subsystem, with bisect pointing to a commit which relies on the state of the HW queue (empty or not) in conjunction with the 'none' io scheduler. This may suggest that drive speed and access timing may be involved. That guess may of course be just as wrong as all the others. Let's just hope that Jens will be able to track down and fix the problem. Then we may be able to get a better idea what it actually takes to trigger it. Oh, and if commit 6ce3dd6ee is indeed the culprit, you won't be able to trigger the problem with mq-deadline (or any other scheduler) active. Progress report - I've managed to reproduce it now, following the procedure from Lukáš Krejčí. I have a PC with multiple boot -Windows 10, Arch linux with kernel 4.19.x en Ubuntu Disco Dingo with 4.19.x. My Arch linux is an encrypted LVM. I can actually invoke the EXT4-fs errors on Ubuntu! Which is not encrypted but has cryptsetup-initramfs installed, because I make regular backups with partclone from the Arch partitions. All that is needed on Ubuntu is to run sudo update-initramfs -u cryptsetup: WARNING: The initramfs image may not contain cryptsetup binaries nor crypto modules. If that's on purpose, you may want to uninstall the 'crypsetup-initramfs' package in order to disable the cryptsetup initramfs integration and avoid this warning. You will get a warning or error that is also subscribed in this bugreport: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=901830 The problem is that when you reboot you will get all the EXT4-fs errors. I have to do a e2fsck via Arch, it reports some inode errors and when rebooting Ubuntu the problem is gone, if there never were any problems. Also when I am on ARCH cloning the Ubuntu partitions I can reproduce these errors. When mistakenly partcloning a read-only mounted partition for instance. So far I have not been able to reproduce the problem on the affected systems after reverting commits 8824f62246 and 6ce3dd6eec. I've only reproduced it that one time, but here's what I think is happening: - Usually a request is queued, inserted into the blk-mq proper - There's an optimization in place to attempt to issue to the driver directly before doing that insert. If we fail because of some resource limitation, we insert the request into blk-mq proper - But if that failure did trigger, SCSI has already setup the command. This means we now have a request in the regular blk-mq IO lists that is mergeable with other commands, but where the SG tables for IO have already been setup. - If we later do merge with this IO before dispatch, we'll only do DMA to the original part of the request. This makes the rest very unhappy... The case is different because from normal dispatch, if IO needs to be requeued, it will NEVER be merged/changed after the fact. This means that we don't have to release SG tables/mappings, we can simply reissue later. This is just a theory... If I could reproduce more reliably, I'd verify it. I'm going to spin a quick patch. With 4.19.6, setting CONFIG_SCSI_MQ_DEFAULT=n seems to resolve the issue on my system, going back to CONFIG_SCSI_MQ_DEFAULT=y makes it show up again. Indeed all schedulers in /sys/devices/virtual/block/*/queue/scheduler are none. #245: Is there a means to log the possible error case ? Created attachment 279851 [details]
dmesg with mdraid1
Created attachment 279853 [details]
4.19 fix
Here's a fix, verifying it now. It might be better to fully unprep the request after the direct issue fails, but this one should be the safe no-brainer. And at times like this, that feels prudent...
(In reply to Chris Severance from comment #248) > Created attachment 279851 [details] > Triggered by mdraid Unrelated issue. #248: Looks like you may be running my reproducer script. It tends to do that, especially on slow drives (ie anything but nvme), depending on the io scheduler used. I have seen it with "cfq", but not with "none". iostat would probably show you 90+ % iowait when it happens. That by itself does not indicate the error we are trying to track down here. (In reply to Guenter Roeck from comment #247) > #245: Is there a means to log the possible error case ? Yes, that's how I ended up verifying that this was indeed what was going on. Example: [ 235.665576] issue_direct=13, 22080, ffff9ee3da59e400 [ 235.931483] bio_attempt_back_merge: MERGE ON NO PREP ffff9ee3da59e400 [ 235.931486] bio_attempt_back_merge: MERGE ON NO PREP ffff9ee3da59e400 [ 235.931489] bio_attempt_back_merge: MERGE ON NO PREP ffff9ee3da59e400 [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) Here we see req 0xffff9ee3da59e400 being rejected, due to resource starvation. Shortly thereafter, we see us happily merging more IO into that request. Once that request finishes, ext4 gets immediately unhappy as only a small part of the request contents were valid. The rest simply contained garbage. Created attachment 279855 [details]
4.19 patch v2
Better version of the previous patch. They both solve the issue, but the latter version seems safer since it doesn't rely on whatever state that SCSI happens to maintain. If we fail direct dispatch, don't ever touch the request before dispatch.
(In reply to Michel Roelofs from comment #246) > With 4.19.6, setting CONFIG_SCSI_MQ_DEFAULT=n seems to resolve the issue on > my system, going back to CONFIG_SCSI_MQ_DEFAULT=y makes it show up again. > Indeed all schedulers in /sys/devices/virtual/block/*/queue/scheduler are > none. I can confirm here. Created attachment 279857 [details]
4.19/4.20 patch v3
Here's the one I sent upstream, also tested this one. Should be the safest of them all, as we set REQ_NOMERGE at the source of when we attempt to queue it. That'll cover all cases, guaranteed.
The folks that have seen this, please try this patch on top of 4.19 or 4.20-rc and report back.
Over an hour now on 4.19.0 with commits 8824f62246 and 6ce3dd6eec reverted, and ZFS is happy. Plenty of IO and not a single checksum error. (In reply to Eric Benoit from comment #256) > Over an hour now on 4.19.0 with commits 8824f62246 and 6ce3dd6eec reverted, > and ZFS is happy. Plenty of IO and not a single checksum error. Can you try 4.19.x with the patch from comment #255? Thanks! So far 30 minutes on two system running the patch from #255. Prior to that, close to two hours running patch v1. No issues so far. In my case (kernel 4.19.6 and a 2-vdev/6-drive raidz1) doing an rsync to/from the same ZFS filesystem would generate ~1 error every 5s or so (on a random drive on the pool). With the patch from #255 I have been through 300GB of the rsync w/o any errors. Throughput of the rsync is identical before/after the patch. #255 feels like a good patch. About an hour now with 4.19.6 and the patch from #255 without a peep from zed. I think we have a winner! Two hours. Agreed - looks like a winner. I have patched with #255 Jens Axboe 4.19/4.20 patch v3 to my trees 4.19.0 4.19.5 4.19.6 and I can say that for the first time I am able to use 4.19.0 reading #259 reminds me that every time I updated I started having problems with rsync and I had to do it several times to get it right ... I innocently thought it was a problem in the repository. I was investigating my hdds in case they did not have real physical problems but: # smartctl -H /dev/sda SMART overall-health self-assessment test result: PASSED # smartctl -H /dev/sdb SMART overall-health self-assessment test result: PASSED Very glad that the cause of the evils has been found ...to see that it continues now. $ uname -r Linux pc-user 4.19.0-gentoo Regards (In reply to Guenter Roeck from comment #240) > As mentioned earlier, I only ever saw the problem on two of four systems > (see #57), all running the same kernel and the same version of Ubuntu. The > only differences are mainboard, CPU, and attached drive types. > > I don't think we know for sure what it takes to trigger the problem. We have > seen various guesses, from gcc version to l1tf mitigation to CPU type, > broken hard drives, and whatnot. At this time evidence points to the block > subsystem, with bisect pointing to a commit which relies on the state of the > HW queue (empty or not) in conjunction with the 'none' io scheduler. This > may suggest that drive speed and access timing may be involved. That guess > may of course be just as wrong as all the others. > > Let's just hope that Jens will be able to track down and fix the problem. > Then we may be able to get a better idea what it actually takes to trigger > it. It would indeed be nice to get a short summary *here* of what happened and why, once the dust has settled. It would also be interesting to know why all the testing in the run-up to 4.19 didn't catch it, including rc-kernels. It's imo for instance unlikely that everybody just tested with CONFIG_SCSI_MQ_DEFAULT=n. Is this possible to avoid this bug by using some command line parameter or setting some sysfs entry? Something I could use on my machines before the fix get included in my distribution? I think Jens pretty much summarized the situation in #245. To trigger the bug blk-mq must be used together with an underlying block device (such as SCSI or SATA) that is stateful after a rejected bio submit. Then it's just a matter of enough concurrent I/O. So a workaround is to just disable blk-mq with SCIS: scsi_mod.use_blk_mq=1 > So a workaround is to just disable blk-mq with SCIS: scsi_mod.use_blk_mq=1
I wasn't sure if that's a reliable workaround as it was being discussed before the fix was provided. Thank you for clarifying that!
Of course that should be a zero to disable it: scsi_mod.use_blk_mq=0 (In reply to Rainer Fiebig from comment #263) > (In reply to Guenter Roeck from comment #240) > > As mentioned earlier, I only ever saw the problem on two of four systems > > (see #57), all running the same kernel and the same version of Ubuntu. The > > only differences are mainboard, CPU, and attached drive types. > > > > I don't think we know for sure what it takes to trigger the problem. We > have > > seen various guesses, from gcc version to l1tf mitigation to CPU type, > > broken hard drives, and whatnot. At this time evidence points to the block > > subsystem, with bisect pointing to a commit which relies on the state of > the > > HW queue (empty or not) in conjunction with the 'none' io scheduler. This > > may suggest that drive speed and access timing may be involved. That guess > > may of course be just as wrong as all the others. > > > > Let's just hope that Jens will be able to track down and fix the problem. > > Then we may be able to get a better idea what it actually takes to trigger > > it. > > It would indeed be nice to get a short summary *here* of what happened and > why, once the dust has settled. > > It would also be interesting to know why all the testing in the run-up to > 4.19 didn't catch it, including rc-kernels. It's imo for instance unlikely > that everybody just tested with CONFIG_SCSI_MQ_DEFAULT=n. As mentioned earlier: it would be nice to have a definitive list of ciscumstances that are likely to have the bug triggered so people can check if they are probably affected because the _ran_ their systems with these setting and possibly have garbage on their disks now... From what I've learned there are two types of schedulers. Legacy stack schedulers, e.g.: 1) noop 2) deadline 3) cfq MQ schedulers, e.g.: 1) none 2) mq-deadline 3) kyber 4) bfq This issue is triggered by a bug in the blk-mq (MQ schedulers). If in the output of: cat /sys/block/sda/queue/scheduler you see e.g. noop / deadline / cfq (no matter which one is selected) then you are safe. If you see none / mq-deadline / kyber / bfq, your system may be affected. Until the fix is applied, it's the safest to switch to the legacy stack schedulers using scsi_mod.use_blk_mq=0. What I'm unsure of: is there a workaround for the NVMe drives? Setting scsi_mod.use_blk_mq=0 obviously won't affect them. (In reply to Guenter Roeck from comment #241) > Oh, and if commit 6ce3dd6ee is indeed the culprit, you won't be able to > trigger the problem with mq-deadline (or any other scheduler) active. Isn't the direct-issue optimization also used when the scheduler is being bypassed by adding a scheduler to a stacked device? Not that this seems to be a common case... NVMe is hard-wired to blk_mq. NVMe drives were basically the reason for the invention of blk_mq. (In reply to Lukáš Krejčí from comment #232) > No corruption (yet): > 4.18 > 4.18.20 I actually had this bug on 4.18.20, I was running Manjaro and one day I had a serious bug that flushed my /etc/fstab into garbage (not completely garbage, but most of the content are the files I read/written), should have etckeeper not be there my system will never boot again. Before that I also had frequent fsck failures that deleted my NetworkManager profile, my pacman database, my pnpm repository files, some journal blocks on my Seafile client, one random dynamic library from /lib such that I can't even initiate X session (because node requires it), my Mathematica installation and many more are man pages and assets I don't bother to recover. Although I had a badly reputed SSD (ADATA SU650 960GB), and my rootfs is installed on it, however, that SSD was only purchased for roughly a month. I thought that my SSD was broken would be the reason until I came across this bug introduced from someone. Now I'm running 4.19.6-1, my PC has been running it for 12 hours now, no more EXT4 checksum and bad blocks from dmesg, so far so good. I'm on v4.19.4 without scsi_mod.use_blk_mq=0 on the commandline and CONFIG_SCSI_MQ_DEFAULT=y So far no issue on ext4 mounts cat /sys/block/sd*/queue/scheduler reports mq-deadline cat /sys/block/dm-*/dm/use_blk_mq reports 0 So, even without disabling use_blk_mq from commandline, it seams it could get disabled by the driver itself, maybe because of incompatible HW. Is it relevant to mention about checking /sys/block/dm-*/dm/use_blk_mq ? The patch discussion: https://patchwork.kernel.org/patch/10712695/ The reason why this wasn't caught in testing, and why it only affected a relatively small number of people, is that it required the following conditions to be met: 1) The drive must not have an IO scheduler configured 2) The driver must regularly have a resource starvation condition 3) The driver must maintain state in the request over multiple ->queue_rq() invocations NVMe isn't affected by this, as it doesn't meet conditions 2+3. SCSI is affected, but 99.99% of SCSI use is single queue, and the kernel defaults to using mq-deadline for single queue blk-mq devices. Hence we also needed the distro to override this fact, which largely explains why this wasn't caught earlier. This will of course be rectified, I'll write a specific regression test case for this. FWIW, the patch is queued up for inclusion, and I'll send it to Linus later today. The patch from #255 seems good to me under Qemu. Without the patch and with kernel v4.20-rc5, the bug occurred on the second restart of the VM (~4 minutes). I haven't been able to reproduce the bug with the patch applied (to the same kernel) even after 10 tries. I also reviewed the log in #58 from the person that had CONFIG_SCSI_MQ_DEFAULT=n and found that blk-mq was used as well (if I am correct). See this line from dmesg: [ 7096.208603] io schedulerFbfq registered (In reply to Steve "Stefan" Fan from comment #273) > I actually had this bug on 4.18.20, I was running Manjaro and one day I had > a serious bug that flushed my /etc/fstab into garbage (not completely > garbage, but most of the content are the files I read/written), should have > etckeeper not be there my system will never boot again. The discussion on https://patchwork.kernel.org/patch/10712695/ suggests that the bug has been in the kernel for a bit longer, it probably just got easier to hit with the patches in v4.19. Actually the patch v3 suggested seems fixing. Note that sometimes, before patch v3, the bug hit was asymptomatic, e.g. got a few bit flips (in the order of 2 bytes off 5GB of data moved) without having the correspondent ext4 corruption report in the logs (the common were MQ + high CPU pressure + high I/O pressure). Please also reconsider in not removing definitively the legacy block layer for 4.21. (In reply to Giuseppe Ghibò from comment #278) > Please also reconsider in not removing definitively the legacy block layer > for 4.21. It's definitely still going, this highlights nicely why having two parallel stacks that need to be maintained is an issue. FWIW, timing is unfortunate, but this is the first corruption issue we've had (blk-mq or otherwise) in any kernel in the storage stack in decades as far as I can remember. Strangely, the last time it happened, it was also a merging issue... I do not understand "NVMe isn't affected by this": how come some reporters described problems with such a drive (eg comment #198)? My personal experience was also a "disappearing" NVMe SSD: after running for a few hours under 4.19.5 and powering off, next boot popped-up a BIOS error screen telling that no drive was detected. This stayed like this for ~ 30 minutes and after around 10 failed reboots (with a few minuites between each), the drive was detected again. I then became aware of the bug reports and switched back to 4.18.20. The last days on this kernel did not show any problem. Also note that I have been running 4.19-rc7 for ~1 month (was in Debian experimental for quite some time) without any issue. Any clue why it became more visible in recent 4.19 only? Thanks! what means "removing the legacy block layer for 4.21"? will this invalidate "elevator=noop" which is what you want on virtualized guests which have no clue at all about the real disks? (In reply to Damien Wyart from comment #280) > I do not understand "NVMe isn't affected by this": how come some reporters > described problems with such a drive (eg comment #198)? That particular issue looks different. This exact issue is not something that will affect nvme. > My personal experience was also a "disappearing" NVMe SSD: after running for > a few hours under 4.19.5 and powering off, next boot popped-up a BIOS error > screen telling that no drive was detected. This stayed like this for ~ 30 > minutes and after around 10 failed reboots (with a few minuites between > each), the drive was detected again. I then became aware of the bug reports > and switched back to 4.18.20. The last days on this kernel did not show any > problem. > > Also note that I have been running 4.19-rc7 for ~1 month (was in Debian > experimental for quite some time) without any issue. It's quite possible that you have drive issues. If the drive doesn't even detect, that's not a kernel problem. (In reply to Reindl Harald from comment #281) > what means "removing the legacy block layer for 4.21"? The old IO stack is being removed in 4.21. For most drivers it's already the case in 4.20. > will this invalidate "elevator=noop" which is what you want on virtualized > guests which have no clue at all about the real disks? 'none' is the equivalent on blk-mq driven devices. But we're getting off topic now. (In reply to Antonio Borneo from comment #274) > I'm on v4.19.4 without scsi_mod.use_blk_mq=0 on the commandline and > CONFIG_SCSI_MQ_DEFAULT=y > So far no issue on ext4 mounts > > cat /sys/block/sd*/queue/scheduler > reports mq-deadline > > cat /sys/block/dm-*/dm/use_blk_mq > reports 0 > > So, even without disabling use_blk_mq from commandline, it seams it could > get disabled by the driver itself, maybe because of incompatible HW. > Is it relevant to mention about checking /sys/block/dm-*/dm/use_blk_mq ? DM's 'use_blk_mq' was only relevant for request-based DM. The only DM target that uses request-based DM is multipath. DM core doesn't allow scsi's use_blk_mq to be on but DM's use_blk_mq to be off, meaning: you aren't using multipath. So DM's 'use_blk_mq' is irrelevant. (In reply to Jens Axboe from comment #276) > The reason why this wasn't caught in testing, and why it only affected a > relatively small number of people, is that it required the following > conditions to be met: > > 1) The drive must not have an IO scheduler configured > 2) The driver must regularly have a resource starvation condition > 3) The driver must maintain state in the request over multiple ->queue_rq() > invocations > > NVMe isn't affected by this, as it doesn't meet conditions 2+3. SCSI is > affected, but 99.99% of SCSI use is single queue, and the kernel defaults to > using mq-deadline for single queue blk-mq devices. Hence we also needed the > distro to override this fact, which largely explains why this wasn't caught > earlier. This will of course be rectified, I'll write a specific regression > test case for this. > > FWIW, the patch is queued up for inclusion, and I'll send it to Linus later > today. Thanks. The distro-override probably explains why I couldn't reproduce it in a Debian-8-VM. Given the flaky nature of this, I think it was ultimately uncovered rather quickly. And I guess that everybody has also learnt something - if perhaps only to back-up their data. ;) (In reply to Jan Steffens from comment #271) > (In reply to Guenter Roeck from comment #241) > > Oh, and if commit 6ce3dd6ee is indeed the culprit, you won't be able to > > trigger the problem with mq-deadline (or any other scheduler) active. > > Isn't the direct-issue optimization also used when the scheduler is being > bypassed by adding a scheduler to a stacked device? Not that this seems to > be a common case... Yes, it is used by dm-multipath that has a top-level IO scheduler, so the underlying devices (individual paths) don't benefit from another IO scheduler. The difference is dm-multipath _always_ issues requests directly. Whereas, the issue associated with the corruption is the use of direct issue as a fallback. Thank you, to everyone involved and for spending time figuring it out and getting this fixed. I was just an observer, but reading the comments and watching the progress. @Guenter Roeck: thank you for your great determination, not giving up, especially after a rather harsh and demotivating comment. Very much appreciated. So now that a patch is pending and working its way into the stable-queue and into the next stable release, I would like to ask: 1) Are any file systems other than ext4 and zfs, which are mentioned in here, affected by this too? 2) What about xfs? Or other file systems? 3) Why did this issue and the file system corruptions only surface because of ext4? I've not seen one person mentioning corruptions with xfs. The way I understand it, since the issue is below file system level, it should have affected more file systems, shouldn't it? I'd like to echo the sentiment towards Guenter, but also extend it to all the rest of the folks that have been diligent in testing, reporting, and trying to make sense of it all. This would affect any file system, but it's not unlikely that some would be more susceptible to it than others. Or maybe there's just a lot more folks running ext4 than xfs. I've got a reproducer now using fio, that will go into blktests to ensure we don't regress in this area again. (In reply to Siegfried Metz from comment #287) > Thank you, to everyone involved and for spending time figuring it out and > getting this fixed. I was just an observer, but reading the comments and > watching the progress. > > @Guenter Roeck: thank you for your great determination, not giving up, > especially after a rather harsh and demotivating comment. Very much > appreciated. > > > So now that a patch is pending and working its way into the stable-queue and > into the next stable release, I would like to ask: > > 1) Are any file systems other than ext4 and zfs, which are mentioned in > here, affected by this too? All file systems are affected. There is at least one bug open against btrfs (https://bugzilla.kernel.org/show_bug.cgi?id=201639), and one against xfs (https://bugzilla.kernel.org/show_bug.cgi?id=201897). Others were reported here. > 2) What about xfs? Or other file systems? See 1). > 3) Why did this issue and the file system corruptions only surface because > of ext4? I've not seen one person mentioning corruptions with xfs. > My best guess is that most people use ext4. The write pattern of a specific file system may also have some impact. > The way I understand it, since the issue is below file system level, it > should have affected more file systems, shouldn't it? It does. Note that of my four systems running 4.19 kernels, three were affected, not just two. The only unaffected system, at the end, was the one with an nvme drive and no other drives. The root file system on the one system I thought unaffected (with two ssd drives) is corrupted so badly that grub doesn't recognize it anymore (I only noticed after trying to reboot the system). I suspect we'll see many more reports of this problem as time goes by and people who thought they were unaffected notice that they are affected after all. Thanks a lot. However, I think @Lukáš Krejčí should get most of the credit for finding a reliable and quick reproducer, and for bisecting and finding the root cause. All I was able to do, after many false positives, was to show that the problem was not caused by ext4 after all. Closing this issue... Fix is pending, will go upstream later today. @Jens Axboe, @Guenter Roeck: Thanks for answering my questions.
> Thanks a lot. However, I think @Lukáš Krejčí should get most of the credit
> for finding a reliable and quick reproducer, and for bisecting and finding
> the root cause.
Indeed. So a big thanks to you, @Lukáš Krejčí! :)
Is there, for end-users, a guide on what they should do until the patch makes its way into stable and their distros? Will scsi_mod.use_blk_mq=0 do the job? And also an easy to follow list for people to check whether they're affected? What's in comment #269 seem apply to many systems (which do not show corruptions). While the criteria in comment #276 are rather abstract. scsi_mod.use_blk_mq=0 will do the trick, as will just ensuring that you have a scheduler for your device. Eg for sda, check: # cat /sys/block/sda/queue/scheduler bfq [mq-deadline] none As long as that doesn't say [none], you are fine as well. Also note that this seems to require a special circumstance of timing and devices to even be possible in the first place. But I would recommend ensuring that one of the above two conditions are true, and I'd further recommend just using mq-deadline (or bfq or kyber, whatever is your preference) instead of turning scsi-mq off. Once you've ensured that after a fresh boot, I'd double check by running fsck on the file systems hosted by a SCSI/SATA device. (In reply to Christoph Anton Mitterer from comment #293) > > And also an easy to follow list for people to check whether they're > affected? What's in comment #269 seem apply to many systems (which do not > show corruptions). > While the criteria in comment #276 are rather abstract. if you want to do a further deeper and simply check of your data (of course that has not already been corrupted), beyond the filesystem, a simple procedure would be to boot with a kernel for sure not affected (if you don't have one installed, use a LiveCD, you might go back to 4.9.x in SQ for instance) and generate the md5+sha256 checksum of all your data; this can be done using the utility "rhash" (most of distros provide this package), then run into your interested dirs: rhash -v --speed --percents --md5 --sha256 -r . > mydata.rhashsums It will generate the "mydata.rhashsums" file which will be the reference point. Then reboot with your latest suspected 4.19.x kernel with multiqueue, whatever, and move/copy back and forth intensively the data (especially if you have big filedisks coming from Virtual Machines) across filesystems, and recheck the md5+sha256 checksums in the destination, using: rhash -v --speed --percents --md5 --sha256 mydata.rhashsums if there was some bit flip it will show an ERROR, otherwise OK. This seems trivial, but safe (for speeding up, you might want to generate just sha256 or md5 sums, and not both md5+sha256 which IIRC should be hash collision free). Man, I wish I had found this thread a day or so earlier - I could 100% consistently reproduce this bug within 30 minutes just by doing normal desktop stuff with any 4.20-rcX kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline. I never hit the bug with the 4.19 kernels from that same repo. I thought it was an ecryptfs bug as it manifested as ecryptfs errors (Mint 19, just encrypted /home) This is just on a Dell Latitude 3590 laptop with 16GB RAM and this SSD, very lightly loaded - my iowait barely gets over 1%. I just thought I'd mention it because it doesn't seem to fit any of the other use cases where people are hitting the bug. *-scsi physical id: 1 logical name: scsi0 capabilities: emulated *-disk description: ATA Disk product: TOSHIBA THNSNK12 vendor: Toshiba physical id: 0.0.0 bus info: scsi@0:0.0.0 logical name: /dev/sda version: 4101 serial: 57OB628XKLMU size: 119GiB (128GB) capabilities: gpt-1.00 partitioned partitioned:gpt configuration: ansiversion=5 guid=3593c5c3-c8cd-43f8-b4cb-d56475bb229f logicalsectorsize=512 sectorsize=4096 (In reply to Lee Revell from comment #296) > Man, I wish I had found this thread a day or so earlier - I could 100% > consistently reproduce this bug within 30 minutes just by doing normal > desktop stuff with any 4.20-rcX kernel from > http://kernel.ubuntu.com/~kernel-ppa/mainline. I never hit the bug with the > 4.19 kernels from that same repo. I thought it was an ecryptfs bug as it > manifested as ecryptfs errors (Mint 19, just encrypted /home) Initially, I thought that Ubuntu's compiler was at fault (and it may have exacerbated the problem, given some of the comments above. Since I also hit it with 4.20rc, and not with 4.19.*, even after they updated their compiler, this coinciding with the problem being narrowed down to MQ, I did a smo@dell-smo:~$ grep MQ /boot/config-4.19.6-041906-generic CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_BLK_WBT_MQ=y CONFIG_BLK_MQ_PCI=y CONFIG_BLK_MQ_VIRTIO=y CONFIG_BLK_MQ_RDMA=y CONFIG_MQ_IOSCHED_DEADLINE=m CONFIG_MQ_IOSCHED_KYBER=m CONFIG_NET_SCH_MQPRIO=m CONFIG_SCSI_MQ_DEFAULT=y # CONFIG_DM_MQ_DEFAULT is not set CONFIG_DM_CACHE_SMQ=m ...and then a smo@dell-smo:~$ grep MQ /boot/config-4.20.0-042000rc5-generic CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_BLK_WBT_MQ=y CONFIG_BLK_MQ_PCI=y CONFIG_BLK_MQ_VIRTIO=y CONFIG_BLK_MQ_RDMA=y CONFIG_MQ_IOSCHED_DEADLINE=m CONFIG_MQ_IOSCHED_KYBER=m CONFIG_NET_SCH_MQPRIO=m CONFIG_SCSI_MQ_DEFAULT=y CONFIG_DM_CACHE_SMQ=m Note that # CONFIG_DM_MQ_DEFAULT on 4.19, whereas it is set to y on 4.20 - I think that might be the real reason why kernel-ppa's 4.19's seem unaffected, as opposed to their 4.20rcs. Cetera censeo that the praises above are absolutely spot on - well done, all around! JFYI: I could now reproduce it in my Debian-8-VirtualBox-VM. There is nothing special about the VM, just 1024 MB memory, the 2 discs are attached to a SATA-controller (AHCI). The reason why I could not reproduce it there before was that I was using the default kernel .config for x86_64 created with "make defconfig". Relevant settings of default-config (good): CONFIG_MQ_IOSCHED_DEADLINE=y CONFIG_MQ_IOSCHED_KYBER=y CONFIG_SCSI_MQ_DEFAULT=y Those settings lead to this: >cat /sys/block/sd*/queue/scheduler [mq-deadline] kyber none [mq-deadline] kyber none The settings necessary to *reproduce* the bug were: # CONFIG_MQ_IOSCHED_DEADLINE is not set # CONFIG_MQ_IOSCHED_KYBER is not set CONFIG_SCSI_MQ_DEFAULT=y The latter settings lead to this: > cat /sys/block/sd*/queue/scheduler [none] [none] Guenter's script alone wouldn't do the trick (but ran it for just 5 min). But the script + a bigger file copy made the errors pop up almost immediately: > dmesg -t | grep -i ext4 EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null) EXT4-fs (sda2): re-mounted. Opts: errors=remount-ro EXT4-fs (sdb): mounted filesystem with ordered data mode. Opts: (null) EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null) EXT4-fs error (device sda2): ext4_iget:4831: inode #275982: comm dbus-daemon: bad extra_isize 1191 (inode size 256) EXT4-fs (sda2): Remounting filesystem read-only EXT4-fs error (device sda2): ext4_iget:4831: inode #275982: comm dbus-daemon: bad extra_isize 1191 (inode size 256) EXT4-fs error (device sda2): ext4_iget:4831: inode #275982: comm dbus-daemon: bad extra_isize 1191 (inode size 256) [...] So with 4.19.x + the .config-settings necessary to reproduce the bug you had a good chance of getting into trouble. And now for something completely different... :) (In reply to Sune Mølgaard from comment #297) > Note that # CONFIG_DM_MQ_DEFAULT on 4.19, whereas it is set to y on 4.20 - I > think that might be the real reason why kernel-ppa's 4.19's seem unaffected, > as opposed to their 4.20rcs. I was using kernel-ppa's 4.19 and I was definitely affected. (In reply to Steve "Stefan" Fan from comment #273) > > Now I'm running 4.19.6-1, my PC has been running it for 12 hours now, no > more EXT4 checksum and bad blocks from dmesg, so far so good. It seems like I'm still naive and had faith that this problem is gone for the wind: it happened again today and it is catastrophic, my system has entered an unbootable state. I will audit a live CD rescue and hope for the best. 4.19.6 was not supposed to fix this issue as 4.19.7 isn't too until your distribution has the patch https://bugzilla.kernel.org/show_bug.cgi?id=201685#c255 Fedora: * Wed Dec 05 2018 Jeremy Cline <jcline@redhat.com> - 4.19.7-300 - Linux v4.19.7 - Fix CVE-2018-19406 (rhbz 1652650 1653346) * Wed Dec 05 2018 Jeremy Cline <jeremy@jcline.org> - Fix corruption bug in direct dispatch for blk-mq so the only relevant question is: had your kernel that patch and was the FS 100% clean before? (In reply to Steve "Stefan" Fan from comment #300) > (In reply to Steve "Stefan" Fan from comment #273) > > > > Now I'm running 4.19.6-1, my PC has been running it for 12 hours now, no > > more EXT4 checksum and bad blocks from dmesg, so far so good. > > It seems like I'm still naive and had faith that this problem is gone for > the wind: it happened again today and it is catastrophic, my system has > entered an unbootable state. I will audit a live CD rescue and hope for the > best. Oh and by the way, I think this is absurdly hilarious: https://imgur.com/a/cbpxTze (In reply to Reindl Harald from comment #301) > 4.19.6 was not supposed to fix this issue as 4.19.7 isn't too until your > distribution has the patch > https://bugzilla.kernel.org/show_bug.cgi?id=201685#c255 > > Fedora: > * Wed Dec 05 2018 Jeremy Cline <jcline@redhat.com> - 4.19.7-300 > - Linux v4.19.7 > - Fix CVE-2018-19406 (rhbz 1652650 1653346) > > * Wed Dec 05 2018 Jeremy Cline <jeremy@jcline.org> > - Fix corruption bug in direct dispatch for blk-mq > > so the only relevant question is: had your kernel that patch and was the FS > 100% clean before? Unfortunately, it was tainted, tainted very seriously. I might need to conduct a full system wipe, which means I will have to start from point zero again. My SSD had no issues, said SMART. In reply to comments #296, #299 and others. Sorry not to contribute more to this thread, but I had two very different machines experience "this" bug and they don't quite fit the pattern mentioned above, so the data may be of use. Both were running Kubuntu 18.04 with kernels updated from ubuntu-stock (4.15.x) to kernel-ppa/mainline/4.19.x. 1) Desktop machine (i7-5960X based). Started with an Intel 530 series SSD (SSDSC2BW240A4). The "bug", combined with fsck at subsequent boot, erased /usr so you can imagine nothing worked after this. The SSD had a few bad blocks (badblocks' non-destructive rw test showed three) so I replaced it with a brand new Samsung SSD 860 PRO 512GB together with a fresh new install of Kubuntu 18.04. This gave errors similar to those reported in this thread *immediately* on the first boot after the install of 4.19.x (the boot failed, so I rebooted with the stock 4.15.0-x kernel and ran fsck). I immediately downgraded the kernel, now running 4.15.0-42, it's been fine since (also 4.18.20 seemed fine, but I ran it only for a short time). What's weird is that the scheduler is single-queue cfq in this system by default: > cat /sys/block/sda/queue/scheduler noop deadline [cfq] so apparently (#298 ?) should be ok. It wasn't. Load was (presumably) relatively high during installation, after that little. 16GB of RAM was hardly touched, although there is some minor use of swap. 2) Laptop (ASUS Zenbook UX31A) running Kubuntu 18.04. After upgrade to 4.19.x from the kernel-ppa this also started giving the errors reported above. Damage was limited, and I immediately downgraded the kernel and fsckd back to relative health (reinstalled all packages to be sure). The disc, a Sandisk SSD U100 252GB, has no bad blocks (badblocks' non-destructive rw test). This is also using the single-queue cfq scheduler, not multi-queue. Please don't ask me to (further) test 4.19.x kernels on these systems, I need them for work with intact filesystems, but there's nothing odd about them: standard (k)ubuntu 18.04 "minimal" installs then upgraded with linux-ppa 4.19.x kernels. Both (different) SSDs, and both had the problem with the ubuntu-default single-queue cfq scheduler, contradicting #276. This might be a worry! I would like to thank all those on here who've been doing great work to fix this, it's very much appreciated by many. (In reply to Rob Izzard from comment #304) > In reply to comments #296, #299 and others. > > Sorry not to contribute more to this thread, but I had two very different > machines experience "this" bug and they don't quite fit the pattern > mentioned above, so the data may be of use. Both were running Kubuntu 18.04 > with kernels updated from ubuntu-stock (4.15.x) to > kernel-ppa/mainline/4.19.x. > > > 1) Desktop machine (i7-5960X based). Started with an Intel 530 series SSD > (SSDSC2BW240A4). The "bug", combined with fsck at subsequent boot, erased > /usr so you can imagine nothing worked after this. The SSD had a few bad > blocks (badblocks' non-destructive rw test showed three) so I replaced it > with a brand new Samsung SSD 860 PRO 512GB together with a fresh new install > of Kubuntu 18.04. This gave errors similar to those reported in this thread > *immediately* on the first boot after the install of 4.19.x (the boot > failed, so I rebooted with the stock 4.15.0-x kernel and ran fsck). I > immediately downgraded the kernel, now running 4.15.0-42, it's been fine > since (also 4.18.20 seemed fine, but I ran it only for a short time). > > What's weird is that the scheduler is single-queue cfq in this system by > default: > > > > cat /sys/block/sda/queue/scheduler > noop deadline [cfq] Are you really sure that your have seen this with the *4.19*-kernels that you had run? Or are you seeing this *now*? (In reply to Steve "Stefan" Fan from comment #303) > Unfortunately, it was tainted, tainted very seriously. I might need to > conduct a full system wipe, which means I will have to start from point zero > again. > > My SSD had no issues, said SMART. Thankfully, my system was able to breathe again, it just missed one single library file libresolv.so.2 such that it is one of the file my fsck randomly erased. Then I performed full system upgrade to prevent another missing kidney. I was able to boot into X. Well, that’s hell of a lesson to teach me how you shouldn’t be living on the edge. > Are you really sure that your have seen this with the *4.19*-kernels that > you had run? Or are you seeing this *now*? Yes, 4.19.4, installed from http://kernel.ubuntu.com/~kernel-ppa/mainline/ (the generic 64-bit versions, not the low-latency build) was the most recent problematic kernel. I have the old SSD (with what remains of the filesystem) and just checked. *Now* I'm running ubuntu's 4.15.0-42 on all the affected machines, with no problems at all. I also previously used the 4.18 releases from kernel-ppa without any problems. I did think it was my SSD that was the cause, hence the replacement, but even the new SSD had the same problem. Downgrade the kernel and the problem goes away. (In reply to Rob Izzard from comment #307) > > > Are you really sure that your have seen this with the *4.19*-kernels that > > you had run? Or are you seeing this *now*? > > Yes, 4.19.4, installed from http://kernel.ubuntu.com/~kernel-ppa/mainline/ > (the generic 64-bit versions, not the low-latency build) was the most recent > problematic kernel. I have the old SSD (with what remains of the filesystem) > and just checked. > > *Now* I'm running ubuntu's 4.15.0-42 on all the affected machines, with no > problems at all. I also previously used the 4.18 releases from kernel-ppa > without any problems. > > I did think it was my SSD that was the cause, hence the replacement, but > even the new SSD had the same problem. Downgrade the kernel and the problem > goes away. OK, I wasn't daubting that you had file corruption. What I was asking was whether you saw > cat /sys/block/sda/queue/scheduler noop deadline [cfq] with your 4.19 kernels. CONFIG_SCSI_MQ_DEFAULT=y is the default setting for 4.19. So it would have had to be set explicitly to "n" in the .config or de-activated via commandline (scsi_mod.use_blk_mq=0) to see that a/m output. So I was just wondering. Good point, you may be right, I can't check that now without going back in time, sorry! :) I must admit, at the time this happened, I was more focused on recovering from the panic it caused than anything else. I'm naturally wary to "upgrade" to 4.19 again. It's been an interesting read on here though, I've learned a lot. I can only re-emphasize the thanks from us out here in the rest of the world. (In reply to Steve "Stefan" Fan from comment #306) > > Well, that’s hell of a lesson to teach me how you shouldn’t be living on the > edge. Kernel developers would argue that the "stable" kernels published at kernel.org are indeed stable and ready to be used but what do I know? Especially when each of them get dozens of dozens of fixes during their support cycles. I'm going offtopic, so please disregard this comment. In reply to Rainer Fiebig ( #308 ) Dear Rainer, I've still got some of the truncated syslogs from the desktop system where I was running 4.19.3 from kernel.org, they all read: Nov 22 09:28:35 capc85 kernel: [ 5.112690] io scheduler noop registered Nov 22 09:28:35 capc85 kernel: [ 5.112690] io scheduler deadline registered Nov 22 09:28:35 capc85 kernel: [ 5.112712] io scheduler cfq registered (default) suggesting, but not proving, that indeed I had the problem corruption on a non-mq system. Or is the scheduler mentioned in the kernel log then overridden later? Or perhaps 4.19.3 isn't new enough to switch to mq and show that particular problem? Not sure. sorry I can't be of more help! the syslog after the corruption was a load of NULLs followed by the reboot into the stable (4.15.0-24) kernel ... (In reply to Rob Izzard from comment #311) > In reply to Rainer Fiebig ( #308 ) > > Dear Rainer, > > I've still got some of the truncated syslogs from the desktop system where I > was running 4.19.3 from kernel.org, they all read: > > Nov 22 09:28:35 capc85 kernel: [ 5.112690] io scheduler noop registered > Nov 22 09:28:35 capc85 kernel: [ 5.112690] io scheduler deadline > registered > Nov 22 09:28:35 capc85 kernel: [ 5.112712] io scheduler cfq registered > (default) > > suggesting, but not proving, that indeed I had the problem corruption on a > non-mq system. Or is the scheduler mentioned in the kernel log then > overridden later? Or perhaps 4.19.3 isn't new enough to switch to mq and > show that particular problem? Not sure. > "registered" doesn't mean it is enabled. Only that it exists. The above is logged on mq systems as well. (In reply to Rob Izzard from comment #311) > In reply to Rainer Fiebig ( #308 ) > > Dear Rainer, > > I've still got some of the truncated syslogs from the desktop system where I > was running 4.19.3 from kernel.org, they all read: > > Nov 22 09:28:35 capc85 kernel: [ 5.112690] io scheduler noop registered > Nov 22 09:28:35 capc85 kernel: [ 5.112690] io scheduler deadline > registered > Nov 22 09:28:35 capc85 kernel: [ 5.112712] io scheduler cfq registered > (default) > > suggesting, but not proving, that indeed I had the problem corruption on a > non-mq system. Or is the scheduler mentioned in the kernel log then > overridden later? Or perhaps 4.19.3 isn't new enough to switch to mq and > show that particular problem? Not sure. > > sorry I can't be of more help! the syslog after the corruption was a load of > NULLs followed by the reboot into the stable (4.15.0-24) kernel ... Rob, please see #227/228, it's well explained there. So the most likely scenario for you and others was imo that the kernel was configured with CONFIG_SCSI_MQ_DEFAULT=y and CONFIG_MQ_IOSCHED_DEADLINE=m CONFIG_MQ_IOSCHED_KYBER=m or # CONFIG_MQ_IOSCHED_DEADLINE is not set # CONFIG_MQ_IOSCHED_KYBER is not set And so > cat /sys/block/sd*/queue/scheduler [none] Which was - because of our little bug - like flirting with disaster - unwittingly, of course. ;) Had it been CONFIG_SCSI_MQ_DEFAULT=y and CONFIG_MQ_IOSCHED_DEADLINE=Y CONFIG_MQ_IOSCHED_KYBER=Y the result would have been cat /sys/block/sd*/queue/scheduler [mq-deadline] kyber none and you might have been off the hook - at least that's what I am seeing here. I was on the lucky side because I had # CONFIG_SCSI_MQ_DEFAULT is not set /* from old .config */ and so cat /sys/block/sd*/queue/scheduler noop deadline [cfq] I don't know why there is no mq-scheduler when the mq-schedulers are configured as modules. This is imo somewhat misleading. But I see this with 4.18.20, too, and 4.18 obviously doesn't have that fs-corruption issue. So long! Linux 4.20-rc6 has been released with the fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") Both fixes are also part of the stable Linux 4.19.8: 724ff9cbfe1f ("blk-mq: fix corruption with direct issue") 55cbeea76e76 ("blk-mq: punt failed direct issue to dispatch list") (In reply to Rafał Miłecki from comment #314) > Linux 4.20-rc6 has been released with the fixes: > ffe81d45322c ("blk-mq: fix corruption with direct issue") > c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") > > Both fixes are also part of the stable Linux 4.19.8: > 724ff9cbfe1f ("blk-mq: fix corruption with direct issue") > 55cbeea76e76 ("blk-mq: punt failed direct issue to dispatch list") I installed 4.20-rc6 and there has not been an incident for almost 24 hours. (In reply to Hao Wei Tee from comment #90) > (In reply to Artem S. Tashkinov from comment #87) > > Maybe one day someone will become truly invested in the kernel development > > process and we'll have proper QA/QC/unit testing/regression > > testing/fuzzying > > What we have now is not proper? syzkaller bot, Linux Test Project, > kernelci.org, xfstests, and more that I don't know of. Probably more than > any other OS. Speaking as syzkaller/syzbot developer, there are still hundreds of known bugs unfixed: https://syzkaller.appspot.com/#upstream + more in stable releases. We know how to make syzkaller find 10x of what it's finding now. But (1) most of my time is taken by dealing with all these existing bugs, (2) there is little usefulness in adding few hundreds more known bugs on top of a thousand already known bugs. There is _some_ fixing of _some_ reported bugs. For some subsystems it's actually close to perfect. But there are also some subsystems where it feels more like a coin toss if it's fixed or let drown in LKML. And for some subsystems it's more like sending to /dev/null. We need more fixing. We also need more API descriptions for syzkaller. We are very far from having complete coverage. Lots of subsystems are very complex and can be well tested only by the developers (not a single person on a side). We are seeing close to zero involvement from kernel developers in this area. We need more involvement. Systematic testing also requires investment in tooling: making then not produce false positives, not producing flakes, providing more relevant info for debugging, etc. We are seeing very little involvement from kernel developers here. Just making WARN not used for non-bugs took lots of effort and faced some confrontation. KMEMLEAK still produces false positives and is not used on syzbot. Stall/hang detectors are hard to tune, produce lots of flakes and frequently unactionable reports. We need more involvement from developers on this front too. That's not to say that we don't need better unit testing. kernelci.org does almost no testing today. xfstests/LTP are just test suites, passive entities, they don't do anything by themselves. Number of kernel bug fixes that add regression tests is very low overall. Dmitry, I'm not sure if Syzkaller could have detected this problem. And even if it could, it can't differentiate between some lint-style, "yeah, root can shoot your system in the head" versus a real "a normal user could lose data" problem. The claim that we need to fix all Syzkaller problems so we can concentrate on the real problem sounds nice in theory, but that's similar to a strict lint or checkpatch --warning run claiming that if we could fix all of the lint warnings, the kernel would be better --- and since we are ignoring thousands of lint warnings, we are ignoring (potentially) thousands of unfixed bugs. If the goal is to prevent actual, real-life user data loss problems, or actual real-life security exposures, we need to be able to distinguish between false positive complaints and real, high priority problems to fix. (Or, companies need to resource vast amounts of headcount resources to address Syzkaller issues.) And xfstests are *not* passive things; they are actively used, and they do catch regressions. If blktests had better coverage and better stress testing, perhaps it could have caught this. But until Syzkaller can credibly catch this specific class of bug, and not waste my time with huge numbers of "if root executes this root-only loop device ioctl which is only used by installers, it might hang the kernel", and make it easy to wade through the huge numbers of low-priority issues, I only have so much time to spend on QA versus the feature development work that I am actually paid to work on, I'm going to spend my time that I do have for QA work in the way I think will best help users. (In reply to Theodore Tso from comment #317) > If blktests had better coverage and better stress > testing, perhaps it could have caught this. I don't think that's quite fair, I do think that there's a higher chance that xfstest would have caught this. It speaks to how hard (or impossible) it was to hit for most that neither FS nor IO folks caught this in testing. I'm waiting for Omar to queue up the blktest that DOES reproduce this, but it still requires a special vm setup for me to hit. But as part of this exercise, I know run blktest on both test boxes and a test vm to ensure we don't hit something like this again, and to extend coverage in general. Totally agree on the syzkaller comment. It's useful for finding mistakes in handling user input and hardening an API, it's not useful for catching this sort of issue. And that's fine, we have other tests for that. To make it clear my comments was not so much about this particular bug, but rather about the comment that we have syzkaller and few other things and thus everything is great. Just having syzkaller does not make everything great. In we take any particular non-trivial bug (esp a future bug), unfortunately there is no magic that can guarantee that this particular bug will be caught by existing measures (Meltdown is obviously a good example). However, if we do our best discovering and fixing lots of bugs, we also fix lots of future critical bugs too. A "problem" with this approach is that while fixing bugs that are hitting production bad right now has large perceived impact for everybody, proactively fixing bugs does not have large perceived impact for some. While I would say in reality it's the other way around, a bug killed before the patch was mailed for the first time is the most impactful fix. Re fuzzing vs unit testing, they are complementary. Unit tests can discover regressions faster, suitable for presubmit testing, give clear reproducers, can reliably test the most complex scenarios, etc. While fuzzing can give coverage not achievable with manual tests, find corner and unexpected cases, doesn't have bias, etc. We need both. Re critical/non-critical bugs. It's a hard question, I understand it involves priorities. My condensed thoughts are: - let's start with at least critical ones (we don't do even this for some subsystems) - some developers are actually happy getting all bugs (notably Eric Dumazet fixes everything in most parts of net right away) - I guess for these people it can be a question of code quality ("I just don't want bugs in my part of code whatever they are") - it can also be much cheaper (up to 100x) to just fix a bug rather then prove that it's not an important bug (this can be really-really expensive) - we could restrict fuzzer to not operate under root, but then we would miss lots of non-root-triggerable bugs too (e.g. you do socket setup under root, but then remote side gets remote code execution) And FWIW syzkaller can also find image corruptions, e.g, that bug where innocent unprivileged fallocate's blow up ext4 image: https://groups.google.com/d/msg/syzkaller-bugs/ODpZRn8S7nU/4exgv8g3BAAJ (In reply to Dmitry Vyukov from comment #319) > To make it clear my comments was not so much about this particular bug, but > rather about the comment that we have syzkaller and few other things and > thus everything is great. Just having syzkaller does not make everything > great. I didn't say everything is great. I was just saying that this (Quoted from comment #87) > Maybe one day someone will become truly invested in the kernel development > process and we'll have proper QA/QC/unit testing/regression testing/fuzzying, > so that individuals won't have to sacrifice their data and time because > kernel developers are mostly busy with adding new features and usually not > really concerned with performance, security and stability of their code > unless they are pointed at such issues. is a very unfair and pessimistic view of what we currently have. |
Created attachment 279431 [details] dmesg 4.18.18 amdgpu.dc=0 My system was fine when I shut it down on Sunday Nov 11. Today Nov 13 I booted 4.19.1, built two new kernels 4.20-rc2 4.18.18 (using a tmpfs, not the SSD or HDD), then booted into those kernels briefly (to test if a different bug had been fixed). Finally I booted into 4.18.18 (setting amdgpu.dc=0 to workaround my other bug), and after some moments experienced symptoms of filesystem corruption on opening an xterm: sed: error while loading shared libraries: /lib/x86_64-linux-gnu/libattr.so.1: unexpected PLT reloc type 0x00000107 sed: error while loading shared libraries: /lib/x86_64-linux-gnu/libattr.so.1: unexpected PLT reloc type 0x00000107 claude@eiskaffee:~$ I fixed it by extracting the relevant file from the Debian archive on a different machine and using `cat` with `bash` shell IO redirection to overwrite the corrupted shared library file on my problem machine. Here are the relevant versions extracted from my syslog: Nov 13 15:45:49 eiskaffee kernel: [ 0.000000] Linux version 4.19.1 (claude@eiskaffee) (gcc version 8.2.0 (Debian 8.2.0-9)) #1 SMP Tue Nov 6 14:58:04 GMT 2018 Nov 13 18:44:12 eiskaffee kernel: [ 0.000000] Linux version 4.20.0-rc2 (claude@eiskaffee) (gcc version 8.2.0 (Debian 8.2.0-9)) #1 SMP Tue Nov 13 16:38:55 GMT 2018 Nov 13 18:45:00 eiskaffee kernel: [ 0.000000] Linux version 4.18.18 (claude@eiskaffee) (gcc version 8.2.0 (Debian 8.2.0-9)) #1 SMP Tue Nov 13 16:23:11 GMT 2018 Nov 13 18:46:13 eiskaffee kernel: [ 0.000000] Linux version 4.18.18 (claude@eiskaffee) (gcc version 8.2.0 (Debian 8.2.0-9)) #1 SMP Tue Nov 13 16:23:11 GMT 2018 mount says: /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro) The machine in question is my production workstation, so I don't feel like testing anything that might result in data loss.