Bug 219609
Summary: | File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Stefan (linux-kernel) |
Component: | NVMe | Assignee: | IO/NVME Virtual Default Assignee (io_nvme) |
Status: | RESOLVED DOCUMENTED | ||
Severity: | normal | CC: | akovacs, andrew, bgravato, carnil, kbusch, kernel, klaus.hader, linux, mario.limonciello, mathieu, mxilievski, regressions, reklamukibiras, rg-bugzilla.kernel.org, sam |
Priority: | P3 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | 6.11.5, most liklely 6.5+ | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
attachment-4531-0.html
logs.tar.bz2 dmesg from before and after the bios update |
Description
Stefan
2024-12-18 11:23:04 UTC
You mention the observation has occurred since kernel 6.5. Are you saying that this used to work in older kernels? Bug 1: Oldest non-working Debian kernel is 6.3.7 (package linux-image-6.3.0-1-amd64), Debian kernel 6.3.5 (latest version of package linux-image-6.3.0-0-amd64) works. (I'm assuming it's not debian-specific because the error also occurs in an upstream-kernel (6.11.5) If you have patches, I could compile one of these version and then try out the patches. (Possible) Bug 2: Occurred with 6.1 kernels, but very difficult to reproduce. So, I'm not sure whether this error is limited to this kernel version. Because I cannot test both bugs at the same time (the bugs occur only in 1st M.2 socket and the PC is remote), we should first focus on Bug 1. If that bug is fixed, I would run a long term test with the fixed kernel. (Because it are read errors, this can be done by a checksum test of existing files in background.) I have the same barebone (ASRock Deskmini X600) with Ryzen 8600G CPU. I've run into similar issues. In my case I'm using btrfs on a Solidigm P44 Pro M.2 nvme 1TB disk. After copying a large amount of files (over 150K-300K files, variable sizes) to the btrfs partition and running btrfs scrub on the partition, it will report some files with checksum errors. If I put the disk in the secondary M.2 slot in the back this problem does not occur. RAM is 2x16GB Kingston Fury Impact DDR5 6400 SODIMM, but I've also tried a Crucial DDR5 5600 SODIMM with same results. I run single memory stick, dual, different speeds, etc... all with the same result. RAM seems to not be the problem. I also had same results with a WD nvme SN750 500GB disk. I've tried both disks (running the same installation), on a different machine (Deskmini X300) and no errors. Only a few files get corrupted. On my last test, copying nearly 400K files, only 22 got corrupted. I mounted the btrfs partition with rescue=all and I was able to read the corrupted files. I compared a few to the original files and looks like a big chunk of data in the middle of the files was altered (contiguous blocks). So it's not just a bit flip here and there... it's a big portion of the file that gets messed up (in contiguous blocks). System is running Debian stable with some packages from backports, namely the kernel. I got same results with kernel 6.10.5 and 6.11.10 (from bookworm-backports) and 6.12.6 (from testing). Also got the same results with BIOS firmware 4.03 and 4.08 (downloaded from asrock website). I tried different sources for the files: copying over LAN using either rsync over ssh or restic backup restore, but also from a locally installed SATA SSD disk with the same files. Copying the same files to the SATA disk (also btrfs) do not get corrupted. Using the secondary M.2 slot (gen4x4) also seems to be free of errors. It only happens when the disk is in the main M.2 slot (gen5x4). I thought this could be a faulty M.2 slot on my board, but after seeing other reports of similar problem, now I'm more convinced that this may be either BIOS firmware issue or kernel issue or a combination of both. Anyway I thought I'd add my report here hoping it can help. I can run some more tests if needed. In terms of reproducibility, I can reproduce this fairly consistently given I copy a large enough sample of files (my "sample" is my personal files from my home dir in my older PC, which are over 700K files). Copying 150K-300K files (20-60GB of data) is usually enough to cause checksum errors on some files when running btrfs scrub (it seems to be always on different files). With the disk on the secondary M.2 slot I copied all 700K+ files (twice I think) and no errors. I haven't tried older kernel versions. I can try 6.1.x from debian stable, but I think this has issues with amdgpu driver and can eventually freeze the system with some amdgpu error, so it may not be very reliable for testing. Let me know if you have any questions and I'll try to answer. With the help of TJ from the Debian kernel team ( https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372 ), at least a workaround could be found. The bug is triggered by the patch "nvme-pci: clamp max_hw_sectors based on DMA optimized limitation" (see https://lore.kernel.org/linux-iommu/20230503161759.GA1614@lst.de/ ) introduced in 6.3.7 To examine the situation, I added this debug info (all files are located in `drivers/nvme/host`): > --- core.c.orig 2025-01-03 14:27:38.220428482 +0100 > +++ core.c 2025-01-03 12:56:34.503259774 +0100 > @@ -3306,6 +3306,7 @@ > max_hw_sectors = nvme_mps_to_sectors(ctrl, id->mdts); > else > max_hw_sectors = UINT_MAX; > + dev_warn(ctrl->device, "id->mdts=%d, max_hw_sectors=%d, > ctrl->max_hw_sectors=%d\n", id->mdts, max_hw_sectors, ctrl->max_hw_sectors); > ctrl->max_hw_sectors = > min_not_zero(ctrl->max_hw_sectors, max_hw_sectors); 6.3.6 (last version w/o mentioned patch and w/o data corruption) says: > [ 127.196212] nvme nvme0: id->mdts=7, max_hw_sectors=1024, > ctrl->max_hw_sectors=16384 > [ 127.203530] nvme nvme0: allocated 40 MiB host memory buffer. 6.3.7 (first version w/ mentioned patch and w/ data corruption) says: > [ 46.436384] nvme nvme0: id->mdts=7, max_hw_sectors=1024, > ctrl->max_hw_sectors=256 > [ 46.443562] nvme nvme0: allocated 40 MiB host memory buffer. After I reverted the mentioned patch ( > --- pci.c.orig 2025-01-03 14:28:05.944819822 +0100 > +++ pci.c 2025-01-03 12:54:37.014579093 +0100 > @@ -3042,7 +3042,8 @@ > * over a single page. > */ > dev->ctrl.max_hw_sectors = min_t(u32, > - NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9); > +// NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9); > + NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9); > dev->ctrl.max_segments = NVME_MAX_SEGS; > > /* ), 6.11.5 (used this version because sources were laying around) works and says: > [ 1.251370] nvme nvme0: id->mdts=7, max_hw_sectors=1024, > ctrl->max_hw_sectors=16384 > [ 1.261168] nvme nvme0: allocated 40 MiB host memory buffer. Thus, the corruption occurs if `ctrl->max_hw_sectors` is set to another (a smaller) value than defined by `id->mdts`. If this should be allowed, the mentioned patch is not the (root) cause, but reversion is at least a workaround. I forwarded the problem by mail[1] https://lore.kernel.org/all/401f2c46-0bc3-4e7f-b549-f868dc1834c5@leemhuis.info/ Bruno, Stefan, can we CC you on further mails regarding this? this would expose your email address to the public. [1] reminder, bugzilla.kernel.org is usually a bad place to report bugs, as mentioned on https://docs.kernel.org/admin-guide/reporting-issues.html ohh, an did anyone check if mainline is still affected? Even with the patch reverted, the host can still send IO that aligns to the smaller sized limits anyway, so it sounds like this patch that's been bisected to may have merely exposed a nvme controller bug. Hi, Yes you can CC me. I didn't try the patch mentioned above. This is my (new) daily driver and I needed to get the machine up and running as quickly as possible. I went with the work around of putting the disk on the secondary M.2 slot (gen4 vs gen5 on the main slot). No problems so far. The latest kernel I tried was 6.12.6 and it still had the problem. I should be able to put my old disk (WD Black SN750) on the main slot and run some more tests with the mainline kernel when I get the chance. Are all these reports using the same model nvme controller? Or is this happening across a variety of vendors? My email-address "linux-kernel@simg.de" can be CC'd publicly. But it is an alias, i.e. cannot reply directly from it. That's why I prefer the bug tracker. According to a forum of the German IT magazine c't, the bug was also recognized by several other people: https://www.heise.de/forum/c-t/Wuensch-dir-mal-wieder-was/X600-btrfs-scrub-uncorrectable-errors/thread-7689447 . (That hardware was recommended by that magazine). Furthermore it seems, the the errors do not occur with all SSD's. I'm trying to figure out, whether this has something to do with the MDTS setting (can be queried using `nvme id-ctrl` command). The problem also occurs in 6.13.0-rc6 (unless I revert the patch introduced in 6.3.7). Just a few thoughts (I'm not a NVME or kernel developer): I would not expect that reducing the MDTS (=max data transfer size) limit (that is what the patch does) should cause such errors. The only explanation is, that one component still assumes, up to the amount reported by MDTS (setting of the SSD) can be used. If that assumption is valid (NVME sepcs should answer this question), the patch is responsible for the problems. Otherwise, the root cause is the component that does not take the reduced limit into account. While the 6.13 kernel was compiling I searched in the kernel sources for the term "mdts". It seems that this setting is only used to initialize `max_hw_sectors' of the nvme_ctrl` struct. If that is correct, the other component that causes the problem is probably some kind of firmware. Hi, I can also reliably reproduce the data corruption with following setup: Deskmini X600 AMD Ryzen7 8700G 2x 16 GB Kingston-FURY KF564S38IBK2-32 Samsung 990 Pro 2 TB NVMe SSD, latest firmware 4B2QJXD7, installed on primary nvme slot Filesystem: ext4 OS: Ubuntu 24.10 with kernel 6.11.0-13.14 When copying ~60 GB of data to the nvme, some files get always corrupted. A diff between the source and the copied files shows that continuous chunks of < 3 MB in the middle of the files are either filled with zeros or garbage data. Also affected: Ubuntu 24.04 with kernel 6.8.0. Not affected: Debian 12 with kernel 6.1.119-1 The bad news: Applying the patch from comment #4 (using dma_max_mapping_size() instead of dma_opt_mapping_size() to set max_hw_sectors) to kernel 6.11.0-13.14 did not solve the problem in my case, the data corruption still occurs. 6.11.0-13.14 with patch and corruption: >[ 1.429438] nvme nvme0: pci function 0000:02:00.0 >[ 1.433783] nvme nvme0: id->mdts=9, max_hw_sectors=4096, >ctrl->max_hw_sectors=16384 >[ 1.433787] nvme nvme0: D3 entry latency set to 10 seconds >[ 1.438308] nvme nvme0: 16/0/0 default/read/poll queues Because it might be a Firmware issue, I updated the BIOS/UEFI and installed the latest firmware blobs (version 20241210 from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ ): No success. Furthermore I found a setting where PCIe speed could be reduced. Changing this value to Gen 3 had no effect. > The bad news: > Applying the patch from comment #4 (using dma_max_mapping_size() instead of > dma_opt_mapping_size() to set max_hw_sectors) > to kernel 6.11.0-13.14 did not solve the problem in my case, the data > corruption still occurs. Strange, especially because 6.1 is working. You might try to replace `dma_max_mapping_size(&pdev->dev) >> 9` by `min_t(u32, dma_max_mapping_size(&pdev->dev) >> 9, 1024)`. This will limit max_hw_sectors to 1024 sectors, the value which works for me. I just backported the patch from 6.3.7 to 6.1.112. The corruption now also occurs in that kernel. So for me, the problem connected to the patch. If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston, and now Samsung NVMe's? Unless they're all using the same 3rd party controller, like Phison or SMI, then I guess we'd have some trouble saying it's a vendor problem. Or perhaps we're now mixing multiple problems at this point, considering one patch fixes some but not others. Do these drives have volatile write caches? You can check with # cat /sys/block/nvme0n1/queue/fua A non-zero value means "yes". Replace "nvme0n1" with whatever your device is named, like nvme1n1, nvme2n1, etc... Is ext4 used in the other observations too? If not, what other filesystems are used? (In reply to Keith Busch from comment #13) > If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston, > and now Samsung NVMe's? In my case it was Solidigm P44 Pro 1TB and WD Black SN750 500GB > Do these drives have volatile write caches? You can check with > > # cat /sys/block/nvme0n1/queue/fua > I get 1, so yes. > Is ext4 used in the other observations too? If not, what other filesystems > are used? In my case I was using btrfs. Running btrfs scrub gave me some checksum errors and that's how I found out files were getting corrupted... If I was on ext4 it could have taken months for me to find out... The somewhat odd thing is that the same disks on the secondary M.2 nvme slot work fine with no error. The only difference in the specs between the two M.2 slots is that one is gen5x4 (the main one, which is the one with problems) and the other is gen4x4 (this works fine, no errors). as a test, could you turn off the volatile write cache? # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0 Your write performance may be pretty bad, but it's just a temporary test to see if the problem still occurs without a volatile cache. A power cycle reverts the setting back to the default state. Sorry, depending on the nvme version, the value parameter may be "-V" (capital "V"). Hi, due to Thorstens hints, I'm trying to reply to both, the bug tracker and the mailing list. > --- Comment #13 from Keith Busch (kbusch@kernel.org) --- > If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston, > and now Samsung NVMe's? The Kingston read errors may be something different. They are described in detail in messages #108 and #113 of the Debian Bug Tracker https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372 With the Kington, I never saw the write errors that occur with Lexar and Samsung on newer Kernels (and which are easy to reproduce). (ATM I cannot provide test results from the Kingston SSD because the Lexar is installed, the PC is installed remotely and in use. Thus I can't swap the SSDS that often.) > # cat /sys/block/nvme0n1/queue/fua Returns "1" > --- Comment #15 from Keith Busch (kbusch@kernel.org) --- as a test, > could you turn off the volatile write cache? > > # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0 Had to modify that a little bit: $ nvme get-feature /dev/nvme0n1 -f 6 get-feature:0x06 (Volatile Write Cache), Current value:0x00000001 $ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0 set-feature:0x06 (Volatile Write Cache), value:00000000, cdw12:00000000, save:0 $ nvme get-feature /dev/nvme0n1 -f 6 get-feature:0x06 (Volatile Write Cache), Current value:00000000 Corruptions disappear (under 6.13.0-rc6) if volatile write cache is disabled (and appear again if I turn it on with "-v 1"). But, lspci says I have a Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD (DRAM-less) (rev 01) (prog-if 02 [NVM Express]) Note the "DRAM-less". This is confirmed by https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB Host-Memory-Buffer (HMB). May there be an issue with the HMB allocation/usage ? Is the mainboard firmware involved into HMB allocation/usage ? That would explain, why volatile write caching via HMB works in the 2nd M.2 socket. BTW, controller is MaxioTech MAP1602A, which is different from the Samsung controllers. > --- Comment #14 from Bruno Gravato (bgravato@gmail.com) --- The only > difference in the specs between the two M.2 slots is that one is > gen5x4 (the main one, which is the one with problems) and the other > is gen4x4 (this works fine, no errors). AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of the CPU. On my PC, it runs in Gen4 mode (limited by SSD). The secondary M.2 socket on the rear side is probably connected to PCIe lanes which are usually used by a chipset -- but that socket works. Regards Stefan Hi, lspci says: 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal] It uses volatile write cache: > cat /sys/block/nvme0n1/queue/fua > 1 Test 1: Disabling volatile write cache via nvme-cli => no corruption occurs Test 2: volatile write cache enabled, using the suggestion from comment #12 > dev->ctrl.max_hw_sectors = min_t(u32, > NVME_MAX_KB_SZ << 1, min_t(u32, dma_max_mapping_size(&pdev->dev) >> 9, > 1024)); => corruption still occurs > [ 0.815340] nvme nvme0: id->mdts=9, max_hw_sectors=4096, > ctrl->max_hw_sectors=1024 Created attachment 307463 [details] attachment-4531-0.html Hi, I can reply via email, that's not a problem. I'll try to run some more tests when I get the chance (it's been a very busy week, sorry). Besides the volatile write cache test, any other test I should try? Regarding the M.2 slots. I believe this motherboard has no chipset. So both slots should be connected directly to the CPU (mine is Ryzen 8600G), although they might be connecting to different parts of the CPU, right? I guess that can make a difference. My disks are gen4 as well. Bruno On Thu, 9 Jan 2025 at 15:44, Stefan <linux-kernel@simg.de> wrote: > Hi, > > due to Thorstens hints, I'm trying to reply to both, the bug tracker and > the mailing list. > > > --- Comment #13 from Keith Busch (kbusch@kernel.org) --- > > If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston, > > and now Samsung NVMe's? > > The Kingston read errors may be something different. They are described > in detail in messages #108 and #113 of the Debian Bug Tracker > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372 > > With the Kington, I never saw the write errors that occur with Lexar and > Samsung on newer Kernels (and which are easy to reproduce). > > (ATM I cannot provide test results from the Kingston SSD because the > Lexar is installed, the PC is installed remotely and in use. Thus I > can't swap the SSDS that often.) > > > # cat /sys/block/nvme0n1/queue/fua > > Returns "1" > > > --- Comment #15 from Keith Busch (kbusch@kernel.org) --- as a test, > > could you turn off the volatile write cache? > > > > # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0 > Had to modify that a little bit: > > $ nvme get-feature /dev/nvme0n1 -f 6 > get-feature:0x06 (Volatile Write Cache), Current value:0x00000001 > $ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0 > set-feature:0x06 (Volatile Write Cache), value:00000000, > cdw12:00000000, save:0 > $ nvme get-feature /dev/nvme0n1 -f 6 > get-feature:0x06 (Volatile Write Cache), Current value:00000000 > > Corruptions disappear (under 6.13.0-rc6) if volatile write cache is > disabled (and appear again if I turn it on with "-v 1"). > > But, lspci says I have a > > Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD > (DRAM-less) (rev 01) (prog-if 02 [NVM Express]) > > Note the "DRAM-less". This is confirmed by > https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of > this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB > Host-Memory-Buffer (HMB). > > May there be an issue with the HMB allocation/usage ? > > Is the mainboard firmware involved into HMB allocation/usage ? That > would explain, why volatile write caching via HMB works in the 2nd M.2 > socket. > > BTW, controller is MaxioTech MAP1602A, which is different from the > Samsung controllers. > > > --- Comment #14 from Bruno Gravato (bgravato@gmail.com) --- The only > > difference in the specs between the two M.2 slots is that one is > > gen5x4 (the main one, which is the one with problems) and the other > > is gen4x4 (this works fine, no errors). > > AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of > the CPU. On my PC, it runs in Gen4 mode (limited by SSD). > > The secondary M.2 socket on the rear side is probably connected to PCIe > lanes which are usually used by a chipset -- but that socket works. > > Regards Stefan > Hi, (resending in text-only mode, because mailing lists don't like HMTL emails... sorry to those getting this twice) I can reply via email, that's not a problem. I'll try to run some more tests when I get the chance (it's been a very busy week, sorry). Besides the volatile write cache test, any other test I should try? Regarding the M.2 slots. I believe this motherboard has no chipset. So both slots should be connected directly to the CPU (mine is Ryzen 8600G), although they might be connecting to different parts of the CPU, right? I guess that can make a difference. My disks are gen4 as well. Bruno On Thu, 9 Jan 2025 at 15:44, Stefan <linux-kernel@simg.de> wrote: > > Hi, > > due to Thorstens hints, I'm trying to reply to both, the bug tracker and > the mailing list. > > > --- Comment #13 from Keith Busch (kbusch@kernel.org) --- > > If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston, > > and now Samsung NVMe's? > > The Kingston read errors may be something different. They are described > in detail in messages #108 and #113 of the Debian Bug Tracker > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372 > > With the Kington, I never saw the write errors that occur with Lexar and > Samsung on newer Kernels (and which are easy to reproduce). > > (ATM I cannot provide test results from the Kingston SSD because the > Lexar is installed, the PC is installed remotely and in use. Thus I > can't swap the SSDS that often.) > > > # cat /sys/block/nvme0n1/queue/fua > > Returns "1" > > > --- Comment #15 from Keith Busch (kbusch@kernel.org) --- as a test, > > could you turn off the volatile write cache? > > > > # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0 > Had to modify that a little bit: > > $ nvme get-feature /dev/nvme0n1 -f 6 > get-feature:0x06 (Volatile Write Cache), Current value:0x00000001 > $ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0 > set-feature:0x06 (Volatile Write Cache), value:00000000, > cdw12:00000000, save:0 > $ nvme get-feature /dev/nvme0n1 -f 6 > get-feature:0x06 (Volatile Write Cache), Current value:00000000 > > Corruptions disappear (under 6.13.0-rc6) if volatile write cache is > disabled (and appear again if I turn it on with "-v 1"). > > But, lspci says I have a > > Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD > (DRAM-less) (rev 01) (prog-if 02 [NVM Express]) > > Note the "DRAM-less". This is confirmed by > https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of > this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB > Host-Memory-Buffer (HMB). > > May there be an issue with the HMB allocation/usage ? > > Is the mainboard firmware involved into HMB allocation/usage ? That > would explain, why volatile write caching via HMB works in the 2nd M.2 > socket. > > BTW, controller is MaxioTech MAP1602A, which is different from the > Samsung controllers. > > > --- Comment #14 from Bruno Gravato (bgravato@gmail.com) --- The only > > difference in the specs between the two M.2 slots is that one is > > gen5x4 (the main one, which is the one with problems) and the other > > is gen4x4 (this works fine, no errors). > > AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of > the CPU. On my PC, it runs in Gen4 mode (limited by SSD). > > The secondary M.2 socket on the rear side is probably connected to PCIe > lanes which are usually used by a chipset -- but that socket works. > > Regards Stefan Hi, I did some more tests. At first I retrieved the following values under debian > Debian 12, Kernel 6.1.119, no corruption > cat /sys/class/block/nvme0n1/queue/max_hw_sectors_kb > 2048 > > cat /sys/class/block/nvme0n1/queue/max_sectors_kb > 1280 > > cat /sys/class/block/nvme0n1/queue/max_segments > 127 > > cat /sys/class/block/nvme0n1/queue/max_segment_size > 4294967295 To achieve the same values on Kernel 6.11.0-13, I had to make the following changes to drivers/nvme/host/pci.c > --- pci.c.org 2024-09-15 16:57:56.000000000 +0200 > +++ pci.c 2025-01-13 21:18:54.475903619 +0100 > @@ -41,8 +41,8 @@ > * These can be higher, but we need to ensure that any command doesn't > * require an sg allocation that needs more than a page of data. > */ > -#define NVME_MAX_KB_SZ 8192 > -#define NVME_MAX_SEGS 128 > +#define NVME_MAX_KB_SZ 4096 > +#define NVME_MAX_SEGS 127 > #define NVME_MAX_NR_ALLOCATIONS 5 > > static int use_threaded_interrupts; > @@ -3048,8 +3048,8 @@ > * Limit the max command size to prevent iod->sg allocations going > * over a single page. > */ > - dev->ctrl.max_hw_sectors = min_t(u32, > - NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9); > + //dev->ctrl.max_hw_sectors = min_t(u32, > + // NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9); > dev->ctrl.max_segments = NVME_MAX_SEGS; > > /* So basically, dev->ctl.max_hw_sectors stays zero, so that in core.c it is set to the value of nvme_mps_to_sectors(ctrl, id->mdts) (=> 4096 in my case) > if (id->mdts) > max_hw_sectors = nvme_mps_to_sectors(ctrl, id->mdts); > else > max_hw_sectors = UINT_MAX; > ctrl->max_hw_sectors = > min_not_zero(ctrl->max_hw_sectors, max_hw_sectors); But that alone was not enough: Tests with ctrl->max_hw_sectors=4096 and NVME_MAX_SEGS = 128 still resulted in corruptions. They only went away after reverting this value back to 127 (the value from kernel 6.1). Additional logging to get the values of the following statements > (dma_opt_mapping_size(&pdev->dev) >> 9) = 256 > (dma_max_mapping_size(&pdev->dev) >> 9) = 36028797018963967 [sic!] @Stefan, can you check which value NVME_MAX_SEGS had in your tests? It also seems to have an influence. Best regards, Matthias (In reply to mbe from comment #21) > To achieve the same values on Kernel 6.11.0-13, Please clarify: what upstream kernel does that distro-specifc version number refer to? And is that a kernel that is vanilla or close to upstream? And why use a EOL series anyway? It's best to use a fresh mainline for all testing, except when data from older kernels is required. I finally got the chance to run some more tests with some interesting and unexpected results... I put another disk (WD Black SN750) in the main M.2 slot (the problematic one), but kept my main disk (Solidigm P44 Pro) in the secondary M.2 slot (where it doesn't have any issues). I rerun my test: step 1) copy a large number of files to the WD disk (main slot), step 2) run btrfs scrub on it and expect some checksum errors To my surprise there were no errors! I tried it twice with different kernels (6.2.6 and 6.11.5) and booting from either disk (I have linux installations on both). Still no errors. I then removed the Solidigm disk from the secondary and kept the WD disk in the main M.2 slot. Rerun my tests (on kernel 6.11.5) and bang! btrfs scrub now detected quite a few checksum errors! I then tried disabling volatile write cache with "nvme set-feature /dev/nvme0 -f 6 -v 0" "nvme get-feature /dev/nvme0 -f 6" confirmed it was disabled, but /sys/block/nvme0n1/queue/fua still showed 1... Was that supposed to turn into 0? I re-run my test, but I still got checksum errors on btrfs scrub. So disabling volatile write cache (assuming I did it correctly) didn't make a difference in my case. I put the Solidigm disk back into the secondary slot, booted and rerun the test on the WD disk (main slot) just to be triple sure and still no errors. So it looks like the corruption only happens if only the main M.2 slot is occupied and the secondary M.2 slot is free. With two nvme disks (one on each M.2 slot), there were no errors at all. Stefan, did you ever try running your tests with 2 nvme disks installed on both slots? Or did you use only one slot at a time? Bruno On 15.01.25 07:37, Bruno Gravato wrote: > I finally got the chance to run some more tests with some interesting > and unexpected results... FWIW, I briefly looked into the issue in between as well and can reproduce it[1] locally with my Samsung SSD 990 EVO Plus 4TB in the main M.2 slot of my DeskMini X600 using btrfs on a mainline kernel with a config from Fedora rawhide. So what can we that are affected by the problem do to narrow it down? What does it mean that disabling the NVMe devices's write cache often but apparently not always helps? It it just reducing the chance of the problem occurring or accidentally working around it? hch initially brought up that swiotlb seems to be used. Are there any BIOS setup settings we should try? I tried a few changes yesterday, but I still get the "PCI-DMA: Using software bounce buffering for IO (SWIOTLB)" message in the log and not a single line mentioning DMAR. Ciao, Thorsten [1] see start of this thread and/or https://bugzilla.kernel.org/show_bug.cgi?id=219609 for details > I put another disk (WD Black SN750) in the main M.2 slot (the > problematic one), but kept my main disk (Solidigm P44 Pro) in the > secondary M.2 slot (where it doesn't have any issues). > I rerun my test: step 1) copy a large number of files to the WD disk > (main slot), step 2) run btrfs scrub on it and expect some checksum > errors > To my surprise there were no errors! > I tried it twice with different kernels (6.2.6 and 6.11.5) and booting > from either disk (I have linux installations on both). > Still no errors. > > I then removed the Solidigm disk from the secondary and kept the WD > disk in the main M.2 slot. > Rerun my tests (on kernel 6.11.5) and bang! btrfs scrub now detected > quite a few checksum errors! > > I then tried disabling volatile write cache with "nvme set-feature > /dev/nvme0 -f 6 -v 0" > "nvme get-feature /dev/nvme0 -f 6" confirmed it was disabled, but > /sys/block/nvme0n1/queue/fua still showed 1... Was that supposed to > turn into 0? > > I re-run my test, but I still got checksum errors on btrfs scrub. So > disabling volatile write cache (assuming I did it correctly) didn't > make a difference in my case. > > I put the Solidigm disk back into the secondary slot, booted and rerun > the test on the WD disk (main slot) just to be triple sure and still > no errors. > > So it looks like the corruption only happens if only the main M.2 slot > is occupied and the secondary M.2 slot is free. > With two nvme disks (one on each M.2 slot), there were no errors at all. > > Stefan, did you ever try running your tests with 2 nvme disks > installed on both slots? Or did you use only one slot at a time? $ journalctl -k | grep -i -e DMAR -e IOMMU -e AMD-Vi -e SWIOTLB AMD-Vi: Using global IVHD EFR:0x246577efa2254afa, EFR2:0x0 iommu: Default domain type: Translated iommu: DMA domain TLB invalidation policy: lazy mode pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported pci 0000:00:01.0: Adding to iommu group 0 pci 0000:00:01.3: Adding to iommu group 1 pci 0000:00:02.0: Adding to iommu group 2 pci 0000:00:02.3: Adding to iommu group 3 pci 0000:00:03.0: Adding to iommu group 4 pci 0000:00:04.0: Adding to iommu group 5 pci 0000:00:08.0: Adding to iommu group 6 pci 0000:00:08.1: Adding to iommu group 7 pci 0000:00:08.2: Adding to iommu group 8 pci 0000:00:08.3: Adding to iommu group 9 pci 0000:00:14.0: Adding to iommu group 10 pci 0000:00:14.3: Adding to iommu group 10 pci 0000:00:18.0: Adding to iommu group 11 pci 0000:00:18.1: Adding to iommu group 11 pci 0000:00:18.2: Adding to iommu group 11 pci 0000:00:18.3: Adding to iommu group 11 pci 0000:00:18.4: Adding to iommu group 11 pci 0000:00:18.5: Adding to iommu group 11 pci 0000:00:18.6: Adding to iommu group 11 pci 0000:00:18.7: Adding to iommu group 11 pci 0000:01:00.0: Adding to iommu group 12 pci 0000:02:00.0: Adding to iommu group 13 pci 0000:03:00.0: Adding to iommu group 14 pci 0000:03:00.1: Adding to iommu group 15 pci 0000:03:00.2: Adding to iommu group 16 pci 0000:03:00.3: Adding to iommu group 17 pci 0000:03:00.4: Adding to iommu group 18 pci 0000:03:00.6: Adding to iommu group 19 pci 0000:04:00.0: Adding to iommu group 20 pci 0000:04:00.1: Adding to iommu group 21 pci 0000:05:00.0: Adding to iommu group 22 AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA PC GA_vAPIC AMD-Vi: Interrupt remapping enabled AMD-Vi: Virtual APIC enabled PCI-DMA: Using software bounce buffering for IO (SWIOTLB) perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank). Hi, (replying to both, the mailing list and the kernel bug tracker) Am 15.01.25 um 07:37 schrieb Bruno Gravato: > I then removed the Solidigm disk from the secondary and kept the WD > disk in the main M.2 slot. Rerun my tests (on kernel 6.11.5) and > bang! btrfs scrub now detected quite a few checksum errors! > > I then tried disabling volatile write cache with "nvme set-feature > /dev/nvme0 -f 6 -v 0" "nvme get-feature /dev/nvme0 -f 6" confirmed it > was disabled, but /sys/block/nvme0n1/queue/fua still showed 1... Was > that supposed to turn into 0? You can check this using `nvme get-feature /dev/nvme0n1 -f 6` > So it looks like the corruption only happens if only the main M.2 > slot is occupied and the secondary M.2 slot is free. With two nvme > disks (one on each M.2 slot), there were no errors at all. > > Stefan, did you ever try running your tests with 2 nvme disks > installed on both slots? Or did you use only one slot at a time? No, I only tested these configurations: 1. 1st M.2: Lexar; 2nd M.2: empty (Easy to reproduce write errors) 2. 1st M.2: Kingsten; 2nd M.2: Lexar (Difficult to reproduce read errors with 6.1 Kernel, but no issues with a newer ones within several month of intense use) I'll swap the SSD's soon. Then I will also test other configurations and will try out a third SSD. If I get corruption with other SSD's, I will check which modifications help. Note that I need both SSD's (configuration 2) in about one week and cannot change this for about 3 months (already announced this in December). Thus, if there are things I shall test with configuration 1, please inform me quickly. Just as remainder (for those who did not read the two bug trackers): I tested with `f3` (a utility used to detect scam disks) on ext4. `f3` reports overwritten sectors. In configuration 1 this are write errors (appear if I read again). (If no other SSD-intense jobs are running), the corruption do not occur in the last files, and I never noticed file system corruptions, only file contents is corrupt. (This is probably luck, but also has something to do with the journal and the time when file system information are written.) Am 13.01.25 um 22:01 schrieb bugzilla-daemon@kernel.org: > https://bugzilla.kernel.org/show_bug.cgi?id=219609 > > --- Comment #21 from mbe --- > Hi, > > I did some more tests. At first I retrieved the following values under debian > >> Debian 12, Kernel 6.1.119, no corruption >> cat /sys/class/block/nvme0n1/queue/max_hw_sectors_kb >> 2048 >> >> cat /sys/class/block/nvme0n1/queue/max_sectors_kb >> 1280 >> >> cat /sys/class/block/nvme0n1/queue/max_segments >> 127 >> >> cat /sys/class/block/nvme0n1/queue/max_segment_size >> 4294967295 > > To achieve the same values on Kernel 6.11.0-13, I had to make the following > changes to drivers/nvme/host/pci.c > >> --- pci.c.org 2024-09-15 16:57:56.000000000 +0200 >> +++ pci.c 2025-01-13 21:18:54.475903619 +0100 >> @@ -41,8 +41,8 @@ >> * These can be higher, but we need to ensure that any command doesn't >> * require an sg allocation that needs more than a page of data. >> */ >> -#define NVME_MAX_KB_SZ 8192 >> -#define NVME_MAX_SEGS 128 >> +#define NVME_MAX_KB_SZ 4096 >> +#define NVME_MAX_SEGS 127 >> #define NVME_MAX_NR_ALLOCATIONS 5 >> >> static int use_threaded_interrupts; >> @@ -3048,8 +3048,8 @@ >> * Limit the max command size to prevent iod->sg allocations going >> * over a single page. >> */ >> - dev->ctrl.max_hw_sectors = min_t(u32, >> - NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9); >> + //dev->ctrl.max_hw_sectors = min_t(u32, >> + // NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9); >> dev->ctrl.max_segments = NVME_MAX_SEGS; >> >> /* > > So basically, dev->ctl.max_hw_sectors stays zero, so that in core.c it is set > to the value of nvme_mps_to_sectors(ctrl, id->mdts) (=> 4096 in my case) This has the same effect as setting it to `dma_max_mapping_size(...)` >> if (id->mdts) >> max_hw_sectors = nvme_mps_to_sectors(ctrl, id->mdts); >> else >> max_hw_sectors = UINT_MAX; >> ctrl->max_hw_sectors = >> min_not_zero(ctrl->max_hw_sectors, max_hw_sectors); > > But that alone was not enough: > Tests with ctrl->max_hw_sectors=4096 and NVME_MAX_SEGS = 128 still resulted in > corruptions. > They only went away after reverting this value back to 127 (the value from > kernel 6.1). That change was introduced in 6.3-rc1 using a patch "nvme-pci: place descriptor addresses in iod" ( https://github.com/torvalds/linux/commit/7846c1b5a5db8bb8475603069df7c7af034fd081 ) This patch has no effect for me, i.e. unmodified kernels work up to 6.3.6. The patch that triggers the corruptions is the one introduced in 6.3.7 which replaces `dma_max_mapping_size(...)` by `dma_opt_mapping_size(...)`. If I apply this change to 6.1, the corruptions also occur in that kernel. Matthias, did you checked what happens is you only modify NVME_MAX_SEGS (and leave the `dev->ctrl.max_hw_sectors = min_t(u32, NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);`) > Additional logging to get the values of the following statements >> (dma_opt_mapping_size(&pdev->dev) >> 9) = 256 >> (dma_max_mapping_size(&pdev->dev) >> 9) = 36028797018963967 [sic!] > > @Stefan, can you check which value NVME_MAX_SEGS had in your tests? > It also seems to have an influence. "128", see above. Regards Stefan On Wed, 15 Jan 2025 at 10:48, Stefan <linux-kernel@simg.de> wrote: > > Stefan, did you ever try running your tests with 2 nvme disks > > installed on both slots? Or did you use only one slot at a time? > > No, I only tested these configurations: > > 1. 1st M.2: Lexar; 2nd M.2: empty > (Easy to reproduce write errors) > 2. 1st M.2: Kingsten; 2nd M.2: Lexar > (Difficult to reproduce read errors with 6.1 Kernel, but no issues > with a newer ones within several month of intense use) > > I'll swap the SSD's soon. Then I will also test other configurations and > will try out a third SSD. If I get corruption with other SSD's, I will > check which modifications help. So it may be that the reason you no longer had errors in config 2 is not because you put a different SSD in the 1st slot, but because you now have the 2nd slot also occupied, like me. If yours behaves like mine, I'd expect that if you swap the disks in config 2, that you won't have any errors as well... I'm very curious to see the result of that test! Just to recap the results of my tests: Setup 1 Main slot: Solidigm Secondary slot: (empty) Result: BAD - corruption happens Setup 2 Main slot: (empty) Secondary slot: Solidigm Result: GOOD - no corruption Setup 3 Main slot: WD Secondary slot: (empty) Result: BAD - corruption happens Setup 4 Main slot: WD Secondary slot: Solidigm Result: GOOD - no corruption (on either disk) So, in my case, it looks like the corruption only happens if I have only 1 disk installed in the main slot and the secondary slot is empty. If I have the two slots occupied or only the secondary slot occupied, there are no more errors. Bruno Hi,
Am 15.01.25 um 14:14 schrieb Bruno Gravato:
> If yours behaves like mine, I'd expect that if you swap the disks in
> config 2, that you won't have any errors as well...
yeah, I would just need to plug something into the 2nd M.2 socket. But
that can't be done remotely. I will do that on weekend or in next week.
BTW, is there a kernel parameter to ignore a NVME/PCI device? If the
corruptions appear again after disabling the 2nd SSD, it is more likely
that it is a kernel problem, e.g. a driver writing to memory reserved
for some other driver/component. Such a bug may only occur under rare
conditions. AFAIU, the patch "nvme-pci: place descriptor addresses in
iod" form 6.3-rc1 attempts to use some space which is otherwise unused.
Unfortunately I was not able to revert that patch because later changes
depend on it.
So, I now only tried out whether just `NVME_MAX_SEGS 127` helps (see
message from Matthias). Answer is no. This only seem to by an upper
limit, because `/sys/class/block/nvme0n1/queue/max_segments` reports 33
with unmodified kernels >= 6.3.7. With older kernels or kernels with
reversed patch "nvme-pci: clamp max_hw_sectors based on DMA optimized
limitation" (introduced in 6.3.7) this value is 127 and corruptions
disappear.
I guess, this value somehow has to be 127. In my case it is sufficient
to revert the patch form 6.3.7. In Matthias's case, the values then
becomes 128 and has to be limited additionally using `NVME_MAX_SEGS 127`
Regards Stefan
I don't know if it helps to narrow it down, but adding the kernel parameter nvme.io_queue_depth=2 makes the corruption disappear with an unpatched kernel (Ubuntu 6.11.0-12 in my case). Of course it is much slower with this setting. Well this is a real doozy. The observation appears completely dependent on PCI slot populations, but it's somehow also dependent on a software alignment/granularity or queue depth choice? The whole part with the 2nd slot use vs. unused really indicates some kind of platform anomaly than a kernel bug. I'm going to ignore the 2nd slot for a moment because I can't reconcile that with the kernel size limits. Let's just consider the kernel transfer sizing did something weird for your device, and now we introduce the queue-depth 2 observation into the picture. This now starts to sound like that O2 Micro bug where transfers than ended on page boundaries got misinterpreted by NVMe controller. That's this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=ebefac5647968679f6ef5803e5d35a71997d20fa Now, it may not be appropriate to just add your devices to that quirk because it only reliably works for devices with MDTS of 5 or less, and I think your devices are larger. But they might have the same bug. It'd be weird if so many vendors implemented it incorrectly, but maybe they're using the same 3rd party controller. (In reply to Keith Busch from comment #29) > > Now, it may not be appropriate to just add your devices to that quirk > because it only reliably works for devices with MDTS of 5 or less, and I > think your devices are larger. Will give that a try, but one comment: > But they might have the same bug. It'd be > weird if so many vendors implemented it incorrectly, but maybe they're using > the same 3rd party controller. That makes it sounds like you suspect a problem in the NVMe devices. But isn't it way more likely that it's something in the machine? I mean we all seem to have the same one (ASRock Deskmini X600) and use NVMe devices that apparently work fine for everybody else, as they are not new and sold for a while. So it sounds more like that machine is doing something wrong or doing something odd that exposes a kernel bug. For me it seems disabling the IOMMU in the BIOS Setup (Advanced -> AMD CBS -> iommu) prevents the problem from happening. Hi, Am 16.01.25 um 06:37 schrieb bugzilla-daemon@kernel.org: > --- Comment #30 from The Linux kernel's regression tracker (Thorsten > Leemhuis) --- >> But they might have the same bug. It'd be weird if so many vendors >> implemented it incorrectly, but maybe they're using the same 3rd >> party controller. > > That makes it sounds like you suspect a problem in the NVMe devices. > But isn't it way more likely that it's something in the machine? I > mean we all seem to have the same one (ASRock Deskmini X600) and use > NVMe devices that apparently work fine for everybody else, as they > are not new and sold for a while. So it sounds more like that machine > is doing something wrong or doing something odd that exposes a kernel > bug. Furthermore is seems that the corruptions occur with all SSD's under certain conditions and the controllers are quite different. One user from the c't forum wrote me, that the corruptions only occur if network is enabled, and that this trick works with both, Ethernet and WLAN. (Is asked him to report his results here.) Maybe something (kernel, firmware or even the CPU) messes up DMA transfers of different PCIe devices, e.g. due to a buffer overflow. AFAIS, another thing that is in common: All CPU's used are from 8000 (and on this chipset-less mainbaord, all PCIe devices are connected to the CPU). Regards Stefan > Well this is a real doozy. Are all of these reports on the exact same motherboard? "ASRock Deskmini X600" > One user from the c't forum wrote me, that the corruptions only occur if network is enabled, and that this trick works with both, Ethernet and WLAN. (Is asked him to report his results here.) Has anyone contacted ASRock support? With such random results I would wonder if there is a signal integrity issue that needs to be looked at. > For me it seems disabling the IOMMU in the BIOS Setup (Advanced -> AMD CBS -> > iommu) prevents the problem from happening. Can others corroborate this finding? (In reply to Mario Limonciello (AMD) from comment #33) > > Well this is a real doozy. > Are all of these reports on the exact same motherboard? "ASRock Deskmini > X600" Pretty sure that's the case. > > One user from the c't forum wrote me, that the corruptions only occur if > network is enabled, and that this trick works with both, Ethernet and > WLAN. (Is asked him to report his results here.) > Has anyone contacted ASRock support? Not that I know of. > With such random results I would > wonder if there is a signal integrity issue that needs to be looked at. FWIW, Windows apparently works fine. But I guess that might be due to some random minor details/difference or something like that. > > For me it seems disabling the IOMMU in the BIOS Setup (Advanced -> AMD CBS > -> > > iommu) prevents the problem from happening. > Can others corroborate this finding? Yeah, would be good if someone could confirm my result. > --- Comment #33 from Mario Limonciello (AMD) --- >> Well this is a real doozy. > > Are all of these reports on the exact same motherboard? "ASRock Deskmini > X600" If I haven't overlooked something, all reports are from the motherboard "AsRock X600M-STX" (from the mini PC "ASRock Deskmini X600") with an series 8000 Ryzen. >> One user from the c't forum wrote me, that the corruptions only occur if > network is enabled, and that this trick works with both, Ethernet and > WLAN. (Is asked him to report his results here.) > > Has anyone contacted ASRock support? With such random results I would wonder > if there is a signal integrity issue that needs to be looked at. Signal integrity does not depend on transfer size and is not improved by crosstalk of a 2nd SSD. (Corruptions disappear if a 2nd SSD is installed.) Regards Stefan I can confirm that disabling IOMMU under "Advanced\AMD CBS\NBIO Common Options" prevents the data corruption. System spec: ASRock Deskmini X600, AMD Ryzen 7 8700G On 15.01.25 09:40, Thorsten Leemhuis wrote: > On 15.01.25 07:37, Bruno Gravato wrote: >> I finally got the chance to run some more tests with some interesting >> and unexpected results... > > FWIW, I briefly looked into the issue in between as well and can > reproduce it[1] locally with my Samsung SSD 990 EVO Plus 4TB in the main > M.2 slot of my DeskMini X600 using btrfs on a mainline kernel with a > config from Fedora rawhide. > > So what can we that are affected by the problem do to narrow it down? > > What does it mean that disabling the NVMe devices's write cache often > but apparently not always helps? It it just reducing the chance of the > problem occurring or accidentally working around it? > > hch initially brought up that swiotlb seems to be used. Are there any > BIOS setup settings we should try? I tried a few changes yesterday, but > I still get the "PCI-DMA: Using software bounce buffering for IO > (SWIOTLB)" message in the log and not a single line mentioning DMAR. FWIW, I meanwhile became aware that it is normal that there are no lines with DMAR when it comes to AMD's IOMMU. Sorry for the noise. But there is a new development: I noticed earlier today that disabling the IOMMU in the BIOS Setup seems to prevent the corruption from occurring. Another user in the bugzilla ticket just confirmed this. Ciao, Thorsten > [1] see start of this thread and/or > https://bugzilla.kernel.org/show_bug.cgi?id=219609 for details > >> I put another disk (WD Black SN750) in the main M.2 slot (the >> problematic one), but kept my main disk (Solidigm P44 Pro) in the >> secondary M.2 slot (where it doesn't have any issues). >> I rerun my test: step 1) copy a large number of files to the WD disk >> (main slot), step 2) run btrfs scrub on it and expect some checksum >> errors >> To my surprise there were no errors! >> I tried it twice with different kernels (6.2.6 and 6.11.5) and booting >> from either disk (I have linux installations on both). >> Still no errors. >> >> I then removed the Solidigm disk from the secondary and kept the WD >> disk in the main M.2 slot. >> Rerun my tests (on kernel 6.11.5) and bang! btrfs scrub now detected >> quite a few checksum errors! >> >> I then tried disabling volatile write cache with "nvme set-feature >> /dev/nvme0 -f 6 -v 0" >> "nvme get-feature /dev/nvme0 -f 6" confirmed it was disabled, but >> /sys/block/nvme0n1/queue/fua still showed 1... Was that supposed to >> turn into 0? >> >> I re-run my test, but I still got checksum errors on btrfs scrub. So >> disabling volatile write cache (assuming I did it correctly) didn't >> make a difference in my case. >> >> I put the Solidigm disk back into the secondary slot, booted and rerun >> the test on the WD disk (main slot) just to be triple sure and still >> no errors. >> >> So it looks like the corruption only happens if only the main M.2 slot >> is occupied and the secondary M.2 slot is free. >> With two nvme disks (one on each M.2 slot), there were no errors at all. >> >> Stefan, did you ever try running your tests with 2 nvme disks >> installed on both slots? Or did you use only one slot at a time? > > $ journalctl -k | grep -i -e DMAR -e IOMMU -e AMD-Vi -e SWIOTLB > AMD-Vi: Using global IVHD EFR:0x246577efa2254afa, EFR2:0x0 > iommu: Default domain type: Translated > iommu: DMA domain TLB invalidation policy: lazy mode > pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported > pci 0000:00:01.0: Adding to iommu group 0 > pci 0000:00:01.3: Adding to iommu group 1 > pci 0000:00:02.0: Adding to iommu group 2 > pci 0000:00:02.3: Adding to iommu group 3 > pci 0000:00:03.0: Adding to iommu group 4 > pci 0000:00:04.0: Adding to iommu group 5 > pci 0000:00:08.0: Adding to iommu group 6 > pci 0000:00:08.1: Adding to iommu group 7 > pci 0000:00:08.2: Adding to iommu group 8 > pci 0000:00:08.3: Adding to iommu group 9 > pci 0000:00:14.0: Adding to iommu group 10 > pci 0000:00:14.3: Adding to iommu group 10 > pci 0000:00:18.0: Adding to iommu group 11 > pci 0000:00:18.1: Adding to iommu group 11 > pci 0000:00:18.2: Adding to iommu group 11 > pci 0000:00:18.3: Adding to iommu group 11 > pci 0000:00:18.4: Adding to iommu group 11 > pci 0000:00:18.5: Adding to iommu group 11 > pci 0000:00:18.6: Adding to iommu group 11 > pci 0000:00:18.7: Adding to iommu group 11 > pci 0000:01:00.0: Adding to iommu group 12 > pci 0000:02:00.0: Adding to iommu group 13 > pci 0000:03:00.0: Adding to iommu group 14 > pci 0000:03:00.1: Adding to iommu group 15 > pci 0000:03:00.2: Adding to iommu group 16 > pci 0000:03:00.3: Adding to iommu group 17 > pci 0000:03:00.4: Adding to iommu group 18 > pci 0000:03:00.6: Adding to iommu group 19 > pci 0000:04:00.0: Adding to iommu group 20 > pci 0000:04:00.1: Adding to iommu group 21 > pci 0000:05:00.0: Adding to iommu group 22 > AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA > PC GA_vAPIC > AMD-Vi: Interrupt remapping enabled > AMD-Vi: Virtual APIC enabled > PCI-DMA: Using software bounce buffering for IO (SWIOTLB) > perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank). > I noticed earlier today that disabling the IOMMU in the BIOS Setup seems to prevent the corruption from occurring. If you can reliably reproduce this issue, can you also experiment with turning it back on in BIOS and then using: * iommu=pt (which will do identity domain) and separately * amd_iommu=off (which will disable the IOMMU from Linux) > If I haven't overlooked something, all reports are from the motherboard "AsRock X600M-STX" (from the mini PC "ASRock Deskmini X600") with an series 8000 Ryzen. For everyone responding with their system, it would be ideal to also share information about the AGESA version (sometimes reported in `dmidecode | grep AGESA`) as well as the ASRock BIOS version (/sys/class/dmi/id/bios_version). > Corruptions disappear if a 2nd SSD is installed I missed that; quite bizarre. Mario, thx for looking into this. > If you can reliably reproduce this issue Usually within ten to twenty seconds > iommu=pt Apparently[1] helps. > amd_iommu=off Apparently[1] helps, too [1] I did not try for a long time, but for two or three minutes and no corruption occurred; normally one occurs on nearly every try of "f3write -e 4" and checking the result afterwards. > it would be ideal to also share information $ grep -s '' /sys/class/dmi/id/* /sys/class/dmi/id/bios_date:12/05/2024 /sys/class/dmi/id/bios_release:5.35 /sys/class/dmi/id/bios_vendor:American Megatrends International, LLC. /sys/class/dmi/id/bios_version:4.08 /sys/class/dmi/id/board_asset_tag:Default string /sys/class/dmi/id/board_name:X600M-STX /sys/class/dmi/id/board_vendor:ASRock /sys/class/dmi/id/board_version:Default string /sys/class/dmi/id/chassis_asset_tag:Default string /sys/class/dmi/id/chassis_type:3 /sys/class/dmi/id/chassis_vendor:Default string /sys/class/dmi/id/chassis_version:Default string /sys/class/dmi/id/modalias:dmi:bvnAmericanMegatrendsInternational,LLC.:bvr4.08:bd12/05/2024:br5.35:svnASRock:pnX600M-STX:pvrDefaultstring:rvnASRock:rnX600M-STX:rvrDefaultstring:cvnDefaultstring:ct3:cvrDefaultstring:skuDefaultstring: /sys/class/dmi/id/product_family:Default string /sys/class/dmi/id/product_name:X600M-STX /sys/class/dmi/id/product_sku:Default string /sys/class/dmi/id/product_version:Default string /sys/class/dmi/id/sys_vendor:ASRock /sys/class/dmi/id/uevent:MODALIAS=dmi:bvnAmericanMegatrendsInternational,LLC.:bvr4.08:bd12/05/2024:br5.35:svnASRock:pnX600M-STX:pvrDefaultstring:rvnASRock:rnX600M-STX:rvrDefaultstring:cvnDefaultstring:ct3:cvrDefaultstring:skuDefaultstring: $ sudo dmidecode | grep AGESA String: AGESA!V9 ComboAm5PI 1.2.0.2a Created attachment 307497 [details] logs.tar.bz2 Hi, I ran a few tests with SSD's and BIOS settings. (I cannot do this often because hardware is in use and installed remotely). Kernel logs and lspci output are in the enclosed archive. Unmodified (except of an additional message) Kernel 6.13-rc6 was used. 0. As reference: IOMMU and ethernet enabled 1st M.2: Lexar 2nd M.2: empty Archive directory: `lexar_empty` ==> Corruptions occur 1. IOMMU disabled via BIOS, ethernet enabled 1st M.2: Lexar 2nd M.2: empty Archive directory: `lexar_empty.noiommu` ==> No corruptions 2. Ethernet disabled via BIOS, IOMMU enabled 1st M.2: Lexar 2nd M.2: empty Archive directory: `lexar_empty.noeth` ==> No corruptions 3. IOMMU and ethernet enabled 1st M.2: Lexar 2nd M.2: Seagate Firecuda 520, 500 GB Archive directory: `lexar_firecuda` ==> No corruptions 4. IOMMU and ethernet enabled 1st M.2: Seagate Firecuda 520, 500 GB 2nd M.2: empty Archive directory: `firecuda_empty` ==> No corruptions The last test was a surprise because it is different to the observations reported in comment 26. Note that the kernel emits the warning > workqueue: work disable count underflowed > WARNING: CPU: 1 PID: 23 at kernel/workqueue.c:4317 ... > [1] I did not try for a long time, but for two or three minutes and no > corruption occurred; normally one occurs on nearly every try of "f3write -e > 4" > and checking the result afterwards. I write 250 or 1000 files (1 file = 1 GB) because only about 2% of them are corrupt. The probability of errors seems to vary strongly. Regards Stefan Hi Team, i track this error since weeks, made dozends of test-installs and i would like to add my recent results to this report. Happens on at least btrfs and ext4. I prefer btrfs since it takes only 2 sec. to find the bug right after installation (before reboot) with "brtfs scrub start /target" because btrfs does crc checksumming - while with ext4 does not: so on ext4 you need to copy & verify - manually or scripted. Erros seems 100% reproducible. Usually there are about 15-30 corrupted files. I usually test with simple install of Linux Mint 22. Takes only ~10 min - assumed you have a bootable install-stick at hand. Alternatively Ubuntu Server 24.04 or 24.10 - does no matter - but they ask more questions. (Ubuntu Desktop does not offer btrfs.) I was the one who brought in the author of c't Magazin - Christian. With help of c'ts support forum i found bugreport 1076372 thus Stefan and this report. I guess I was the first who found out slot M2_2 is not concerned. ( I am the person mentiond by Stefan in comment 32) I also found out: If Ubuntu Server can't update packages while installing, due to unavailable network, there might be no corrupted files . (Tested twice with no corruptions.) This was confirmed by Christian. If network is available - problem is reproducible. (Tested intensly with LAN and WLAN.) If you insert a unused dummy ssd in slot M2_2 and install on slot M2_1 - the error on slot M2_1 is gone. I am willing to do further tests if needed - please request me to do so. Best regards Ralph Systems involved: ASRock DESKMINI X-600M-STX with BIOS v4.03, v 4.04, v4.08, v3.02 AMD Ryzen 5 8600G, AMD Ryzen 7 8700G Diverse RAM Diverse NVMe-SSDs: Samsung 990 Evo - Samsung 990 PRO - Samsung 980 PRO (all with up2date firmware) + many more Ubuntu 24.04, Ubuntu 24.10 (6.11), Linux Mint 22, Mint 21.3 Edge, (6.5), Fedora40, Fedora41 + more On Dec., 7th i opened a Ticket @ ASRock support. I just updated this and pointed them here. (In reply to Ralph Gerstmann from comment #41) > I am willing to do further tests if needed Would afaics be good if at least one person could do what Mario asked for in Comment 38 (and hopefully confirming my results from Comment 39). On Wed, Jan 15, 2025 at 09:40:04AM +0100, Thorsten Leemhuis wrote: > What does it mean that disabling the NVMe devices's write cache often > but apparently not always helps? It it just reducing the chance of the > problem occurring or accidentally working around it? For consumer NAND device you basically can't disable the volatile write cache. If you do disable it, that just means it gets flushed after every write, meaning you have to write the entire NAND (super)block for every write, causing a huge slowdown (and a lot of media wear). This will change timings a lot obviously. If it doesn't change the timing the driver just fakes it, which reputable vendors shouldn't be doing, but I would not be entirely surprised about for noname devices. > hch initially brought up that swiotlb seems to be used. Are there any > BIOS setup settings we should try? I tried a few changes yesterday, but > I still get the "PCI-DMA: Using software bounce buffering for IO > (SWIOTLB)" message in the log and not a single line mentioning DMAR. The real question would be to figure out why it is used. Do you see the pci_dbg(dev, "marking as untrusted\n"); message in the commit log if enabling the pci debug output? (I though we had a sysfs file for that, but I can't find it). Hi, an extra data point. I have the following setup: AsRock Deskmini x600, BIOS 4.08 with Secure Boot enabled. Ryzen 9 7900 Kingston Fury 2*32GB at default 5200Mhz Western Digital SN850X 1TB (main slot, secondary slot never used) Intel AX210 WiFi Ethernet is enabled, but I don't use it, I use WiFi. IOMMU is "auto", haven't touched. Running kernel 6.12.9 on Fedora 41 with btrfs. Been using this system for a couple of months, have copied my 500GB backup drive containing 1.6 million files to the nvme drive. I have also just generated 400 1GB files containing data from /dev/urandom. btrfs scrub reports no errors. On 17.01.25 09:05, Christoph Hellwig wrote: > On Wed, Jan 15, 2025 at 09:40:04AM +0100, Thorsten Leemhuis wrote: > >> hch initially brought up that swiotlb seems to be used. Are there any >> BIOS setup settings we should try? I tried a few changes yesterday, but >> I still get the "PCI-DMA: Using software bounce buffering for IO >> (SWIOTLB)" message in the log and not a single line mentioning DMAR. > > The real question would be to figure out why it is used. > > Do you see the > > pci_dbg(dev, "marking as untrusted\n"); > > message in the commit log if enabling the pci debug output? By booting with 'ignore_loglevel dyndbg="file drivers/pci/* +p"' I suppose? No, that is not printed (but other debug lines from the pci code are). Side note: that "PCI-DMA: Using software bounce buffering for IO >> (SWIOTLB)" message does show up on two other AMD machines I own as well. One also has a Ryzen 8000, the other one a much older one. And BTW a few bits of the latest development in the bugzilla ticket (https://bugzilla.kernel.org/show_bug.cgi?id=219609 ): * iommu=pt and amd_iommu=off seems to work around the problem (in addition to disabling the iommu in the BIOS setup). * Not totally sure, but it seems most or everyone affected is using a Ryzen 8000 CPU -- and now one user showed up that mentioned a DeskMini x600 with a Ryzen 7000 CPU is not affected (see ticket for details). But that might be due to other aspects. A former colleague of mine who can reproduce the problem will later test if a different CPU line really is making a difference. Ciao, Thorsten On Fri, Jan 17, 2025 at 10:51:09AM +0100, Thorsten Leemhuis wrote:
> By booting with 'ignore_loglevel dyndbg="file drivers/pci/* +p"' I
> suppose? No, that is not printed (but other debug lines from the pci
> code are).
>
> Side note: that "PCI-DMA: Using software bounce buffering for IO
> >> (SWIOTLB)" message does show up on two other AMD machines I own as
> well. One also has a Ryzen 8000, the other one a much older one.
>
> And BTW a few bits of the latest development in the bugzilla ticket
> (https://bugzilla.kernel.org/show_bug.cgi?id=219609 ):
>
> * iommu=pt and amd_iommu=off seems to work around the problem (in
> addition to disabling the iommu in the BIOS setup).
That suggests the problem is related to the dma-iommu code, and
my strong suspect is the swiotlb bounce buffering for untrusted
device. If you feel adventurous, can you try building a kernel
where dev_use_swiotlb() in drivers/iommu/dma-iommu.c is hacked
to always return false?
On 17.01.25 10:55, Christoph Hellwig wrote:
> On Fri, Jan 17, 2025 at 10:51:09AM +0100, Thorsten Leemhuis wrote:
>> By booting with 'ignore_loglevel dyndbg="file drivers/pci/* +p"' I
>> suppose? No, that is not printed (but other debug lines from the pci
>> code are).
>>
>> Side note: that "PCI-DMA: Using software bounce buffering for IO
>>>> (SWIOTLB)" message does show up on two other AMD machines I own as
>> well. One also has a Ryzen 8000, the other one a much older one.
>>
>> And BTW a few bits of the latest development in the bugzilla ticket
>> (https://bugzilla.kernel.org/show_bug.cgi?id=219609 ):
>>
>> * iommu=pt and amd_iommu=off seems to work around the problem (in
>> addition to disabling the iommu in the BIOS setup).
>
> That suggests the problem is related to the dma-iommu code, and
> my strong suspect is the swiotlb bounce buffering for untrusted
> device. If you feel adventurous, can you try building a kernel
> where dev_use_swiotlb() in drivers/iommu/dma-iommu.c is hacked
> to always return false?
Tried that, did not help: I still get corrupted data.
Ciao, Thorsten
On Fri, 17 Jan 2025 at 09:51, Thorsten Leemhuis <regressions@leemhuis.info> wrote: > * Not totally sure, but it seems most or everyone affected is using a > Ryzen 8000 CPU -- and now one user showed up that mentioned a DeskMini > x600 with a Ryzen 7000 CPU is not affected (see ticket for details). But > that might be due to other aspects. A former colleague of mine who can > reproduce the problem will later test if a different CPU line really is > making a difference. One other different aspect for that user besides the 7000 series CPU is that he's using a wifi card as well (that sits in a M.2 wifi slot just below the main M.2 disk slot), so I wonder if that may play a role? I think most of us have no wifi card installed. I think I have a M.2 wifi card on my former NUC, I'll see if it's compatible with the deskmini and try it out. The other reason could be some disk models aren't affected... I think Stefan reported no issues on a Firecuda 520. I ordered a Crucial T500 1TB yesterday. It's for another machine, but I will try it on the deskmini x600 before deploying on the other machine. I should receive it in a week or so. Bruno No corruption with IOMMU disabled in bios IOMMU enabled in bios, iommu=pt IOMMU enabled in bios, amd_iommu=off Full system spec: ASRock Deskmini X600 CPU: AMD Ryzen 7 8700G Memory: 2x 16 GB Kingston-Fury KF564S38IBK2-32, tested at different speeds from 4800 to 6400 1st M.2: Samsung 990 Pro 2 TB NVMe, latest firmware 4B2QJXD7 2nd M.2: always empty Wifi M.2: Intel AX210, enabled and connected in all tests Ethernet: enabled, but never connected in all tests cat /sys/class/dmi/id/bios_version 4.08 dmidecode | grep AGESA String: AGESA!V9 ComboAm5PI 1.2.0.2a latest SIO firmware 240522 installed I had the error from the beginning even with the original bios version 1.43 Many thanks to everyone who is now looking into the problem. Matthias Made a 3 more tests... *) Mint 22 (6.8.0) with IOMMU disabled in BIOS: No errors. I set it back to Auto before i continued with the following tests.) *) Mint 22 (6.8.0) with network disconnected: Errors (This is what i thought i saw long ago repeatedly - but since we found no errors in disconnected Ubuntu 24.10 Server i tested this again.) *) Ubuntu 24.10 (6.11.0) with network disconnected: Errors Conclusion: A missing network might prevent the failure during install - at least in Ubuntu 22.10 - but can happen anyway. Enabling network seems to raise the chance. I made dozens of installations with Mint 22 (WLAN/LAN/no net), i am pretty sure i didn't see a single one without this error - if the known conditions (4x4 NVMe SSD in Slot 1, nothing in Slot 2) are met. Both systems show the same after installation: cat /sys/class/block/nvme0n1/queue/max_hw_sectors_kb 128 cat /sys/class/block/nvme0n1/queue/max_sectors_kb 128 cat /sys/class/block/nvme0n1/queue/max_segments 33 cat /sys/class/block/nvme0n1/queue/max_segment_size 4294967295 BTW: Asrock support confirmed they forwarded this bugreport to their BIOS devolpment team. Ralph Is this even a Linux bug? Surely this would be observed in other operating systems? Until now no one could reproduce in W1N. Hi, >> What does it mean that disabling the NVMe devices's write cache >> often but apparently not always helps? It it just reducing the >> chance of the problem occurring or accidentally working around it? > > For consumer NAND device you basically can't disable the volatile > write cache. If you do disable it, that just means it gets flushed > after every write, meaning you have to write the entire NAND > (super)block for every write, causing a huge slowdown (and a lot of > media wear). This will change timings a lot obviously. If it > doesn't change the timing the driver just fakes it, which reputable > vendors shouldn't be doing, but I would not be entirely surprised > about for noname devices. As already mentioned, my SSD has no DRAM and uses HMB (Host memory buffer). (It has non-volatile SLC cache.) Disabling volatile write cache has no significant effect on read/write performance of large files, because the HMB size in only 40MB. But things like file deletions may be slower. AFAIS the corruption occur with both kinds of SSD's, the ones that have own DRAM and he ones that use HMB. > --- Comment #49 from Bruno Gravato --- >> * Not totally sure, but it seems most or everyone affected is >> using a Ryzen 8000 CPU -- and now one user showed up that mentioned >> a DeskMini x600 with a Ryzen 7000 CPU is not affected (see ticket >> for details). But that might be due to other aspects. A former >> colleague of mine who can reproduce the problem will later test if >> a different CPU line really is making a difference. > > One other different aspect for that user besides the 7000 series CPU > is that he's using a wifi card as well (that sits in a M.2 wifi slot > just below the main M.2 disk slot), so I wonder if that may play a > role? I think most of us have no wifi card installed. I think I have > a M.2 wifi card on my former NUC, I'll see if it's compatible with > the deskmini and try it out. > > The other reason could be some disk models aren't affected... I think > Stefan reported no issues on a Firecuda 520. Correct. To verify that the two other CPU series are not affected, someone who can reproduce this error and who have laying around another CPU must swap them. > --- Comment #51 from Ralph Gerstman --- > A missing network might prevent the > failure during install - at least > in Ubuntu> 22.10 - but can happen anyway. Enabling network seems to > raise the chance. I had to disable it in BIOS. Just not connecting it has no effect because drivers and firmware are still loaded. Just for the files (already mentioned it): I'm using the latest BIOS version 4.08 with AGESA PI 1.2.0.2a (according to AsRock page) and firmware blobs version 20241210 from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ and I can confirm the the corruptions also occur with older versions of BIOS/firmware. Regards Stefan > > --- Comment #51 from Ralph Gerstman --- > A missing network might prevent > the > > failure during install - at least > > in Ubuntu> 22.10 - but can happen anyway. Enabling network seems to > > raise the chance. > > I had to disable it in BIOS. Just not connecting it has no effect > because drivers and firmware are still loaded. I had a lot of different situations why network did not work. Tagged VLAN , unplugged cable, removed WLAN card, too lazy to enter access-key. But one thing i never did: Disable LAN or VLAN devices in BIOS. On Fri, Jan 17, 2025 at 10:31:55PM +0100, Stefan wrote: > As already mentioned, my SSD has no DRAM and uses HMB (Host memory > buffer). HMB and volatile write caches are not necessarily intertwined. A device can have both. Generally speaking, you'd expect the HMB to have SSD metadata, not user data, where a VWC usually just has user data. The spec also requires the device maintain data integrity even with an unexpected sudden loss of access to the HMB, but that isn't the case with a VWC. >(It has non-volatile SLC cache.) Disabling volatile write cache > has no significant effect on read/write performance of large files, Devices are free to have whatever hierarchy of non-volatile caches they want without advertising that to the host, but if they're calling those "volatile" then I think something has been misinterpreted. > because the HMB size in only 40MB. But things like file deletions may be > slower. > > AFAIS the corruption occur with both kinds of SSD's, the ones that have > own DRAM and he ones that use HMB. Yeah, that was the point of the experiment. If corruption happens when it's off, then that helps rule out host buffer size/alignment (which is where this bz started) as a triggering condition. Disabling VWC is not a "fix", it's just a debug data point. If corruption goes away with it off, though, then we can't really conclude anything for this issue. On 17.01.25 10:51, Thorsten Leemhuis wrote: > On 17.01.25 09:05, Christoph Hellwig wrote: >> On Wed, Jan 15, 2025 at 09:40:04AM +0100, Thorsten Leemhuis wrote: > And BTW a few bits of the latest development in the bugzilla ticket > (https://bugzilla.kernel.org/show_bug.cgi?id=219609 ): > > * iommu=pt and amd_iommu=off seems to work around the problem (in > addition to disabling the iommu in the BIOS setup). > > * Not totally sure, but it seems most or everyone affected is using a > Ryzen 8000 CPU -- and now one user showed up that mentioned a DeskMini > x600 with a Ryzen 7000 CPU is not affected (see ticket for details). But > that might be due to other aspects. A former colleague of mine who can > reproduce the problem will later test if a different CPU line really is > making a difference. My former colleague Christian Hirsch (not CCed) can reproduce the problem reliably. He today switched the CPU to a Ryzen 7 7700 and later to some Ryzen 9600X – and with those things worked just fine, e.g. no corruptions. But they came back after putting the 8600G back in. Ralph, can you please add this detail to the Asrock support ticket? Ciao, Thorsten [1] he described building a x600 machine in the c't magazine, which is the reason why I and a few others affected and CCed build their x600 systems So are all the problematic CPUs reproducing this Ryzen 8600G/Ryzen 8700G? Perhaps there is a firmware issue with those. Hi, > --- Comment #58 from Mario Limonciello (AMD) --- > So are all the problematic CPUs reproducing this Ryzen 8600G/Ryzen 8700G? > Perhaps there is a firmware issue with those. ... or even a hardware issue with those 8000 series CPU's which occurs under certain conditions, namely without a chipset. AsRock offers a few other products that use the same technology (DeskMeet, Jupiter and a mini-ITX mainboard). Are they affected too? Has anyone (ASRock and/or AMD) tested them with Linux before releasing the hardware? (Windows often uses older technologies / features). AFAIK, AsRock does not develop such products and firmware without massive support from AMD. I started another support request at http://event.asrock.com/tsd.asp . Maybe this will expedite a fix. Regards Stefan > ... or even a hardware issue with those 8000 series CPU's which occurs under certain conditions, namely without a chipset. > AsRock offers a few other products that use the same technology (DeskMeet, Jupiter and a mini-ITX mainboard). Are they affected too? Yes; I also want to know if this is unique to ASRock's X600M-STX or if this is happening to anyone on any other AM5 motherboards. > Has anyone (ASRock and/or AMD) tested them with Linux before releasing the hardware Yes; I can assert that AMD has tested 8600G and 8700G with Linux. You can look under "OS support" to see what OSes have been tested. https://www.amd.com/en/products/processors/desktops/ryzen/8000-series/amd-ryzen-5-8600g.html It's not out of the question that a generic AGESA firmware regression under specific circumstances has happened; but right now, all of the evidence on this thread /currently/ points to 8600G/8700G + X600M-STX. Hi, Am 20.01.25 um 17:26 schrieb bugzilla-daemon@kernel.org: > --- Comment #60 from Mario Limonciello (AMD) > (mario.limonciello@amd.com) --- Yes; I also want to know if this is > unique to ASRock's X600M-STX or if this is happening to anyone on any > other AM5 motherboards. > >> Has anyone (ASRock and/or AMD) tested them with Linux before >> releasing > the hardware > > Yes; I can assert that AMD has tested 8600G and 8700G with Linux. > You can look under "OS support" to see what OSes have been tested. sorry, my last statement about insufficient testing was obliviously misleading. I had the combination CPU + mainboard in mind -- the "them" refers to the AsRock products in the previous sentence. Of course the combination of 8x00G + 1 or 2 x Promontory 19/21 (which makes the different chipsets) has been tested and is widely used. I therefore think, a generic AM5 issue is unlikely. But the combination of 8x00G + Knoll3 (the magic SoC enabler chip) is quite new and not used often so far. And since the errors occur in many different configurations, they should be detectable by proper testing. Most likely, the corruptions are triggered either by the combination 8x00G + Knoll3 (in that case other x600 products from AsRock should be affected too) or by the combination 8x00G + X600M STX (that specific mainboard only) and may be caused by firmware, hardware or the kernel. > It's not out of the question that a generic AGESA firmware regression > under specific circumstances has happened; but right now, all of the > evidence on this thread /currently/ points to 8600G/8700G + > X600M-STX. The issues occurred with all BIOS version I tested, starting from the initial one 1.43. AGESA version of some of them are stated at https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.asp#BIOS . Regards Stefan (In reply to Thorsten Leemhuis from comment #57) > > My former colleague Christian Hirsch (not CCed) can reproduce the > problem reliably. He today switched the CPU to a Ryzen 7 7700 and later > to some Ryzen 9600X – and with those things worked just fine, e.g. no > corruptions. But they came back after putting the 8600G back in. > > Ralph, can you please add this detail to the Asrock support ticket? > > Ciao, Thorsten Done. ASRock support came back to me today and sayd they can't reproduce. > We cannot reproduce the problem. Can you provide steps how to reproduce it? > Our test method: > CPU: 8600G > BIOS: 4.08 > OS: Ubuntu 22.04 LTS > SSD: Crucial P300 2TB (M.2_1) > We have copy a 800GB file and use F3write to create 1GB test file 600+ round. > Didn't see this problem... I answered them: Hi, Steps: Reset BIOS - E.g. We can't reproduce if IOMMU is disabled. BIOS Version does not seem to matter. CPU: 8600G and 8700G OS: Any recent Linux Kernel. (Details in bugreport https://bugzilla.kernel.org/show_bug.cgi?id=219609 ) SSD in M.2_1: A lot of SSDs fail probably most - but not all. For details please check the bugreport. ( https://bugzilla.kernel.org/show_bug.cgi?id=219609 ) Slot M.2_2: Must be empty. There are no problems if populated. As you see in the bugreport, there are different ways to reproduce. I personally prefer installing Linux Mint 22 on btrfs. That fails 100%. Network setup does not matter. Ubuntu Server 24.10 on btrfs installation fails sometimes. It seems with disabled network it might not fail. So make sure network is working and try more than once if you cant' reproduce. Interface LAN/WLAN does not matter. Error also happens on ext4 - i reproduced this. But since it is much easier to reproduce with btrfs scrub i always go this way and drop the installation later. Other users using ext4 on a system they don't want to reinstall use F3 to copy and verify. I suggest you to replace Crucial P300 with one of the SSDs mentioned in the bugreport. ( https://bugzilla.kernel.org/show_bug.cgi?id=219609 ) Best regards, Ralph (In reply to Ralph Gerstmann from comment #63) > > SSD: Crucial P300 2TB (M.2_1) > > We have copy a 800GB file and use F3write to create 1GB test file 600+ > round. > > Didn't see this problem... > I suggest you to replace Crucial P300 with one of the SSDs mentioned in the > bugreport. ( https://bugzilla.kernel.org/show_bug.cgi?id=219609 ) > Does anybody here have experiances with this SSD? Not the P300, but I got a Crucial T500 1TB yesterday and experimented with it and I can still reproduce the errors. Original firmware on the disk was P8CR002, I then upgraded to P8CR004, but it didn't make any difference... still getting checksum errors after copying a large amount of files and running btrfs scrub. > > > SSD: Crucial P300 2TB (M.2_1) > > > We have copy a 800GB file and use F3write to create 1GB test file 600+ > > round. > > > Didn't see this problem... > > > > I suggest you to replace Crucial P300 with one of the SSDs mentioned in the > > bugreport. ( https://bugzilla.kernel.org/show_bug.cgi?id=219609 ) > > > > Does anybody here have experiances with this SSD? > (In reply to Ralph Gerstmann from comment #63) > As you see in the bugreport, there are different ways to reproduce. > I personally prefer installing Linux Mint 22 on btrfs. > That fails 100%. Network setup does not matter. Unasked for advice from someone who had to occasionally reproduce problems in a lab setup in the last 20 years: I'd say you should point them to reproducing it using f3write and f3read, which at least for me and apparently a few others (please correct me if I'm wrong) quickly reproduces the problem without much effort (like reinstalling a distro) by the person that runs the test. New feedback from ASRock support: <snip> Hello, Got feedback from our BIOS department: Sorry, we still not able to reproduce the problem. BIOS: 4.08 with IOMMU enabled CPU: 8600G SSD: SAMSUNG 990 Pro 1TB OS: Linux Mints 22.1 installed in btrfs LAN: Connected Trasfer 100x 1G file and still not meet the problem. (Tested via f3) </snip> On Mon, Jan 20, 2025 at 03:31:28PM +0100, Thorsten Leemhuis wrote:
> My former colleague Christian Hirsch (not CCed) can reproduce the
> problem reliably. He today switched the CPU to a Ryzen 7 7700 and later
> to some Ryzen 9600X – and with those things worked just fine, e.g. no
> corruptions. But they came back after putting the 8600G back in.
So basically you need a specific board and a specific CPU, and only
one M.2 SSD in the two slots to reproduce it? Puh. I'm kinda lost on
what we could do about this on the Linux side.
(In reply to Ralph Gerstmann from comment #67) > New feedback from ASRock support: > > Got feedback from our BIOS department: > Sorry, we still not able to reproduce the problem. > […] That sounds like we might be stuck here, as I guess we need them to reproduce the problem, as they are unlikely to fix it otherwise. Anyone any idea why they were unable to reproduce the problem? > BIOS: 4.08 with IOMMU enabled Does it maybe make a difference if the IOMMU is enabled in the BIOS Setup (the default iirc is "AUTO") > SSD: SAMSUNG 990 Pro 1TB In which slot was it? Was it the only device? > Trasfer 100x 1G file and still not meet the problem. (Tested via f3) Was that transfering a file using f3 (e.g. with some network share), or was it "transfer 100x 1G file and run f3write and f3read" in parallel? Hi, Am 28.01.25 um 08:41 schrieb Christoph Hellwig: > So basically you need a specific board and a specific CPU, and only > one M.2 SSD in the two slots to reproduce it? more generally, it dependents on which PCIe devices are used. On my PC corruptions also disappear if I disable the ethernet controller in the BIOS. Furthermore it depends on transaction sizes (that's why older kernels work), IOMMU, sometimes on volatile write cache and partially on SSD type (which may have something to do with the former things). > Puh. I'm kinda lost on what we could do about this on the Linux > side. Because it also depends on the CPU series, a firmware or hardware issue seems to be more likely than a Linux bug. ATM ASRock is still trying to reproduce the issue. (I'm in contact with them to. But they have Chinese new year holidays in Taiwan this week.) If they can't reproduce it, they have to provide an explanation why the issues are seen by so many users. Regards Stefan * Stefan (linux-kernel@simg.de) wrote: > Hi, > > Am 28.01.25 um 08:41 schrieb Christoph Hellwig: > > So basically you need a specific board and a specific CPU, and only > > one M.2 SSD in the two slots to reproduce it? > > more generally, it dependents on which PCIe devices are used. On my PC > corruptions also disappear if I disable the ethernet controller in the BIOS. > > Furthermore it depends on transaction sizes (that's why older kernels > work), IOMMU, sometimes on volatile write cache and partially on SSD > type (which may have something to do with the former things). Is there any characterisation of the corrupted data; last time I looked at the bz there wasn't. I mean, is it reliably any of: a) What's the size of the corruption? block, cache line, word, bit??? b) Position? e.g. last word in a block or something? c) Data? pile of zero's/ff's junk/etc? d) Is it a missed write, old data, or partially written block? Dave > > Puh. I'm kinda lost on what we could do about this on the Linux > > side. > > Because it also depends on the CPU series, a firmware or hardware issue > seems to be more likely than a Linux bug. > > ATM ASRock is still trying to reproduce the issue. (I'm in contact with > them to. But they have Chinese new year holidays in Taiwan this week.) > > If they can't reproduce it, they have to provide an explanation why the > issues are seen by so many users. > > Regards Stefan > > Hi, Am 28.01.25 um 13:52 schrieb Dr. David Alan Gilbert: > Is there any characterisation of the corrupted data; last time I > looked at the bz there wasn't. Yes, there is. (And I already reported it at least on the Debian bug tracker, see links in the initial message.) f3 reports overwritten sectors, i.e. it looks like the pseudo-random test pattern is written to wrong position. These corruptions occur in clusters whose size is an integer multiple of 2^17 bytes in most cases (about 80%) and 2^15 in all cases. The frequency of these corruptions is roughly 1 cluster per 50 GB written. Can others confirm this or do they observe a different characteristic? Regards Stefan > I mean, is it reliably any of: > a) What's the size of the corruption? > block, cache line, word, bit??? > b) Position? > e.g. last word in a block or something? > c) Data? > pile of zero's/ff's junk/etc? > > d) Is it a missed write, old data, or partially written block? > > Dave > >>> Puh. I'm kinda lost on what we could do about this on the Linux >>> side. >> >> Because it also depends on the CPU series, a firmware or hardware issue >> seems to be more likely than a Linux bug. >> >> ATM ASRock is still trying to reproduce the issue. (I'm in contact with >> them to. But they have Chinese new year holidays in Taiwan this week.) >> >> If they can't reproduce it, they have to provide an explanation why the >> issues are seen by so many users. >> >> Regards Stefan >> >> (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #69) Hi Thorsten > > > BIOS: 4.08 with IOMMU enabled > > Does it maybe make a difference if the IOMMU is enabled in the BIOS Setup > (the default iirc is "AUTO") I tested enabled vs. auto - both are with errors. > > > SSD: SAMSUNG 990 Pro 1TB > > In which slot was it? Was it the only device? I told them to use slot 1 and no device in slot 2 - but i didn't ask again to make sure they did so. > > > Trasfer 100x 1G file and still not meet the problem. (Tested via f3) > > Was that transfering a file using f3 (e.g. with some network share), or was > it "transfer 100x 1G file and run f3write and f3read" in parallel? They added a screenshot where i can see lots of OKs from f3read. Regards, Ralph (In reply to Stefan from comment #70) > On my PC > corruptions also disappear if I disable the ethernet controller in the BIOS. > Hi Stefan, i tested this too. On my system it does not matter. Errors also accure if LAN and WLAN are disabled in BIOS. (LAN was plugged in but obviously disabled.) (WLAN is not installed since we are hunting this bug.) Please test what you experienced again to verify. Regards, Ralph (In reply to Christoph Hellwig from comment #68) > On Mon, Jan 20, 2025 at 03:31:28PM +0100, Thorsten Leemhuis wrote: > > So basically you need a specific board and a specific CPU, and only > one M.2 SSD in the two slots to reproduce it? Hi Christoph, problem exists only if you place a single SSD in slot 1. problem in slot 1 disapears if you place a second SSD in slot 2. problem disapears if you place a single SSD in slot 2. Regards, Ralph > > Is there any characterisation of the corrupted data; last time I > > looked at the bz there wasn't. > > Yes, there is. (And I already reported it at least on the Debian bug > tracker, see links in the initial message.) > > f3 reports overwritten sectors, i.e. it looks like the pseudo-random > test pattern is written to wrong position. These corruptions occur in > clusters whose size is an integer multiple of 2^17 bytes in most cases > (about 80%) and 2^15 in all cases. > > The frequency of these corruptions is roughly 1 cluster per 50 GB written. > > Can others confirm this or do they observe a different characteristic? In my tests I was using real data: a backup of my files. On one such test I copied over 300K files, variables sizes and types totalling about 60GB. A bit over 20 files got corrupted. I tried copying the files over the network (ethernet) using rsync/ssh. I also tried restoring the files using restic (over ssh as well). And I also tried copying the files locally from a SATA disk. In all cases I got similar results with some files being corrupted. The destination nvme disk was using btrfs and running btrfs scrub after the copy detects quite a few checksum errors. I analyzed some of those corrupted files and one of them happened to be a text file (linux kernel source code). A big portion of the text was replaced with text from another file in the same directory (being text made it easy to find where it came from). So this was a contiguous block of text that was overwritten with a contiguous block of text from another file. If I remember correctly the other file was not corrupted (so the blocks weren't swapped). It looked like a certain block of text was written twice: on the correct file and on another file in the same directory. I also got some jpeg images corrupted. I was able to open and view (partially) those images and it looked like a portion of the image was repeated in a different part of it), so blocks of the same file were probably duplicated and overwritten within itself. The blocks being overwritten seemed to be different sizes on different files. Bruno Hi, just got feedback from ASRock. They asked me to make a video from the corruptions occurring on my remotely (and headless) running system. Maybe I should make video of printing out the logs that can be found an the Linux and Debian bug trackers ... Seems that ASRock is unwilling to solve the problem. Regards Stefan Am 28.01.25 um 15:24 schrieb Stefan: > Hi, > > Am 28.01.25 um 13:52 schrieb Dr. David Alan Gilbert: >> Is there any characterisation of the corrupted data; last time I >> looked at the bz there wasn't. > > Yes, there is. (And I already reported it at least on the Debian bug > tracker, see links in the initial message.) > > f3 reports overwritten sectors, i.e. it looks like the pseudo-random > test pattern is written to wrong position. These corruptions occur in > clusters whose size is an integer multiple of 2^17 bytes in most cases > (about 80%) and 2^15 in all cases. > > The frequency of these corruptions is roughly 1 cluster per 50 GB written. > > Can others confirm this or do they observe a different characteristic? > > Regards Stefan > > >> I mean, is it reliably any of: >> a) What's the size of the corruption? >> block, cache line, word, bit??? >> b) Position? >> e.g. last word in a block or something? >> c) Data? >> pile of zero's/ff's junk/etc? >> >> d) Is it a missed write, old data, or partially written block? >> >> Dave >> >>>> Puh. I'm kinda lost on what we could do about this on the Linux >>>> side. >>> >>> Because it also depends on the CPU series, a firmware or hardware issue >>> seems to be more likely than a Linux bug. >>> >>> ATM ASRock is still trying to reproduce the issue. (I'm in contact with >>> them to. But they have Chinese new year holidays in Taiwan this week.) >>> >>> If they can't reproduce it, they have to provide an explanation why the >>> issues are seen by so many users. >>> >>> Regards Stefan >>> >>> > On Fri, Jan 17, 2025 at 11:30:47AM +0100, Thorsten Leemhuis wrote: > >> Side note: that "PCI-DMA: Using software bounce buffering for IO > >>>> (SWIOTLB)" message does show up on two other AMD machines I own as > >> well. One also has a Ryzen 8000, the other one a much older one. The message will aways show with > 4G of memory. It only implies swiotlb is initialized, not that any device actually uses it. > >> And BTW a few bits of the latest development in the bugzilla ticket > >> (https://bugzilla.kernel.org/show_bug.cgi?id=219609 ): > >> > >> * iommu=pt and amd_iommu=off seems to work around the problem (in > >> addition to disabling the iommu in the BIOS setup). iommu_pt calls iommu_set_default_passthrough, which sets iommu_def_domain_type to IOMMU_DOMAIN_IDENTITY. I.e. the hardware IOMMu is left on, but treated as a 1:1 mapping by Linux. amd_iommu=off sets amd_iommu_disabled, which calls disable_iommus, which from a quick read disables the hardware IOMMU. In either case we'll end up using dma-direct instead of dma-iommu. > > > > That suggests the problem is related to the dma-iommu code, and > > my strong suspect is the swiotlb bounce buffering for untrusted > > device. If you feel adventurous, can you try building a kernel > > where dev_use_swiotlb() in drivers/iommu/dma-iommu.c is hacked > > to always return false? > > Tried that, did not help: I still get corrupted data. .. which together with this implies that the problem only happens when using the dma-iommu code (with or without swiotlb buffering for unaligned / untrusted data), and does not happen with dma-direct. If we assume it also is related to the optimal dma size, which the original report suggests, the values for that might be interesting. For dma-iommu this is: PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); where IOVA_RANGE_CACHE_MAX_SIZE is 6, i.e. PAGE_SIZE << 5 or 131072 for x86_64. for dma-direct it falls back to dma_max_mapping_size, which is SIZE_MAX without swiotlb, or swiotlb_max_mapping_size, which is a bit complicate due to minimum alignment, but in this case should evaluate to: 258048, which is almost twice as big. And all this unfortunately leaves me really confused. If someone is interested in playing around with at the risk of data corruption it would be interesting to hack hardcoded values into dma_opt_mapping_size, e.g. plug in the 131072 used by dma-iommu while using dma-direct with the above iommu disable options. On Tue, 4 Feb 2025 at 06:12, Christoph Hellwig wrote: > > On Sun, Feb 02, 2025 at 08:32:31AM +0000, Bruno Gravato wrote: > > In my tests I was using real data: a backup of my files. > > > > On one such test I copied over 300K files, variables sizes and types > > totalling about 60GB. A bit over 20 files got corrupted. > > I tried copying the files over the network (ethernet) using rsync/ssh. > > I also tried restoring the files using restic (over ssh as well). And > > I also tried copying the files locally from a SATA disk. In all cases > > I got similar results with some files being corrupted. > > The destination nvme disk was using btrfs and running btrfs scrub > > after the copy detects quite a few checksum errors. > > So you used various different data sources, and the desintation was > always the nvme device in the suspect slot. > Yes, regardless of the data source, the destination was always a single nvme disk on the main M.2 nvme slot, with the secondary M.2 nvme slot empty. I tried 3 different disks (WD, Crucial and Solidigm) with similar results. If I put any of those disks on the secondary M.2 slot (with the main slot empty) the problem doesn't occur. The one that intrigues me most is if I put 2 nvme disks in, occupying both M.2 slots, the problem doesn't occur either. The secondary slot must be empty for the issue to happen. I didn't try using the main M.2 slot as source instead of target, to see if the problem also occurs on reading as well. I could try that if you think it's worth testing. > > I analyzed some of those corrupted files and one of them happened to > > be a text file (linux kernel source code). > > A big portion of the text was replaced with text from another file in > > the same directory (being text made it easy to find where it came > > from). > > So this was a contiguous block of text that was overwritten with a > > contiguous block of text from another file. > > If I remember correctly the other file was not corrupted (so the > > blocks weren't swapped). It looked like a certain block of text was > > written twice: on the correct file and on another file in the same > > directory. > > That's a very interesting pattern. > > > I also got some jpeg images corrupted. I was able to open and view > > (partially) those images and it looked like a portion of the image was > > repeated in a different part of it), so blocks of the same file were > > probably duplicated and overwritten within itself. > > > > The blocks being overwritten seemed to be different sizes on different > files. > > This does sound like a fairly common pattern due to SSD FTL issues, > but I still don't want to rule out swiotlb, which due to the bucketing > could maybe also lead to these, but I can't really see how. But the > fact that the affected systems seem to be using swiotlb despite no > good reason for them to do so still leaves me puzzled. > In my case the issue also occurs when both slots are in use. I use ZFS and both NVMes are in a mirror. Scrubbing after writing a larger amount of data to the mirror reports a small number of cksum errors on the disk in slot M2_1. CPU: AMD Ryzen 5 8500G NVMe (2x): WD Red SN700 4000GB Can someone who can readily reproduce this please try with 'iommu.forcedac=1 iommu.strict=1' on the kernel command line? (In reply to Mario Limonciello (AMD) from comment #81) > Can someone who can readily reproduce this please try with 'iommu.forcedac=1 > iommu.strict=1' on the kernel command line? If i boot my system with these options it doesn't find the volume group (LVM) any more. How about if you try them just individually? All tests with fresh install of linux mint 22 (not 22.1 (anyway kernels are the same)) using btrfs, (w)lan disabled: Second try: iommu.forcedac=1 iommu.strict=1 -> vg (LVM) not found First try: iommu.forcedac=1 -> vg found -> errors Firtst try: iommu.strict=1 -> vg found -> errors Third try: iommu.forcedac=1 iommu.strict=1 -> vg (LVM) not found First try: Above 4G Decoding (BIOS): Disabled -> vg found -> errors -- ¯\_(ツ)_/¯ (In reply to Ralph Gerstmann from comment #84) > All tests with fresh install of linux mint 22 Doesn't that use a 6.8 kernel that is heavily patched? I'd say that is a really bad (or maybe even unsuitable?) choice for a upstream bug report like this. Anyway, here are my results with Fedora 41 and a mainline snapshot from today build using the Fedora rawhide config: iommu.forcedac=1 iommu.strict=1 -> does not boot, hangs in the initramfs waiting for a device (either the USB stick with the crytsetup key or the NVMe SSD) iommu.forcedac=1 -> same iommu.strict=1 -> boots, but corruptions still occur Hi, after Matthias was so kind (more than me) to make a video (!) for the ASRock support, and after I once again referred to this thread and the many users who have the same problem, ASRock is able to reproduce the issues. Ralph, all tests in comment #40 (including the network issue) where run twice, because I did not collect logs and lspci outputs the first time. (The corruptions seem to depend on which PCIe devices / lanes (?) are used. That's why I also included the lspci outputs.) (As announced in initial message, I cannot run tests ATM and for a while.) Regards Stefan Am 03.02.25 um 19:48 schrieb Stefan: > Hi, > > just got feedback from ASRock. They asked me to make a video from the > corruptions occurring on my remotely (and headless) running system. > Maybe I should make video of printing out the logs that can be found an > the Linux and Debian bug trackers ... > > Seems that ASRock is unwilling to solve the problem. > > Regards Stefan > > > Am 28.01.25 um 15:24 schrieb Stefan: >> Hi, >> >> Am 28.01.25 um 13:52 schrieb Dr. David Alan Gilbert: >>> Is there any characterisation of the corrupted data; last time I >>> looked at the bz there wasn't. >> >> Yes, there is. (And I already reported it at least on the Debian bug >> tracker, see links in the initial message.) >> >> f3 reports overwritten sectors, i.e. it looks like the pseudo-random >> test pattern is written to wrong position. These corruptions occur in >> clusters whose size is an integer multiple of 2^17 bytes in most cases >> (about 80%) and 2^15 in all cases. >> >> The frequency of these corruptions is roughly 1 cluster per 50 GB >> written. >> >> Can others confirm this or do they observe a different characteristic? >> >> Regards Stefan >> >> >>> I mean, is it reliably any of: >>> a) What's the size of the corruption? >>> block, cache line, word, bit??? >>> b) Position? >>> e.g. last word in a block or something? >>> c) Data? >>> pile of zero's/ff's junk/etc? >>> >>> d) Is it a missed write, old data, or partially written block? >>> >>> Dave >>> >>>>> Puh. I'm kinda lost on what we could do about this on the Linux >>>>> side. >>>> >>>> Because it also depends on the CPU series, a firmware or hardware issue >>>> seems to be more likely than a Linux bug. >>>> >>>> ATM ASRock is still trying to reproduce the issue. (I'm in contact with >>>> them to. But they have Chinese new year holidays in Taiwan this week.) >>>> >>>> If they can't reproduce it, they have to provide an explanation why the >>>> issues are seen by so many users. >>>> >>>> Regards Stefan >>>> >>>> >> > OK, so if those parameters are not helping this is likely not related to lazy flush. Another thing that would be useful to try to isolate is disabling TRIM support. Some filesystems enable this by default and there are some systemd units out there that will manually run fstrim. Hello, Can someone who can reproduce this issue please try disabling TRIM and re-running? I can confirm that TRIM does not trigger the issue. My ZFS setup has autotrim off. Cron does it every two weeks or so. The issue is easy reproducable by just writing ~50GB and then scrubbing. > I can confirm that TRIM does not trigger the issue.
> The issue is easy reproducable by just writing ~50GB and then scrubbing.
Sorry; but it sounds like you're contradicting yourself. You say you can't trigger, and you don't have TRIM enabled but you find that you can trip it by using a manual trim command?
Can you please clarify?
(In reply to Scharel from comment #89) > I can confirm that TRIM does not trigger the issue. > My ZFS setup has autotrim off. Cron does it every two weeks or so. > The issue is easy reproducable by just writing ~50GB and then scrubbing. Having trouble parsing this. You've turned TRIM off, and there are no issues? But you could still reproduce it with a scrubbing? (In reply to Keith Busch from comment #91) > (In reply to Scharel from comment #89) > > I can confirm that TRIM does not trigger the issue. > > My ZFS setup has autotrim off. Cron does it every two weeks or so. > > The issue is easy reproducable by just writing ~50GB and then scrubbing. > > Having trouble parsing this. You've turned TRIM off, and there are no > issues? But you could still reproduce it with a scrubbing? On a re-read, I think you're saying that TRIM has nothing to do with the issue and it happens with or without it enabled. And if so, that frankly makes sense: TRIM just affects NAND stale page tracking, it has nothing to do with DMA. > And if so, that frankly makes sense: TRIM just affects NAND stale page
> tracking, it has nothing to do with DMA
I should probably add some more color to why Prathyushi and I were both asking about TRIM. There have been reports in the past that TRIM request (specifically) was getting corrupted. So we're looking to see if this is a similar issue.
(In reply to Mario Limonciello (AMD) from comment #93) > > And if so, that frankly makes sense: TRIM just affects NAND stale page > > tracking, it has nothing to do with DMA > > I should probably add some more color to why Prathyushi and I were both > asking about TRIM. There have been reports in the past that TRIM request > (specifically) was getting corrupted. So we're looking to see if this is a > similar issue. By TRIM requests getting corrupted, I assume you mean the NVMe DSM list, host -> device DMA, is getting corrupted on the way? That could create these observations, but for it to be specific to a TRIM command? It shouldn't look any different than a write command's DMA payload, right? TRIM has a start and range field, and I the case I'm talking about it was specifically the "start" was getting corrupted.
> It shouldn't look any different than a write command's DMA payload, right?
Yeah I would think the same way. But 🤷. At least want to see if that's the case because it can give us more hints at a repro on other hardware.
Sorry that my comment was unclear. What I wanted to say is that I can provoke the issue without running TRIM. By scrubbing I meant "zfs scrub <pool>", that I use to detect the errors. Errors also only seem to happen while writing data and not with data at rest. (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #85) > (In reply to Ralph Gerstmann from comment #84) > > All tests with fresh install of linux mint 22 > > Doesn't that use a 6.8 kernel that is heavily patched? I'd say that is a > really bad (or maybe even unsuitable?) choice for a upstream bug report like > this. > afaik, the mint team does not patch kernels at all - they just follow Ubuntu kernels. "Linux Mint 22 is based on Ubuntu 24.04 and ships with kernel 6.8. All subsequent point releases will follow the Hardware Enablement (HWE) kernel series, which improves support for newer devices." $uname -a Linux mint 6.8.0-38-generic #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:25:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux > afaik, the mint team does not patch kernels at all - they just follow Ubuntu
> kernels.
Sure, but *Ubuntu kernels* are heavily patched. They are not upstream kernels. Discussions on issues with Ubuntu kernels should be brought to Launchpad.
Kernel Bugzilla is for discussion on upstream kernels.
(In reply to Mario Limonciello (AMD) from comment #98) > > afaik, the mint team does not patch kernels at all - they just follow > Ubuntu > > kernels. > > Sure, but *Ubuntu kernels* are heavily patched. They are not upstream > kernels. Discussions on issues with Ubuntu kernels should be brought to > Launchpad. > Kernel Bugzilla is for discussion on upstream kernels. This bug was reproduced by others with upstream kernels, too. Ok, then i will take the short way and live with slot 2 and bring my system back to production, means i can not run any tests anymore. Agree? > This bug was reproduced by others with upstream kernels, too. Right; at this point everything is data as we don't have a specific commit, change or firmware that seems to be causing it. > Agree? Totally up to you what to do with your system. Since there is the workaround mentioned here of IOMMU disabled is avoiding it, you might do that for now. (In reply to Mario Limonciello (AMD) from comment #100) > > This bug was reproduced by others with upstream kernels, too. > > Right; at this point everything is data as we don't have a specific commit, > change or firmware that seems to be causing it. Problem is limited to 8X00G in combination with X600 Boards, which smells like Knoll-related and not like Linux. > > > Agree? > > Totally up to you what to do with your system. Since there is the > workaround mentioned here of IOMMU disabled is avoiding it, you might do > that for now. Why should i disable IOMMU? Workaround is - if you have a PCI4 NVME: Populate slot 2 before you populate slot 1. (In reply to Ralph Gerstmann from comment #101) > Workaround is - if you have a PCI4 NVME: > Populate slot 2 before you populate slot 1. To be more precise: Workaround is - if you have a PCI4 NVME and don't run a RAID. : Populate slot 2 before you populate slot 1. Hi, Am 07.02.25 um 20:34 schrieb bugzilla-daemon@kernel.org: > https://bugzilla.kernel.org/show_bug.cgi?id=219609 > > --- Comment #100 from Mario Limonciello (AMD) --- >> This bug was reproduced by others with upstream kernels, too. I can confirm that. It is not very likely that a Ubuntu patch causes another bug with exact the same symptoms ... > Right; at this point everything is data as we don't have a specific commit, > change or firmware that seems to be causing it. We have two *upstream kernel* commits that trigger the corruptions: Both these commits change the transfer size We have a specific firmware that introduces the corruptions: the initial one. We have a specific hardware combination that is causing the issues: ASock Deskmini X600 + AMD Ryzen 8000 series. (It seems that the bug is limited to that CPU series while it has not been tested yet whether other X600 / Knoll systems are affected too. But meanwhile ASRock is able to reproduce the corruptions.) Regards Stefan > >> Agree? > > Totally up to you what to do with your system. Since there is the workaround > mentioned here of IOMMU disabled is avoiding it, you might do that for now. > (In reply to Stefan from comment #103) > We have a specific hardware combination that is causing the issues: > ASock Deskmini X600 + AMD Ryzen 8000 series. (It seems that the bug is > limited to that CPU series while it has not been tested yet whether > other X600 / Knoll systems are affected too. But meanwhile ASRock is > able to reproduce the corruptions.) > Afaik there exist only 4 different X600 systems - all from ASRock. Afaik only the Deskmini X600 has PCI5 capability in slot 1. ... Hi, Am 07.02.25 um 22:06 schrieb bugzilla-daemon@kernel.org: > https://bugzilla.kernel.org/show_bug.cgi?id=219609 > > --- Comment #104 from Ralph Gerstmann --- > Afaik there exist only 4 different X600 systems - all from ASRock. > Afaik only the Deskmini X600 has PCI5 capability in slot 1. > ... it has nothing to do with the PCIe version. I have e Gen4 SSD and enforcing Gen3 via BIOS has no effect. Regards Stefan (In reply to Stefan from comment #105) > > it has nothing to do with the PCIe version. I have e Gen4 SSD and > enforcing Gen3 via BIOS has no effect. Hi, My thoughts were not about the capabilities of the inserted device or the BIOS setup. My thoughts are simply about the capabilities of the slot - because there is the obvious difference. Regards, Ralph (In reply to Mario Limonciello (AMD) from comment #93) > There have been reports in the past that TRIM request > (specifically) was getting corrupted. So we're looking to see if this is a > similar issue. Disabling trim did not change anything for me: the corruptions still occurred. (In reply to Stefan from comment #103) > Hi, > > Am 07.02.25 um 20:34 schrieb bugzilla-daemon@kernel.org: > > https://bugzilla.kernel.org/show_bug.cgi?id=219609 > > > > --- Comment #100 from Mario Limonciello (AMD) --- > >> This bug was reproduced by others with upstream kernels, too. > > I can confirm that. > > It is not very likely that a Ubuntu patch causes another bug with exact > the same symptoms ... > Just for the record - there are issues on other Linux distributions as well. I faced arbitrary reboots with Debian 12 as well as with the latest Manjaro (I guess it is kernel 6.12). CPU is a 8500G. I've seen no file or filesystem corruptions, but even freshly installed systems reboot every couple of minutes (up to a few hours). No error messages, no core dumps ... nothing > > Totally up to you what to do with your system. Since there is the > workaround > > mentioned here of IOMMU disabled is avoiding it, you might do that for now. > Unfortunately disabling IOMMU did not change anything, but moving the SSD to the lower socket solved (resp. worked around) the problem. Hi, here is a link to a new BIOS version from ASRock: http://www.simg.de/X600M-STX_4.10.zip (Cannot attach this due to the size limit. The file will be removed in a few month's.) I cannot test this ATM (as announced in December). Maybe someone want to try this. Regards Stefan (In reply to Stefan from comment #109) > here is a link to a new BIOS version from ASRock: Thx. Is this just a new version that might change things for us, or is this supposed to contain a fix for our problem? Hi, with that firmware ASRock can't reproduce the corruptions anymore. Regards Stefan Just for clarification: ASRock sent me that file and asked me to test it (which is not possible ATM) and allowed me to share it. Has anyone seen this issue with the V2.01 BIOS? All I see mentioned are updated BIOS versions. (In reply to Stefan from comment #109) > here is a link to a new BIOS version from ASRock: > http://www.simg.de/X600M-STX_4.10.zip From a quick test it seems like this is fixing the problem for me. (In reply to Alex Kovacs from comment #113) > Has anyone seen this issue with the V2.01 BIOS? All I see mentioned are > updated BIOS versions. Sorry, I did not see Stefan's comment about seeing this on all BIOS before posting my question. (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #114) > (In reply to Stefan from comment #109) > > here is a link to a new BIOS version from ASRock: > > http://www.simg.de/X600M-STX_4.10.zip > > From a quick test it seems like this is fixing the problem for me. Does this require FW version 240522 to be installed first? Created attachment 307686 [details] dmesg from before and after the bios update (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #114) > From a quick test it seems like this is fixing the problem for me. TWIMC, here is the dmesg from booting with the old and the new BIOS. The old one might have used slightly different BIOS Setup settings, can't recall, sorry. There are a few new lines, like: ACPI: BGRT 0x000000008D5ED000 000038 (v01 ALASKA A M I 00000001 AMI 00010013) ACPI: WPBT 0x000000008CDC4000 000036 (v01 ALASKA A M I 00000001 MSFT 00010013) ACPI: Reserving BGRT table memory at [mem 0x8d5ed000-0x8d5ed037] ACPI: Reserving SSDT table memory at [mem 0x8cdc3000-0x8cdc3cdd] And it seems there is an additional PCIe device. Wondering if that is due to the new BIOS or some setting differences in the BIOS Setup. /me shrugs and stops investigating, as nobody might care anyway (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #114) > From a quick test it seems like this is fixing the problem for me. Someone from c't magazine (which had a actile about building a system with the DeskMini X600, which at least for me was the reason why I bought it) also confirmed that the new BIOS seems to fix this. (In reply to Stefan from comment #112) > ASRock sent me that file and asked me to test it BTW, can you please ask them what they changed (they might not answer, but it's worth asking… :-D ) Ohh, and many thx for your work with this. (In reply to Alex Kovacs from comment #116) > Does this require FW version 240522 to be installed first? I find https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.de.asp?cat=#BIOS somewhat confusing, but it sounds you need to install that SIO FW first, unless you already have it. > BTW, can you please ask them what they changed (they might not answer, but > it's > worth asking… :-D ) One thing that irritates me quite a bit about ASRock, is that they never share any changelogs of what they change on each new BIOS update... It's really annoying. How can someone make a decision on whether to update the BIOS or not without having a clue of what changed? I've found a few threads on sff.network forum by users trying to figure out what changed and whether it's a good idea to upgrade or not... This is true for both Deskmini X300 and X600. It's not uncommon to see comments saying you should _not_ upgrade to version X or Y, because a certain feature was removed or that performance declined... > I find > > https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.de.asp?cat=#BIOS > somewhat confusing, but it sounds you need to install that SIO FW first, > unless > you already have it. It is indeed confusing, because the BIOS update says "Before updating BIOS 2.01, please update SIO firmware" and the SIO update says "Requires BIOS 2.01 or later version". So which one do you do first? IIRC, before I updated from the original BIOS version to 4.03, I think I did the SIO update first and it all went well. > And it seems there is an additional PCIe device. Wondering if that is due to > the new BIOS or some setting differences in the BIOS Setup. > > /me shrugs and stops investigating, as nobody might care anyway I currently have all my spare nvme disks in use, so unfortunately I can't test the new BIOS, but I'm very interested in all the differences you may find. Regarding BIOS settings, I guess it's now too late for you, but for others, I suggest saving your current settings to an USB pen before updating. One thing I've learned from the past is that updating BIOS firmware on the Deskmini will usually reset all the settings and also wipe any profiles saved, so you really need to save them to a USB pen, if you desire to recover them after the update. Hi, Am 19.02.25 um 15:21 schrieb bugzilla-daemon@kernel.org: > https://bugzilla.kernel.org/show_bug.cgi?id=219609 > > --- Comment #117 from The Linux kernel's regression tracker (Thorsten > Leemhuis) --- > There are a few new lines, like: > > ACPI: BGRT 0x000000008D5ED000 000038 (v01 ALASKA A M I 00000001 AMI > 00010013) > ACPI: WPBT 0x000000008CDC4000 000036 (v01 ALASKA A M I 00000001 MSFT > 00010013) > ACPI: Reserving BGRT table memory at [mem 0x8d5ed000-0x8d5ed037] > ACPI: Reserving SSDT table memory at [mem 0x8cdc3000-0x8cdc3cdd] > > And it seems there is an additional PCIe device. Wondering if that is due to > the new BIOS or some setting differences in the BIOS Setup. > > /me shrugs and stops investigating, as nobody might care anyway that may be relevant and I would like to clarify this before I forward your questions and thanks. Can you share your lspci output and/or compare it with the output I created (see attachments at begin of the bug tracker page) Reason: Whether the corruptions appear seems to depend on which PCI devices are present (2nd M.2 SSD; in my case corruptions disappear if I disable network in BIOS) Thus, If there is a new PCI device, that may be the reason why the corruptions go away. But the underlying problem may not be resolved. > --- Comment #119 from Bruno Gravato --- It is indeed confusing, > because the BIOS update says "Before updating BIOS 2.01, please > update SIO firmware" and the SIO update says "Requires BIOS 2.01 or > later version". So which one do you do first?>> IIRC, before I > updated from the original BIOS version to 4.03, I think> I did the SIO update > first and it all went well. If you have 4.03 you do not need to care about SIO firmware. AFAIR, my board came with Firmware 1.43. I first updated SIO firmware, then 2.01 and then 4.03 (and later 4.08) Regards Stefan (In reply to Stefan from comment #120) > Am 19.02.25 um 15:21 schrieb bugzilla-daemon@kernel.org: > > And it seems there is an additional PCIe device. s/an/two/ > > Wondering if that is due to > > the new BIOS or some setting differences in the BIOS Setup. > > /me shrugs and stops investigating, as nobody might care anyway /me wonders if downgrading the BIOS is worth it (if possible!), but decides for now that it is not. > that may be relevant and I would like to clarify this before I forward > your questions and thanks. Can you share your lspci output and/or > compare it with the output I created (see attachments at begin of the > bug tracker page) All your lspci logs miss the "SATA controller [0106]: ASMedia Technology Inc. ASM1061/ASM1062 Serial ATA Controller [1b21:0612] (rev 02)" that is one of the two new PCI devices after the BIOS update (as can be seen by the logs I uploaded). I doubt I disabled the chip in the BIOS Setup, but it's possible that I did and forgot about it. #Sigh :-/ > If you have 4.03 you do not need to care about SIO firmware. > > AFAIR, my board came with Firmware 1.43. I first updated SIO firmware, > then 2.01 and then 4.03 (and later 4.08) I think mine came with 1.43 as well. I updated the SIO, then BIOS to 4.03. And later on to 4.08 when it came out. I think all of this was before I found out about the corruption issue. Anyway, what I mainly wanted to alert about was the fact that all BIOS settings get reset, including any saved profiles, when upgrading the BIOS firmware... the only way to preserve and restore any settings is saving them to a USB pen and restoring after the upgrade. Issue seems fixed with 4.10. I will verify tomorrow. Where is the change log from ASRock? (In reply to Ralph Gerstmann from comment #123) > Where is the change log from ASRock? I doubt they'd publish any interesting details on what was changed. At best, they might provide "Release Notes" with the update using a vaguely worded description like "Fixed various bugs". ASRock Support answered my question: ___ Sorry, I do not get any change log or closer information what was changed/fixed with this BIOS. The only information is, that we redefined the unused CPU PCIE lanes on BIOS 4.10. ___ Hi, Am 20.02.25 um 02:03 schrieb bugzilla-daemon@kernel.org: > --- Comment #124 from Keith Busch (kbusch@kernel.org) --- > (In reply to Ralph Gerstmann from comment #123) >> Where is the change log from ASRock? > > I doubt they'd publish any interesting details on what was changed. At best, > they might provide "Release Notes" with the update using a vaguely worded > description like "Fixed various bugs". according to ASRock support, they "redefined the unused CPU PCIE lanes on BIOS 4.10." They cannot provide further information. Regards Stefan I will test if the issue now moved to the other slot... I got my hands on a spare nvme disk (WD SN850X 1TB) and ran some tests. TL;DR version: BIOS firmware 4.10 seems to prevent the corruption. Now for the details... Test 1: - BIOS firmware 4.08 - M.2 slots - main: WD SN850X / secondary: empty - installed Debian 12 on btrfs, upgraded kernel to backports (6.12.9), rebooted - copied about 500k files / 100GB (source was a SATA disk installed on the machine as in my previous tests) - running btrfs scrub detects corrupted files on the nvme disk, as expected - deleted files and ran fstrim BIOS upgrade: - booted into BIOS - made a backup of my config to an USB pen - upgraded to BIOS firmware 4.10 - restored my BIOS settings from USB Test 2: - BIOS firmware 4.10 - M.2 slots - main: WD SN850X / secondary: empty - copied again same 500k files / 100GB from the SATA disk - btrfs scrub returned no corruptions - deleted files and ran fstrim Teste 3: - same as test 2 except I swapped the disk from main to secondary M.2 slot - same result Test 4: - put another disk in, so both nvme M.2 slots occupied - still no corrupted files So BIOS firmware 4.10 seems to have solved the problem. Bruno Where are you getting this mythical 4.10 BIOS, I don't see it on https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.asp#BIOS > --- Comment #129 from Mathieu Borderé --- > Where are you getting this mythical 4.10 BIOS, I don't see it on > https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.asp#BIOS It's not an official release. Check comment #109 for the link. The BIOS version 4.10 is now available on the official ASRock support page: https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.de.asp?cat=#BIOS > --- Comment #131 from mbe --- > The BIOS version 4.10 is now available on the official ASRock support page: > > https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.de.asp?cat=#BIOS I just noticed that today and I was going to post about it here, but you beat me to it. Anyway just to add that I checksummed it and compared to the version that was posted here a few weeks ago and it's the exact same version, so for those who already upgraded to 4.10 then, no need to "upgrade" again. Thx everyone who help with this, much appreciated! If there is anyone with contact to Asrock, please consider asking them to distribute the update through LVFS (hughsie brought that up in the Fediverse and I think it would be great idea: https://mastodon.social/@hughsie/114221918449126392 ) P.S.: In a ideal world where that is not possible we'd had some daemon yelling at people "your data is in danger" that run the old bios…) Maybe this is no longer the place to post this, but since installing 4.10 my computer once rebooted spontaneously and didn't detect my 2 nvme drives in the BIOS anymore. This happened after writing a couple of gigabytes to the drive in the secondary slot. Restarting solved it. A bit harsh to blame 4.10, but went back to 4.08. Posting this just in case anyone else would experience the same issue. On Fri, 4 Apr 2025 at 08:21, <bugzilla-daemon@kernel.org> wrote: > --- Comment #134 from Mathieu Borderé --- > Maybe this is no longer the place to post this, but since installing 4.10 my > computer once rebooted spontaneously and didn't detect my 2 nvme drives in > the > BIOS anymore. This happened after writing a couple of gigabytes to the drive > in > the secondary slot. Restarting solved it. A bit harsh to blame 4.10, but went > back to 4.08. Posting this just in case anyone else would experience the same > issue. I don't think that is related to 4.10. I had that (spontaneous reboot) happen to me once or twice before, when I was using firmware 4.08. I've also experienced some issues with amdgpu driver crashing sometimes, plus some "glitches" in the graphics occasionally (like a screen quick "flicker", or random pixels "flashing"). When amdgpu crashes sometimes X freezes or even the full system freezes. Other times amdgpu restarts successfully and X stays alive. This doesn't happen very often, but when it happens I get a bunch of amdgpu errors in the logs. I think it got worse since I upgraded from kernel 6.12.9 to 6.12.12 and it was much worse with previous kernels (6.11.xx and earlier), but I have no way of reproducing it consistently and it happens too scarcely (once or twice a month) to reach any conclusion. As for the random pixels flashing or the screen flickering, I don't get any errors in the logs, so I can't rule out the possibility that it is a monitor issue. Check your system logs prior to it rebooting and see if there's any relevant message, especially related to amdgpu. Which kernel and amd firmware versions are you using? Bruno Log was clean, used to have major issues with amdgpu crashing and taking down the desktop environment with it, turned out to be a faulty CPU, CPU replacement fixed those crashes. |