Bug 219609

Summary: File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G
Product: IO/Storage Reporter: Stefan (linux-kernel)
Component: NVMeAssignee: IO/NVME Virtual Default Assignee (io_nvme)
Status: NEW ---    
Severity: normal CC: bgravato, carnil, kbusch, kernel, mario.limonciello, regressions, reklamukibiras
Priority: P3    
Hardware: AMD   
OS: Linux   
Kernel Version: 6.11.5, most liklely 6.5+ Subsystem:
Regression: No Bisected commit-id:
Attachments: attachment-4531-0.html

Description Stefan 2024-12-18 11:23:04 UTC
Hi,

there are one or two bugs which were originally reported at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372 . For details (logs, etc.), see there. Here, I will post a summary and try to point out the most relevant observations:

Bug 1: Write errors with Lexar NM790 NVME

* Occur since Debian kernel 6.5, but reproduced with upstream kernel 6.11.5 (the only upstream kernel I tested)
* Only occur in 1st M.2 socket (not in the 2nd one on rear side)
* Easiest way to reproduce them is to use f3 ( https://fight-flash-fraud.readthedocs.io/en/latest/usage.html ). f3 reports overwritten sectors
* The errors seem not to occur in the last files of 500 file (=500 GB) test runs and I never detected file system corruption (just defect files; I produced probably more than thousand ones). The reason for the latter observation is maybe, that file system information are written last. (See see message 113 in the Debian bug report)

(Possible) Bug 2: Read errors with Kingston FURY Renegade

* Only occur in 1st M.2 socket (did not tested the rear socket, because the warranty seal would to be broken in order to remove the heat sink)
* Almost impossible to reproduce it, only detected it in Debian kernel that bases on 6.1.112
* 1st occurrence: I detected in an SSD intensive computation (as data cache) which produced wrong results after a few days (but not in the first days). The error could be reproduced with f3: The corruptions were massive and different files were affected in subsequent f3read runs (==> read errors). Unfortunately I did not stored the f3 logs. (I still have the corrupt computation results, so it was real.)
* 2nd occurrence: A single defect sector (read error) in a multi-day attempt to reproduce the error with the same kernel (Debian 6.1.112), see message 113 in the Debian bug report

Consideration / Notes:
* These serial links (PCIe) need to be calibrated. Calibration issues would explain while the errors (dis)appear under certain condition. But errors like this should be detected (nothing could be found in the kernel logs). Is the error correction possibly inactive? However, this still does not explain why f2 reports overwritten sectors, unless the signal errors occur during command / address transmission.
* Testing is difficult, because the machine is installed remotely and in use. ATM, till about end of Janaury, can run tests for bug 1.
* On the AsRock X600M-STX mainboard (without chipset), the CPU (Ryzen 8700G) runs in SoC (system on chip) mode. Maybe someone did not tested this properly ...

Regards Stefan
Comment 1 Keith Busch 2024-12-18 15:23:50 UTC
You mention the observation has occurred since kernel 6.5. Are you saying that this used to work in older kernels?
Comment 2 Stefan 2024-12-18 17:41:17 UTC
Bug 1: Oldest non-working Debian kernel is 6.3.7 (package linux-image-6.3.0-1-amd64), Debian kernel 6.3.5 (latest version of package linux-image-6.3.0-0-amd64) works. (I'm assuming it's not debian-specific because the error also occurs in an upstream-kernel (6.11.5)

If you have patches, I could compile one of these version and then try out the patches.

(Possible) Bug 2: Occurred with 6.1 kernels, but very difficult to reproduce. So, I'm not sure whether this error is limited to this kernel version.

Because I cannot test both bugs at the same time (the bugs occur only in 1st M.2 socket and the PC is remote), we should first focus on Bug 1. If that bug is fixed, I would run a long term test with the fixed kernel. (Because it are read errors, this can be done by a checksum test of existing files in background.)
Comment 3 Bruno Gravato 2025-01-02 17:16:05 UTC
I have the same barebone (ASRock Deskmini X600) with Ryzen 8600G CPU.

I've run into similar issues.

In my case I'm using btrfs on a Solidigm P44 Pro M.2 nvme 1TB disk. After copying a large amount of files (over 150K-300K files, variable sizes) to the btrfs partition and running btrfs scrub on the partition, it will report some files with checksum errors.

If I put the disk in the secondary M.2 slot in the back this problem does not occur.

RAM is 2x16GB Kingston Fury Impact DDR5 6400 SODIMM, but I've also tried a Crucial DDR5 5600 SODIMM with same results. I run single memory stick, dual, different speeds, etc... all with the same result. RAM seems to not be the problem.

I also had same results with a WD nvme SN750 500GB disk.

I've tried both disks (running the same installation), on a different machine (Deskmini X300) and no errors.

Only a few files get corrupted. On my last test, copying nearly 400K files, only 22 got corrupted.

I mounted the btrfs partition with rescue=all and I was able to read the corrupted files. I compared a few to the original files and looks like a big chunk of data in the middle of the files was altered (contiguous blocks). So it's not just a bit flip here and there... it's a big portion of the file that gets messed up (in contiguous blocks).

System is running Debian stable with some packages from backports, namely the kernel. I got same results with kernel 6.10.5 and 6.11.10 (from bookworm-backports) and 6.12.6 (from testing).

Also got the same results with BIOS firmware 4.03 and 4.08 (downloaded from asrock website).

I tried different sources for the files: copying over LAN using either rsync over ssh or restic backup restore, but also from a locally installed SATA SSD disk with the same files. Copying the same files to the SATA disk (also btrfs) do not get corrupted.

Using the secondary M.2 slot (gen4x4) also seems to be free of errors. It only happens when the disk is in the main M.2 slot (gen5x4).

I thought this could be a faulty M.2 slot on my board, but after seeing other reports of similar problem, now I'm more convinced that this may be either BIOS firmware issue or kernel issue or a combination of both.

Anyway I thought I'd add my report here hoping it can help.

I can run some more tests if needed.

In terms of reproducibility, I can reproduce this fairly consistently given I copy a large enough sample of files (my "sample" is my personal files from my home dir in my older PC, which are over 700K files). Copying 150K-300K files (20-60GB of data) is usually enough to cause checksum errors on some files when running btrfs scrub (it seems to be always on different files). With the disk on the secondary M.2 slot I copied all 700K+ files (twice I think) and no errors.

I haven't tried older kernel versions. I can try 6.1.x from debian stable, but I think this has issues with amdgpu driver and can eventually freeze the system with some amdgpu error, so it may not be very reliable for testing.

Let me know if you have any questions and I'll try to answer.
Comment 4 Stefan 2025-01-03 14:15:17 UTC
With the help of TJ from the Debian kernel team ( https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372 ), at least a workaround could be found.

The bug is triggered by the patch "nvme-pci: clamp max_hw_sectors based on DMA optimized limitation" (see https://lore.kernel.org/linux-iommu/20230503161759.GA1614@lst.de/ ) introduced in 6.3.7

To examine the situation, I added this debug info (all files are located in `drivers/nvme/host`):

> --- core.c.orig       2025-01-03 14:27:38.220428482 +0100
> +++ core.c    2025-01-03 12:56:34.503259774 +0100
> @@ -3306,6 +3306,7 @@
>               max_hw_sectors = nvme_mps_to_sectors(ctrl, id->mdts);
>       else
>               max_hw_sectors = UINT_MAX;
> +     dev_warn(ctrl->device, "id->mdts=%d,  max_hw_sectors=%d, 
> ctrl->max_hw_sectors=%d\n", id->mdts, max_hw_sectors, ctrl->max_hw_sectors);
>       ctrl->max_hw_sectors =
>               min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);

6.3.6 (last version w/o mentioned patch and w/o data corruption) says:

> [  127.196212] nvme nvme0: id->mdts=7,  max_hw_sectors=1024, 
> ctrl->max_hw_sectors=16384
> [  127.203530] nvme nvme0: allocated 40 MiB host memory buffer.

6.3.7 (first version w/ mentioned patch and w/ data corruption) says:

> [   46.436384] nvme nvme0: id->mdts=7,  max_hw_sectors=1024, 
> ctrl->max_hw_sectors=256
> [   46.443562] nvme nvme0: allocated 40 MiB host memory buffer.

After I reverted the mentioned patch (

> --- pci.c.orig        2025-01-03 14:28:05.944819822 +0100
> +++ pci.c     2025-01-03 12:54:37.014579093 +0100
> @@ -3042,7 +3042,8 @@
>        * over a single page.
>        */
>       dev->ctrl.max_hw_sectors = min_t(u32,
> -             NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
> +//           NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
> +             NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9);
>       dev->ctrl.max_segments = NVME_MAX_SEGS;
>  
>       /*

), 6.11.5 (used this version because sources were laying around) works and says:

> [    1.251370] nvme nvme0: id->mdts=7,  max_hw_sectors=1024, 
> ctrl->max_hw_sectors=16384
> [    1.261168] nvme nvme0: allocated 40 MiB host memory buffer.

Thus, the corruption occurs if `ctrl->max_hw_sectors` is set to another (a smaller) value than defined by `id->mdts`. 

If this should be allowed, the mentioned patch is not the (root) cause, but reversion is at least a workaround.
Comment 5 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-08 14:42:28 UTC
I forwarded the problem by mail[1]
https://lore.kernel.org/all/401f2c46-0bc3-4e7f-b549-f868dc1834c5@leemhuis.info/

Bruno, Stefan, can we CC you on further mails regarding this? this would expose your email address to the public. 

[1] reminder, bugzilla.kernel.org is usually a bad place to report bugs, as mentioned on https://docs.kernel.org/admin-guide/reporting-issues.html
Comment 6 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-08 14:44:02 UTC
ohh, an did anyone check if mainline is still affected?
Comment 7 Keith Busch 2025-01-08 15:19:41 UTC
Even with the patch reverted, the host can still send IO that aligns to the smaller sized limits anyway, so it sounds like this patch that's been bisected to may have merely exposed a nvme controller bug.
Comment 8 Bruno Gravato 2025-01-08 15:31:22 UTC
Hi,

Yes you can CC me.

I didn't try the patch mentioned above.

This is my (new) daily driver and I needed to get the machine up and running as quickly as possible. I went with the work around of putting the disk on the secondary M.2 slot (gen4 vs gen5 on the main slot). No problems so far.

The latest kernel I tried was 6.12.6 and it still had the problem.

I should be able to put my old disk (WD Black SN750) on the main slot and run some more tests with the mainline kernel when I get the chance.
Comment 9 Keith Busch 2025-01-08 15:35:19 UTC
Are all these reports using the same model nvme controller? Or is this happening across a variety of vendors?
Comment 10 Stefan 2025-01-08 17:25:38 UTC
My email-address "linux-kernel@simg.de" can be CC'd publicly. But it is an alias, i.e. cannot reply directly from it. That's why I prefer the bug tracker.

According to a forum of the German IT magazine c't, the bug was also recognized by several other people: https://www.heise.de/forum/c-t/Wuensch-dir-mal-wieder-was/X600-btrfs-scrub-uncorrectable-errors/thread-7689447 . (That hardware was recommended by that magazine). Furthermore it seems, the the errors do not occur with all SSD's. I'm trying to figure out, whether this has something to do with the MDTS setting (can be queried using `nvme id-ctrl` command). 

The problem also occurs in 6.13.0-rc6 (unless I revert the patch introduced in 6.3.7).

Just a few thoughts (I'm not a NVME or kernel developer): I would not expect that reducing the MDTS (=max data transfer size) limit (that is what the patch does) should cause such errors. The only explanation is, that one component still assumes, up to the amount reported by MDTS (setting of the SSD) can be used. 

If that assumption is valid (NVME sepcs should answer this question), the patch is responsible for the problems. 

Otherwise, the root cause is the component that does not take the reduced limit into account. 

While the 6.13 kernel was compiling I searched in the kernel sources for the term "mdts". It seems that this setting is only used to initialize `max_hw_sectors' of the nvme_ctrl` struct. If that is correct, the other component that causes the problem is probably some kind of firmware.
Comment 11 mbe 2025-01-08 21:29:22 UTC
Hi,

I can also reliably reproduce the data corruption with following setup:

Deskmini X600
AMD Ryzen7 8700G
2x 16 GB Kingston-FURY KF564S38IBK2-32
Samsung 990 Pro 2 TB NVMe SSD, latest firmware 4B2QJXD7, installed on primary nvme slot
Filesystem: ext4
OS: Ubuntu 24.10 with kernel 6.11.0-13.14

When copying ~60 GB of data to the nvme, some files get always corrupted.
A diff between the source and the copied files shows that continuous chunks of < 3 MB in the middle of the files are either filled with zeros or garbage data.

Also affected: Ubuntu 24.04 with kernel 6.8.0.
Not affected: Debian 12 with kernel 6.1.119-1

The bad news:
Applying the patch from comment #4 (using dma_max_mapping_size() instead of dma_opt_mapping_size() to set max_hw_sectors)
to kernel 6.11.0-13.14 did not solve the problem in my case, the data corruption still occurs.

6.11.0-13.14 with patch and corruption:
>[    1.429438] nvme nvme0: pci function 0000:02:00.0
>[    1.433783] nvme nvme0: id->mdts=9,  max_hw_sectors=4096,
>ctrl->max_hw_sectors=16384
>[    1.433787] nvme nvme0: D3 entry latency set to 10 seconds
>[    1.438308] nvme nvme0: 16/0/0 default/read/poll queues
Comment 12 Stefan 2025-01-08 23:45:59 UTC
Because it might be a Firmware issue, I updated the BIOS/UEFI and installed the latest firmware blobs (version 20241210 from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ ): No success. Furthermore I found a setting where PCIe speed could be reduced. Changing this value to Gen 3 had no effect.

> The bad news:
> Applying the patch from comment #4 (using dma_max_mapping_size() instead of 
> dma_opt_mapping_size() to set max_hw_sectors)
> to kernel 6.11.0-13.14 did not solve the problem in my case, the data
> corruption still occurs.

Strange, especially because 6.1 is working.

You might try to replace `dma_max_mapping_size(&pdev->dev) >> 9` by `min_t(u32, dma_max_mapping_size(&pdev->dev) >> 9, 1024)`. This will limit max_hw_sectors to 1024 sectors, the value which works for me.

I just backported the patch from 6.3.7 to 6.1.112. The corruption now also occurs in that kernel. So for me, the problem connected to the patch.
Comment 13 Keith Busch 2025-01-09 00:09:36 UTC
If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston, and now Samsung NVMe's? Unless they're all using the same 3rd party controller, like Phison or SMI, then I guess we'd have some trouble saying it's a vendor problem. Or perhaps we're now mixing multiple problems at this point, considering one patch fixes some but not others.

Do these drives have volatile write caches? You can check with 

 # cat /sys/block/nvme0n1/queue/fua

A non-zero value means "yes". Replace "nvme0n1" with whatever your device is named, like nvme1n1, nvme2n1, etc...

Is ext4 used in the other observations too? If not, what other filesystems are used?
Comment 14 Bruno Gravato 2025-01-09 03:09:49 UTC
(In reply to Keith Busch from comment #13)
> If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston,
> and now Samsung NVMe's? 

In my case it was Solidigm P44 Pro 1TB and WD Black SN750 500GB

> Do these drives have volatile write caches? You can check with 
> 
>  # cat /sys/block/nvme0n1/queue/fua
> 

I get 1, so yes.

> Is ext4 used in the other observations too? If not, what other filesystems
> are used?

In my case I was using btrfs. Running btrfs scrub gave me some checksum errors and that's how I found out files were getting corrupted... If I was on ext4 it could have taken months for me to find out...

The somewhat odd thing is that the same disks on the secondary M.2 nvme slot work fine with no error.

The only difference in the specs between the two M.2 slots is that one is gen5x4 (the main one, which is the one with problems) and the other is gen4x4 (this works fine, no errors).
Comment 15 Keith Busch 2025-01-09 03:47:53 UTC
as a test, could you turn off the volatile write cache?

  # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0

Your write performance may be pretty bad, but it's just a temporary test to see if the problem still occurs without a volatile cache. A power cycle reverts the setting back to the default state.
Comment 16 Keith Busch 2025-01-09 03:52:35 UTC
Sorry, depending on the nvme version, the value parameter may be "-V" (capital "V").
Comment 17 Stefan 2025-01-09 15:44:24 UTC
Hi,

due to Thorstens hints, I'm trying to reply to both, the bug tracker and
the mailing list.

> --- Comment #13 from Keith Busch (kbusch@kernel.org) ---
> If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston,
> and now Samsung NVMe's?

The Kingston read errors may be something different. They are described
in detail in messages #108 and #113 of the Debian Bug Tracker
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372

With the Kington, I never saw the write errors that occur with Lexar and
Samsung on newer Kernels (and which are easy to reproduce).

(ATM I cannot provide test results from the Kingston SSD because the
Lexar is installed, the PC is installed remotely and in use. Thus I
can't swap the SSDS that often.)

> # cat /sys/block/nvme0n1/queue/fua

Returns "1"

> --- Comment #15 from Keith Busch (kbusch@kernel.org) --- as a test,
> could you turn off the volatile write cache?
>
> # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0
Had to modify that a little bit:

   $ nvme get-feature /dev/nvme0n1 -f 6
   get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
   $ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0
   set-feature:0x06 (Volatile Write Cache), value:00000000,
cdw12:00000000, save:0
   $ nvme get-feature /dev/nvme0n1 -f 6
   get-feature:0x06 (Volatile Write Cache), Current value:00000000

Corruptions disappear (under 6.13.0-rc6) if volatile write cache is
disabled (and appear again if I turn it on with "-v 1").

But, lspci says I have a

   Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD
(DRAM-less) (rev 01) (prog-if 02 [NVM Express])

Note the "DRAM-less". This is confirmed by
https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of
this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB
Host-Memory-Buffer (HMB).

May there be an issue with the HMB allocation/usage ?

Is the mainboard firmware involved into HMB allocation/usage ? That
would explain, why volatile write caching via HMB works in the 2nd M.2
socket.

BTW, controller is MaxioTech MAP1602A, which is different from the
Samsung controllers.

> --- Comment #14 from Bruno Gravato (bgravato@gmail.com) --- The only
>  difference in the specs between the two M.2 slots is that one is
> gen5x4 (the main one, which is the one with problems) and the other
> is gen4x4 (this works fine, no errors).

AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of
the CPU. On my PC, it runs in Gen4 mode (limited by SSD).

The secondary M.2 socket on the rear side is probably connected to PCIe
lanes which are usually used by a chipset -- but that socket works.

Regards Stefan
Comment 18 mbe 2025-01-09 23:38:25 UTC
Hi,

lspci says:
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal]

It uses volatile write cache:
> cat /sys/block/nvme0n1/queue/fua 
> 1

Test 1:
Disabling volatile write cache via nvme-cli 
=> no corruption occurs

Test 2:
volatile write cache enabled, using the suggestion from comment #12
> dev->ctrl.max_hw_sectors = min_t(u32,
> NVME_MAX_KB_SZ << 1, min_t(u32, dma_max_mapping_size(&pdev->dev) >> 9,
> 1024));

=> corruption still occurs

> [    0.815340] nvme nvme0: id->mdts=9,  max_hw_sectors=4096,
> ctrl->max_hw_sectors=1024
Comment 19 Bruno Gravato 2025-01-10 10:41:13 UTC
Created attachment 307463 [details]
attachment-4531-0.html

Hi,

I can reply via email, that's not a problem.

I'll try to run some more tests when I get the chance (it's been a very
busy week, sorry).
Besides the volatile write cache test, any other test I should try?

Regarding the M.2 slots. I believe this motherboard has no chipset. So both
slots should be connected directly to the CPU (mine is Ryzen 8600G),
although they might be connecting to different parts of the CPU, right? I
guess that can make a difference.

My disks are gen4 as well.

Bruno

On Thu, 9 Jan 2025 at 15:44, Stefan <linux-kernel@simg.de> wrote:

> Hi,
>
> due to Thorstens hints, I'm trying to reply to both, the bug tracker and
> the mailing list.
>
> > --- Comment #13 from Keith Busch (kbusch@kernel.org) ---
> > If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston,
> > and now Samsung NVMe's?
>
> The Kingston read errors may be something different. They are described
> in detail in messages #108 and #113 of the Debian Bug Tracker
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372
>
> With the Kington, I never saw the write errors that occur with Lexar and
> Samsung on newer Kernels (and which are easy to reproduce).
>
> (ATM I cannot provide test results from the Kingston SSD because the
> Lexar is installed, the PC is installed remotely and in use. Thus I
> can't swap the SSDS that often.)
>
> > # cat /sys/block/nvme0n1/queue/fua
>
> Returns "1"
>
> > --- Comment #15 from Keith Busch (kbusch@kernel.org) --- as a test,
> > could you turn off the volatile write cache?
> >
> > # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0
> Had to modify that a little bit:
>
>    $ nvme get-feature /dev/nvme0n1 -f 6
>    get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
>    $ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0
>    set-feature:0x06 (Volatile Write Cache), value:00000000,
> cdw12:00000000, save:0
>    $ nvme get-feature /dev/nvme0n1 -f 6
>    get-feature:0x06 (Volatile Write Cache), Current value:00000000
>
> Corruptions disappear (under 6.13.0-rc6) if volatile write cache is
> disabled (and appear again if I turn it on with "-v 1").
>
> But, lspci says I have a
>
>    Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD
> (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
>
> Note the "DRAM-less". This is confirmed by
> https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of
> this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB
> Host-Memory-Buffer (HMB).
>
> May there be an issue with the HMB allocation/usage ?
>
> Is the mainboard firmware involved into HMB allocation/usage ? That
> would explain, why volatile write caching via HMB works in the 2nd M.2
> socket.
>
> BTW, controller is MaxioTech MAP1602A, which is different from the
> Samsung controllers.
>
> > --- Comment #14 from Bruno Gravato (bgravato@gmail.com) --- The only
> >  difference in the specs between the two M.2 slots is that one is
> > gen5x4 (the main one, which is the one with problems) and the other
> > is gen4x4 (this works fine, no errors).
>
> AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of
> the CPU. On my PC, it runs in Gen4 mode (limited by SSD).
>
> The secondary M.2 socket on the rear side is probably connected to PCIe
> lanes which are usually used by a chipset -- but that socket works.
>
> Regards Stefan
>
Comment 20 Bruno Gravato 2025-01-10 11:17:59 UTC
Hi,

(resending in text-only mode, because mailing lists don't like HMTL
emails... sorry to those getting this twice)

I can reply via email, that's not a problem.

I'll try to run some more tests when I get the chance (it's been a
very busy week, sorry).
Besides the volatile write cache test, any other test I should try?

Regarding the M.2 slots. I believe this motherboard has no chipset. So
both slots should be connected directly to the CPU (mine is Ryzen
8600G), although they might be connecting to different parts of the
CPU, right? I guess that can make a difference.

My disks are gen4 as well.

Bruno


On Thu, 9 Jan 2025 at 15:44, Stefan <linux-kernel@simg.de> wrote:
>
> Hi,
>
> due to Thorstens hints, I'm trying to reply to both, the bug tracker and
> the mailing list.
>
> > --- Comment #13 from Keith Busch (kbusch@kernel.org) ---
> > If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston,
> > and now Samsung NVMe's?
>
> The Kingston read errors may be something different. They are described
> in detail in messages #108 and #113 of the Debian Bug Tracker
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372
>
> With the Kington, I never saw the write errors that occur with Lexar and
> Samsung on newer Kernels (and which are easy to reproduce).
>
> (ATM I cannot provide test results from the Kingston SSD because the
> Lexar is installed, the PC is installed remotely and in use. Thus I
> can't swap the SSDS that often.)
>
> > # cat /sys/block/nvme0n1/queue/fua
>
> Returns "1"
>
> > --- Comment #15 from Keith Busch (kbusch@kernel.org) --- as a test,
> > could you turn off the volatile write cache?
> >
> > # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0
> Had to modify that a little bit:
>
>    $ nvme get-feature /dev/nvme0n1 -f 6
>    get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
>    $ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0
>    set-feature:0x06 (Volatile Write Cache), value:00000000,
> cdw12:00000000, save:0
>    $ nvme get-feature /dev/nvme0n1 -f 6
>    get-feature:0x06 (Volatile Write Cache), Current value:00000000
>
> Corruptions disappear (under 6.13.0-rc6) if volatile write cache is
> disabled (and appear again if I turn it on with "-v 1").
>
> But, lspci says I have a
>
>    Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD
> (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
>
> Note the "DRAM-less". This is confirmed by
> https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of
> this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB
> Host-Memory-Buffer (HMB).
>
> May there be an issue with the HMB allocation/usage ?
>
> Is the mainboard firmware involved into HMB allocation/usage ? That
> would explain, why volatile write caching via HMB works in the 2nd M.2
> socket.
>
> BTW, controller is MaxioTech MAP1602A, which is different from the
> Samsung controllers.
>
> > --- Comment #14 from Bruno Gravato (bgravato@gmail.com) --- The only
> >  difference in the specs between the two M.2 slots is that one is
> > gen5x4 (the main one, which is the one with problems) and the other
> > is gen4x4 (this works fine, no errors).
>
> AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of
> the CPU. On my PC, it runs in Gen4 mode (limited by SSD).
>
> The secondary M.2 socket on the rear side is probably connected to PCIe
> lanes which are usually used by a chipset -- but that socket works.
>
> Regards Stefan
Comment 21 mbe 2025-01-13 21:01:51 UTC
Hi,

I did some more tests. At first I retrieved the following values under debian

> Debian 12, Kernel 6.1.119, no corruption
> cat /sys/class/block/nvme0n1/queue/max_hw_sectors_kb 
> 2048
>
> cat /sys/class/block/nvme0n1/queue/max_sectors_kb 
> 1280
>
> cat /sys/class/block/nvme0n1/queue/max_segments
> 127
>
> cat /sys/class/block/nvme0n1/queue/max_segment_size 
> 4294967295

To achieve the same values on Kernel 6.11.0-13, I had to make the following changes to drivers/nvme/host/pci.c

> --- pci.c.org 2024-09-15 16:57:56.000000000 +0200
> +++ pci.c     2025-01-13 21:18:54.475903619 +0100
> @@ -41,8 +41,8 @@
>   * These can be higher, but we need to ensure that any command doesn't
>   * require an sg allocation that needs more than a page of data.
>   */
> -#define NVME_MAX_KB_SZ       8192
> -#define NVME_MAX_SEGS        128
> +#define NVME_MAX_KB_SZ       4096
> +#define NVME_MAX_SEGS        127
>  #define NVME_MAX_NR_ALLOCATIONS      5
> 
>  static int use_threaded_interrupts;
> @@ -3048,8 +3048,8 @@
>        * Limit the max command size to prevent iod->sg allocations going
>        * over a single page.
>        */
> -     dev->ctrl.max_hw_sectors = min_t(u32,
> -             NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
> +     //dev->ctrl.max_hw_sectors = min_t(u32,
> +     //      NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
>       dev->ctrl.max_segments = NVME_MAX_SEGS;
>  
>       /*

So basically, dev->ctl.max_hw_sectors stays zero, so that in core.c it is set
to the value of nvme_mps_to_sectors(ctrl, id->mdts)  (=> 4096 in my case)

> if (id->mdts)
>   max_hw_sectors = nvme_mps_to_sectors(ctrl, id->mdts);
> else
>   max_hw_sectors = UINT_MAX;
> ctrl->max_hw_sectors =
>   min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);

But that alone was not enough: 
Tests with ctrl->max_hw_sectors=4096 and NVME_MAX_SEGS = 128 still resulted in corruptions.
They only went away after reverting this value back to 127 (the value from kernel 6.1).

Additional logging to get the values of the following statements
> (dma_opt_mapping_size(&pdev->dev) >> 9) = 256
> (dma_max_mapping_size(&pdev->dev) >> 9) = 36028797018963967 [sic!]

@Stefan, can you check which value NVME_MAX_SEGS had in your tests?
It also seems to have an influence.

Best regards, Matthias
Comment 22 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-13 21:14:55 UTC
(In reply to mbe from comment #21)

> To achieve the same values on Kernel 6.11.0-13, 

Please clarify: what upstream kernel does that distro-specifc version number refer to? And is that a kernel that is vanilla or close to upstream? And why use a EOL series anyway? It's best to use a fresh mainline for all testing, except when data from older kernels is required.
Comment 23 Bruno Gravato 2025-01-15 06:38:02 UTC
I finally got the chance to run some more tests with some interesting
and unexpected results...

I put another disk (WD Black SN750) in the main M.2 slot (the
problematic one), but kept my main disk (Solidigm P44 Pro) in the
secondary M.2 slot (where it doesn't have any issues).
I rerun my test: step 1) copy a large number of files to the WD disk
(main slot), step 2) run btrfs scrub on it and expect some checksum
errors
To my surprise there were no errors!
I tried it twice with different kernels (6.2.6 and 6.11.5) and booting
from either disk (I have linux installations on both).
Still no errors.

I then removed the Solidigm disk from the secondary and kept the WD
disk in the main M.2 slot.
Rerun my tests (on kernel 6.11.5) and bang! btrfs scrub now detected
quite a few checksum errors!

I then tried disabling volatile write cache with "nvme set-feature
/dev/nvme0 -f 6 -v 0"
"nvme get-feature /dev/nvme0 -f 6" confirmed it was disabled, but
/sys/block/nvme0n1/queue/fua still showed 1... Was that supposed to
turn into 0?

I re-run my test, but I still got checksum errors on btrfs scrub. So
disabling volatile write cache (assuming I did it correctly) didn't
make a difference in my case.

I put the Solidigm disk back into the secondary slot, booted and rerun
the test on the WD disk (main slot) just to be triple sure and still
no errors.

So it looks like the corruption only happens if only the main M.2 slot
is occupied and the secondary M.2 slot is free.
With two nvme disks (one on each M.2 slot), there were no errors at all.

Stefan, did you ever try running your tests with 2 nvme disks
installed on both slots? Or did you use only one slot at a time?


Bruno
Comment 24 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-15 08:40:19 UTC
On 15.01.25 07:37, Bruno Gravato wrote:
> I finally got the chance to run some more tests with some interesting
> and unexpected results...

FWIW, I briefly looked into the issue in between as well and can
reproduce it[1] locally with my Samsung SSD 990 EVO Plus 4TB in the main
M.2 slot of my DeskMini X600 using btrfs on a mainline kernel with a
config from Fedora rawhide.

So what can we that are affected by the problem do to narrow it down?

What does it mean that disabling the NVMe devices's write cache often
but apparently not always helps? It it just reducing the chance of the
problem occurring or accidentally working around it?

hch initially brought up that swiotlb seems to be used. Are there any
BIOS setup settings we should try? I tried a few changes yesterday, but
I still get the "PCI-DMA: Using software bounce buffering for IO
(SWIOTLB)" message in the log and not a single line mentioning DMAR.

Ciao, Thorsten

[1] see start of this thread and/or
https://bugzilla.kernel.org/show_bug.cgi?id=219609 for details

> I put another disk (WD Black SN750) in the main M.2 slot (the
> problematic one), but kept my main disk (Solidigm P44 Pro) in the
> secondary M.2 slot (where it doesn't have any issues).
> I rerun my test: step 1) copy a large number of files to the WD disk
> (main slot), step 2) run btrfs scrub on it and expect some checksum
> errors
> To my surprise there were no errors!
> I tried it twice with different kernels (6.2.6 and 6.11.5) and booting
> from either disk (I have linux installations on both).
> Still no errors.
> 
> I then removed the Solidigm disk from the secondary and kept the WD
> disk in the main M.2 slot.
> Rerun my tests (on kernel 6.11.5) and bang! btrfs scrub now detected
> quite a few checksum errors!
> 
> I then tried disabling volatile write cache with "nvme set-feature
> /dev/nvme0 -f 6 -v 0"
> "nvme get-feature /dev/nvme0 -f 6" confirmed it was disabled, but
> /sys/block/nvme0n1/queue/fua still showed 1... Was that supposed to
> turn into 0?
> 
> I re-run my test, but I still got checksum errors on btrfs scrub. So
> disabling volatile write cache (assuming I did it correctly) didn't
> make a difference in my case.
> 
> I put the Solidigm disk back into the secondary slot, booted and rerun
> the test on the WD disk (main slot) just to be triple sure and still
> no errors.
> 
> So it looks like the corruption only happens if only the main M.2 slot
> is occupied and the secondary M.2 slot is free.
> With two nvme disks (one on each M.2 slot), there were no errors at all.
> 
> Stefan, did you ever try running your tests with 2 nvme disks
> installed on both slots? Or did you use only one slot at a time?

$ journalctl -k | grep -i -e DMAR -e IOMMU -e AMD-Vi -e SWIOTLB
AMD-Vi: Using global IVHD EFR:0x246577efa2254afa, EFR2:0x0
iommu: Default domain type: Translated
iommu: DMA domain TLB invalidation policy: lazy mode
pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
pci 0000:00:01.0: Adding to iommu group 0
pci 0000:00:01.3: Adding to iommu group 1
pci 0000:00:02.0: Adding to iommu group 2
pci 0000:00:02.3: Adding to iommu group 3
pci 0000:00:03.0: Adding to iommu group 4
pci 0000:00:04.0: Adding to iommu group 5
pci 0000:00:08.0: Adding to iommu group 6
pci 0000:00:08.1: Adding to iommu group 7
pci 0000:00:08.2: Adding to iommu group 8
pci 0000:00:08.3: Adding to iommu group 9
pci 0000:00:14.0: Adding to iommu group 10
pci 0000:00:14.3: Adding to iommu group 10
pci 0000:00:18.0: Adding to iommu group 11
pci 0000:00:18.1: Adding to iommu group 11
pci 0000:00:18.2: Adding to iommu group 11
pci 0000:00:18.3: Adding to iommu group 11
pci 0000:00:18.4: Adding to iommu group 11
pci 0000:00:18.5: Adding to iommu group 11
pci 0000:00:18.6: Adding to iommu group 11
pci 0000:00:18.7: Adding to iommu group 11
pci 0000:01:00.0: Adding to iommu group 12
pci 0000:02:00.0: Adding to iommu group 13
pci 0000:03:00.0: Adding to iommu group 14
pci 0000:03:00.1: Adding to iommu group 15
pci 0000:03:00.2: Adding to iommu group 16
pci 0000:03:00.3: Adding to iommu group 17
pci 0000:03:00.4: Adding to iommu group 18
pci 0000:03:00.6: Adding to iommu group 19
pci 0000:04:00.0: Adding to iommu group 20
pci 0000:04:00.1: Adding to iommu group 21
pci 0000:05:00.0: Adding to iommu group 22
AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA
PC GA_vAPIC
AMD-Vi: Interrupt remapping enabled
AMD-Vi: Virtual APIC enabled
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
Comment 25 Stefan 2025-01-15 10:47:51 UTC
Hi,

(replying to both, the mailing list and the kernel bug tracker)

Am 15.01.25 um 07:37 schrieb Bruno Gravato:
> I then removed the Solidigm disk from the secondary and kept the WD
> disk in the main M.2 slot. Rerun my tests (on kernel 6.11.5) and
> bang! btrfs scrub now detected quite a few checksum errors!
>
> I then tried disabling volatile write cache with "nvme set-feature
> /dev/nvme0 -f 6 -v 0" "nvme get-feature /dev/nvme0 -f 6" confirmed it
> was disabled, but /sys/block/nvme0n1/queue/fua still showed 1... Was
> that supposed to turn into 0?

You can check this using `nvme get-feature /dev/nvme0n1 -f 6`

> So it looks like the corruption only happens if only the main M.2
> slot is occupied and the secondary M.2 slot is free. With two nvme
> disks (one on each M.2 slot), there were no errors at all.
>
> Stefan, did you ever try running your tests with 2 nvme disks
> installed on both slots? Or did you use only one slot at a time?

No, I only tested these configurations:

1. 1st M.2: Lexar;    2nd M.2: empty
    (Easy to reproduce write errors)
2. 1st M.2: Kingsten; 2nd M.2: Lexar
    (Difficult to reproduce read errors with 6.1 Kernel, but no issues
    with a newer ones within several month of intense use)

I'll swap the SSD's soon. Then I will also test other configurations and
will try out a third SSD. If I get corruption with other SSD's, I will
check which modifications help.

Note that I need both SSD's (configuration 2) in about one week and
cannot change this for about 3 months (already announced this in December).

Thus, if there are things I shall test with configuration 1, please
inform me quickly.

Just as remainder (for those who did not read the two bug trackers):
I tested with `f3` (a utility used to detect scam disks) on ext4.
`f3` reports overwritten sectors. In configuration 1 this are write
errors (appear if I read again).

(If no other SSD-intense jobs are running), the corruption do not occur
in the last files, and I never noticed file system corruptions, only
file contents is corrupt. (This is probably luck, but also has something
to do with the journal and the time when file system information are
written.)


Am 13.01.25 um 22:01 schrieb bugzilla-daemon@kernel.org:
 > https://bugzilla.kernel.org/show_bug.cgi?id=219609
 >
 > --- Comment #21 from mbe ---
 > Hi,
 >
 > I did some more tests. At first I retrieved the following values
under debian
 >
 >> Debian 12, Kernel 6.1.119, no corruption
 >> cat /sys/class/block/nvme0n1/queue/max_hw_sectors_kb
 >> 2048
 >>
 >> cat /sys/class/block/nvme0n1/queue/max_sectors_kb
 >> 1280
 >>
 >> cat /sys/class/block/nvme0n1/queue/max_segments
 >> 127
 >>
 >> cat /sys/class/block/nvme0n1/queue/max_segment_size
 >> 4294967295
 >
 > To achieve the same values on Kernel 6.11.0-13, I had to make the
following
 > changes to drivers/nvme/host/pci.c
 >
 >> --- pci.c.org 2024-09-15 16:57:56.000000000 +0200
 >> +++ pci.c     2025-01-13 21:18:54.475903619 +0100
 >> @@ -41,8 +41,8 @@
 >>    * These can be higher, but we need to ensure that any command doesn't
 >>    * require an sg allocation that needs more than a page of data.
 >>    */
 >> -#define NVME_MAX_KB_SZ       8192
 >> -#define NVME_MAX_SEGS        128
 >> +#define NVME_MAX_KB_SZ       4096
 >> +#define NVME_MAX_SEGS        127
 >>   #define NVME_MAX_NR_ALLOCATIONS      5
 >>
 >>   static int use_threaded_interrupts;
 >> @@ -3048,8 +3048,8 @@
 >>         * Limit the max command size to prevent iod->sg allocations
going
 >>         * over a single page.
 >>         */
 >> -     dev->ctrl.max_hw_sectors = min_t(u32,
 >> -             NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev)
 >> 9);
 >> +     //dev->ctrl.max_hw_sectors = min_t(u32,
 >> +     //      NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev)
 >> 9);
 >>        dev->ctrl.max_segments = NVME_MAX_SEGS;
 >>
 >>        /*
 >
 > So basically, dev->ctl.max_hw_sectors stays zero, so that in core.c
it is set
 > to the value of nvme_mps_to_sectors(ctrl, id->mdts)  (=> 4096 in my case)
This has the same effect as setting it to `dma_max_mapping_size(...)`

 >> if (id->mdts)
 >>    max_hw_sectors = nvme_mps_to_sectors(ctrl, id->mdts);
 >> else
 >>    max_hw_sectors = UINT_MAX;
 >> ctrl->max_hw_sectors =
 >>    min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
 >
 > But that alone was not enough:
 > Tests with ctrl->max_hw_sectors=4096 and NVME_MAX_SEGS = 128 still
resulted in
 > corruptions.
 > They only went away after reverting this value back to 127 (the value
from
 > kernel 6.1).

That change was introduced in 6.3-rc1 using a patch "nvme-pci: place
descriptor addresses in iod" (
https://github.com/torvalds/linux/commit/7846c1b5a5db8bb8475603069df7c7af034fd081
)

This patch has no effect for me, i.e. unmodified kernels work up to 6.3.6.

The patch that triggers the corruptions is the one introduced in 6.3.7
  which replaces `dma_max_mapping_size(...)` by
`dma_opt_mapping_size(...)`. If I apply this change to 6.1, the
corruptions also occur in that kernel.

Matthias, did you checked what happens is you only modify NVME_MAX_SEGS
(and leave the `dev->ctrl.max_hw_sectors = min_t(u32, NVME_MAX_KB_SZ <<
1, dma_opt_mapping_size(&pdev->dev) >> 9);`)

 > Additional logging to get the values of the following statements
 >> (dma_opt_mapping_size(&pdev->dev) >> 9) = 256
 >> (dma_max_mapping_size(&pdev->dev) >> 9) = 36028797018963967 [sic!]
 >
 > @Stefan, can you check which value NVME_MAX_SEGS had in your tests?
 > It also seems to have an influence.

"128", see above.

Regards Stefan
Comment 26 Bruno Gravato 2025-01-15 13:14:37 UTC
On Wed, 15 Jan 2025 at 10:48, Stefan <linux-kernel@simg.de> wrote:
> > Stefan, did you ever try running your tests with 2 nvme disks
> > installed on both slots? Or did you use only one slot at a time?
>
> No, I only tested these configurations:
>
> 1. 1st M.2: Lexar;    2nd M.2: empty
>     (Easy to reproduce write errors)
> 2. 1st M.2: Kingsten; 2nd M.2: Lexar
>     (Difficult to reproduce read errors with 6.1 Kernel, but no issues
>     with a newer ones within several month of intense use)
>
> I'll swap the SSD's soon. Then I will also test other configurations and
> will try out a third SSD. If I get corruption with other SSD's, I will
> check which modifications help.

So it may be that the reason you no longer had errors in config 2 is
not because you put a different SSD in the 1st slot, but because you
now have the 2nd slot also occupied, like me.

If yours behaves like mine, I'd expect that if you swap the disks in
config 2, that you won't have any errors as well...
I'm very curious to see the result of that test!

Just to recap the results of my tests:

Setup 1
Main slot: Solidigm
Secondary slot: (empty)
Result: BAD - corruption happens

Setup 2
Main slot: (empty)
Secondary slot: Solidigm
Result: GOOD - no corruption

Setup 3
Main slot: WD
Secondary slot: (empty)
Result: BAD - corruption happens

Setup 4
Main slot: WD
Secondary slot: Solidigm
Result: GOOD - no corruption (on either disk)

So, in my case, it looks like the corruption only happens if I have
only 1 disk installed in the main slot and the secondary slot is
empty.
If I have the two slots occupied or only the secondary slot occupied,
there are no more errors.


Bruno
Comment 27 Stefan 2025-01-15 16:26:46 UTC
Hi,

Am 15.01.25 um 14:14 schrieb Bruno Gravato:
> If yours behaves like mine, I'd expect that if you swap the disks in
> config 2, that you won't have any errors as well...

yeah, I would just need to plug something into the 2nd M.2 socket. But
that can't be done remotely. I will do that on weekend or in next week.

BTW, is there a kernel parameter to ignore a NVME/PCI device? If the
corruptions appear again after disabling the 2nd SSD, it is more likely
that it is a kernel problem, e.g. a driver writing to memory reserved
for some other driver/component. Such a bug may only occur under rare
conditions. AFAIU, the patch "nvme-pci: place descriptor addresses in
iod" form 6.3-rc1 attempts to use some space which is otherwise unused.
Unfortunately I was not able to revert that patch because later changes
depend on it.

So, I now only tried out whether just `NVME_MAX_SEGS 127` helps (see
message from Matthias). Answer is no. This only seem to by an upper
limit, because `/sys/class/block/nvme0n1/queue/max_segments` reports 33
with unmodified kernels >= 6.3.7. With older kernels or kernels with
reversed patch "nvme-pci: clamp max_hw_sectors based on DMA optimized
limitation" (introduced in 6.3.7) this value is 127 and corruptions
disappear.

I guess, this value somehow has to be 127. In my case it is sufficient
to revert the patch form 6.3.7. In Matthias's case, the values then
becomes 128 and has to be limited additionally using `NVME_MAX_SEGS 127`

Regards Stefan
Comment 28 mbe 2025-01-15 23:13:27 UTC
I don't know if it helps to narrow it down, but adding the kernel parameter

nvme.io_queue_depth=2

makes the corruption disappear with an unpatched kernel (Ubuntu 6.11.0-12 in my case). Of course it is much slower with this setting.
Comment 29 Keith Busch 2025-01-16 00:52:51 UTC
Well this is a real doozy. The observation appears completely dependent on PCI slot populations, but it's somehow also dependent on a software alignment/granularity or queue depth choice? The whole part with the 2nd slot use vs. unused really indicates some kind of platform anomaly than a kernel bug.

I'm going to ignore the 2nd slot for a moment because I can't reconcile that with the kernel size limits. Let's just consider the kernel transfer sizing did something weird for your device, and now we introduce the queue-depth 2 observation into the picture. This now starts to sound like that O2 Micro bug where transfers than ended on page boundaries got misinterpreted by NVMe controller. That's this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=ebefac5647968679f6ef5803e5d35a71997d20fa

Now, it may not be appropriate to just add your devices to that quirk because it only reliably works for devices with MDTS of 5 or less, and I think your devices are larger. But they might have the same bug. It'd be weird if so many vendors implemented it incorrectly, but maybe they're using the same 3rd party controller.
Comment 30 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-16 05:37:31 UTC
(In reply to Keith Busch from comment #29)
>
> Now, it may not be appropriate to just add your devices to that quirk
> because it only reliably works for devices with MDTS of 5 or less, and I
> think your devices are larger.

Will give that a try, but one comment:

> But they might have the same bug. It'd be
> weird if so many vendors implemented it incorrectly, but maybe they're using
> the same 3rd party controller.

That makes it sounds like you suspect a problem in the NVMe devices. But isn't it way more likely that it's something in the machine? I mean we all seem to have the same one (ASRock Deskmini X600) and use NVMe devices that apparently work fine for everybody else, as they are not new and sold for a while. So it sounds more like that machine is doing something wrong or doing something odd that exposes a kernel bug.
Comment 31 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-16 09:06:27 UTC
For me it seems disabling the IOMMU in the BIOS Setup (Advanced -> AMD CBS -> iommu) prevents the problem from happening.
Comment 32 Stefan 2025-01-16 09:14:09 UTC
Hi,

Am 16.01.25 um 06:37 schrieb bugzilla-daemon@kernel.org:
> --- Comment #30 from The Linux kernel's regression tracker (Thorsten
> Leemhuis) ---
>> But they might have the same bug. It'd be weird if so many vendors
>> implemented it incorrectly, but maybe they're using the same 3rd
>> party controller.
>
> That makes it sounds like you suspect a problem in the NVMe devices.
> But isn't it way more likely that it's something in the machine? I
> mean we all seem to have the same one (ASRock Deskmini X600) and use
> NVMe devices that apparently work fine for everybody else, as they
> are not new and sold for a while. So it sounds more like that machine
> is doing something wrong or doing something odd that exposes a kernel
> bug.

Furthermore is seems that the corruptions occur with all SSD's under
certain conditions and the controllers are quite different.

One user from the c't forum wrote me, that the corruptions only occur if
network is enabled, and that this trick works with both, Ethernet and
WLAN. (Is asked him to report his results here.)

Maybe something (kernel, firmware or even the CPU) messes up DMA
transfers of different PCIe devices, e.g. due to a buffer overflow.

AFAIS, another thing that is in common: All CPU's used are from 8000
(and on this chipset-less mainbaord, all PCIe devices are connected to
the CPU).

Regards Stefan
Comment 33 Mario Limonciello (AMD) 2025-01-16 14:24:52 UTC
> Well this is a real doozy. 

Are all of these reports on the exact same motherboard?  "ASRock Deskmini X600"

> One user from the c't forum wrote me, that the corruptions only occur if
network is enabled, and that this trick works with both, Ethernet and
WLAN. (Is asked him to report his results here.)

Has anyone contacted ASRock support?  With such random results I would wonder if there is a signal integrity issue that needs to be looked at.

> For me it seems disabling the IOMMU in the BIOS Setup (Advanced -> AMD CBS ->
> iommu) prevents the problem from happening.

Can others corroborate this finding?
Comment 34 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-16 15:32:46 UTC
(In reply to Mario Limonciello (AMD) from comment #33)
> > Well this is a real doozy. 
> Are all of these reports on the exact same motherboard?  "ASRock Deskmini
> X600"

Pretty sure that's the case.
 
> > One user from the c't forum wrote me, that the corruptions only occur if
> network is enabled, and that this trick works with both, Ethernet and
> WLAN. (Is asked him to report his results here.)
> Has anyone contacted ASRock support?

Not that I know of.

>  With such random results I would
> wonder if there is a signal integrity issue that needs to be looked at.

FWIW, Windows apparently works fine. But I guess that might be due to some random minor details/difference or something like that.
 
> > For me it seems disabling the IOMMU in the BIOS Setup (Advanced -> AMD CBS
> ->
> > iommu) prevents the problem from happening.
> Can others corroborate this finding?

Yeah, would be good if someone could confirm my result.
Comment 35 Stefan 2025-01-16 15:35:19 UTC
> --- Comment #33 from Mario Limonciello (AMD) ---
>> Well this is a real doozy.
>
> Are all of these reports on the exact same motherboard?  "ASRock Deskmini
> X600"

If I haven't overlooked something, all reports are from the motherboard
"AsRock X600M-STX" (from the mini PC "ASRock Deskmini X600") with an
series 8000 Ryzen.

>> One user from the c't forum wrote me, that the corruptions only occur if
> network is enabled, and that this trick works with both, Ethernet and
> WLAN. (Is asked him to report his results here.)
>
> Has anyone contacted ASRock support?  With such random results I would wonder
> if there is a signal integrity issue that needs to be looked at.

Signal integrity does not depend on transfer size and is not improved by
crosstalk of a 2nd SSD. (Corruptions disappear if a 2nd SSD is installed.)

Regards Stefan
Comment 36 mbe 2025-01-16 17:12:52 UTC
I can confirm that disabling IOMMU under "Advanced\AMD CBS\NBIO Common Options"
prevents the data corruption.

System spec: ASRock Deskmini X600, AMD Ryzen 7 8700G
Comment 37 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-16 17:29:46 UTC
On 15.01.25 09:40, Thorsten Leemhuis wrote:
> On 15.01.25 07:37, Bruno Gravato wrote:
>> I finally got the chance to run some more tests with some interesting
>> and unexpected results...
> 
> FWIW, I briefly looked into the issue in between as well and can
> reproduce it[1] locally with my Samsung SSD 990 EVO Plus 4TB in the main
> M.2 slot of my DeskMini X600 using btrfs on a mainline kernel with a
> config from Fedora rawhide.
> 
> So what can we that are affected by the problem do to narrow it down?
> 
> What does it mean that disabling the NVMe devices's write cache often
> but apparently not always helps? It it just reducing the chance of the
> problem occurring or accidentally working around it?
> 
> hch initially brought up that swiotlb seems to be used. Are there any
> BIOS setup settings we should try? I tried a few changes yesterday, but
> I still get the "PCI-DMA: Using software bounce buffering for IO
> (SWIOTLB)" message in the log and not a single line mentioning DMAR.

FWIW, I meanwhile became aware that it is normal that there are no lines
with DMAR when it comes to AMD's IOMMU. Sorry for the noise.

But there is a new development:

I noticed earlier today that disabling the IOMMU in the BIOS Setup seems
to prevent the corruption from occurring. Another user in the bugzilla
ticket just confirmed this.

Ciao, Thorsten

> [1] see start of this thread and/or
> https://bugzilla.kernel.org/show_bug.cgi?id=219609 for details
> 
>> I put another disk (WD Black SN750) in the main M.2 slot (the
>> problematic one), but kept my main disk (Solidigm P44 Pro) in the
>> secondary M.2 slot (where it doesn't have any issues).
>> I rerun my test: step 1) copy a large number of files to the WD disk
>> (main slot), step 2) run btrfs scrub on it and expect some checksum
>> errors
>> To my surprise there were no errors!
>> I tried it twice with different kernels (6.2.6 and 6.11.5) and booting
>> from either disk (I have linux installations on both).
>> Still no errors.
>>
>> I then removed the Solidigm disk from the secondary and kept the WD
>> disk in the main M.2 slot.
>> Rerun my tests (on kernel 6.11.5) and bang! btrfs scrub now detected
>> quite a few checksum errors!
>>
>> I then tried disabling volatile write cache with "nvme set-feature
>> /dev/nvme0 -f 6 -v 0"
>> "nvme get-feature /dev/nvme0 -f 6" confirmed it was disabled, but
>> /sys/block/nvme0n1/queue/fua still showed 1... Was that supposed to
>> turn into 0?
>>
>> I re-run my test, but I still got checksum errors on btrfs scrub. So
>> disabling volatile write cache (assuming I did it correctly) didn't
>> make a difference in my case.
>>
>> I put the Solidigm disk back into the secondary slot, booted and rerun
>> the test on the WD disk (main slot) just to be triple sure and still
>> no errors.
>>
>> So it looks like the corruption only happens if only the main M.2 slot
>> is occupied and the secondary M.2 slot is free.
>> With two nvme disks (one on each M.2 slot), there were no errors at all.
>>
>> Stefan, did you ever try running your tests with 2 nvme disks
>> installed on both slots? Or did you use only one slot at a time?
> 
> $ journalctl -k | grep -i -e DMAR -e IOMMU -e AMD-Vi -e SWIOTLB
> AMD-Vi: Using global IVHD EFR:0x246577efa2254afa, EFR2:0x0
> iommu: Default domain type: Translated
> iommu: DMA domain TLB invalidation policy: lazy mode
> pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
> pci 0000:00:01.0: Adding to iommu group 0
> pci 0000:00:01.3: Adding to iommu group 1
> pci 0000:00:02.0: Adding to iommu group 2
> pci 0000:00:02.3: Adding to iommu group 3
> pci 0000:00:03.0: Adding to iommu group 4
> pci 0000:00:04.0: Adding to iommu group 5
> pci 0000:00:08.0: Adding to iommu group 6
> pci 0000:00:08.1: Adding to iommu group 7
> pci 0000:00:08.2: Adding to iommu group 8
> pci 0000:00:08.3: Adding to iommu group 9
> pci 0000:00:14.0: Adding to iommu group 10
> pci 0000:00:14.3: Adding to iommu group 10
> pci 0000:00:18.0: Adding to iommu group 11
> pci 0000:00:18.1: Adding to iommu group 11
> pci 0000:00:18.2: Adding to iommu group 11
> pci 0000:00:18.3: Adding to iommu group 11
> pci 0000:00:18.4: Adding to iommu group 11
> pci 0000:00:18.5: Adding to iommu group 11
> pci 0000:00:18.6: Adding to iommu group 11
> pci 0000:00:18.7: Adding to iommu group 11
> pci 0000:01:00.0: Adding to iommu group 12
> pci 0000:02:00.0: Adding to iommu group 13
> pci 0000:03:00.0: Adding to iommu group 14
> pci 0000:03:00.1: Adding to iommu group 15
> pci 0000:03:00.2: Adding to iommu group 16
> pci 0000:03:00.3: Adding to iommu group 17
> pci 0000:03:00.4: Adding to iommu group 18
> pci 0000:03:00.6: Adding to iommu group 19
> pci 0000:04:00.0: Adding to iommu group 20
> pci 0000:04:00.1: Adding to iommu group 21
> pci 0000:05:00.0: Adding to iommu group 22
> AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA
> PC GA_vAPIC
> AMD-Vi: Interrupt remapping enabled
> AMD-Vi: Virtual APIC enabled
> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
Comment 38 Mario Limonciello (AMD) 2025-01-16 17:33:52 UTC
> I noticed earlier today that disabling the IOMMU in the BIOS Setup seems
to prevent the corruption from occurring.

If you can reliably reproduce this issue, can you also experiment with turning it back on in BIOS and then using:
* iommu=pt
  (which will do identity domain)
and separately
* amd_iommu=off
 (which will disable the IOMMU from Linux)

> If I haven't overlooked something, all reports are from the motherboard
"AsRock X600M-STX" (from the mini PC "ASRock Deskmini X600") with an
series 8000 Ryzen.

For everyone responding with their system, it would be ideal to also share information about the AGESA version (sometimes reported in `dmidecode | grep AGESA`) as well as the ASRock BIOS version (/sys/class/dmi/id/bios_version).

> Corruptions disappear if a 2nd SSD is installed

I missed that; quite bizarre.
Comment 39 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-16 18:25:50 UTC
Mario, thx for looking into this.

> If you can reliably reproduce this issue

Usually within ten to twenty seconds

> iommu=pt

Apparently[1] helps. 

> amd_iommu=off

Apparently[1] helps, too

[1] I did not try for a long time, but for two or three minutes and no corruption occurred; normally one occurs on nearly every try of "f3write -e 4" and checking the result afterwards.

> it would be ideal to also share information

$ grep -s '' /sys/class/dmi/id/*
/sys/class/dmi/id/bios_date:12/05/2024
/sys/class/dmi/id/bios_release:5.35
/sys/class/dmi/id/bios_vendor:American Megatrends International, LLC.
/sys/class/dmi/id/bios_version:4.08
/sys/class/dmi/id/board_asset_tag:Default string
/sys/class/dmi/id/board_name:X600M-STX
/sys/class/dmi/id/board_vendor:ASRock
/sys/class/dmi/id/board_version:Default string
/sys/class/dmi/id/chassis_asset_tag:Default string
/sys/class/dmi/id/chassis_type:3
/sys/class/dmi/id/chassis_vendor:Default string
/sys/class/dmi/id/chassis_version:Default string
/sys/class/dmi/id/modalias:dmi:bvnAmericanMegatrendsInternational,LLC.:bvr4.08:bd12/05/2024:br5.35:svnASRock:pnX600M-STX:pvrDefaultstring:rvnASRock:rnX600M-STX:rvrDefaultstring:cvnDefaultstring:ct3:cvrDefaultstring:skuDefaultstring:
/sys/class/dmi/id/product_family:Default string
/sys/class/dmi/id/product_name:X600M-STX
/sys/class/dmi/id/product_sku:Default string
/sys/class/dmi/id/product_version:Default string
/sys/class/dmi/id/sys_vendor:ASRock
/sys/class/dmi/id/uevent:MODALIAS=dmi:bvnAmericanMegatrendsInternational,LLC.:bvr4.08:bd12/05/2024:br5.35:svnASRock:pnX600M-STX:pvrDefaultstring:rvnASRock:rnX600M-STX:rvrDefaultstring:cvnDefaultstring:ct3:cvrDefaultstring:skuDefaultstring:
$ sudo dmidecode | grep AGESA
	String: AGESA!V9 ComboAm5PI 1.2.0.2a