Bug 219609 - File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G
Summary: File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G
Status: RESOLVED DOCUMENTED
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: AMD Linux
: P3 normal
Assignee: IO/NVME Virtual Default Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-12-18 11:23 UTC by Stefan
Modified: 2025-04-04 12:14 UTC (History)
15 users (show)

See Also:
Kernel Version: 6.11.5, most liklely 6.5+
Subsystem:
Regression: No
Bisected commit-id:


Attachments
attachment-4531-0.html (4.47 KB, text/html)
2025-01-10 10:41 UTC, Bruno Gravato
Details
logs.tar.bz2 (63.06 KB, application/x-bzip)
2025-01-16 21:51 UTC, Stefan
Details
dmesg from before and after the bios update (39.37 KB, application/gzip)
2025-02-19 14:21 UTC, The Linux kernel's regression tracker (Thorsten Leemhuis)
Details

Description Stefan 2024-12-18 11:23:04 UTC
Hi,

there are one or two bugs which were originally reported at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372 . For details (logs, etc.), see there. Here, I will post a summary and try to point out the most relevant observations:

Bug 1: Write errors with Lexar NM790 NVME

* Occur since Debian kernel 6.5, but reproduced with upstream kernel 6.11.5 (the only upstream kernel I tested)
* Only occur in 1st M.2 socket (not in the 2nd one on rear side)
* Easiest way to reproduce them is to use f3 ( https://fight-flash-fraud.readthedocs.io/en/latest/usage.html ). f3 reports overwritten sectors
* The errors seem not to occur in the last files of 500 file (=500 GB) test runs and I never detected file system corruption (just defect files; I produced probably more than thousand ones). The reason for the latter observation is maybe, that file system information are written last. (See see message 113 in the Debian bug report)

(Possible) Bug 2: Read errors with Kingston FURY Renegade

* Only occur in 1st M.2 socket (did not tested the rear socket, because the warranty seal would to be broken in order to remove the heat sink)
* Almost impossible to reproduce it, only detected it in Debian kernel that bases on 6.1.112
* 1st occurrence: I detected in an SSD intensive computation (as data cache) which produced wrong results after a few days (but not in the first days). The error could be reproduced with f3: The corruptions were massive and different files were affected in subsequent f3read runs (==> read errors). Unfortunately I did not stored the f3 logs. (I still have the corrupt computation results, so it was real.)
* 2nd occurrence: A single defect sector (read error) in a multi-day attempt to reproduce the error with the same kernel (Debian 6.1.112), see message 113 in the Debian bug report

Consideration / Notes:
* These serial links (PCIe) need to be calibrated. Calibration issues would explain while the errors (dis)appear under certain condition. But errors like this should be detected (nothing could be found in the kernel logs). Is the error correction possibly inactive? However, this still does not explain why f2 reports overwritten sectors, unless the signal errors occur during command / address transmission.
* Testing is difficult, because the machine is installed remotely and in use. ATM, till about end of Janaury, can run tests for bug 1.
* On the AsRock X600M-STX mainboard (without chipset), the CPU (Ryzen 8700G) runs in SoC (system on chip) mode. Maybe someone did not tested this properly ...

Regards Stefan
Comment 1 Keith Busch 2024-12-18 15:23:50 UTC
You mention the observation has occurred since kernel 6.5. Are you saying that this used to work in older kernels?
Comment 2 Stefan 2024-12-18 17:41:17 UTC
Bug 1: Oldest non-working Debian kernel is 6.3.7 (package linux-image-6.3.0-1-amd64), Debian kernel 6.3.5 (latest version of package linux-image-6.3.0-0-amd64) works. (I'm assuming it's not debian-specific because the error also occurs in an upstream-kernel (6.11.5)

If you have patches, I could compile one of these version and then try out the patches.

(Possible) Bug 2: Occurred with 6.1 kernels, but very difficult to reproduce. So, I'm not sure whether this error is limited to this kernel version.

Because I cannot test both bugs at the same time (the bugs occur only in 1st M.2 socket and the PC is remote), we should first focus on Bug 1. If that bug is fixed, I would run a long term test with the fixed kernel. (Because it are read errors, this can be done by a checksum test of existing files in background.)
Comment 3 Bruno Gravato 2025-01-02 17:16:05 UTC
I have the same barebone (ASRock Deskmini X600) with Ryzen 8600G CPU.

I've run into similar issues.

In my case I'm using btrfs on a Solidigm P44 Pro M.2 nvme 1TB disk. After copying a large amount of files (over 150K-300K files, variable sizes) to the btrfs partition and running btrfs scrub on the partition, it will report some files with checksum errors.

If I put the disk in the secondary M.2 slot in the back this problem does not occur.

RAM is 2x16GB Kingston Fury Impact DDR5 6400 SODIMM, but I've also tried a Crucial DDR5 5600 SODIMM with same results. I run single memory stick, dual, different speeds, etc... all with the same result. RAM seems to not be the problem.

I also had same results with a WD nvme SN750 500GB disk.

I've tried both disks (running the same installation), on a different machine (Deskmini X300) and no errors.

Only a few files get corrupted. On my last test, copying nearly 400K files, only 22 got corrupted.

I mounted the btrfs partition with rescue=all and I was able to read the corrupted files. I compared a few to the original files and looks like a big chunk of data in the middle of the files was altered (contiguous blocks). So it's not just a bit flip here and there... it's a big portion of the file that gets messed up (in contiguous blocks).

System is running Debian stable with some packages from backports, namely the kernel. I got same results with kernel 6.10.5 and 6.11.10 (from bookworm-backports) and 6.12.6 (from testing).

Also got the same results with BIOS firmware 4.03 and 4.08 (downloaded from asrock website).

I tried different sources for the files: copying over LAN using either rsync over ssh or restic backup restore, but also from a locally installed SATA SSD disk with the same files. Copying the same files to the SATA disk (also btrfs) do not get corrupted.

Using the secondary M.2 slot (gen4x4) also seems to be free of errors. It only happens when the disk is in the main M.2 slot (gen5x4).

I thought this could be a faulty M.2 slot on my board, but after seeing other reports of similar problem, now I'm more convinced that this may be either BIOS firmware issue or kernel issue or a combination of both.

Anyway I thought I'd add my report here hoping it can help.

I can run some more tests if needed.

In terms of reproducibility, I can reproduce this fairly consistently given I copy a large enough sample of files (my "sample" is my personal files from my home dir in my older PC, which are over 700K files). Copying 150K-300K files (20-60GB of data) is usually enough to cause checksum errors on some files when running btrfs scrub (it seems to be always on different files). With the disk on the secondary M.2 slot I copied all 700K+ files (twice I think) and no errors.

I haven't tried older kernel versions. I can try 6.1.x from debian stable, but I think this has issues with amdgpu driver and can eventually freeze the system with some amdgpu error, so it may not be very reliable for testing.

Let me know if you have any questions and I'll try to answer.
Comment 4 Stefan 2025-01-03 14:15:17 UTC
With the help of TJ from the Debian kernel team ( https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372 ), at least a workaround could be found.

The bug is triggered by the patch "nvme-pci: clamp max_hw_sectors based on DMA optimized limitation" (see https://lore.kernel.org/linux-iommu/20230503161759.GA1614@lst.de/ ) introduced in 6.3.7

To examine the situation, I added this debug info (all files are located in `drivers/nvme/host`):

> --- core.c.orig       2025-01-03 14:27:38.220428482 +0100
> +++ core.c    2025-01-03 12:56:34.503259774 +0100
> @@ -3306,6 +3306,7 @@
>               max_hw_sectors = nvme_mps_to_sectors(ctrl, id->mdts);
>       else
>               max_hw_sectors = UINT_MAX;
> +     dev_warn(ctrl->device, "id->mdts=%d,  max_hw_sectors=%d, 
> ctrl->max_hw_sectors=%d\n", id->mdts, max_hw_sectors, ctrl->max_hw_sectors);
>       ctrl->max_hw_sectors =
>               min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);

6.3.6 (last version w/o mentioned patch and w/o data corruption) says:

> [  127.196212] nvme nvme0: id->mdts=7,  max_hw_sectors=1024, 
> ctrl->max_hw_sectors=16384
> [  127.203530] nvme nvme0: allocated 40 MiB host memory buffer.

6.3.7 (first version w/ mentioned patch and w/ data corruption) says:

> [   46.436384] nvme nvme0: id->mdts=7,  max_hw_sectors=1024, 
> ctrl->max_hw_sectors=256
> [   46.443562] nvme nvme0: allocated 40 MiB host memory buffer.

After I reverted the mentioned patch (

> --- pci.c.orig        2025-01-03 14:28:05.944819822 +0100
> +++ pci.c     2025-01-03 12:54:37.014579093 +0100
> @@ -3042,7 +3042,8 @@
>        * over a single page.
>        */
>       dev->ctrl.max_hw_sectors = min_t(u32,
> -             NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
> +//           NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
> +             NVME_MAX_KB_SZ << 1, dma_max_mapping_size(&pdev->dev) >> 9);
>       dev->ctrl.max_segments = NVME_MAX_SEGS;
>  
>       /*

), 6.11.5 (used this version because sources were laying around) works and says:

> [    1.251370] nvme nvme0: id->mdts=7,  max_hw_sectors=1024, 
> ctrl->max_hw_sectors=16384
> [    1.261168] nvme nvme0: allocated 40 MiB host memory buffer.

Thus, the corruption occurs if `ctrl->max_hw_sectors` is set to another (a smaller) value than defined by `id->mdts`. 

If this should be allowed, the mentioned patch is not the (root) cause, but reversion is at least a workaround.
Comment 5 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-08 14:42:28 UTC
I forwarded the problem by mail[1]
https://lore.kernel.org/all/401f2c46-0bc3-4e7f-b549-f868dc1834c5@leemhuis.info/

Bruno, Stefan, can we CC you on further mails regarding this? this would expose your email address to the public. 

[1] reminder, bugzilla.kernel.org is usually a bad place to report bugs, as mentioned on https://docs.kernel.org/admin-guide/reporting-issues.html
Comment 6 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-08 14:44:02 UTC
ohh, an did anyone check if mainline is still affected?
Comment 7 Keith Busch 2025-01-08 15:19:41 UTC
Even with the patch reverted, the host can still send IO that aligns to the smaller sized limits anyway, so it sounds like this patch that's been bisected to may have merely exposed a nvme controller bug.
Comment 8 Bruno Gravato 2025-01-08 15:31:22 UTC
Hi,

Yes you can CC me.

I didn't try the patch mentioned above.

This is my (new) daily driver and I needed to get the machine up and running as quickly as possible. I went with the work around of putting the disk on the secondary M.2 slot (gen4 vs gen5 on the main slot). No problems so far.

The latest kernel I tried was 6.12.6 and it still had the problem.

I should be able to put my old disk (WD Black SN750) on the main slot and run some more tests with the mainline kernel when I get the chance.
Comment 9 Keith Busch 2025-01-08 15:35:19 UTC
Are all these reports using the same model nvme controller? Or is this happening across a variety of vendors?
Comment 10 Stefan 2025-01-08 17:25:38 UTC
My email-address "linux-kernel@simg.de" can be CC'd publicly. But it is an alias, i.e. cannot reply directly from it. That's why I prefer the bug tracker.

According to a forum of the German IT magazine c't, the bug was also recognized by several other people: https://www.heise.de/forum/c-t/Wuensch-dir-mal-wieder-was/X600-btrfs-scrub-uncorrectable-errors/thread-7689447 . (That hardware was recommended by that magazine). Furthermore it seems, the the errors do not occur with all SSD's. I'm trying to figure out, whether this has something to do with the MDTS setting (can be queried using `nvme id-ctrl` command). 

The problem also occurs in 6.13.0-rc6 (unless I revert the patch introduced in 6.3.7).

Just a few thoughts (I'm not a NVME or kernel developer): I would not expect that reducing the MDTS (=max data transfer size) limit (that is what the patch does) should cause such errors. The only explanation is, that one component still assumes, up to the amount reported by MDTS (setting of the SSD) can be used. 

If that assumption is valid (NVME sepcs should answer this question), the patch is responsible for the problems. 

Otherwise, the root cause is the component that does not take the reduced limit into account. 

While the 6.13 kernel was compiling I searched in the kernel sources for the term "mdts". It seems that this setting is only used to initialize `max_hw_sectors' of the nvme_ctrl` struct. If that is correct, the other component that causes the problem is probably some kind of firmware.
Comment 11 mbe 2025-01-08 21:29:22 UTC
Hi,

I can also reliably reproduce the data corruption with following setup:

Deskmini X600
AMD Ryzen7 8700G
2x 16 GB Kingston-FURY KF564S38IBK2-32
Samsung 990 Pro 2 TB NVMe SSD, latest firmware 4B2QJXD7, installed on primary nvme slot
Filesystem: ext4
OS: Ubuntu 24.10 with kernel 6.11.0-13.14

When copying ~60 GB of data to the nvme, some files get always corrupted.
A diff between the source and the copied files shows that continuous chunks of < 3 MB in the middle of the files are either filled with zeros or garbage data.

Also affected: Ubuntu 24.04 with kernel 6.8.0.
Not affected: Debian 12 with kernel 6.1.119-1

The bad news:
Applying the patch from comment #4 (using dma_max_mapping_size() instead of dma_opt_mapping_size() to set max_hw_sectors)
to kernel 6.11.0-13.14 did not solve the problem in my case, the data corruption still occurs.

6.11.0-13.14 with patch and corruption:
>[    1.429438] nvme nvme0: pci function 0000:02:00.0
>[    1.433783] nvme nvme0: id->mdts=9,  max_hw_sectors=4096,
>ctrl->max_hw_sectors=16384
>[    1.433787] nvme nvme0: D3 entry latency set to 10 seconds
>[    1.438308] nvme nvme0: 16/0/0 default/read/poll queues
Comment 12 Stefan 2025-01-08 23:45:59 UTC
Because it might be a Firmware issue, I updated the BIOS/UEFI and installed the latest firmware blobs (version 20241210 from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ ): No success. Furthermore I found a setting where PCIe speed could be reduced. Changing this value to Gen 3 had no effect.

> The bad news:
> Applying the patch from comment #4 (using dma_max_mapping_size() instead of 
> dma_opt_mapping_size() to set max_hw_sectors)
> to kernel 6.11.0-13.14 did not solve the problem in my case, the data
> corruption still occurs.

Strange, especially because 6.1 is working.

You might try to replace `dma_max_mapping_size(&pdev->dev) >> 9` by `min_t(u32, dma_max_mapping_size(&pdev->dev) >> 9, 1024)`. This will limit max_hw_sectors to 1024 sectors, the value which works for me.

I just backported the patch from 6.3.7 to 6.1.112. The corruption now also occurs in that kernel. So for me, the problem connected to the patch.
Comment 13 Keith Busch 2025-01-09 00:09:36 UTC
If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston, and now Samsung NVMe's? Unless they're all using the same 3rd party controller, like Phison or SMI, then I guess we'd have some trouble saying it's a vendor problem. Or perhaps we're now mixing multiple problems at this point, considering one patch fixes some but not others.

Do these drives have volatile write caches? You can check with 

 # cat /sys/block/nvme0n1/queue/fua

A non-zero value means "yes". Replace "nvme0n1" with whatever your device is named, like nvme1n1, nvme2n1, etc...

Is ext4 used in the other observations too? If not, what other filesystems are used?
Comment 14 Bruno Gravato 2025-01-09 03:09:49 UTC
(In reply to Keith Busch from comment #13)
> If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston,
> and now Samsung NVMe's? 

In my case it was Solidigm P44 Pro 1TB and WD Black SN750 500GB

> Do these drives have volatile write caches? You can check with 
> 
>  # cat /sys/block/nvme0n1/queue/fua
> 

I get 1, so yes.

> Is ext4 used in the other observations too? If not, what other filesystems
> are used?

In my case I was using btrfs. Running btrfs scrub gave me some checksum errors and that's how I found out files were getting corrupted... If I was on ext4 it could have taken months for me to find out...

The somewhat odd thing is that the same disks on the secondary M.2 nvme slot work fine with no error.

The only difference in the specs between the two M.2 slots is that one is gen5x4 (the main one, which is the one with problems) and the other is gen4x4 (this works fine, no errors).
Comment 15 Keith Busch 2025-01-09 03:47:53 UTC
as a test, could you turn off the volatile write cache?

  # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0

Your write performance may be pretty bad, but it's just a temporary test to see if the problem still occurs without a volatile cache. A power cycle reverts the setting back to the default state.
Comment 16 Keith Busch 2025-01-09 03:52:35 UTC
Sorry, depending on the nvme version, the value parameter may be "-V" (capital "V").
Comment 17 Stefan 2025-01-09 15:44:24 UTC
Hi,

due to Thorstens hints, I'm trying to reply to both, the bug tracker and
the mailing list.

> --- Comment #13 from Keith Busch (kbusch@kernel.org) ---
> If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston,
> and now Samsung NVMe's?

The Kingston read errors may be something different. They are described
in detail in messages #108 and #113 of the Debian Bug Tracker
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372

With the Kington, I never saw the write errors that occur with Lexar and
Samsung on newer Kernels (and which are easy to reproduce).

(ATM I cannot provide test results from the Kingston SSD because the
Lexar is installed, the PC is installed remotely and in use. Thus I
can't swap the SSDS that often.)

> # cat /sys/block/nvme0n1/queue/fua

Returns "1"

> --- Comment #15 from Keith Busch (kbusch@kernel.org) --- as a test,
> could you turn off the volatile write cache?
>
> # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0
Had to modify that a little bit:

   $ nvme get-feature /dev/nvme0n1 -f 6
   get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
   $ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0
   set-feature:0x06 (Volatile Write Cache), value:00000000,
cdw12:00000000, save:0
   $ nvme get-feature /dev/nvme0n1 -f 6
   get-feature:0x06 (Volatile Write Cache), Current value:00000000

Corruptions disappear (under 6.13.0-rc6) if volatile write cache is
disabled (and appear again if I turn it on with "-v 1").

But, lspci says I have a

   Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD
(DRAM-less) (rev 01) (prog-if 02 [NVM Express])

Note the "DRAM-less". This is confirmed by
https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of
this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB
Host-Memory-Buffer (HMB).

May there be an issue with the HMB allocation/usage ?

Is the mainboard firmware involved into HMB allocation/usage ? That
would explain, why volatile write caching via HMB works in the 2nd M.2
socket.

BTW, controller is MaxioTech MAP1602A, which is different from the
Samsung controllers.

> --- Comment #14 from Bruno Gravato (bgravato@gmail.com) --- The only
>  difference in the specs between the two M.2 slots is that one is
> gen5x4 (the main one, which is the one with problems) and the other
> is gen4x4 (this works fine, no errors).

AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of
the CPU. On my PC, it runs in Gen4 mode (limited by SSD).

The secondary M.2 socket on the rear side is probably connected to PCIe
lanes which are usually used by a chipset -- but that socket works.

Regards Stefan
Comment 18 mbe 2025-01-09 23:38:25 UTC
Hi,

lspci says:
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal]

It uses volatile write cache:
> cat /sys/block/nvme0n1/queue/fua 
> 1

Test 1:
Disabling volatile write cache via nvme-cli 
=> no corruption occurs

Test 2:
volatile write cache enabled, using the suggestion from comment #12
> dev->ctrl.max_hw_sectors = min_t(u32,
> NVME_MAX_KB_SZ << 1, min_t(u32, dma_max_mapping_size(&pdev->dev) >> 9,
> 1024));

=> corruption still occurs

> [    0.815340] nvme nvme0: id->mdts=9,  max_hw_sectors=4096,
> ctrl->max_hw_sectors=1024
Comment 19 Bruno Gravato 2025-01-10 10:41:13 UTC
Created attachment 307463 [details]
attachment-4531-0.html

Hi,

I can reply via email, that's not a problem.

I'll try to run some more tests when I get the chance (it's been a very
busy week, sorry).
Besides the volatile write cache test, any other test I should try?

Regarding the M.2 slots. I believe this motherboard has no chipset. So both
slots should be connected directly to the CPU (mine is Ryzen 8600G),
although they might be connecting to different parts of the CPU, right? I
guess that can make a difference.

My disks are gen4 as well.

Bruno

On Thu, 9 Jan 2025 at 15:44, Stefan <linux-kernel@simg.de> wrote:

> Hi,
>
> due to Thorstens hints, I'm trying to reply to both, the bug tracker and
> the mailing list.
>
> > --- Comment #13 from Keith Busch (kbusch@kernel.org) ---
> > If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston,
> > and now Samsung NVMe's?
>
> The Kingston read errors may be something different. They are described
> in detail in messages #108 and #113 of the Debian Bug Tracker
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372
>
> With the Kington, I never saw the write errors that occur with Lexar and
> Samsung on newer Kernels (and which are easy to reproduce).
>
> (ATM I cannot provide test results from the Kingston SSD because the
> Lexar is installed, the PC is installed remotely and in use. Thus I
> can't swap the SSDS that often.)
>
> > # cat /sys/block/nvme0n1/queue/fua
>
> Returns "1"
>
> > --- Comment #15 from Keith Busch (kbusch@kernel.org) --- as a test,
> > could you turn off the volatile write cache?
> >
> > # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0
> Had to modify that a little bit:
>
>    $ nvme get-feature /dev/nvme0n1 -f 6
>    get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
>    $ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0
>    set-feature:0x06 (Volatile Write Cache), value:00000000,
> cdw12:00000000, save:0
>    $ nvme get-feature /dev/nvme0n1 -f 6
>    get-feature:0x06 (Volatile Write Cache), Current value:00000000
>
> Corruptions disappear (under 6.13.0-rc6) if volatile write cache is
> disabled (and appear again if I turn it on with "-v 1").
>
> But, lspci says I have a
>
>    Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD
> (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
>
> Note the "DRAM-less". This is confirmed by
> https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of
> this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB
> Host-Memory-Buffer (HMB).
>
> May there be an issue with the HMB allocation/usage ?
>
> Is the mainboard firmware involved into HMB allocation/usage ? That
> would explain, why volatile write caching via HMB works in the 2nd M.2
> socket.
>
> BTW, controller is MaxioTech MAP1602A, which is different from the
> Samsung controllers.
>
> > --- Comment #14 from Bruno Gravato (bgravato@gmail.com) --- The only
> >  difference in the specs between the two M.2 slots is that one is
> > gen5x4 (the main one, which is the one with problems) and the other
> > is gen4x4 (this works fine, no errors).
>
> AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of
> the CPU. On my PC, it runs in Gen4 mode (limited by SSD).
>
> The secondary M.2 socket on the rear side is probably connected to PCIe
> lanes which are usually used by a chipset -- but that socket works.
>
> Regards Stefan
>
Comment 20 Bruno Gravato 2025-01-10 11:17:59 UTC
Hi,

(resending in text-only mode, because mailing lists don't like HMTL
emails... sorry to those getting this twice)

I can reply via email, that's not a problem.

I'll try to run some more tests when I get the chance (it's been a
very busy week, sorry).
Besides the volatile write cache test, any other test I should try?

Regarding the M.2 slots. I believe this motherboard has no chipset. So
both slots should be connected directly to the CPU (mine is Ryzen
8600G), although they might be connecting to different parts of the
CPU, right? I guess that can make a difference.

My disks are gen4 as well.

Bruno


On Thu, 9 Jan 2025 at 15:44, Stefan <linux-kernel@simg.de> wrote:
>
> Hi,
>
> due to Thorstens hints, I'm trying to reply to both, the bug tracker and
> the mailing list.
>
> > --- Comment #13 from Keith Busch (kbusch@kernel.org) ---
> > If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston,
> > and now Samsung NVMe's?
>
> The Kingston read errors may be something different. They are described
> in detail in messages #108 and #113 of the Debian Bug Tracker
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372
>
> With the Kington, I never saw the write errors that occur with Lexar and
> Samsung on newer Kernels (and which are easy to reproduce).
>
> (ATM I cannot provide test results from the Kingston SSD because the
> Lexar is installed, the PC is installed remotely and in use. Thus I
> can't swap the SSDS that often.)
>
> > # cat /sys/block/nvme0n1/queue/fua
>
> Returns "1"
>
> > --- Comment #15 from Keith Busch (kbusch@kernel.org) --- as a test,
> > could you turn off the volatile write cache?
> >
> > # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0
> Had to modify that a little bit:
>
>    $ nvme get-feature /dev/nvme0n1 -f 6
>    get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
>    $ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0
>    set-feature:0x06 (Volatile Write Cache), value:00000000,
> cdw12:00000000, save:0
>    $ nvme get-feature /dev/nvme0n1 -f 6
>    get-feature:0x06 (Volatile Write Cache), Current value:00000000
>
> Corruptions disappear (under 6.13.0-rc6) if volatile write cache is
> disabled (and appear again if I turn it on with "-v 1").
>
> But, lspci says I have a
>
>    Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD
> (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
>
> Note the "DRAM-less". This is confirmed by
> https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of
> this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB
> Host-Memory-Buffer (HMB).
>
> May there be an issue with the HMB allocation/usage ?
>
> Is the mainboard firmware involved into HMB allocation/usage ? That
> would explain, why volatile write caching via HMB works in the 2nd M.2
> socket.
>
> BTW, controller is MaxioTech MAP1602A, which is different from the
> Samsung controllers.
>
> > --- Comment #14 from Bruno Gravato (bgravato@gmail.com) --- The only
> >  difference in the specs between the two M.2 slots is that one is
> > gen5x4 (the main one, which is the one with problems) and the other
> > is gen4x4 (this works fine, no errors).
>
> AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of
> the CPU. On my PC, it runs in Gen4 mode (limited by SSD).
>
> The secondary M.2 socket on the rear side is probably connected to PCIe
> lanes which are usually used by a chipset -- but that socket works.
>
> Regards Stefan
Comment 21 mbe 2025-01-13 21:01:51 UTC
Hi,

I did some more tests. At first I retrieved the following values under debian

> Debian 12, Kernel 6.1.119, no corruption
> cat /sys/class/block/nvme0n1/queue/max_hw_sectors_kb 
> 2048
>
> cat /sys/class/block/nvme0n1/queue/max_sectors_kb 
> 1280
>
> cat /sys/class/block/nvme0n1/queue/max_segments
> 127
>
> cat /sys/class/block/nvme0n1/queue/max_segment_size 
> 4294967295

To achieve the same values on Kernel 6.11.0-13, I had to make the following changes to drivers/nvme/host/pci.c

> --- pci.c.org 2024-09-15 16:57:56.000000000 +0200
> +++ pci.c     2025-01-13 21:18:54.475903619 +0100
> @@ -41,8 +41,8 @@
>   * These can be higher, but we need to ensure that any command doesn't
>   * require an sg allocation that needs more than a page of data.
>   */
> -#define NVME_MAX_KB_SZ       8192
> -#define NVME_MAX_SEGS        128
> +#define NVME_MAX_KB_SZ       4096
> +#define NVME_MAX_SEGS        127
>  #define NVME_MAX_NR_ALLOCATIONS      5
> 
>  static int use_threaded_interrupts;
> @@ -3048,8 +3048,8 @@
>        * Limit the max command size to prevent iod->sg allocations going
>        * over a single page.
>        */
> -     dev->ctrl.max_hw_sectors = min_t(u32,
> -             NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
> +     //dev->ctrl.max_hw_sectors = min_t(u32,
> +     //      NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
>       dev->ctrl.max_segments = NVME_MAX_SEGS;
>  
>       /*

So basically, dev->ctl.max_hw_sectors stays zero, so that in core.c it is set
to the value of nvme_mps_to_sectors(ctrl, id->mdts)  (=> 4096 in my case)

> if (id->mdts)
>   max_hw_sectors = nvme_mps_to_sectors(ctrl, id->mdts);
> else
>   max_hw_sectors = UINT_MAX;
> ctrl->max_hw_sectors =
>   min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);

But that alone was not enough: 
Tests with ctrl->max_hw_sectors=4096 and NVME_MAX_SEGS = 128 still resulted in corruptions.
They only went away after reverting this value back to 127 (the value from kernel 6.1).

Additional logging to get the values of the following statements
> (dma_opt_mapping_size(&pdev->dev) >> 9) = 256
> (dma_max_mapping_size(&pdev->dev) >> 9) = 36028797018963967 [sic!]

@Stefan, can you check which value NVME_MAX_SEGS had in your tests?
It also seems to have an influence.

Best regards, Matthias
Comment 22 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-13 21:14:55 UTC
(In reply to mbe from comment #21)

> To achieve the same values on Kernel 6.11.0-13, 

Please clarify: what upstream kernel does that distro-specifc version number refer to? And is that a kernel that is vanilla or close to upstream? And why use a EOL series anyway? It's best to use a fresh mainline for all testing, except when data from older kernels is required.
Comment 23 Bruno Gravato 2025-01-15 06:38:02 UTC
I finally got the chance to run some more tests with some interesting
and unexpected results...

I put another disk (WD Black SN750) in the main M.2 slot (the
problematic one), but kept my main disk (Solidigm P44 Pro) in the
secondary M.2 slot (where it doesn't have any issues).
I rerun my test: step 1) copy a large number of files to the WD disk
(main slot), step 2) run btrfs scrub on it and expect some checksum
errors
To my surprise there were no errors!
I tried it twice with different kernels (6.2.6 and 6.11.5) and booting
from either disk (I have linux installations on both).
Still no errors.

I then removed the Solidigm disk from the secondary and kept the WD
disk in the main M.2 slot.
Rerun my tests (on kernel 6.11.5) and bang! btrfs scrub now detected
quite a few checksum errors!

I then tried disabling volatile write cache with "nvme set-feature
/dev/nvme0 -f 6 -v 0"
"nvme get-feature /dev/nvme0 -f 6" confirmed it was disabled, but
/sys/block/nvme0n1/queue/fua still showed 1... Was that supposed to
turn into 0?

I re-run my test, but I still got checksum errors on btrfs scrub. So
disabling volatile write cache (assuming I did it correctly) didn't
make a difference in my case.

I put the Solidigm disk back into the secondary slot, booted and rerun
the test on the WD disk (main slot) just to be triple sure and still
no errors.

So it looks like the corruption only happens if only the main M.2 slot
is occupied and the secondary M.2 slot is free.
With two nvme disks (one on each M.2 slot), there were no errors at all.

Stefan, did you ever try running your tests with 2 nvme disks
installed on both slots? Or did you use only one slot at a time?


Bruno
Comment 24 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-15 08:40:19 UTC
On 15.01.25 07:37, Bruno Gravato wrote:
> I finally got the chance to run some more tests with some interesting
> and unexpected results...

FWIW, I briefly looked into the issue in between as well and can
reproduce it[1] locally with my Samsung SSD 990 EVO Plus 4TB in the main
M.2 slot of my DeskMini X600 using btrfs on a mainline kernel with a
config from Fedora rawhide.

So what can we that are affected by the problem do to narrow it down?

What does it mean that disabling the NVMe devices's write cache often
but apparently not always helps? It it just reducing the chance of the
problem occurring or accidentally working around it?

hch initially brought up that swiotlb seems to be used. Are there any
BIOS setup settings we should try? I tried a few changes yesterday, but
I still get the "PCI-DMA: Using software bounce buffering for IO
(SWIOTLB)" message in the log and not a single line mentioning DMAR.

Ciao, Thorsten

[1] see start of this thread and/or
https://bugzilla.kernel.org/show_bug.cgi?id=219609 for details

> I put another disk (WD Black SN750) in the main M.2 slot (the
> problematic one), but kept my main disk (Solidigm P44 Pro) in the
> secondary M.2 slot (where it doesn't have any issues).
> I rerun my test: step 1) copy a large number of files to the WD disk
> (main slot), step 2) run btrfs scrub on it and expect some checksum
> errors
> To my surprise there were no errors!
> I tried it twice with different kernels (6.2.6 and 6.11.5) and booting
> from either disk (I have linux installations on both).
> Still no errors.
> 
> I then removed the Solidigm disk from the secondary and kept the WD
> disk in the main M.2 slot.
> Rerun my tests (on kernel 6.11.5) and bang! btrfs scrub now detected
> quite a few checksum errors!
> 
> I then tried disabling volatile write cache with "nvme set-feature
> /dev/nvme0 -f 6 -v 0"
> "nvme get-feature /dev/nvme0 -f 6" confirmed it was disabled, but
> /sys/block/nvme0n1/queue/fua still showed 1... Was that supposed to
> turn into 0?
> 
> I re-run my test, but I still got checksum errors on btrfs scrub. So
> disabling volatile write cache (assuming I did it correctly) didn't
> make a difference in my case.
> 
> I put the Solidigm disk back into the secondary slot, booted and rerun
> the test on the WD disk (main slot) just to be triple sure and still
> no errors.
> 
> So it looks like the corruption only happens if only the main M.2 slot
> is occupied and the secondary M.2 slot is free.
> With two nvme disks (one on each M.2 slot), there were no errors at all.
> 
> Stefan, did you ever try running your tests with 2 nvme disks
> installed on both slots? Or did you use only one slot at a time?

$ journalctl -k | grep -i -e DMAR -e IOMMU -e AMD-Vi -e SWIOTLB
AMD-Vi: Using global IVHD EFR:0x246577efa2254afa, EFR2:0x0
iommu: Default domain type: Translated
iommu: DMA domain TLB invalidation policy: lazy mode
pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
pci 0000:00:01.0: Adding to iommu group 0
pci 0000:00:01.3: Adding to iommu group 1
pci 0000:00:02.0: Adding to iommu group 2
pci 0000:00:02.3: Adding to iommu group 3
pci 0000:00:03.0: Adding to iommu group 4
pci 0000:00:04.0: Adding to iommu group 5
pci 0000:00:08.0: Adding to iommu group 6
pci 0000:00:08.1: Adding to iommu group 7
pci 0000:00:08.2: Adding to iommu group 8
pci 0000:00:08.3: Adding to iommu group 9
pci 0000:00:14.0: Adding to iommu group 10
pci 0000:00:14.3: Adding to iommu group 10
pci 0000:00:18.0: Adding to iommu group 11
pci 0000:00:18.1: Adding to iommu group 11
pci 0000:00:18.2: Adding to iommu group 11
pci 0000:00:18.3: Adding to iommu group 11
pci 0000:00:18.4: Adding to iommu group 11
pci 0000:00:18.5: Adding to iommu group 11
pci 0000:00:18.6: Adding to iommu group 11
pci 0000:00:18.7: Adding to iommu group 11
pci 0000:01:00.0: Adding to iommu group 12
pci 0000:02:00.0: Adding to iommu group 13
pci 0000:03:00.0: Adding to iommu group 14
pci 0000:03:00.1: Adding to iommu group 15
pci 0000:03:00.2: Adding to iommu group 16
pci 0000:03:00.3: Adding to iommu group 17
pci 0000:03:00.4: Adding to iommu group 18
pci 0000:03:00.6: Adding to iommu group 19
pci 0000:04:00.0: Adding to iommu group 20
pci 0000:04:00.1: Adding to iommu group 21
pci 0000:05:00.0: Adding to iommu group 22
AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA
PC GA_vAPIC
AMD-Vi: Interrupt remapping enabled
AMD-Vi: Virtual APIC enabled
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
Comment 25 Stefan 2025-01-15 10:47:51 UTC
Hi,

(replying to both, the mailing list and the kernel bug tracker)

Am 15.01.25 um 07:37 schrieb Bruno Gravato:
> I then removed the Solidigm disk from the secondary and kept the WD
> disk in the main M.2 slot. Rerun my tests (on kernel 6.11.5) and
> bang! btrfs scrub now detected quite a few checksum errors!
>
> I then tried disabling volatile write cache with "nvme set-feature
> /dev/nvme0 -f 6 -v 0" "nvme get-feature /dev/nvme0 -f 6" confirmed it
> was disabled, but /sys/block/nvme0n1/queue/fua still showed 1... Was
> that supposed to turn into 0?

You can check this using `nvme get-feature /dev/nvme0n1 -f 6`

> So it looks like the corruption only happens if only the main M.2
> slot is occupied and the secondary M.2 slot is free. With two nvme
> disks (one on each M.2 slot), there were no errors at all.
>
> Stefan, did you ever try running your tests with 2 nvme disks
> installed on both slots? Or did you use only one slot at a time?

No, I only tested these configurations:

1. 1st M.2: Lexar;    2nd M.2: empty
    (Easy to reproduce write errors)
2. 1st M.2: Kingsten; 2nd M.2: Lexar
    (Difficult to reproduce read errors with 6.1 Kernel, but no issues
    with a newer ones within several month of intense use)

I'll swap the SSD's soon. Then I will also test other configurations and
will try out a third SSD. If I get corruption with other SSD's, I will
check which modifications help.

Note that I need both SSD's (configuration 2) in about one week and
cannot change this for about 3 months (already announced this in December).

Thus, if there are things I shall test with configuration 1, please
inform me quickly.

Just as remainder (for those who did not read the two bug trackers):
I tested with `f3` (a utility used to detect scam disks) on ext4.
`f3` reports overwritten sectors. In configuration 1 this are write
errors (appear if I read again).

(If no other SSD-intense jobs are running), the corruption do not occur
in the last files, and I never noticed file system corruptions, only
file contents is corrupt. (This is probably luck, but also has something
to do with the journal and the time when file system information are
written.)


Am 13.01.25 um 22:01 schrieb bugzilla-daemon@kernel.org:
 > https://bugzilla.kernel.org/show_bug.cgi?id=219609
 >
 > --- Comment #21 from mbe ---
 > Hi,
 >
 > I did some more tests. At first I retrieved the following values
under debian
 >
 >> Debian 12, Kernel 6.1.119, no corruption
 >> cat /sys/class/block/nvme0n1/queue/max_hw_sectors_kb
 >> 2048
 >>
 >> cat /sys/class/block/nvme0n1/queue/max_sectors_kb
 >> 1280
 >>
 >> cat /sys/class/block/nvme0n1/queue/max_segments
 >> 127
 >>
 >> cat /sys/class/block/nvme0n1/queue/max_segment_size
 >> 4294967295
 >
 > To achieve the same values on Kernel 6.11.0-13, I had to make the
following
 > changes to drivers/nvme/host/pci.c
 >
 >> --- pci.c.org 2024-09-15 16:57:56.000000000 +0200
 >> +++ pci.c     2025-01-13 21:18:54.475903619 +0100
 >> @@ -41,8 +41,8 @@
 >>    * These can be higher, but we need to ensure that any command doesn't
 >>    * require an sg allocation that needs more than a page of data.
 >>    */
 >> -#define NVME_MAX_KB_SZ       8192
 >> -#define NVME_MAX_SEGS        128
 >> +#define NVME_MAX_KB_SZ       4096
 >> +#define NVME_MAX_SEGS        127
 >>   #define NVME_MAX_NR_ALLOCATIONS      5
 >>
 >>   static int use_threaded_interrupts;
 >> @@ -3048,8 +3048,8 @@
 >>         * Limit the max command size to prevent iod->sg allocations
going
 >>         * over a single page.
 >>         */
 >> -     dev->ctrl.max_hw_sectors = min_t(u32,
 >> -             NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev)
 >> 9);
 >> +     //dev->ctrl.max_hw_sectors = min_t(u32,
 >> +     //      NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev)
 >> 9);
 >>        dev->ctrl.max_segments = NVME_MAX_SEGS;
 >>
 >>        /*
 >
 > So basically, dev->ctl.max_hw_sectors stays zero, so that in core.c
it is set
 > to the value of nvme_mps_to_sectors(ctrl, id->mdts)  (=> 4096 in my case)
This has the same effect as setting it to `dma_max_mapping_size(...)`

 >> if (id->mdts)
 >>    max_hw_sectors = nvme_mps_to_sectors(ctrl, id->mdts);
 >> else
 >>    max_hw_sectors = UINT_MAX;
 >> ctrl->max_hw_sectors =
 >>    min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
 >
 > But that alone was not enough:
 > Tests with ctrl->max_hw_sectors=4096 and NVME_MAX_SEGS = 128 still
resulted in
 > corruptions.
 > They only went away after reverting this value back to 127 (the value
from
 > kernel 6.1).

That change was introduced in 6.3-rc1 using a patch "nvme-pci: place
descriptor addresses in iod" (
https://github.com/torvalds/linux/commit/7846c1b5a5db8bb8475603069df7c7af034fd081
)

This patch has no effect for me, i.e. unmodified kernels work up to 6.3.6.

The patch that triggers the corruptions is the one introduced in 6.3.7
  which replaces `dma_max_mapping_size(...)` by
`dma_opt_mapping_size(...)`. If I apply this change to 6.1, the
corruptions also occur in that kernel.

Matthias, did you checked what happens is you only modify NVME_MAX_SEGS
(and leave the `dev->ctrl.max_hw_sectors = min_t(u32, NVME_MAX_KB_SZ <<
1, dma_opt_mapping_size(&pdev->dev) >> 9);`)

 > Additional logging to get the values of the following statements
 >> (dma_opt_mapping_size(&pdev->dev) >> 9) = 256
 >> (dma_max_mapping_size(&pdev->dev) >> 9) = 36028797018963967 [sic!]
 >
 > @Stefan, can you check which value NVME_MAX_SEGS had in your tests?
 > It also seems to have an influence.

"128", see above.

Regards Stefan
Comment 26 Bruno Gravato 2025-01-15 13:14:37 UTC
On Wed, 15 Jan 2025 at 10:48, Stefan <linux-kernel@simg.de> wrote:
> > Stefan, did you ever try running your tests with 2 nvme disks
> > installed on both slots? Or did you use only one slot at a time?
>
> No, I only tested these configurations:
>
> 1. 1st M.2: Lexar;    2nd M.2: empty
>     (Easy to reproduce write errors)
> 2. 1st M.2: Kingsten; 2nd M.2: Lexar
>     (Difficult to reproduce read errors with 6.1 Kernel, but no issues
>     with a newer ones within several month of intense use)
>
> I'll swap the SSD's soon. Then I will also test other configurations and
> will try out a third SSD. If I get corruption with other SSD's, I will
> check which modifications help.

So it may be that the reason you no longer had errors in config 2 is
not because you put a different SSD in the 1st slot, but because you
now have the 2nd slot also occupied, like me.

If yours behaves like mine, I'd expect that if you swap the disks in
config 2, that you won't have any errors as well...
I'm very curious to see the result of that test!

Just to recap the results of my tests:

Setup 1
Main slot: Solidigm
Secondary slot: (empty)
Result: BAD - corruption happens

Setup 2
Main slot: (empty)
Secondary slot: Solidigm
Result: GOOD - no corruption

Setup 3
Main slot: WD
Secondary slot: (empty)
Result: BAD - corruption happens

Setup 4
Main slot: WD
Secondary slot: Solidigm
Result: GOOD - no corruption (on either disk)

So, in my case, it looks like the corruption only happens if I have
only 1 disk installed in the main slot and the secondary slot is
empty.
If I have the two slots occupied or only the secondary slot occupied,
there are no more errors.


Bruno
Comment 27 Stefan 2025-01-15 16:26:46 UTC
Hi,

Am 15.01.25 um 14:14 schrieb Bruno Gravato:
> If yours behaves like mine, I'd expect that if you swap the disks in
> config 2, that you won't have any errors as well...

yeah, I would just need to plug something into the 2nd M.2 socket. But
that can't be done remotely. I will do that on weekend or in next week.

BTW, is there a kernel parameter to ignore a NVME/PCI device? If the
corruptions appear again after disabling the 2nd SSD, it is more likely
that it is a kernel problem, e.g. a driver writing to memory reserved
for some other driver/component. Such a bug may only occur under rare
conditions. AFAIU, the patch "nvme-pci: place descriptor addresses in
iod" form 6.3-rc1 attempts to use some space which is otherwise unused.
Unfortunately I was not able to revert that patch because later changes
depend on it.

So, I now only tried out whether just `NVME_MAX_SEGS 127` helps (see
message from Matthias). Answer is no. This only seem to by an upper
limit, because `/sys/class/block/nvme0n1/queue/max_segments` reports 33
with unmodified kernels >= 6.3.7. With older kernels or kernels with
reversed patch "nvme-pci: clamp max_hw_sectors based on DMA optimized
limitation" (introduced in 6.3.7) this value is 127 and corruptions
disappear.

I guess, this value somehow has to be 127. In my case it is sufficient
to revert the patch form 6.3.7. In Matthias's case, the values then
becomes 128 and has to be limited additionally using `NVME_MAX_SEGS 127`

Regards Stefan
Comment 28 mbe 2025-01-15 23:13:27 UTC
I don't know if it helps to narrow it down, but adding the kernel parameter

nvme.io_queue_depth=2

makes the corruption disappear with an unpatched kernel (Ubuntu 6.11.0-12 in my case). Of course it is much slower with this setting.
Comment 29 Keith Busch 2025-01-16 00:52:51 UTC
Well this is a real doozy. The observation appears completely dependent on PCI slot populations, but it's somehow also dependent on a software alignment/granularity or queue depth choice? The whole part with the 2nd slot use vs. unused really indicates some kind of platform anomaly than a kernel bug.

I'm going to ignore the 2nd slot for a moment because I can't reconcile that with the kernel size limits. Let's just consider the kernel transfer sizing did something weird for your device, and now we introduce the queue-depth 2 observation into the picture. This now starts to sound like that O2 Micro bug where transfers than ended on page boundaries got misinterpreted by NVMe controller. That's this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=ebefac5647968679f6ef5803e5d35a71997d20fa

Now, it may not be appropriate to just add your devices to that quirk because it only reliably works for devices with MDTS of 5 or less, and I think your devices are larger. But they might have the same bug. It'd be weird if so many vendors implemented it incorrectly, but maybe they're using the same 3rd party controller.
Comment 30 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-16 05:37:31 UTC
(In reply to Keith Busch from comment #29)
>
> Now, it may not be appropriate to just add your devices to that quirk
> because it only reliably works for devices with MDTS of 5 or less, and I
> think your devices are larger.

Will give that a try, but one comment:

> But they might have the same bug. It'd be
> weird if so many vendors implemented it incorrectly, but maybe they're using
> the same 3rd party controller.

That makes it sounds like you suspect a problem in the NVMe devices. But isn't it way more likely that it's something in the machine? I mean we all seem to have the same one (ASRock Deskmini X600) and use NVMe devices that apparently work fine for everybody else, as they are not new and sold for a while. So it sounds more like that machine is doing something wrong or doing something odd that exposes a kernel bug.
Comment 31 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-16 09:06:27 UTC
For me it seems disabling the IOMMU in the BIOS Setup (Advanced -> AMD CBS -> iommu) prevents the problem from happening.
Comment 32 Stefan 2025-01-16 09:14:09 UTC
Hi,

Am 16.01.25 um 06:37 schrieb bugzilla-daemon@kernel.org:
> --- Comment #30 from The Linux kernel's regression tracker (Thorsten
> Leemhuis) ---
>> But they might have the same bug. It'd be weird if so many vendors
>> implemented it incorrectly, but maybe they're using the same 3rd
>> party controller.
>
> That makes it sounds like you suspect a problem in the NVMe devices.
> But isn't it way more likely that it's something in the machine? I
> mean we all seem to have the same one (ASRock Deskmini X600) and use
> NVMe devices that apparently work fine for everybody else, as they
> are not new and sold for a while. So it sounds more like that machine
> is doing something wrong or doing something odd that exposes a kernel
> bug.

Furthermore is seems that the corruptions occur with all SSD's under
certain conditions and the controllers are quite different.

One user from the c't forum wrote me, that the corruptions only occur if
network is enabled, and that this trick works with both, Ethernet and
WLAN. (Is asked him to report his results here.)

Maybe something (kernel, firmware or even the CPU) messes up DMA
transfers of different PCIe devices, e.g. due to a buffer overflow.

AFAIS, another thing that is in common: All CPU's used are from 8000
(and on this chipset-less mainbaord, all PCIe devices are connected to
the CPU).

Regards Stefan
Comment 33 Mario Limonciello (AMD) 2025-01-16 14:24:52 UTC
> Well this is a real doozy. 

Are all of these reports on the exact same motherboard?  "ASRock Deskmini X600"

> One user from the c't forum wrote me, that the corruptions only occur if
network is enabled, and that this trick works with both, Ethernet and
WLAN. (Is asked him to report his results here.)

Has anyone contacted ASRock support?  With such random results I would wonder if there is a signal integrity issue that needs to be looked at.

> For me it seems disabling the IOMMU in the BIOS Setup (Advanced -> AMD CBS ->
> iommu) prevents the problem from happening.

Can others corroborate this finding?
Comment 34 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-16 15:32:46 UTC
(In reply to Mario Limonciello (AMD) from comment #33)
> > Well this is a real doozy. 
> Are all of these reports on the exact same motherboard?  "ASRock Deskmini
> X600"

Pretty sure that's the case.
 
> > One user from the c't forum wrote me, that the corruptions only occur if
> network is enabled, and that this trick works with both, Ethernet and
> WLAN. (Is asked him to report his results here.)
> Has anyone contacted ASRock support?

Not that I know of.

>  With such random results I would
> wonder if there is a signal integrity issue that needs to be looked at.

FWIW, Windows apparently works fine. But I guess that might be due to some random minor details/difference or something like that.
 
> > For me it seems disabling the IOMMU in the BIOS Setup (Advanced -> AMD CBS
> ->
> > iommu) prevents the problem from happening.
> Can others corroborate this finding?

Yeah, would be good if someone could confirm my result.
Comment 35 Stefan 2025-01-16 15:35:19 UTC
> --- Comment #33 from Mario Limonciello (AMD) ---
>> Well this is a real doozy.
>
> Are all of these reports on the exact same motherboard?  "ASRock Deskmini
> X600"

If I haven't overlooked something, all reports are from the motherboard
"AsRock X600M-STX" (from the mini PC "ASRock Deskmini X600") with an
series 8000 Ryzen.

>> One user from the c't forum wrote me, that the corruptions only occur if
> network is enabled, and that this trick works with both, Ethernet and
> WLAN. (Is asked him to report his results here.)
>
> Has anyone contacted ASRock support?  With such random results I would wonder
> if there is a signal integrity issue that needs to be looked at.

Signal integrity does not depend on transfer size and is not improved by
crosstalk of a 2nd SSD. (Corruptions disappear if a 2nd SSD is installed.)

Regards Stefan
Comment 36 mbe 2025-01-16 17:12:52 UTC
I can confirm that disabling IOMMU under "Advanced\AMD CBS\NBIO Common Options"
prevents the data corruption.

System spec: ASRock Deskmini X600, AMD Ryzen 7 8700G
Comment 37 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-16 17:29:46 UTC
On 15.01.25 09:40, Thorsten Leemhuis wrote:
> On 15.01.25 07:37, Bruno Gravato wrote:
>> I finally got the chance to run some more tests with some interesting
>> and unexpected results...
> 
> FWIW, I briefly looked into the issue in between as well and can
> reproduce it[1] locally with my Samsung SSD 990 EVO Plus 4TB in the main
> M.2 slot of my DeskMini X600 using btrfs on a mainline kernel with a
> config from Fedora rawhide.
> 
> So what can we that are affected by the problem do to narrow it down?
> 
> What does it mean that disabling the NVMe devices's write cache often
> but apparently not always helps? It it just reducing the chance of the
> problem occurring or accidentally working around it?
> 
> hch initially brought up that swiotlb seems to be used. Are there any
> BIOS setup settings we should try? I tried a few changes yesterday, but
> I still get the "PCI-DMA: Using software bounce buffering for IO
> (SWIOTLB)" message in the log and not a single line mentioning DMAR.

FWIW, I meanwhile became aware that it is normal that there are no lines
with DMAR when it comes to AMD's IOMMU. Sorry for the noise.

But there is a new development:

I noticed earlier today that disabling the IOMMU in the BIOS Setup seems
to prevent the corruption from occurring. Another user in the bugzilla
ticket just confirmed this.

Ciao, Thorsten

> [1] see start of this thread and/or
> https://bugzilla.kernel.org/show_bug.cgi?id=219609 for details
> 
>> I put another disk (WD Black SN750) in the main M.2 slot (the
>> problematic one), but kept my main disk (Solidigm P44 Pro) in the
>> secondary M.2 slot (where it doesn't have any issues).
>> I rerun my test: step 1) copy a large number of files to the WD disk
>> (main slot), step 2) run btrfs scrub on it and expect some checksum
>> errors
>> To my surprise there were no errors!
>> I tried it twice with different kernels (6.2.6 and 6.11.5) and booting
>> from either disk (I have linux installations on both).
>> Still no errors.
>>
>> I then removed the Solidigm disk from the secondary and kept the WD
>> disk in the main M.2 slot.
>> Rerun my tests (on kernel 6.11.5) and bang! btrfs scrub now detected
>> quite a few checksum errors!
>>
>> I then tried disabling volatile write cache with "nvme set-feature
>> /dev/nvme0 -f 6 -v 0"
>> "nvme get-feature /dev/nvme0 -f 6" confirmed it was disabled, but
>> /sys/block/nvme0n1/queue/fua still showed 1... Was that supposed to
>> turn into 0?
>>
>> I re-run my test, but I still got checksum errors on btrfs scrub. So
>> disabling volatile write cache (assuming I did it correctly) didn't
>> make a difference in my case.
>>
>> I put the Solidigm disk back into the secondary slot, booted and rerun
>> the test on the WD disk (main slot) just to be triple sure and still
>> no errors.
>>
>> So it looks like the corruption only happens if only the main M.2 slot
>> is occupied and the secondary M.2 slot is free.
>> With two nvme disks (one on each M.2 slot), there were no errors at all.
>>
>> Stefan, did you ever try running your tests with 2 nvme disks
>> installed on both slots? Or did you use only one slot at a time?
> 
> $ journalctl -k | grep -i -e DMAR -e IOMMU -e AMD-Vi -e SWIOTLB
> AMD-Vi: Using global IVHD EFR:0x246577efa2254afa, EFR2:0x0
> iommu: Default domain type: Translated
> iommu: DMA domain TLB invalidation policy: lazy mode
> pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
> pci 0000:00:01.0: Adding to iommu group 0
> pci 0000:00:01.3: Adding to iommu group 1
> pci 0000:00:02.0: Adding to iommu group 2
> pci 0000:00:02.3: Adding to iommu group 3
> pci 0000:00:03.0: Adding to iommu group 4
> pci 0000:00:04.0: Adding to iommu group 5
> pci 0000:00:08.0: Adding to iommu group 6
> pci 0000:00:08.1: Adding to iommu group 7
> pci 0000:00:08.2: Adding to iommu group 8
> pci 0000:00:08.3: Adding to iommu group 9
> pci 0000:00:14.0: Adding to iommu group 10
> pci 0000:00:14.3: Adding to iommu group 10
> pci 0000:00:18.0: Adding to iommu group 11
> pci 0000:00:18.1: Adding to iommu group 11
> pci 0000:00:18.2: Adding to iommu group 11
> pci 0000:00:18.3: Adding to iommu group 11
> pci 0000:00:18.4: Adding to iommu group 11
> pci 0000:00:18.5: Adding to iommu group 11
> pci 0000:00:18.6: Adding to iommu group 11
> pci 0000:00:18.7: Adding to iommu group 11
> pci 0000:01:00.0: Adding to iommu group 12
> pci 0000:02:00.0: Adding to iommu group 13
> pci 0000:03:00.0: Adding to iommu group 14
> pci 0000:03:00.1: Adding to iommu group 15
> pci 0000:03:00.2: Adding to iommu group 16
> pci 0000:03:00.3: Adding to iommu group 17
> pci 0000:03:00.4: Adding to iommu group 18
> pci 0000:03:00.6: Adding to iommu group 19
> pci 0000:04:00.0: Adding to iommu group 20
> pci 0000:04:00.1: Adding to iommu group 21
> pci 0000:05:00.0: Adding to iommu group 22
> AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA
> PC GA_vAPIC
> AMD-Vi: Interrupt remapping enabled
> AMD-Vi: Virtual APIC enabled
> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
Comment 38 Mario Limonciello (AMD) 2025-01-16 17:33:52 UTC
> I noticed earlier today that disabling the IOMMU in the BIOS Setup seems
to prevent the corruption from occurring.

If you can reliably reproduce this issue, can you also experiment with turning it back on in BIOS and then using:
* iommu=pt
  (which will do identity domain)
and separately
* amd_iommu=off
 (which will disable the IOMMU from Linux)

> If I haven't overlooked something, all reports are from the motherboard
"AsRock X600M-STX" (from the mini PC "ASRock Deskmini X600") with an
series 8000 Ryzen.

For everyone responding with their system, it would be ideal to also share information about the AGESA version (sometimes reported in `dmidecode | grep AGESA`) as well as the ASRock BIOS version (/sys/class/dmi/id/bios_version).

> Corruptions disappear if a 2nd SSD is installed

I missed that; quite bizarre.
Comment 39 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-16 18:25:50 UTC
Mario, thx for looking into this.

> If you can reliably reproduce this issue

Usually within ten to twenty seconds

> iommu=pt

Apparently[1] helps. 

> amd_iommu=off

Apparently[1] helps, too

[1] I did not try for a long time, but for two or three minutes and no corruption occurred; normally one occurs on nearly every try of "f3write -e 4" and checking the result afterwards.

> it would be ideal to also share information

$ grep -s '' /sys/class/dmi/id/*
/sys/class/dmi/id/bios_date:12/05/2024
/sys/class/dmi/id/bios_release:5.35
/sys/class/dmi/id/bios_vendor:American Megatrends International, LLC.
/sys/class/dmi/id/bios_version:4.08
/sys/class/dmi/id/board_asset_tag:Default string
/sys/class/dmi/id/board_name:X600M-STX
/sys/class/dmi/id/board_vendor:ASRock
/sys/class/dmi/id/board_version:Default string
/sys/class/dmi/id/chassis_asset_tag:Default string
/sys/class/dmi/id/chassis_type:3
/sys/class/dmi/id/chassis_vendor:Default string
/sys/class/dmi/id/chassis_version:Default string
/sys/class/dmi/id/modalias:dmi:bvnAmericanMegatrendsInternational,LLC.:bvr4.08:bd12/05/2024:br5.35:svnASRock:pnX600M-STX:pvrDefaultstring:rvnASRock:rnX600M-STX:rvrDefaultstring:cvnDefaultstring:ct3:cvrDefaultstring:skuDefaultstring:
/sys/class/dmi/id/product_family:Default string
/sys/class/dmi/id/product_name:X600M-STX
/sys/class/dmi/id/product_sku:Default string
/sys/class/dmi/id/product_version:Default string
/sys/class/dmi/id/sys_vendor:ASRock
/sys/class/dmi/id/uevent:MODALIAS=dmi:bvnAmericanMegatrendsInternational,LLC.:bvr4.08:bd12/05/2024:br5.35:svnASRock:pnX600M-STX:pvrDefaultstring:rvnASRock:rnX600M-STX:rvrDefaultstring:cvnDefaultstring:ct3:cvrDefaultstring:skuDefaultstring:
$ sudo dmidecode | grep AGESA
	String: AGESA!V9 ComboAm5PI 1.2.0.2a
Comment 40 Stefan 2025-01-16 21:51:56 UTC
Created attachment 307497 [details]
logs.tar.bz2

Hi,

I ran a few tests with SSD's and BIOS settings. (I cannot do this often
because hardware is in use and installed remotely). Kernel logs and
lspci output are in the enclosed archive. Unmodified (except of an
additional message) Kernel 6.13-rc6 was used.

0. As reference: IOMMU and ethernet enabled
    1st M.2: Lexar
    2nd M.2: empty
    Archive directory: `lexar_empty`
    ==> Corruptions occur

1. IOMMU disabled via BIOS, ethernet enabled
    1st M.2: Lexar
    2nd M.2: empty
    Archive directory: `lexar_empty.noiommu`
    ==> No corruptions

2. Ethernet disabled via BIOS, IOMMU enabled
    1st M.2: Lexar
    2nd M.2: empty
    Archive directory: `lexar_empty.noeth`
    ==> No corruptions

3. IOMMU and ethernet enabled
    1st M.2: Lexar
    2nd M.2: Seagate Firecuda 520, 500 GB
    Archive directory: `lexar_firecuda`
    ==> No corruptions

4. IOMMU and ethernet enabled
    1st M.2: Seagate Firecuda 520, 500 GB
    2nd M.2: empty
    Archive directory: `firecuda_empty`
    ==> No corruptions

The last test was a surprise because it is different to the observations
reported in comment 26.

Note that the kernel emits the warning

 > workqueue: work disable count underflowed
 > WARNING: CPU: 1 PID: 23 at kernel/workqueue.c:4317
...

> [1] I did not try for a long time, but for two or three minutes and no
> corruption occurred; normally one occurs on nearly every try of "f3write -e
> 4"
> and checking the result afterwards.

I write 250 or 1000 files (1 file = 1 GB) because only about 2% of them
are corrupt.

The probability of errors seems to vary strongly.

Regards Stefan
Comment 41 Ralph Gerstmann 2025-01-16 22:35:43 UTC
Hi Team,

i track this error since weeks, made dozends of test-installs and i would like to add my recent results to this report.

Happens on at least btrfs and ext4.
I prefer btrfs since it takes only 2 sec. to find the bug right after installation (before reboot) with "brtfs scrub start /target"  because btrfs does crc checksumming - while with ext4 does not: so on ext4 you need to copy & verify - manually or scripted.

Erros seems 100% reproducible.
Usually there are about 15-30 corrupted files.

I usually test with simple install of Linux Mint 22.
Takes only ~10 min - assumed you have a bootable install-stick at hand.
Alternatively Ubuntu Server 24.04 or 24.10 - does no matter - 
but they ask more questions. (Ubuntu Desktop does not offer btrfs.)

I was the one who brought in the author of c't Magazin - Christian.
With help of c'ts support forum i found bugreport 1076372 thus Stefan and this report.
I guess I was the first who found out slot M2_2 is not concerned.
( I am the person mentiond by Stefan in comment 32)

I also found out:

If Ubuntu Server can't update packages while installing, due to unavailable network, there might be no corrupted files .
(Tested twice with no corruptions.)
This was confirmed by Christian.
If network is available - problem is reproducible.
(Tested intensly with LAN and WLAN.)

If you insert a unused dummy ssd in slot M2_2 and install on slot M2_1  - the error on slot M2_1 is gone.

I am willing to do further tests if needed - please request me to do so.

Best regards Ralph

Systems involved:
ASRock DESKMINI X-600M-STX
with BIOS v4.03, v 4.04, v4.08, v3.02
AMD Ryzen 5 8600G, AMD Ryzen 7 8700G
Diverse RAM
Diverse NVMe-SSDs: Samsung 990 Evo - Samsung 990 PRO - Samsung 980 PRO (all with up2date firmware) + many more
Ubuntu 24.04, Ubuntu 24.10 (6.11), Linux Mint 22, Mint 21.3 Edge, (6.5), Fedora40, Fedora41 + more
Comment 42 Ralph Gerstmann 2025-01-16 22:56:52 UTC
On Dec., 7th i opened a Ticket @ ASRock support.
I just updated this and pointed them here.
Comment 43 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-17 06:45:07 UTC
(In reply to Ralph Gerstmann from comment #41)

> I am willing to do further tests if needed

Would afaics be good if at least one person could do what Mario asked for in Comment 38 (and hopefully confirming my results from Comment 39).
Comment 44 Christoph Hellwig 2025-01-17 08:05:15 UTC
On Wed, Jan 15, 2025 at 09:40:04AM +0100, Thorsten Leemhuis wrote:
> What does it mean that disabling the NVMe devices's write cache often
> but apparently not always helps? It it just reducing the chance of the
> problem occurring or accidentally working around it?

For consumer NAND device you basically can't disable the volatile
write cache.  If you do disable it, that just means it gets flushed
after every write, meaning you have to write the entire NAND
(super)block for every write, causing a huge slowdown (and a lot of
media wear).  This will change timings a lot obviously.  If it doesn't
change the timing the driver just fakes it, which reputable vendors
shouldn't be doing, but I would not be entirely surprised about
for noname devices.

> hch initially brought up that swiotlb seems to be used. Are there any
> BIOS setup settings we should try? I tried a few changes yesterday, but
> I still get the "PCI-DMA: Using software bounce buffering for IO
> (SWIOTLB)" message in the log and not a single line mentioning DMAR.

The real question would be to figure out why it is used.

Do you see the

	pci_dbg(dev, "marking as untrusted\n");

message in the commit log if enabling the pci debug output?
(I though we had a sysfs file for that, but I can't find it).
Comment 45 Mathieu Borderé 2025-01-17 08:57:21 UTC
Hi, an extra data point. I have the following setup:

AsRock Deskmini x600, BIOS 4.08 with Secure Boot enabled.
Ryzen 9 7900
Kingston Fury 2*32GB at default 5200Mhz
Western Digital SN850X 1TB (main slot, secondary slot never used)
Intel AX210 WiFi

Ethernet is enabled, but I don't use it, I use WiFi. IOMMU is "auto", haven't touched.

Running kernel 6.12.9 on Fedora 41 with btrfs. 

Been using this system for a couple of months, have copied my 500GB backup drive containing 1.6 million files to the nvme drive. I have also just generated 400 1GB files containing data from /dev/urandom. btrfs scrub reports no errors.
Comment 46 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-17 09:51:20 UTC
On 17.01.25 09:05, Christoph Hellwig wrote:
> On Wed, Jan 15, 2025 at 09:40:04AM +0100, Thorsten Leemhuis wrote:
>
>> hch initially brought up that swiotlb seems to be used. Are there any
>> BIOS setup settings we should try? I tried a few changes yesterday, but
>> I still get the "PCI-DMA: Using software bounce buffering for IO
>> (SWIOTLB)" message in the log and not a single line mentioning DMAR.
> 
> The real question would be to figure out why it is used.
> 
> Do you see the
> 
>       pci_dbg(dev, "marking as untrusted\n");
> 
> message in the commit log if enabling the pci debug output?

By booting with 'ignore_loglevel dyndbg="file drivers/pci/* +p"' I
suppose? No, that is not printed (but other debug lines from the pci
code are).

Side note: that "PCI-DMA: Using software bounce buffering for IO
>> (SWIOTLB)" message does show up on two other AMD machines I own as
well. One also has a Ryzen 8000, the other one a much older one.

And BTW a few bits of the latest development in the bugzilla ticket
(https://bugzilla.kernel.org/show_bug.cgi?id=219609 ):

* iommu=pt and amd_iommu=off seems to work around the problem (in
addition to disabling the iommu in the BIOS setup).

* Not totally sure, but it seems most or everyone affected is using a
Ryzen 8000 CPU -- and now one user showed up that mentioned a DeskMini
x600 with a Ryzen 7000 CPU is not affected (see ticket for details). But
that might be due to other aspects. A former colleague of mine who can
reproduce the problem will later test if a different CPU line really is
making a difference.

Ciao, Thorsten
Comment 47 Christoph Hellwig 2025-01-17 09:55:31 UTC
On Fri, Jan 17, 2025 at 10:51:09AM +0100, Thorsten Leemhuis wrote:
> By booting with 'ignore_loglevel dyndbg="file drivers/pci/* +p"' I
> suppose? No, that is not printed (but other debug lines from the pci
> code are).
> 
> Side note: that "PCI-DMA: Using software bounce buffering for IO
> >> (SWIOTLB)" message does show up on two other AMD machines I own as
> well. One also has a Ryzen 8000, the other one a much older one.
> 
> And BTW a few bits of the latest development in the bugzilla ticket
> (https://bugzilla.kernel.org/show_bug.cgi?id=219609 ):
> 
> * iommu=pt and amd_iommu=off seems to work around the problem (in
> addition to disabling the iommu in the BIOS setup).

That suggests the problem is related to the dma-iommu code, and
my strong suspect is the swiotlb bounce buffering for untrusted
device.  If you feel adventurous, can you try building a kernel
where dev_use_swiotlb() in drivers/iommu/dma-iommu.c is hacked
to always return false?
Comment 48 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-17 10:30:57 UTC
On 17.01.25 10:55, Christoph Hellwig wrote:
> On Fri, Jan 17, 2025 at 10:51:09AM +0100, Thorsten Leemhuis wrote:
>> By booting with 'ignore_loglevel dyndbg="file drivers/pci/* +p"' I
>> suppose? No, that is not printed (but other debug lines from the pci
>> code are).
>>
>> Side note: that "PCI-DMA: Using software bounce buffering for IO
>>>> (SWIOTLB)" message does show up on two other AMD machines I own as
>> well. One also has a Ryzen 8000, the other one a much older one.
>>
>> And BTW a few bits of the latest development in the bugzilla ticket
>> (https://bugzilla.kernel.org/show_bug.cgi?id=219609 ):
>>
>> * iommu=pt and amd_iommu=off seems to work around the problem (in
>> addition to disabling the iommu in the BIOS setup).
> 
> That suggests the problem is related to the dma-iommu code, and
> my strong suspect is the swiotlb bounce buffering for untrusted
> device.  If you feel adventurous, can you try building a kernel
> where dev_use_swiotlb() in drivers/iommu/dma-iommu.c is hacked
> to always return false?

Tried that, did not help: I still get corrupted data.

Ciao, Thorsten
Comment 49 Bruno Gravato 2025-01-17 13:36:44 UTC
On Fri, 17 Jan 2025 at 09:51, Thorsten Leemhuis
<regressions@leemhuis.info> wrote:
> * Not totally sure, but it seems most or everyone affected is using a
> Ryzen 8000 CPU -- and now one user showed up that mentioned a DeskMini
> x600 with a Ryzen 7000 CPU is not affected (see ticket for details). But
> that might be due to other aspects. A former colleague of mine who can
> reproduce the problem will later test if a different CPU line really is
> making a difference.

One other different aspect for that user besides the 7000 series CPU
is that he's using a wifi card as well (that sits in a M.2 wifi slot
just below the main M.2 disk slot), so I wonder if that may play a
role? I think most of us have no wifi card installed. I think I have a
M.2 wifi card on my former NUC, I'll see if it's compatible with the
deskmini and try it out.

The other reason could be some disk models aren't affected... I think
Stefan reported no issues on a Firecuda 520.

I ordered a Crucial T500 1TB yesterday. It's for another machine, but
I will try it on the deskmini x600 before deploying on the other
machine. I should receive it in a week or so.

Bruno
Comment 50 mbe 2025-01-17 18:32:33 UTC
No corruption with
IOMMU disabled in bios
IOMMU enabled in bios, iommu=pt
IOMMU enabled in bios, amd_iommu=off

Full system spec: 

ASRock Deskmini X600
CPU: AMD Ryzen 7 8700G
Memory: 2x 16 GB Kingston-Fury KF564S38IBK2-32, tested at different speeds from 4800 to 6400
1st M.2: Samsung 990 Pro 2 TB NVMe, latest firmware 4B2QJXD7
2nd M.2: always empty
Wifi M.2: Intel AX210, enabled and connected in all tests
Ethernet: enabled, but never connected in all tests

cat /sys/class/dmi/id/bios_version
4.08

dmidecode | grep AGESA
	String: AGESA!V9 ComboAm5PI 1.2.0.2a
	
latest SIO firmware 240522 installed

I had the error from the beginning even with the original bios version 1.43

Many thanks to everyone who is now looking into the problem.
Matthias
Comment 51 Ralph Gerstmann 2025-01-17 20:21:39 UTC
Made a 3 more tests...

*) Mint 22 (6.8.0) with IOMMU disabled in BIOS: No errors.
   I set it back to Auto before i continued with the following tests.)

*) Mint 22 (6.8.0) with network disconnected: Errors
(This is what i thought i saw long ago repeatedly - but since we found no errors in disconnected Ubuntu 24.10 Server i tested this again.)

*) Ubuntu 24.10 (6.11.0) with network disconnected: Errors

Conclusion:
A missing network might prevent the failure during install - at least in Ubuntu 22.10 - but can happen anyway. Enabling network seems to raise the chance.

I made dozens of installations with Mint 22 (WLAN/LAN/no net), i am pretty sure i didn't see a single one without this error - if the known conditions (4x4 NVMe SSD in Slot 1, nothing in Slot 2) are met.

Both systems show the same after installation:
cat /sys/class/block/nvme0n1/queue/max_hw_sectors_kb 128
cat /sys/class/block/nvme0n1/queue/max_sectors_kb 128
cat /sys/class/block/nvme0n1/queue/max_segments 33
cat /sys/class/block/nvme0n1/queue/max_segment_size 4294967295

BTW: Asrock support confirmed they forwarded this bugreport to their BIOS devolpment team.

Ralph
Comment 52 Keith Busch 2025-01-17 20:23:20 UTC
Is this even a Linux bug? Surely this would be observed in other operating systems?
Comment 53 Ralph Gerstmann 2025-01-17 20:26:09 UTC
Until now no one could reproduce in W1N.
Comment 54 Stefan 2025-01-17 21:32:30 UTC
Hi,

>> What does it mean that disabling the NVMe devices's write cache
>> often but apparently not always helps? It it just reducing the
>> chance of the problem occurring or accidentally working around it?
>
> For consumer NAND device you basically can't disable the volatile
> write cache.  If you do disable it, that just means it gets flushed
> after every write, meaning you have to write the entire NAND
> (super)block for every write, causing a huge slowdown (and a lot of
> media wear).  This will change timings a lot obviously.  If it
> doesn't change the timing the driver just fakes it, which reputable
> vendors shouldn't be doing, but I would not be entirely surprised
> about for noname devices.

As already mentioned, my SSD has no DRAM and uses HMB (Host memory
buffer). (It has non-volatile SLC cache.) Disabling volatile write cache
has no significant effect on read/write performance of large files,
because the HMB size in only 40MB. But things like file deletions may be
slower.

AFAIS the corruption occur with both kinds of SSD's, the ones that have
own DRAM and he ones that use HMB.

> --- Comment #49 from Bruno Gravato ---
>> * Not totally sure, but it seems most or everyone affected is
>> using a Ryzen 8000 CPU -- and now one user showed up that mentioned
>> a DeskMini x600 with a Ryzen 7000 CPU is not affected (see ticket
>> for details). But that might be due to other aspects. A former
>> colleague of mine who can reproduce the problem will later test if
>> a different CPU line really is making a difference.
>
> One other different aspect for that user besides the 7000 series CPU
> is that he's using a wifi card as well (that sits in a M.2 wifi slot
> just below the main M.2 disk slot), so I wonder if that may play a
> role? I think most of us have no wifi card installed. I think I have
> a M.2 wifi card on my former NUC, I'll see if it's compatible with
> the deskmini and try it out.
>
> The other reason could be some disk models aren't affected... I think
> Stefan reported no issues on a Firecuda 520.

Correct. To verify that the two other CPU series are not affected,
someone who can reproduce this error and who have laying around another
CPU must swap them.

> --- Comment #51 from Ralph Gerstman --- > A missing network might prevent the
> failure during install - at least
> in Ubuntu> 22.10 - but can happen anyway. Enabling network seems to
> raise the chance.
I had to disable it in BIOS. Just not connecting it has no effect
because drivers and firmware are still loaded.


Just for the files (already mentioned it): I'm using the latest BIOS
version 4.08 with  AGESA PI 1.2.0.2a (according to AsRock page) and
firmware blobs version 20241210 from
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/
and I can confirm the the corruptions also occur with older versions of
BIOS/firmware.

Regards Stefan
Comment 55 Ralph Gerstmann 2025-01-17 21:49:39 UTC
> > --- Comment #51 from Ralph Gerstman --- > A missing network might prevent
> the
> > failure during install - at least
> > in Ubuntu> 22.10 - but can happen anyway. Enabling network seems to
> > raise the chance.
>
> I had to disable it in BIOS. Just not connecting it has no effect
> because drivers and firmware are still loaded.

I had a lot of different situations why network did not work.
Tagged VLAN , unplugged cable, removed WLAN card, too lazy to enter access-key.

But one thing i never did: Disable LAN or VLAN devices in BIOS.
Comment 56 Keith Busch 2025-01-18 01:03:54 UTC
On Fri, Jan 17, 2025 at 10:31:55PM +0100, Stefan wrote:
> As already mentioned, my SSD has no DRAM and uses HMB (Host memory
> buffer). 

HMB and volatile write caches are not necessarily intertwined. A device
can have both. Generally speaking, you'd expect the HMB to have SSD
metadata, not user data, where a VWC usually just has user data. The
spec also requires the device maintain data integrity even with an
unexpected sudden loss of access to the HMB, but that isn't the case
with a VWC.

>(It has non-volatile SLC cache.) Disabling volatile write cache
> has no significant effect on read/write performance of large files,

Devices are free to have whatever hierarchy of non-volatile caches they
want without advertising that to the host, but if they're calling those
"volatile" then I think something has been misinterpreted.

> because the HMB size in only 40MB. But things like file deletions may be
> slower.
> 
> AFAIS the corruption occur with both kinds of SSD's, the ones that have
> own DRAM and he ones that use HMB.

Yeah, that was the point of the experiment. If corruption happens when
it's off, then that helps rule out host buffer size/alignment (which is
where this bz started) as a triggering condition. Disabling VWC is not a
"fix", it's just a debug data point. If corruption goes away with it
off, though, then we can't really conclude anything for this issue.
Comment 57 Thorsten Leemhuis 2025-01-20 14:31:41 UTC
On 17.01.25 10:51, Thorsten Leemhuis wrote:
> On 17.01.25 09:05, Christoph Hellwig wrote:
>> On Wed, Jan 15, 2025 at 09:40:04AM +0100, Thorsten Leemhuis wrote:

> And BTW a few bits of the latest development in the bugzilla ticket
> (https://bugzilla.kernel.org/show_bug.cgi?id=219609 ):
> 
> * iommu=pt and amd_iommu=off seems to work around the problem (in
> addition to disabling the iommu in the BIOS setup).
> 
> * Not totally sure, but it seems most or everyone affected is using a
> Ryzen 8000 CPU -- and now one user showed up that mentioned a DeskMini
> x600 with a Ryzen 7000 CPU is not affected (see ticket for details). But
> that might be due to other aspects. A former colleague of mine who can
> reproduce the problem will later test if a different CPU line really is
> making a difference.

My former colleague Christian Hirsch (not CCed) can reproduce the
problem reliably. He today switched the CPU to a Ryzen 7 7700 and later
to some Ryzen 9600X – and with those things worked just fine, e.g. no
corruptions. But they came back after putting the 8600G back in.

Ralph, can you please add this detail to the Asrock support ticket?

Ciao, Thorsten

[1] he described building a x600 machine in the c't magazine, which is
the reason why I and a few others affected and CCed build their x600 systems
Comment 58 Mario Limonciello (AMD) 2025-01-20 15:15:10 UTC
So are all the problematic CPUs reproducing this Ryzen 8600G/Ryzen 8700G?  Perhaps there is a firmware issue with those.
Comment 59 Stefan 2025-01-20 16:15:27 UTC
Hi,

> --- Comment #58 from Mario Limonciello (AMD) ---
> So are all the problematic CPUs reproducing this Ryzen 8600G/Ryzen 8700G?
> Perhaps there is a firmware issue with those.

... or even a hardware issue with those 8000 series CPU's which occurs
under certain conditions, namely without a chipset.

AsRock offers a few other products that use the same technology
(DeskMeet, Jupiter and a mini-ITX mainboard). Are they affected too?

Has anyone (ASRock and/or AMD) tested them with Linux before releasing
the hardware? (Windows often uses older technologies / features). AFAIK,
AsRock does not develop such products and firmware without massive
support from AMD.

I started another support request at http://event.asrock.com/tsd.asp .
Maybe this will expedite a fix.

Regards Stefan
Comment 60 Mario Limonciello (AMD) 2025-01-20 16:26:15 UTC
> ... or even a hardware issue with those 8000 series CPU's which occurs
under certain conditions, namely without a chipset.
> AsRock offers a few other products that use the same technology
(DeskMeet, Jupiter and a mini-ITX mainboard). Are they affected too?

Yes; I also want to know if this is unique to ASRock's X600M-STX or if this is happening to anyone on any other AM5 motherboards.

> Has anyone (ASRock and/or AMD) tested them with Linux before releasing
the hardware

Yes; I can assert that AMD has tested 8600G and 8700G with Linux.  You can look under "OS support" to see what OSes have been tested.

https://www.amd.com/en/products/processors/desktops/ryzen/8000-series/amd-ryzen-5-8600g.html

It's not out of the question that a generic AGESA firmware regression under specific circumstances has happened; but right now, all of the evidence on this thread /currently/ points to 8600G/8700G + X600M-STX.
Comment 61 Stefan 2025-01-20 17:37:06 UTC
Hi,

Am 20.01.25 um 17:26 schrieb bugzilla-daemon@kernel.org:
> --- Comment #60 from Mario Limonciello (AMD)
> (mario.limonciello@amd.com) --- Yes; I also want to know if this is
> unique to ASRock's X600M-STX or if this is happening to anyone on any
> other AM5 motherboards.
>
>> Has anyone (ASRock and/or AMD) tested them with Linux before
>> releasing
> the hardware
>
> Yes; I can assert that AMD has tested 8600G and 8700G with Linux.
> You can look under "OS support" to see what OSes have been tested.

sorry, my last statement about insufficient testing was obliviously
misleading. I had the combination CPU + mainboard in mind -- the "them"
refers to the AsRock products in the previous sentence.

Of course the combination of 8x00G + 1 or 2 x Promontory 19/21 (which
makes the different chipsets) has been tested and is widely used. I
therefore think, a generic AM5 issue is unlikely.

But the combination of 8x00G + Knoll3 (the magic SoC enabler chip) is
quite new and not used often so far. And since the errors occur in many
different configurations, they should be detectable by proper testing.

Most likely, the corruptions are triggered either by the combination
8x00G + Knoll3 (in that case other x600 products from AsRock should be
affected too) or by the combination 8x00G + X600M STX (that specific
mainboard only) and may be caused by firmware, hardware or the kernel.

> It's not out of the question that a generic AGESA firmware regression
> under specific circumstances has happened; but right now, all of the
> evidence on this thread /currently/ points to 8600G/8700G +
> X600M-STX.

The issues occurred with all BIOS version I tested, starting from the
initial one 1.43. AGESA version of some of them are stated at
https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.asp#BIOS .

Regards Stefan
Comment 62 Ralph Gerstmann 2025-01-20 18:15:31 UTC
(In reply to Thorsten Leemhuis from comment #57)
> 
> My former colleague Christian Hirsch (not CCed) can reproduce the
> problem reliably. He today switched the CPU to a Ryzen 7 7700 and later
> to some Ryzen 9600X – and with those things worked just fine, e.g. no
> corruptions. But they came back after putting the 8600G back in.
> 
> Ralph, can you please add this detail to the Asrock support ticket?
> 
> Ciao, Thorsten

Done.
Comment 63 Ralph Gerstmann 2025-01-22 21:38:42 UTC
ASRock support came back to me today and sayd they can't reproduce.

> We cannot reproduce the problem. Can you provide steps how to reproduce it?
> Our test method:
> CPU: 8600G
> BIOS: 4.08
> OS: Ubuntu 22.04 LTS
> SSD: Crucial P300 2TB (M.2_1)
> We have copy a 800GB file and use F3write to create 1GB test file 600+ round.
> Didn't see this problem...

I answered them:

Hi,

Steps:

Reset BIOS - E.g. We can't reproduce if IOMMU is disabled.
BIOS Version does not seem to matter.
CPU: 8600G and 8700G
OS: Any recent Linux Kernel. (Details in bugreport https://bugzilla.kernel.org/show_bug.cgi?id=219609 )
SSD in M.2_1: A lot of SSDs fail probably most - but not all.
For details please check the bugreport. ( https://bugzilla.kernel.org/show_bug.cgi?id=219609 )
Slot M.2_2: Must be empty. There are no problems if populated.

As you see in the bugreport, there are different ways to reproduce.
I personally prefer installing Linux Mint 22 on btrfs.
That fails 100%. Network setup does not matter.

Ubuntu Server 24.10 on btrfs installation fails sometimes.
It seems with disabled network it might not fail. So make sure network is working and try more than once if you cant' reproduce. Interface LAN/WLAN does not matter.

Error also happens on ext4 - i reproduced this.
But since it is much easier to reproduce with btrfs scrub i always go this way and drop the installation later.
Other users using ext4 on a system they don't want to reinstall use F3 to copy and verify.

I suggest you to replace Crucial P300 with one of the SSDs mentioned in the bugreport. ( https://bugzilla.kernel.org/show_bug.cgi?id=219609 )

Best regards,
Ralph
Comment 64 Ralph Gerstmann 2025-01-22 21:58:57 UTC
(In reply to Ralph Gerstmann from comment #63)

> > SSD: Crucial P300 2TB (M.2_1)
> > We have copy a 800GB file and use F3write to create 1GB test file 600+
> round.
> > Didn't see this problem...


> I suggest you to replace Crucial P300 with one of the SSDs mentioned in the
> bugreport. ( https://bugzilla.kernel.org/show_bug.cgi?id=219609 )
> 

Does anybody here have experiances with this SSD?
Comment 65 Bruno Gravato 2025-01-22 22:29:24 UTC
Not the P300, but I got a Crucial T500 1TB yesterday and experimented
with it and I can still reproduce the errors.

Original firmware on the disk was P8CR002, I then upgraded to P8CR004,
but it didn't make any difference... still getting checksum errors
after copying a large amount of files and running btrfs scrub.


> > > SSD: Crucial P300 2TB (M.2_1)
> > > We have copy a 800GB file and use F3write to create 1GB test file 600+
> > round.
> > > Didn't see this problem...
>
>
> > I suggest you to replace Crucial P300 with one of the SSDs mentioned in the
> > bugreport. ( https://bugzilla.kernel.org/show_bug.cgi?id=219609 )
> >
>
> Does anybody here have experiances with this SSD?
>
Comment 66 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-23 08:03:36 UTC
(In reply to Ralph Gerstmann from comment #63)

> As you see in the bugreport, there are different ways to reproduce.
> I personally prefer installing Linux Mint 22 on btrfs.
> That fails 100%. Network setup does not matter.

Unasked for advice from someone who had to occasionally reproduce problems in a lab setup in the last 20 years:

I'd say you should point them to reproducing it using f3write and f3read, which at least for me and apparently a few others (please correct me if I'm wrong) quickly reproduces the problem without much effort (like reinstalling a distro) by the person that runs the test.
Comment 67 Ralph Gerstmann 2025-01-23 15:49:27 UTC
New feedback from ASRock support:

<snip>

Hello,

Got feedback from our BIOS department:
Sorry, we still not able to reproduce the problem.
BIOS: 4.08 with IOMMU enabled
CPU: 8600G
SSD: SAMSUNG 990 Pro 1TB
OS: Linux Mints 22.1 installed in btrfs
LAN: Connected
Trasfer 100x 1G file and still not meet the problem. (Tested via f3)

</snip>
Comment 68 Christoph Hellwig 2025-01-28 07:41:42 UTC
On Mon, Jan 20, 2025 at 03:31:28PM +0100, Thorsten Leemhuis wrote:
> My former colleague Christian Hirsch (not CCed) can reproduce the
> problem reliably. He today switched the CPU to a Ryzen 7 7700 and later
> to some Ryzen 9600X – and with those things worked just fine, e.g. no
> corruptions. But they came back after putting the 8600G back in.

So basically you need a specific board and a specific CPU, and only
one M.2 SSD in the two slots to reproduce it?  Puh.  I'm kinda lost on
what we could do about this on the Linux side.
Comment 69 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-01-28 08:51:49 UTC
(In reply to Ralph Gerstmann from comment #67)
> New feedback from ASRock support:
> 
> Got feedback from our BIOS department:
> Sorry, we still not able to reproduce the problem.
> […]

That sounds like we might be stuck here, as I guess we need them to reproduce the problem, as they are unlikely to fix it otherwise.

Anyone any idea why they were unable to reproduce the problem?

> BIOS: 4.08 with IOMMU enabled

Does it maybe make a difference if the IOMMU is enabled in the BIOS Setup (the default iirc is "AUTO")

> SSD: SAMSUNG 990 Pro 1TB

In which slot was it? Was it the only device?

> Trasfer 100x 1G file and still not meet the problem. (Tested via f3)

Was that transfering a file using f3 (e.g. with some network share), or was it "transfer 100x 1G file and run f3write and f3read" in parallel?
Comment 70 Stefan 2025-01-28 12:07:22 UTC
Hi,

Am 28.01.25 um 08:41 schrieb Christoph Hellwig:
> So basically you need a specific board and a specific CPU, and only
> one M.2 SSD in the two slots to reproduce it?

more generally, it dependents on which PCIe devices are used. On my PC
corruptions also disappear if I disable the ethernet controller in the BIOS.

Furthermore it depends on transaction sizes (that's why older kernels
work), IOMMU, sometimes on volatile write cache and partially on SSD
type (which may have something to do with the former things).

> Puh.  I'm kinda lost on what we could do about this on the Linux
> side.

Because it also depends on the CPU series, a firmware or hardware issue
seems to be more likely than a Linux bug.

ATM ASRock is still trying to reproduce the issue. (I'm in contact with
them to. But they have Chinese new year holidays in Taiwan this week.)

If they can't reproduce it, they have to provide an explanation why the
issues are seen by so many users.

Regards Stefan
Comment 71 Dr. David Alan Gilbert 2025-01-28 12:53:04 UTC
* Stefan (linux-kernel@simg.de) wrote:
> Hi,
> 
> Am 28.01.25 um 08:41 schrieb Christoph Hellwig:
> > So basically you need a specific board and a specific CPU, and only
> > one M.2 SSD in the two slots to reproduce it?
> 
> more generally, it dependents on which PCIe devices are used. On my PC
> corruptions also disappear if I disable the ethernet controller in the BIOS.
> 
> Furthermore it depends on transaction sizes (that's why older kernels
> work), IOMMU, sometimes on volatile write cache and partially on SSD
> type (which may have something to do with the former things).

Is there any characterisation of the corrupted data; last time I looked at the
bz there wasn't.
I mean, is it reliably any of:
   a) What's the size of the corruption?
          block, cache line, word, bit???
   b) Position?
          e.g. last word in a block or something?
   c) Data?
          pile of zero's/ff's junk/etc?

   d) Is it a missed write, old data, or partially written block?

Dave

> > Puh.  I'm kinda lost on what we could do about this on the Linux
> > side.
> 
> Because it also depends on the CPU series, a firmware or hardware issue
> seems to be more likely than a Linux bug.
> 
> ATM ASRock is still trying to reproduce the issue. (I'm in contact with
> them to. But they have Chinese new year holidays in Taiwan this week.)
> 
> If they can't reproduce it, they have to provide an explanation why the
> issues are seen by so many users.
> 
> Regards Stefan
> 
>
Comment 72 Stefan 2025-01-28 14:38:43 UTC
Hi,

Am 28.01.25 um 13:52 schrieb Dr. David Alan Gilbert:
> Is there any characterisation of the corrupted data; last time I
> looked at the bz there wasn't.

Yes, there is. (And I already reported it at least on the Debian bug
tracker, see links in the initial message.)

f3 reports overwritten sectors, i.e. it looks like the pseudo-random
test pattern is written to wrong position. These corruptions occur in
clusters whose size is an integer multiple of 2^17 bytes in most cases
(about 80%) and 2^15 in all cases.

The frequency of these corruptions is roughly 1 cluster per 50 GB written.

Can others confirm this or do they observe a different characteristic?

Regards Stefan


> I mean, is it reliably any of:
>     a) What's the size of the corruption?
>            block, cache line, word, bit???
>     b) Position?
>            e.g. last word in a block or something?
>     c) Data?
>            pile of zero's/ff's junk/etc?
>
>     d) Is it a missed write, old data, or partially written block?
>
> Dave
>
>>> Puh.  I'm kinda lost on what we could do about this on the Linux
>>> side.
>>
>> Because it also depends on the CPU series, a firmware or hardware issue
>> seems to be more likely than a Linux bug.
>>
>> ATM ASRock is still trying to reproduce the issue. (I'm in contact with
>> them to. But they have Chinese new year holidays in Taiwan this week.)
>>
>> If they can't reproduce it, they have to provide an explanation why the
>> issues are seen by so many users.
>>
>> Regards Stefan
>>
>>
Comment 73 Ralph Gerstmann 2025-01-29 00:23:15 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #69)

Hi Thorsten

> 
> > BIOS: 4.08 with IOMMU enabled
> 
> Does it maybe make a difference if the IOMMU is enabled in the BIOS Setup
> (the default iirc is "AUTO")

I tested enabled vs. auto - both are with errors.

> 
> > SSD: SAMSUNG 990 Pro 1TB
> 
> In which slot was it? Was it the only device?

I told them to use slot 1 and no device in slot 2 -
but i didn't ask again to make sure they did so.

> 
> > Trasfer 100x 1G file and still not meet the problem. (Tested via f3)
> 
> Was that transfering a file using f3 (e.g. with some network share), or was
> it "transfer 100x 1G file and run f3write and f3read" in parallel?

They added a screenshot where i can see lots of OKs from f3read.

Regards, Ralph
Comment 74 Ralph Gerstmann 2025-01-29 00:30:39 UTC
(In reply to Stefan from comment #70)

> On my PC
> corruptions also disappear if I disable the ethernet controller in the BIOS.
> 

Hi Stefan,
i tested this too.
On my system it does not matter.
Errors also accure if LAN and WLAN are disabled in BIOS.
(LAN was plugged in but obviously disabled.)
(WLAN is not installed since we are hunting this bug.)

Please test what you experienced again to verify.

Regards, Ralph
Comment 75 Ralph Gerstmann 2025-01-29 00:36:28 UTC
(In reply to Christoph Hellwig from comment #68)
> On Mon, Jan 20, 2025 at 03:31:28PM +0100, Thorsten Leemhuis wrote:

> 
> So basically you need a specific board and a specific CPU, and only
> one M.2 SSD in the two slots to reproduce it?  

Hi Christoph,

problem exists only if you place a single SSD in slot 1.
problem in slot 1 disapears if you place a second SSD in slot 2.
problem disapears if you place a single SSD in slot 2.

Regards, Ralph
Comment 76 Bruno Gravato 2025-01-29 13:00:04 UTC
> > Is there any characterisation of the corrupted data; last time I
> > looked at the bz there wasn't.
>
> Yes, there is. (And I already reported it at least on the Debian bug
> tracker, see links in the initial message.)
>
> f3 reports overwritten sectors, i.e. it looks like the pseudo-random
> test pattern is written to wrong position. These corruptions occur in
> clusters whose size is an integer multiple of 2^17 bytes in most cases
> (about 80%) and 2^15 in all cases.
>
> The frequency of these corruptions is roughly 1 cluster per 50 GB written.
>
> Can others confirm this or do they observe a different characteristic?

In my tests I was using real data: a backup of my files.

On one such test I copied over 300K files, variables sizes and types
totalling about 60GB. A bit over 20 files got corrupted.
I tried copying the files over the network (ethernet) using rsync/ssh.
I also tried restoring the files using restic (over ssh as well). And
I also tried copying the files locally from a SATA disk. In all cases
I got similar results with some files being corrupted.
The destination nvme disk was using btrfs and running btrfs scrub
after the copy detects quite a few checksum errors.

I analyzed some of those corrupted files and one of them happened to
be a text file (linux kernel source code).
A big portion of the text was replaced with text from another file in
the same directory (being text made it easy to find where it came
from).
So this was a contiguous block of text that was overwritten with a
contiguous block of text from another file.
If I remember correctly the other file was not corrupted (so the
blocks weren't swapped). It looked like a certain block of text was
written twice: on the correct file and on another file in the same
directory.

I also got some jpeg images corrupted. I was able to open and view
(partially) those images and it looked like a portion of the image was
repeated in a different part of it), so blocks of the same file were
probably duplicated and overwritten within itself.

The blocks being overwritten seemed to be different sizes on different files.

Bruno
Comment 77 Stefan 2025-02-03 18:48:25 UTC
Hi,

just got feedback from ASRock. They asked me to make a video from the
corruptions occurring on my remotely (and headless) running system.
Maybe I should make video of printing out the logs that can be found an
the Linux and Debian bug trackers ...

Seems that ASRock is unwilling to solve the problem.

Regards Stefan


Am 28.01.25 um 15:24 schrieb Stefan:
> Hi,
>
> Am 28.01.25 um 13:52 schrieb Dr. David Alan Gilbert:
>> Is there any characterisation of the corrupted data; last time I
>> looked at the bz there wasn't.
>
> Yes, there is. (And I already reported it at least on the Debian bug
> tracker, see links in the initial message.)
>
> f3 reports overwritten sectors, i.e. it looks like the pseudo-random
> test pattern is written to wrong position. These corruptions occur in
> clusters whose size is an integer multiple of 2^17 bytes in most cases
> (about 80%) and 2^15 in all cases.
>
> The frequency of these corruptions is roughly 1 cluster per 50 GB written.
>
> Can others confirm this or do they observe a different characteristic?
>
> Regards Stefan
>
>
>> I mean, is it reliably any of:
>>     a) What's the size of the corruption?
>>            block, cache line, word, bit???
>>     b) Position?
>>            e.g. last word in a block or something?
>>     c) Data?
>>            pile of zero's/ff's junk/etc?
>>
>>     d) Is it a missed write, old data, or partially written block?
>>
>> Dave
>>
>>>> Puh.  I'm kinda lost on what we could do about this on the Linux
>>>> side.
>>>
>>> Because it also depends on the CPU series, a firmware or hardware issue
>>> seems to be more likely than a Linux bug.
>>>
>>> ATM ASRock is still trying to reproduce the issue. (I'm in contact with
>>> them to. But they have Chinese new year holidays in Taiwan this week.)
>>>
>>> If they can't reproduce it, they have to provide an explanation why the
>>> issues are seen by so many users.
>>>
>>> Regards Stefan
>>>
>>>
>
Comment 78 Christoph Hellwig 2025-02-04 06:26:13 UTC
On Fri, Jan 17, 2025 at 11:30:47AM +0100, Thorsten Leemhuis wrote:
> >> Side note: that "PCI-DMA: Using software bounce buffering for IO
> >>>> (SWIOTLB)" message does show up on two other AMD machines I own as
> >> well. One also has a Ryzen 8000, the other one a much older one.

The message will aways show with > 4G of memory.  It only implies swiotlb
is initialized, not that any device actually uses it.

> >> And BTW a few bits of the latest development in the bugzilla ticket
> >> (https://bugzilla.kernel.org/show_bug.cgi?id=219609 ):
> >>
> >> * iommu=pt and amd_iommu=off seems to work around the problem (in
> >> addition to disabling the iommu in the BIOS setup).

iommu_pt calls iommu_set_default_passthrough, which sets
iommu_def_domain_type to IOMMU_DOMAIN_IDENTITY.  I.e. the hardware
IOMMu is left on, but treated as a 1:1 mapping by Linux.

amd_iommu=off sets amd_iommu_disabled, which calls disable_iommus,
which from a quick read disables the hardware IOMMU.

In either case we'll end up using dma-direct instead of dma-iommu.

> > 
> > That suggests the problem is related to the dma-iommu code, and
> > my strong suspect is the swiotlb bounce buffering for untrusted
> > device.  If you feel adventurous, can you try building a kernel
> > where dev_use_swiotlb() in drivers/iommu/dma-iommu.c is hacked
> > to always return false?
> 
> Tried that, did not help: I still get corrupted data.

.. which together with this implies that the problem only happens
when using the dma-iommu code (with or without swiotlb buffering
for unaligned / untrusted data), and does not happen with
dma-direct.

If we assume it also is related to the optimal dma size, which
the original report suggests, the values for that might be
interesting.  For dma-iommu this is:

	PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);

where IOVA_RANGE_CACHE_MAX_SIZE is 6, i.e.

	PAGE_SIZE << 5 or 131072 for x86_64.

for dma-direct it falls back to dma_max_mapping_size, which is
SIZE_MAX without swiotlb, or swiotlb_max_mapping_size, which
is a bit complicate due to minimum alignment, but in this case
should evaluate to: 258048, which is almost twice as big.

And all this unfortunately leaves me really confused.  If someone is
interested in playing around with at the risk of data corruption it would
be interesting to hack hardcoded values into dma_opt_mapping_size, e.g.
plug in the 131072 used by dma-iommu while using dma-direct with the
above iommu disable options.
Comment 79 Bruno Gravato 2025-02-04 09:13:13 UTC
On Tue, 4 Feb 2025 at 06:12, Christoph Hellwig wrote:
>
> On Sun, Feb 02, 2025 at 08:32:31AM +0000, Bruno Gravato wrote:
> > In my tests I was using real data: a backup of my files.
> >
> > On one such test I copied over 300K files, variables sizes and types
> > totalling about 60GB. A bit over 20 files got corrupted.
> > I tried copying the files over the network (ethernet) using rsync/ssh.
> > I also tried restoring the files using restic (over ssh as well). And
> > I also tried copying the files locally from a SATA disk. In all cases
> > I got similar results with some files being corrupted.
> > The destination nvme disk was using btrfs and running btrfs scrub
> > after the copy detects quite a few checksum errors.
>
> So you used various different data sources, and the desintation was
> always the nvme device in the suspect slot.
>

Yes, regardless of the data source, the destination was always a
single nvme disk on the main M.2 nvme slot, with the secondary M.2
nvme slot empty.
I tried 3 different disks (WD, Crucial and Solidigm) with similar results.
If I put any of those disks on the secondary M.2 slot (with the main
slot empty) the problem doesn't occur.
The one that intrigues me most is if I put 2 nvme disks in, occupying
both M.2 slots, the problem doesn't occur either.
The secondary slot must be empty for the issue to happen.

I didn't try using the main M.2 slot as source instead of target, to
see if the problem also occurs on reading as well.
I could try that if you think it's worth testing.


> > I analyzed some of those corrupted files and one of them happened to
> > be a text file (linux kernel source code).
> > A big portion of the text was replaced with text from another file in
> > the same directory (being text made it easy to find where it came
> > from).
> > So this was a contiguous block of text that was overwritten with a
> > contiguous block of text from another file.
> > If I remember correctly the other file was not corrupted (so the
> > blocks weren't swapped). It looked like a certain block of text was
> > written twice: on the correct file and on another file in the same
> > directory.
>
> That's a very interesting pattern.
>
> > I also got some jpeg images corrupted. I was able to open and view
> > (partially) those images and it looked like a portion of the image was
> > repeated in a different part of it), so blocks of the same file were
> > probably duplicated and overwritten within itself.
> >
> > The blocks being overwritten seemed to be different sizes on different
> files.
>
> This does sound like a fairly common pattern due to SSD FTL issues,
> but I still don't want to rule out swiotlb, which due to the bucketing
> could maybe also lead to these, but I can't really see how.  But the
> fact that the affected systems seem to be using swiotlb despite no
> good reason for them to do so still leaves me puzzled.
>
Comment 80 Scharel 2025-02-04 09:53:14 UTC
In my case the issue also occurs when both slots are in use.
I use ZFS and both NVMes are in a mirror.
Scrubbing after writing a larger amount of data to the mirror reports a small number of cksum errors on the disk in slot M2_1.

CPU: AMD Ryzen 5 8500G
NVMe (2x): WD Red SN700 4000GB
Comment 81 Mario Limonciello (AMD) 2025-02-04 15:12:43 UTC
Can someone who can readily reproduce this please try with 'iommu.forcedac=1 iommu.strict=1' on the kernel command line?
Comment 82 Ralph Gerstmann 2025-02-04 18:22:26 UTC
(In reply to Mario Limonciello (AMD) from comment #81)
> Can someone who can readily reproduce this please try with 'iommu.forcedac=1
> iommu.strict=1' on the kernel command line?

If i boot my system with these options it doesn't find the volume group (LVM) any more.
Comment 83 Mario Limonciello (AMD) 2025-02-04 19:09:39 UTC
How about if you try them just individually?
Comment 84 Ralph Gerstmann 2025-02-05 00:03:56 UTC
All tests with fresh install of linux mint 22
(not 22.1 (anyway kernels are the same)) using btrfs,
(w)lan disabled:

Second try: iommu.forcedac=1 iommu.strict=1 -> vg (LVM) not found
First try:  iommu.forcedac=1 -> vg found -> errors
Firtst try: iommu.strict=1 -> vg found -> errors
Third try:  iommu.forcedac=1 iommu.strict=1 -> vg (LVM) not found
First try: Above 4G Decoding (BIOS): Disabled -> vg found -> errors

-- 
¯\_(ツ)_/¯
Comment 85 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-02-06 15:18:29 UTC
(In reply to Ralph Gerstmann from comment #84)
> All tests with fresh install of linux mint 22

Doesn't that use a 6.8 kernel that is heavily patched? I'd say that is a really bad (or maybe even unsuitable?) choice for a upstream bug report like this.

Anyway, here are my results with Fedora 41 and a mainline snapshot from today build using the Fedora rawhide config:

iommu.forcedac=1 iommu.strict=1 -> does not boot, hangs in the initramfs waiting for a device (either the USB stick with the crytsetup key or the NVMe SSD)
iommu.forcedac=1 -> same
iommu.strict=1 -> boots, but corruptions still occur
Comment 86 Stefan 2025-02-06 16:12:15 UTC
Hi,

after Matthias was so kind (more than me) to make a video (!) for the
ASRock support, and after I once again referred to this thread and the
many users who have the same problem, ASRock is able to reproduce the
issues.

Ralph, all tests in comment #40 (including the network issue) where run
twice, because I did not collect logs and lspci outputs the first time.
(The corruptions seem to depend on which PCIe devices / lanes (?) are
used. That's why I also included the lspci outputs.)

(As announced in initial message, I cannot run tests ATM and for a while.)

Regards Stefan


Am 03.02.25 um 19:48 schrieb Stefan:
> Hi,
>
> just got feedback from ASRock. They asked me to make a video from the
> corruptions occurring on my remotely (and headless) running system.
> Maybe I should make video of printing out the logs that can be found an
> the Linux and Debian bug trackers ...
>
> Seems that ASRock is unwilling to solve the problem.
>
> Regards Stefan
>
>
> Am 28.01.25 um 15:24 schrieb Stefan:
>> Hi,
>>
>> Am 28.01.25 um 13:52 schrieb Dr. David Alan Gilbert:
>>> Is there any characterisation of the corrupted data; last time I
>>> looked at the bz there wasn't.
>>
>> Yes, there is. (And I already reported it at least on the Debian bug
>> tracker, see links in the initial message.)
>>
>> f3 reports overwritten sectors, i.e. it looks like the pseudo-random
>> test pattern is written to wrong position. These corruptions occur in
>> clusters whose size is an integer multiple of 2^17 bytes in most cases
>> (about 80%) and 2^15 in all cases.
>>
>> The frequency of these corruptions is roughly 1 cluster per 50 GB
>> written.
>>
>> Can others confirm this or do they observe a different characteristic?
>>
>> Regards Stefan
>>
>>
>>> I mean, is it reliably any of:
>>>     a) What's the size of the corruption?
>>>            block, cache line, word, bit???
>>>     b) Position?
>>>            e.g. last word in a block or something?
>>>     c) Data?
>>>            pile of zero's/ff's junk/etc?
>>>
>>>     d) Is it a missed write, old data, or partially written block?
>>>
>>> Dave
>>>
>>>>> Puh.  I'm kinda lost on what we could do about this on the Linux
>>>>> side.
>>>>
>>>> Because it also depends on the CPU series, a firmware or hardware issue
>>>> seems to be more likely than a Linux bug.
>>>>
>>>> ATM ASRock is still trying to reproduce the issue. (I'm in contact with
>>>> them to. But they have Chinese new year holidays in Taiwan this week.)
>>>>
>>>> If they can't reproduce it, they have to provide an explanation why the
>>>> issues are seen by so many users.
>>>>
>>>> Regards Stefan
>>>>
>>>>
>>
>
Comment 87 Mario Limonciello (AMD) 2025-02-06 16:23:53 UTC
OK, so if those parameters are not helping this is likely not related to lazy flush.

Another thing that would be useful to try to isolate is disabling TRIM support.  Some filesystems enable this by default and there are some systemd units out there that will manually run fstrim.
Comment 88 Prathyushi Nangia (AMD) 2025-02-06 16:26:30 UTC
Hello,

Can someone who can reproduce this issue please try disabling TRIM and re-running?
Comment 89 Scharel 2025-02-06 17:16:33 UTC
I can confirm that TRIM does not trigger the issue.
My ZFS setup has autotrim off. Cron does it every two weeks or so.
The issue is easy reproducable by just writing ~50GB and then scrubbing.
Comment 90 Mario Limonciello (AMD) 2025-02-06 17:19:36 UTC
> I can confirm that TRIM does not trigger the issue.
> The issue is easy reproducable by just writing ~50GB and then scrubbing.

Sorry; but it sounds like you're contradicting yourself.  You say you can't trigger, and you don't have TRIM enabled but you find that you can trip it by using a manual trim command?

Can you please clarify?
Comment 91 Keith Busch 2025-02-06 17:21:35 UTC
(In reply to Scharel from comment #89)
> I can confirm that TRIM does not trigger the issue.
> My ZFS setup has autotrim off. Cron does it every two weeks or so.
> The issue is easy reproducable by just writing ~50GB and then scrubbing.

Having trouble parsing this. You've turned TRIM off, and there are no issues? But you could still reproduce it with a scrubbing?
Comment 92 Keith Busch 2025-02-06 17:23:42 UTC
(In reply to Keith Busch from comment #91)
> (In reply to Scharel from comment #89)
> > I can confirm that TRIM does not trigger the issue.
> > My ZFS setup has autotrim off. Cron does it every two weeks or so.
> > The issue is easy reproducable by just writing ~50GB and then scrubbing.
> 
> Having trouble parsing this. You've turned TRIM off, and there are no
> issues? But you could still reproduce it with a scrubbing?

On a re-read, I think you're saying that TRIM has nothing to do with the issue and it happens with or without it enabled. And if so, that frankly makes sense: TRIM just affects NAND stale page tracking, it has nothing to do with DMA.
Comment 93 Mario Limonciello (AMD) 2025-02-06 17:28:21 UTC
> And if so, that frankly makes sense: TRIM just affects NAND stale page
> tracking, it has nothing to do with DMA

I should probably add some more color to why Prathyushi and I were both asking about TRIM.  There have been reports in the past that TRIM request (specifically) was getting corrupted.  So we're looking to see if this is a similar issue.
Comment 94 Keith Busch 2025-02-06 17:37:59 UTC
(In reply to Mario Limonciello (AMD) from comment #93)
> > And if so, that frankly makes sense: TRIM just affects NAND stale page
> > tracking, it has nothing to do with DMA
> 
> I should probably add some more color to why Prathyushi and I were both
> asking about TRIM.  There have been reports in the past that TRIM request
> (specifically) was getting corrupted.  So we're looking to see if this is a
> similar issue.

By TRIM requests getting corrupted, I assume you mean the NVMe DSM list, host -> device DMA, is getting corrupted on the way? That could create these observations, but for it to be specific to a TRIM command? It shouldn't look any different than a write command's DMA payload, right?
Comment 95 Mario Limonciello (AMD) 2025-02-06 17:40:39 UTC
TRIM has a start and range field, and I the case I'm talking about it was specifically the "start" was getting corrupted.

> It shouldn't look any different than a write command's DMA payload, right?

Yeah I would think the same way.  But 🤷.  At least want to see if that's the case because it can give us more hints at a repro on other hardware.
Comment 96 Scharel 2025-02-06 19:05:52 UTC
Sorry that my comment was unclear.
What I wanted to say is that I can provoke the issue without running TRIM.
By scrubbing I meant "zfs scrub <pool>", that I use to detect the errors.
Errors also only seem to happen while writing data and not with data at rest.
Comment 97 Ralph Gerstmann 2025-02-07 19:18:27 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #85)
> (In reply to Ralph Gerstmann from comment #84)
> > All tests with fresh install of linux mint 22
> 
> Doesn't that use a 6.8 kernel that is heavily patched? I'd say that is a
> really bad (or maybe even unsuitable?) choice for a upstream bug report like
> this.
> 

afaik, the mint team does not patch kernels at all - they just follow Ubuntu kernels.

"Linux Mint 22 is based on Ubuntu 24.04 and ships with kernel 6.8. All subsequent point releases will follow the Hardware Enablement (HWE) kernel series, which improves support for newer devices."

$uname -a
Linux mint 6.8.0-38-generic #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun  7 15:25:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comment 98 Mario Limonciello (AMD) 2025-02-07 19:20:58 UTC
> afaik, the mint team does not patch kernels at all - they just follow Ubuntu
> kernels.

Sure, but *Ubuntu kernels* are heavily patched.  They are not upstream kernels.  Discussions on issues with Ubuntu kernels should be brought to Launchpad.
Kernel Bugzilla is for discussion on upstream kernels.
Comment 99 Ralph Gerstmann 2025-02-07 19:26:52 UTC
(In reply to Mario Limonciello (AMD) from comment #98)
> > afaik, the mint team does not patch kernels at all - they just follow
> Ubuntu
> > kernels.
> 
> Sure, but *Ubuntu kernels* are heavily patched.  They are not upstream
> kernels.  Discussions on issues with Ubuntu kernels should be brought to
> Launchpad.
> Kernel Bugzilla is for discussion on upstream kernels.

This bug was reproduced by others with upstream kernels, too.

Ok, then i will take the short way and live with slot 2 and bring my system back to production, means i can not run any tests anymore.

Agree?
Comment 100 Mario Limonciello (AMD) 2025-02-07 19:34:51 UTC
> This bug was reproduced by others with upstream kernels, too.

Right; at this point everything is data as we don't have a specific commit, change or firmware that seems to be causing it.

> Agree?

Totally up to you what to do with your system.  Since there is the workaround mentioned here of IOMMU disabled is avoiding it, you might do that for now.
Comment 101 Ralph Gerstmann 2025-02-07 19:50:18 UTC
(In reply to Mario Limonciello (AMD) from comment #100)
> > This bug was reproduced by others with upstream kernels, too.
> 
> Right; at this point everything is data as we don't have a specific commit,
> change or firmware that seems to be causing it.

Problem is limited to 8X00G in combination with X600 Boards,
which smells like Knoll-related and not like Linux.

> 
> > Agree?
> 
> Totally up to you what to do with your system.  Since there is the
> workaround mentioned here of IOMMU disabled is avoiding it, you might do
> that for now.

Why should i disable IOMMU?
Workaround is - if you have a PCI4 NVME:
Populate slot 2 before you populate slot 1.
Comment 102 Ralph Gerstmann 2025-02-07 19:58:59 UTC
(In reply to Ralph Gerstmann from comment #101)

> Workaround is - if you have a PCI4 NVME:
> Populate slot 2 before you populate slot 1.

To be more precise:

Workaround is - if you have a PCI4 NVME and don't run a RAID. :
Populate slot 2 before you populate slot 1.
Comment 103 Stefan 2025-02-07 20:25:58 UTC
Hi,

Am 07.02.25 um 20:34 schrieb bugzilla-daemon@kernel.org:
> https://bugzilla.kernel.org/show_bug.cgi?id=219609
>
> --- Comment #100 from Mario Limonciello (AMD) ---
>> This bug was reproduced by others with upstream kernels, too.

I can confirm that.

It is not very likely that a Ubuntu patch causes another bug with exact
the same symptoms ...

> Right; at this point everything is data as we don't have a specific commit,
> change or firmware that seems to be causing it.

We have two *upstream kernel* commits that trigger the corruptions: Both
these commits change the transfer size

We have a specific firmware that introduces the corruptions: the initial
one.

We have a specific hardware combination that is causing the issues:
ASock Deskmini X600 + AMD Ryzen 8000 series. (It seems that the bug is
limited to that CPU series while it has not been tested yet whether
other X600 / Knoll systems are affected too. But meanwhile ASRock is
able to reproduce the corruptions.)

Regards Stefan


>
>> Agree?
>
> Totally up to you what to do with your system.  Since there is the workaround
> mentioned here of IOMMU disabled is avoiding it, you might do that for now.
>
Comment 104 Ralph Gerstmann 2025-02-07 21:06:46 UTC
(In reply to Stefan from comment #103)

> We have a specific hardware combination that is causing the issues:
> ASock Deskmini X600 + AMD Ryzen 8000 series. (It seems that the bug is
> limited to that CPU series while it has not been tested yet whether
> other X600 / Knoll systems are affected too. But meanwhile ASRock is
> able to reproduce the corruptions.)
> 

Afaik there exist only 4 different X600 systems - all from ASRock.
Afaik only the Deskmini X600 has PCI5 capability in slot 1.
...
Comment 105 Stefan 2025-02-07 21:45:58 UTC
Hi,

Am 07.02.25 um 22:06 schrieb bugzilla-daemon@kernel.org:
> https://bugzilla.kernel.org/show_bug.cgi?id=219609
>
> --- Comment #104 from Ralph Gerstmann ---
> Afaik there exist only 4 different X600 systems - all from ASRock.
> Afaik only the Deskmini X600 has PCI5 capability in slot 1.
> ...

it has nothing to do with the PCIe version. I have e Gen4 SSD and
enforcing Gen3 via BIOS has no effect.

Regards Stefan
Comment 106 Ralph Gerstmann 2025-02-07 21:52:32 UTC
(In reply to Stefan from comment #105)
> 
> it has nothing to do with the PCIe version. I have e Gen4 SSD and
> enforcing Gen3 via BIOS has no effect.

Hi,

My thoughts were not about the capabilities of the inserted device or the BIOS setup.
My thoughts are simply about the capabilities of the slot - because there is the obvious difference.

Regards, Ralph
Comment 107 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-02-10 07:46:51 UTC
(In reply to Mario Limonciello (AMD) from comment #93)
> There have been reports in the past that TRIM request
> (specifically) was getting corrupted.  So we're looking to see if this is a
> similar issue.

Disabling trim did not change anything for me: the corruptions still occurred.
Comment 108 Klaus 2025-02-15 15:19:11 UTC
(In reply to Stefan from comment #103)
> Hi,
> 
> Am 07.02.25 um 20:34 schrieb bugzilla-daemon@kernel.org:
> > https://bugzilla.kernel.org/show_bug.cgi?id=219609
> >
> > --- Comment #100 from Mario Limonciello (AMD) ---
> >> This bug was reproduced by others with upstream kernels, too.
> 
> I can confirm that.
> 
> It is not very likely that a Ubuntu patch causes another bug with exact
> the same symptoms ...
> 

Just for the record - there are issues on other Linux distributions as well. I faced arbitrary reboots with Debian 12 as well as with the latest Manjaro (I guess it is kernel 6.12). CPU is a 8500G. I've seen no file or filesystem corruptions, but even freshly installed systems reboot every couple of minutes (up to a few hours).
No error messages, no core dumps ... nothing



> > Totally up to you what to do with your system.  Since there is the
> workaround
> > mentioned here of IOMMU disabled is avoiding it, you might do that for now.
> 

Unfortunately disabling IOMMU did not change anything, but moving the SSD to the lower socket solved (resp. worked around) the problem.
Comment 109 Stefan 2025-02-19 10:55:05 UTC
Hi,

here is a link to a new BIOS version from ASRock: http://www.simg.de/X600M-STX_4.10.zip (Cannot attach this due to the size limit. The file will be removed in a few month's.)

I cannot test this ATM (as announced in December).

Maybe someone want to try this.

Regards Stefan
Comment 110 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-02-19 11:05:54 UTC
(In reply to Stefan from comment #109)
> here is a link to a new BIOS version from ASRock:

Thx. Is this just a new version that might change things for us, or is this supposed to contain a fix for our problem?
Comment 111 Stefan 2025-02-19 11:11:44 UTC
Hi,

with that firmware ASRock can't reproduce the corruptions anymore. 

Regards Stefan
Comment 112 Stefan 2025-02-19 11:16:02 UTC
Just for clarification: ASRock sent me that file and asked me to test it (which is not possible ATM) and allowed me to share it.
Comment 113 Alex Kovacs 2025-02-19 13:55:29 UTC
Has anyone seen this issue with the V2.01 BIOS? All I see mentioned are updated BIOS versions.
Comment 114 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-02-19 14:00:13 UTC
(In reply to Stefan from comment #109)
> here is a link to a new BIOS version from ASRock:
> http://www.simg.de/X600M-STX_4.10.zip 

From a quick test it seems like this is fixing the problem for me.
Comment 115 Alex Kovacs 2025-02-19 14:01:51 UTC
(In reply to Alex Kovacs from comment #113)
> Has anyone seen this issue with the V2.01 BIOS? All I see mentioned are
> updated BIOS versions.

Sorry, I did not see Stefan's comment about seeing this on all BIOS before posting my question.
Comment 116 Alex Kovacs 2025-02-19 14:16:39 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #114)
> (In reply to Stefan from comment #109)
> > here is a link to a new BIOS version from ASRock:
> > http://www.simg.de/X600M-STX_4.10.zip 
> 
> From a quick test it seems like this is fixing the problem for me.

Does this require FW version 240522 to be installed first?
Comment 117 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-02-19 14:21:56 UTC
Created attachment 307686 [details]
dmesg from before and after the bios update

(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #114)
> From a quick test it seems like this is fixing the problem for me.

TWIMC, here is the dmesg from booting with the old and the new BIOS. The old one might have used slightly different BIOS Setup settings, can't recall, sorry. 

There are a few new lines, like:

ACPI: BGRT 0x000000008D5ED000 000038 (v01 ALASKA A M I    00000001 AMI  00010013)
ACPI: WPBT 0x000000008CDC4000 000036 (v01 ALASKA A M I    00000001 MSFT 00010013)
ACPI: Reserving BGRT table memory at [mem 0x8d5ed000-0x8d5ed037]
ACPI: Reserving SSDT table memory at [mem 0x8cdc3000-0x8cdc3cdd]

And it seems there is an additional PCIe device. Wondering if that is due to the new BIOS or some setting differences in the BIOS Setup.

/me shrugs and stops investigating, as nobody might care anyway
Comment 118 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-02-19 14:40:36 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #114)
> From a quick test it seems like this is fixing the problem for me.

Someone from c't magazine (which had a actile about building a system with the DeskMini X600, which at least for me was the reason why I bought it) also confirmed that the new BIOS seems to fix this.

(In reply to Stefan from comment #112)
> ASRock sent me that file and asked me to test it

BTW, can you please ask them what they changed (they might not answer, but it's worth asking… :-D )

Ohh, and many thx for your work with this.

(In reply to Alex Kovacs from comment #116)
> Does this require FW version 240522 to be installed first?

I find https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.de.asp?cat=#BIOS somewhat confusing, but it sounds you need to install that SIO FW first, unless you already have it.
Comment 119 Bruno Gravato 2025-02-19 15:09:34 UTC
> BTW, can you please ask them what they changed (they might not answer, but
> it's
> worth asking… :-D )

One thing that irritates me quite a bit about ASRock, is that they
never share any changelogs of what they change on each new BIOS
update...
It's really annoying. How can someone make a decision on whether to
update the BIOS or not without having a clue of what changed?

I've found a few threads on sff.network forum by users trying to
figure out what changed and whether it's a good idea to upgrade or
not... This is true for both Deskmini X300 and X600. It's not uncommon
to see comments saying you should _not_ upgrade to version X or Y,
because a certain feature was removed or that performance declined...

> I find
>
> https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.de.asp?cat=#BIOS
> somewhat confusing, but it sounds you need to install that SIO FW first,
> unless
> you already have it.

It is indeed confusing, because the BIOS update says "Before updating
BIOS 2.01, please update SIO firmware" and the SIO update says
"Requires BIOS 2.01 or later version".
So which one do you do first?
IIRC, before I updated from the original BIOS version to 4.03, I think
I did the SIO update first and it all went well.

> And it seems there is an additional PCIe device. Wondering if that is due to
> the new BIOS or some setting differences in the BIOS Setup.
>
> /me shrugs and stops investigating, as nobody might care anyway

I currently have all my spare nvme disks in use, so unfortunately I
can't test the new BIOS, but I'm very interested in all the
differences you may find.

Regarding BIOS settings, I guess it's now too late for you, but for
others, I suggest saving your current settings to an USB pen before
updating.
One thing I've learned from the past is that updating BIOS firmware on
the Deskmini will usually reset all the settings and also wipe any
profiles saved, so you really need to save them to a USB pen, if you
desire to recover them after the update.
Comment 120 Stefan 2025-02-19 16:24:50 UTC
Hi,

Am 19.02.25 um 15:21 schrieb bugzilla-daemon@kernel.org:
> https://bugzilla.kernel.org/show_bug.cgi?id=219609
>
> --- Comment #117 from The Linux kernel's regression tracker (Thorsten
> Leemhuis) ---
> There are a few new lines, like:
>
> ACPI: BGRT 0x000000008D5ED000 000038 (v01 ALASKA A M I    00000001 AMI
> 00010013)
> ACPI: WPBT 0x000000008CDC4000 000036 (v01 ALASKA A M I    00000001 MSFT
> 00010013)
> ACPI: Reserving BGRT table memory at [mem 0x8d5ed000-0x8d5ed037]
> ACPI: Reserving SSDT table memory at [mem 0x8cdc3000-0x8cdc3cdd]
>
> And it seems there is an additional PCIe device. Wondering if that is due to
> the new BIOS or some setting differences in the BIOS Setup.
>
> /me shrugs and stops investigating, as nobody might care anyway

that may be relevant and I would like to clarify this before I forward
your questions and thanks. Can you share your lspci output and/or
compare it with the output I created (see attachments at begin of the
bug tracker page)

Reason: Whether the corruptions appear seems to depend on which PCI
devices are present (2nd M.2 SSD; in my case corruptions disappear if I
disable network in BIOS)

Thus, If there is a new PCI device, that may be the reason why the
corruptions go away. But the underlying problem may not be resolved.


> --- Comment #119 from Bruno Gravato --- It is indeed confusing,
> because the BIOS update says "Before updating BIOS 2.01, please
> update SIO firmware" and the SIO update says "Requires BIOS 2.01 or
> later version". So which one do you do first?>> IIRC, before I
> updated from the original BIOS version to 4.03, I think> I did the SIO update
> first and it all went well.

If you have 4.03 you do not need to care about SIO firmware.

AFAIR, my board came with Firmware 1.43. I first updated SIO firmware,
then 2.01 and then 4.03 (and later 4.08)

Regards Stefan
Comment 121 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-02-19 16:54:05 UTC
(In reply to Stefan from comment #120)
> Am 19.02.25 um 15:21 schrieb bugzilla-daemon@kernel.org:
> > And it seems there is an additional PCIe device.

s/an/two/

> > Wondering if that is due to
> > the new BIOS or some setting differences in the BIOS Setup.
> > /me shrugs and stops investigating, as nobody might care anyway

/me wonders if downgrading the BIOS is worth it (if possible!), but decides for now that it is not.
 
> that may be relevant and I would like to clarify this before I forward
> your questions and thanks. Can you share your lspci output and/or
> compare it with the output I created (see attachments at begin of the
> bug tracker page)

All your lspci logs miss the "SATA controller [0106]: ASMedia Technology Inc. ASM1061/ASM1062 Serial ATA Controller [1b21:0612] (rev 02)" that is one of the two new PCI devices after the BIOS update (as can be seen by the logs I uploaded). I doubt I disabled the chip in the BIOS Setup, but it's possible that I did and forgot about it. #Sigh :-/
Comment 122 Bruno Gravato 2025-02-19 18:27:29 UTC
> If you have 4.03 you do not need to care about SIO firmware.
>
> AFAIR, my board came with Firmware 1.43. I first updated SIO firmware,
> then 2.01 and then 4.03 (and later 4.08)

I think mine came with 1.43 as well. I updated the SIO, then BIOS to
4.03. And later on to 4.08 when it came out.
I think all of this was before I found out about the corruption issue.

Anyway, what I mainly wanted to alert about was the fact that all BIOS
settings get reset, including any saved profiles, when upgrading the
BIOS firmware... the only way to preserve and restore any settings is
saving them to a USB pen and restoring after the upgrade.
Comment 123 Ralph Gerstmann 2025-02-20 00:55:24 UTC
Issue seems fixed with 4.10.
I will verify tomorrow.
Where is the change log from ASRock?
Comment 124 Keith Busch 2025-02-20 01:03:31 UTC
(In reply to Ralph Gerstmann from comment #123)
> Where is the change log from ASRock?

I doubt they'd publish any interesting details on what was changed. At best, they might provide "Release Notes" with the update using a vaguely worded description like "Fixed various bugs".
Comment 125 Ralph Gerstmann 2025-02-21 14:06:32 UTC
ASRock Support answered my question:

___
Sorry, I do not get any change log or closer information what was changed/fixed with this BIOS.
The only information is, that we redefined the unused CPU PCIE lanes on BIOS 4.10.
___
Comment 126 Stefan 2025-02-21 14:09:46 UTC
Hi,

Am 20.02.25 um 02:03 schrieb bugzilla-daemon@kernel.org:
> --- Comment #124 from Keith Busch (kbusch@kernel.org) ---
> (In reply to Ralph Gerstmann from comment #123)
>> Where is the change log from ASRock?
>
> I doubt they'd publish any interesting details on what was changed. At best,
> they might provide "Release Notes" with the update using a vaguely worded
> description like "Fixed various bugs".

according to ASRock support, they "redefined the unused CPU PCIE lanes
on BIOS 4.10." They cannot provide further information.

Regards Stefan
Comment 127 Ralph Gerstmann 2025-02-21 15:27:22 UTC
I will test if the issue now moved to the other slot...
Comment 128 Bruno Gravato 2025-02-22 23:45:21 UTC
I got my hands on a spare nvme disk (WD SN850X 1TB) and ran some tests.

TL;DR version: BIOS firmware 4.10 seems to prevent the corruption.

Now for the details...

Test 1:
- BIOS firmware 4.08
- M.2 slots - main: WD SN850X / secondary: empty
- installed Debian 12 on btrfs, upgraded kernel to backports (6.12.9), rebooted
- copied about 500k files / 100GB (source was a SATA disk installed on
the machine as in my previous tests)
- running btrfs scrub detects corrupted files on the nvme disk, as expected
- deleted files and ran fstrim

BIOS upgrade:
- booted into BIOS
- made a backup of my config to an USB pen
- upgraded to BIOS firmware 4.10
- restored my BIOS settings from USB

Test 2:
- BIOS firmware 4.10
- M.2 slots - main: WD SN850X / secondary: empty
- copied again same 500k files / 100GB from the SATA disk
- btrfs scrub returned no corruptions
- deleted files and ran fstrim

Teste 3:
- same as test 2 except I swapped the disk from main to secondary M.2 slot
- same result

Test 4:
- put another disk in, so both nvme M.2 slots occupied
- still no corrupted files

So BIOS firmware 4.10 seems to have solved the problem.


Bruno
Comment 129 Mathieu Borderé 2025-02-26 12:14:35 UTC
Where are you getting this mythical 4.10 BIOS, I don't see it on https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.asp#BIOS
Comment 130 Bruno Gravato 2025-02-26 12:58:48 UTC
> --- Comment #129 from Mathieu Borderé ---
> Where are you getting this mythical 4.10 BIOS, I don't see it on
> https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.asp#BIOS

It's not an official release. Check comment #109 for the link.
Comment 131 mbe 2025-03-23 21:26:46 UTC
The BIOS version 4.10 is now available on the official ASRock support page:
https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.de.asp?cat=#BIOS
Comment 132 Bruno Gravato 2025-03-23 21:56:59 UTC
> --- Comment #131 from mbe ---
> The BIOS version 4.10 is now available on the official ASRock support page:
>
> https://www.asrock.com/nettop/AMD/DeskMini%20X600%20Series/index.de.asp?cat=#BIOS

I just noticed that today and I was going to post about it here, but
you beat me to it.

Anyway just to add that I checksummed it and compared to the version
that was posted here a few weeks ago and it's the exact same version,
so for those who already upgraded to 4.10 then, no need to "upgrade"
again.
Comment 133 The Linux kernel's regression tracker (Thorsten Leemhuis) 2025-03-25 08:11:38 UTC
Thx everyone who help with this, much appreciated! 

If there is anyone with contact to Asrock, please consider asking them to distribute the update through LVFS (hughsie brought that up in the Fediverse and I think it would be great idea: https://mastodon.social/@hughsie/114221918449126392 )

P.S.: In a ideal world where that is not possible we'd had some daemon yelling at people "your data is in danger" that run the old bios…)
Comment 134 Mathieu Borderé 2025-04-04 07:21:16 UTC
Maybe this is no longer the place to post this, but since installing 4.10 my computer once rebooted spontaneously and didn't detect my 2 nvme drives in the BIOS anymore. This happened after writing a couple of gigabytes to the drive in the secondary slot. Restarting solved it. A bit harsh to blame 4.10, but went back to 4.08. Posting this just in case anyone else would experience the same issue.
Comment 135 Bruno Gravato 2025-04-04 11:12:53 UTC
On Fri, 4 Apr 2025 at 08:21, <bugzilla-daemon@kernel.org> wrote:
> --- Comment #134 from Mathieu Borderé ---
> Maybe this is no longer the place to post this, but since installing 4.10 my
> computer once rebooted spontaneously and didn't detect my 2 nvme drives in
> the
> BIOS anymore. This happened after writing a couple of gigabytes to the drive
> in
> the secondary slot. Restarting solved it. A bit harsh to blame 4.10, but went
> back to 4.08. Posting this just in case anyone else would experience the same
> issue.

I don't think that is related to 4.10.

I had that (spontaneous reboot) happen to me once or twice before,
when I was using firmware 4.08.

I've also experienced some issues with amdgpu driver crashing
sometimes, plus some "glitches" in the graphics occasionally (like a
screen quick "flicker", or random pixels "flashing").

When amdgpu crashes sometimes X freezes or even the full system
freezes. Other times amdgpu restarts successfully and X stays alive.
This doesn't happen very often, but when it happens I get a bunch of
amdgpu errors in the logs. I think it got worse since I upgraded from
kernel 6.12.9 to 6.12.12 and it was much worse with previous kernels
(6.11.xx and earlier), but I have no way of reproducing it
consistently and it happens too scarcely (once or twice a month) to
reach any conclusion.

As for the random pixels flashing or the screen flickering, I don't
get any errors in the logs, so I can't rule out the possibility that
it is a monitor issue.

Check your system logs prior to it rebooting and see if there's any
relevant message, especially related to amdgpu.
Which kernel and amd firmware versions are you using?

Bruno
Comment 136 Mathieu Borderé 2025-04-04 12:14:33 UTC
Log was clean, used to have major issues with amdgpu crashing and taking down the desktop environment with it, turned out to be a faulty CPU, CPU replacement fixed those crashes.

Note You need to log in before you can comment on or make changes to this bug.