Bug 219480 - PCIE - NVME total lost of storage : Unable to change power state from D3cold to D0
Summary: PCIE - NVME total lost of storage : Unable to change power state from D3cold...
Status: NEW
Alias: None
Product: Power Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P3 high
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-11-07 11:55 UTC by Nicolas BARANGER
Modified: 2025-01-05 18:07 UTC (History)
2 users (show)

See Also:
Kernel Version: 6.12rc6
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Nicolas BARANGER 2024-11-07 11:55:18 UTC
Dear maintener

Here is a new bug I encountered since Linux 6.1 and I'm able to reproduce it on from Linux 6.1 to Linux 6.12-rc6 (mainline)

I have a server which is using a Broadcom PLX PEX8784 pcie switch to connect storage.
On the PEX8748, U2 NVME drives are connected.

After 3 years of perfect work in production, since I upgrade to Linux 6.1 the PEX8748 totally lost the downstream NVME drives connected to it child PCIE port.

Topology : 

PCIE root port <---- PEX8748 <---- UP to 8 NVME drives

Note that PCIE ASPM is totally disable in the UEFI BIOS since 3 years and I never had any issue before upgrading to Linux 6.1

Having a look at kernel log show that the system is puting NVME drives in D3cold state and is never able to recover from this state (even when trying to reset PCIE bus and to manualy rebind pcie and nvme drivers).

The recommanded kernel option are not efficient : "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"

And since Linux 6.1, UEFI BIOS ASPM options are also ignored and the system is doing PM on PCIE where it mustn't
And the only way to recover storage is to reboot the machine 

Kernel logs : 

kernel: nvme 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: nvme 0000:0a:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: nvme 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible

and it has the effect of breaking totally the system as you can see, multi device had lost its 4 nvme drives "[____]" on a RAID5 array :

grep -A1 md127 /proc/mdstat
 md127 : active raid5 nvme1n1p1[1](F) nvme3n1p1[4](F) nvme2n1p1[2](F) nvme0n1p1[0](F)
      12830758912 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/0] [____]


Having a look at the sysfs, I seen that d3cold_allowed is set to 1 for NVME drives and for the PEX8748 downstream port !
It seems that Windows server keep NVME in D0 state during ASPM or suspend, why not Linux when PCIE ASPM is disable ?.


Other output I capture while reproducing on a DELL laptop (with ASPM enable) and on the server (with ASPM disable): 

# NVME drive never come back to D0 state (never wakup back)
#error: 
 nvme nvme3: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
 nvme nvme3: Does your device have a faulty power saving mode enabled?
 nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
 nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
 nvme nvme3: Disabling device after reset failure: -19

# After this failiure, it becomes impossible to wakup NVME drives without rebooting

# Exemple (NOT FROM THIS SERVER - ON LAPTOP with PCIE ASPM enable ) : 
  echo "0000:08:00.0" > /sys/bus/pci/drivers/nvme/unbind
  echo "0000:02:00.0" > /sys/bus/pci/drivers/pcieport/unbind
  setpci -s 08:00.0 CAP_PM+4.b=03 # D3hot
  acpidbg -b "execute \_SB.PC00.PEG0.PXP._OFF"
  acpidbg -b "execute \_SB.PC00.PEG0.PXP._ON"
  #debian cmd# acpiexec -b "execute \_SB.PC00.PEG0.PXP._OFF"
  #debian cmd# acpiexec -b "execute \_SB.PC00.PEG0.PXP._ON"
  setpci -s 08:00.0 CAP_PM+4.b=0 # D0
  echo "0000:02:00.0" > /sys/bus/pci/drivers/pcieport/bind
  echo "0000:08:00.0" > /sys/bus/pci/drivers/nvme/bind
  
# NVME never wakeup  here: 
 nvme 0000:08:00.0: not ready 65535ms after bus reset; giving up
 nvme 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible
 nvme 0000:08:00.0: timed out waiting for pending transaction; performing function level reset anyway
 nvme 0000:08:00.0: not ready 1023ms after FLR; waiting
 nvme 0000:08:00.0: not ready 2047ms after FLR; waiting
 ...
 nvme 0000:08:00.0: not ready 65535ms after FLR; giving up 
 ...
 nvme 0000:08:00.0: not ready 32767ms after bus reset; waiting
 nvme 0000:08:00.0: not ready 65535ms after bus reset; giving up






While waiting for a fix in kernel, I did write the following dirty workarround to save the production (not acceptable to have a total lost of storage every morning on a production server) :

Here is my workarround :
1) identify PCIE bus and PCIE downstream port (if a PCIE switch like a PLX card is connected to the PCIE root port)  and identify the PCIE BUS address of the bind NVME drive 
2) second, write a script which disable D3cold capability for PCIE downstream port (PLX card child port, if any)
3) extend this script to disable D3cold capability on NVME drives attached to PCIE child port
4) execute script at reboot 





##############################################
#!/bin/bash
#
#  
# 20241104
# Workarround power management issue 
# NVME drive goes to d3cold and never go back to d0
#
sleep 60
echo "------------------------------------------------------"
echo "DISABLE D3COLD ON NVME U2"
echo "------------------------------------------------------"


_PLX_BUS='/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0'

_NVME7_PORT='0000:03:08.0' # UNUSED - FREE PORT
_NVME6_PORT='0000:03:09.0' # UNUSED - FREE PORT
_NVME4_PORT='0000:03:0a.0' # UNUSED - FREE PORT
_NVME5_PORT='0000:03:0b.0' # UNUSED - FREE PORT

_NVME0_PORT='0000:03:10.0'
_NVME1_PORT='0000:03:11.0'
_NVME2_PORT='0000:03:12.0'
_NVME3_PORT='0000:03:13.0'


_NVME7_DEV='0000:04:00.0' # NOT BIND
_NVME6_DEV='0000:05:00.0' # NOT BIND
_NVME4_DEV='0000:06:00.0' # NOT BIND 
_NVME5_DEV='0000:07:00.0' # NOT BIND

_NVME0_DEV='0000:08:00.0'
_NVME1_DEV='0000:09:00.0'
_NVME2_DEV='0000:0a:00.0'
_NVME3_DEV='0000:0b:00.0'

# First disable D3COLD on NVME PLUGGED PCIE port
# For the moment, we only consider the 4 plugged NVME port 
# num_port => 4

	echo "DISABLE D3COLD ON PLX CHILD PCIE PORT:"
	echo 0 >"$_PLX_BUS/$_NVME0_PORT/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME1_PORT/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME2_PORT/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME3_PORT/d3cold_allowed"

# Control

	echo "$_PLX_BUS/$_NVME0_PORT/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME0_PORT/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME1_PORT/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME1_PORT/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME2_PORT/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME2_PORT/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME3_PORT/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME3_PORT/d3cold_allowed"

# Second disable D3COLD on CONNECTED NVME drives
# For the moment, we only consider the 4 plugged NVME port 
# num_dev => 4

	echo "DISABLE D3COLD ON NVME DRIVE ATTACHED TO PLX CHILD PCIE PORT:"
	echo 0 >"$_PLX_BUS/$_NVME0_PORT/$_NVME0_DEV/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME1_PORT/$_NVME1_DEV/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME2_PORT/$_NVME2_DEV/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME3_PORT/$_NVME3_DEV/d3cold_allowed"

# Control

	echo "$_PLX_BUS/$_NVME0_PORT/$_NVME0_DEV/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME0_PORT/$_NVME0_DEV/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME1_PORT/$_NVME1_DEV/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME1_PORT/$_NVME1_DEV/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME2_PORT/$_NVME2_DEV/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME2_PORT/$_NVME2_DEV/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME3_PORT/$_NVME3_DEV/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME3_PORT/$_NVME3_DEV/d3cold_allowed"


#
echo -e "\n---------------------------------------------"
echo "SUCESSFULLY DISABLED D3COLD ON NVME U2"
echo "---------------------------------------------"

#######################################################


Hope a fix could be written, everything was working fine since 3 years with Linux 5.X and 6.0.X and the first time the isssue happend was just apter upgrading to Linux 6.1.20 and I reproduce it on Linux 6.10.11 and also in mainline (6.12rc6)

Thanks for help
Kind regards
Nicolas BARANGER
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-11-09 09:19:30 UTC
This can be caused by changes in various areas or the kernel, hence please bisect which change broke things. Here are two guides:

https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html
https://docs.kernel.org/admin-guide/bug-bisect.html
Comment 2 Nicolas BARANGER 2024-12-11 09:43:09 UTC
Dear maintener

Thanks for answer.

Bisection job is still in progress, but I think this issue could maybe be related to INTEL IOMMU / DMAR as after a new failiure for the same reason (using Debian kernel) I catch that at reboot : 

kernel: DMAR: ERROR: DMA PTE for vPFN 0xbdb61 already set (to bdb61003 not 187647003)
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 6 PID: 1119 at drivers/iommu/intel/iommu.c:1920 __domain_mapping+0x375/0x4a0
kernel: Modules linked in: nfsd(+) auth_rpcgss msr nfs_acl parport_pc lockd grace ppdev lp parport loop efi_pstore configfs sunrpc ip_tables x_tables autofs4 btrfs blake2b_g>
kernel: CPU: 6 UID: 0 PID: 1119 Comm: modprobe Tainted: G        W          6.11.5+bpo-amd64 #1  Debian 6.11.5-1~bpo12+1
kernel: Tainted: [W]=WARN
kernel: Hardware name: ASUS All Series/X99-WS/IPMI, BIOS 4001 05/28/2019
kernel: RIP: 0010:__domain_mapping+0x375/0x4a0
kernel: Code: ff 48 8b 34 24 48 89 e9 48 89 c2 48 c7 c7 c8 2e 55 b0 e8 3e b2 87 ff 8b 05 48 84 4e 01 85 c0 74 09 83 e8 01 89 05 3b 84 4e 01 <0f> 0b e9 a8 fe ff ff 40 80 e5 7>
kernel: RSP: 0018:ffffb727cd1733f8 EFLAGS: 00010206
kernel: RAX: 0000000000000003 RBX: 0000000000000001 RCX: 0000000000000027
kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff890cfed21780
kernel: RBP: 0000000187647003 R08: 0000000000000000 R09: 0000000000000003
kernel: R10: ffffb727cd1731d8 R11: ffff890dfff15f68 R12: ffff88cf435ebb08
kernel: R13: ffff88d0c01a9a00 R14: ffff88cf435ebb00 R15: 000000000000001f
kernel: FS:  00007fc4cc356040(0000) GS:ffff890cfed00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fc4cb13a000 CR3: 000000010786e002 CR4: 00000000003706f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? __warn+0x80/0x120
kernel:  ? __domain_mapping+0x375/0x4a0
kernel:  ? report_bug+0x164/0x190
kernel:  ? handle_bug+0x41/0x70
kernel:  ? exc_invalid_op+0x17/0x70
kernel:  ? asm_exc_invalid_op+0x1a/0x20
kernel:  ? __domain_mapping+0x375/0x4a0
kernel:  ? __domain_mapping+0x362/0x4a0
kernel:  ? __call_rcu_common.constprop.0+0xb9/0x660
kernel:  ? __domain_mapping+0x11e/0x4a0
kernel:  intel_iommu_map_pages+0xbb/0x100
kernel:  __iommu_map+0x156/0x280
kernel:  iommu_map_sg+0xc2/0x1c0
kernel:  iommu_dma_map_sg+0x238/0x420
kernel:  __dma_map_sg_attrs+0x29/0x90
kernel:  dma_map_sgtable+0x1d/0x30
kernel:  nvme_prep_rq.part.0+0xe1/0x800 [nvme]
kernel:  nvme_queue_rqs+0xa6/0x270 [nvme]
kernel:  blk_mq_flush_plug_list.part.0+0x473/0x600
kernel:  ? __blk_mq_alloc_requests+0x2ee/0x4f0
kernel:  blk_add_rq_to_plug+0xa4/0x140
kernel:  blk_mq_submit_bio+0x55a/0x700
kernel:  __submit_bio+0x9c/0x1f0
kernel:  submit_bio_noacct_nocheck+0x19a/0x3c0
kernel:  ? submit_bio_noacct+0x47/0x590
kernel:  btrfs_submit_chunk+0x193/0x660 [btrfs]
systemd[1]: Started systemd-journald.service - Journal Service.
kernel:  ? btrfs_do_readpage+0x2e4/0x6f0 [btrfs]
kernel:  btrfs_submit_bio+0x1a/0x30 [btrfs]
kernel:  submit_one_bio+0x36/0x50 [btrfs]
kernel:  btrfs_readahead+0x356/0x390 [btrfs]
kernel:  ? __pfx_end_bbio_data_read+0x10/0x10 [btrfs]
kernel:  read_pages+0x6a/0x240
kernel:  page_cache_ra_unbounded+0x105/0x180
kernel:  filemap_get_pages+0x4c4/0x620
kernel:  ? do_syscall_64+0x8e/0x190
kernel:  filemap_read+0xf2/0x360
kernel:  vfs_read+0x299/0x380
kernel:  ksys_read+0x6f/0xf0
kernel:  do_syscall_64+0x82/0x190
kernel:  ? handle_mm_fault+0xae/0x2d0
kernel:  ? do_user_addr_fault+0x379/0x670
kernel:  ? exc_page_fault+0x76/0x190
kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
kernel: RIP: 0033:0x7fc4cbd1721d
kernel: Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d e6 53 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d c1 23 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2>
kernel: RSP: 002b:00007fff35d92468 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
kernel: RAX: ffffffffffffffda RBX: 00007fff35d964f0 RCX: 00007fc4cbd1721d
kernel: RDX: 0000000000002000 RSI: 00007fff35d924a0 RDI: 0000000000000003
kernel: RBP: 00000000000ee000 R08: 00007fc4cc378020 R09: 0000000000006000
kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000ee000
kernel: R13: 0000000000000000 R14: 00007fc4cb04c010 R15: 00007fc4cb04c010
kernel:  </TASK>
kernel: ---[ end trace 0000000000000000 ]---

I think it could be related
Thanks again for help
Kind regards
Nicolas Baranger
Comment 3 Nicolas BARANGER 2025-01-05 18:07:07 UTC
Dear maintener

Trying to bissect the regression I found that the bug was introduce between Linux 6.1.85 and Linux 6.1.106 with a bigger probability during commit  

From my side as happend on a production machine it would be difficult to build and reproduce to bisect with the 21 kernel between 6.1.85 and 6.1.106 but I will try to do my best for

Regards
Nicolas

Note You need to log in before you can comment on or make changes to this bug.