Bug 202891 - __nvme_disable_io_queues triggers WARNING in kernel/irq/chip.c:210
Summary: __nvme_disable_io_queues triggers WARNING in kernel/irq/chip.c:210
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: io_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-12 22:32 UTC by Piotr Tworek
Modified: 2020-06-21 15:36 UTC (History)
6 users (show)

See Also:
Kernel Version: 5.0.1 5.1.0
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Dmesg from linux 5.0.1 (75.76 KB, text/plain)
2019-03-12 22:38 UTC, Piotr Tworek
Details
kernel 5.0.5 trace (3.56 KB, text/plain)
2019-04-02 08:47 UTC, Daniel Exner
Details
Full dmesg on 5.2.13 (108.88 KB, text/plain)
2019-10-06 00:08 UTC, Marcin P
Details

Description Piotr Tworek 2019-03-12 22:32:56 UTC
After upgrading to Linux 5.0.1 I started seeing the following warning in dmesg whenever I try to put the machine to sleep:

[  156.103095] PM: suspend entry (deep)
[  156.103099] PM: Syncing filesystems ... done.
[  156.122217] Freezing user space processes ... (elapsed 0.001 seconds) done.
[  156.124208] OOM killer disabled.
[  156.124208] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
[  156.125070] printk: Suspending console(s) (use no_console_suspend to debug)
[  156.125431] wlp2s0: deauthenticating from 18:d6:c7:fc:d5:3d by local choice (Reason: 3=DEAUTH_LEAVING)
[  156.461444] WARNING: CPU: 7 PID: 169 at irq_startup+0xd6/0xe0
[  156.461446] Modules linked in: rfcomm bnep btusb btrtl btbcm btintel bluetooth uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videodev videobuf2_common ecdh_generic overlay squashfs kvm_amd r8822be(C) tpm_crb r8169 realtek tpm_tis tpm_tis_core amdgpu chash gpu_sched ttm
[  156.461469] CPU: 7 PID: 169 Comm: kworker/u32:6 Tainted: G         C        5.0.1 #7
[  156.461471] Hardware name: LENOVO 20MU000CPB/20MU000CPB, BIOS R0WET48W (1.16 ) 01/03/2019
[  156.461476] Workqueue: events_unbound async_run_entry_fn
[  156.461479] RIP: 0010:irq_startup+0xd6/0xe0
[  156.461481] Code: 31 f6 4c 89 ef e8 0a 2b 00 00 85 c0 75 20 48 89 ee 31 d2 4c 89 ef e8 69 db ff ff 48 89 df e8 d1 fe ff ff 89 c5 e9 57 ff ff ff <0f> 0b eb b9 0f 0b eb b5 66 90 55 48 89 fd 53 48 8b 47 38 89 f3 8b
[  156.461482] RSP: 0018:ffffaa3e0047bc28 EFLAGS: 00010002
[  156.461483] RAX: 0000000000000010 RBX: ffffa0b427beea00 RCX: 0000000000000040
[  156.461484] RDX: 0000000000000000 RSI: ffffffffad2efeb0 RDI: ffffa0b427beea18
[  156.461485] RBP: ffffa0b427beea18 R08: 0000000000000000 R09: ffffa0b42bb38c40
[  156.461486] R10: 0000000000000000 R11: ffffffffad2364c8 R12: 0000000000000001
[  156.461487] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000010
[  156.461489] FS:  0000000000000000(0000) GS:ffffa0b42fdc0000(0000) knlGS:0000000000000000
[  156.461490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.461491] CR2: 00007f6e4a053000 CR3: 00000007ade44000 CR4: 00000000003406e0
[  156.461492] Call Trace:
[  156.461497]  enable_irq+0x41/0x80
[  156.461503]  nvme_poll_irqdisable+0xd4/0x230
[  156.461506]  __nvme_disable_io_queues+0x1ae/0x1e0
[  156.461508]  ? nvme_del_queue_end+0x20/0x20
[  156.461509]  nvme_dev_disable+0x1b6/0x1d0
[  156.461512]  nvme_suspend+0x11/0x20
[  156.461515]  pci_pm_suspend+0x6e/0x1b0
[  156.461517]  ? pci_pm_suspend_noirq+0x280/0x280
[  156.461521]  dpm_run_callback+0x46/0x130
[  156.461523]  __device_suspend+0x126/0x4e0
[  156.461527]  ? __wake_up_common+0x72/0x140
[  156.461529]  ? dpm_show_time+0xc0/0xc0
[  156.461531]  async_suspend+0x15/0x80
[  156.461533]  async_run_entry_fn+0x32/0xe0
[  156.461537]  process_one_work+0x1e3/0x3e0
[  156.461540]  worker_thread+0x28/0x3c0
[  156.461541]  ? process_one_work+0x3e0/0x3e0
[  156.461544]  kthread+0x10d/0x130
[  156.461546]  ? kthread_park+0x80/0x80
[  156.461550]  ret_from_fork+0x22/0x40
[  156.461553] ---[ end trace 27b1790090a517c4 ]---
[  156.591877] ACPI: EC: interrupt blocked
[  156.639422] ACPI: Preparing to enter system sleep state S3
[  156.646539] ACPI: EC: event blocked
[  156.646540] ACPI: EC: EC stopped
[  156.646541] PM: Saving platform NVS memory
[  156.647033] Disabling non-boot CPUs ..

The computer in question is Lenovo Thinkpad A485. Not sure its important here, but it comes with 2 NVME drivers 1TB ADATA SX6000LNP and 240GB TOSHIBA-RC100.

If required I can bisect it to a specific commit that caused this regression, but from the quick look at the changes between 4.20 and 5.0 there are 2 commits that look like likely candidates:
1. nvme-pci: remove the CQ lock for interrupt driven queues (3a7afd8ee42a68d4f24ab9c947a4ef82d4d52375)
2. nvme-pci: don't poll from irq context when deleting queues (d1ed6aa14bc418531220478604c7b12c5e98fdca)
Comment 1 Piotr Tworek 2019-03-12 22:38:47 UTC
Created attachment 281775 [details]
Dmesg from linux 5.0.1

Attaching full dmesg.
Comment 2 Daniel Exner 2019-04-02 08:47:10 UTC
Created attachment 282087 [details]
kernel 5.0.5 trace

Same issue on Lenovo Thinpad E485 with kernel 5.0.5
Comment 3 Piotr Tworek 2019-05-10 19:31:04 UTC
The problem still persists in kernel 5.1.0
Comment 4 Piotr Tworek 2019-05-18 18:42:34 UTC
I've stumbled on this on phoronix formus https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1089720-amdgpu-laptop-suspend-hang?p=1091151#post1091151. I can confirm this change does solve the problem on my laptop.
Comment 5 Marcin P 2019-10-06 00:08:35 UTC
Created attachment 285363 [details]
Full dmesg on 5.2.13

I get this bug a lot too, on 5.2.13.

Hardware:

* Ryzen 2700
* 64GB ECC RAM
* Samsung 970 Pro 512GB NVMe
Comment 6 Marcin P 2019-10-14 19:21:00 UTC
Happens on 5.3.6 too.
Comment 7 Bart Van Assche 2019-12-16 16:10:52 UTC
I think the kernel warning refers to the following source code:

        if (cpumask_any_and(aff, cpu_online_mask) >= nr_cpu_ids) {
                /*
                 * Catch code which fiddles with enable_irq() on a managed
                 * and potentially shutdown IRQ. Chained interrupt
                 * installment or irq auto probing should not happen on
                 * managed irqs either.
                 */
                if (WARN_ON_ONCE(force))
                        return IRQ_STARTUP_ABORT;
                /*
                 * The interrupt was requested, but there is no online CPU
                 * in it's affinity mask. Put it into managed shutdown
                 * state and let the cpu hotplug mechanism start it up once
                 * a CPU in the mask becomes available.
                 */
                return IRQ_STARTUP_ABORT;
        }
Comment 9 Daniel Parks 2020-06-21 15:36:00 UTC
I am also affected by this bug on kernel 5.7.2-arch1-1, with a Dell Inspiron 7405, AMD Ryzen 7 4700u, and a PC SN530 NVMe WDC 512GB SSD.

Here is the oops from my kernel messages:

------------[ cut here ]------------
WARNING: CPU: 6 PID: 2042 at kernel/irq/chip.c:210 irq_startup+0xdf/0xf0
Modules linked in: fuse ccm btusb btrtl btbcm btintel bluetooth uvcvideo joydev videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 mousedev video>
 pinctrl_amd acpi_tad ac drm crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 serio_raw atkbd libps2 crc32c_intel x>
CPU: 6 PID: 2042 Comm: systemd-sleep Not tainted 5.6.15-arch1-1 #1
Hardware name: Dell Inc. Inspiron 7405 2n1/042J14, BIOS 1.0.0 03/19/2020
RIP: 0010:irq_startup+0xdf/0xf0
Code: f6 4c 89 e7 e8 02 45 00 00 85 c0 75 21 4c 89 e7 31 d2 4c 89 ee e8 21 c9 ff ff 48 89 ef e8 b9 fe ff ff 41 89 c4 e9 53 ff ff ff <0f> 0b eb b>
RSP: 0018:ffffb7c8c2de3d90 EFLAGS: 00010002
RAX: 0000000000000140 RBX: 0000000000000001 RCX: 0000000000000140
RDX: 0000000000000004 RSI: ffffffff8b565160 RDI: ffff976402fb0818
RBP: ffff976402fb0800 R08: 0000000000000000 R09: 0000000000000140
R10: 0000000000000000 R11: ffffffff8b453648 R12: 0000000000000001
R13: ffff976402fb0818 R14: ffff976402fb08e4 R15: 0000000000000000
FS:  00007f2aae131a80(0000) GS:ffff976407580000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005634481363e6 CR3: 000000017a266000 CR4: 0000000000340ee0
Call Trace:
 resume_irqs+0xb6/0xf0
 dpm_resume_noirq+0xf/0x20
 suspend_devices_and_enter+0x338/0x8a0
 pm_suspend.cold+0x333/0x387
 state_store+0x42/0x90
 kernfs_fop_write+0xce/0x1b0
 vfs_write+0xb6/0x1a0
 ksys_write+0x67/0xe0
 do_syscall_64+0x49/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f2aaf094b57
Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f>
RSP: 002b:00007ffe27d04b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f2aaf094b57
RDX: 0000000000000004 RSI: 00007ffe27d04c80 RDI: 0000000000000004
RBP: 00007ffe27d04c80 R08: 000055df0939ea90 R09: 000000000000000d
R10: 000055df0939ac00 R11: 0000000000000246 R12: 0000000000000004
R13: 000055df0939a3c0 R14: 0000000000000004 R15: 00007f2aaf165700
---[ end trace b43f52b8ea5824de ]---

Note You need to log in before you can comment on or make changes to this bug.