Bug 217343

Summary: WARNING "thunderbolt 0000:34:00.6: interrupt for TX ring 0 is already enabled" at S4 resume
Product: Drivers Reporter: Takashi Iwai (tiwai)
Component: OtherAssignee: drivers_other
Status: RESOLVED CODE_FIX    
Severity: normal CC: b.buschinski, mario.limonciello, mika.westerberg, philipp
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: possible patch

Description Takashi Iwai 2023-04-15 09:43:14 UTC
When resuming from the hibernation on Lenovo Thinkpad T14s Gen 3 with AMD Ryzen, the kernel spews the warnings like:

[   13.774391] WARNING: CPU: 13 PID: 111 at drivers/thunderbolt/nhi.c:125 ring_interrupt_active+0x216/0x270
[   13.774399] Modules linked in: bnep(E) btusb(E) btrtl(E) btbcm(E) btintel(E) btmtk(E) bluetooth(E) ecdh_generic(E) qrtr_mhi(E) cdc_mbim(E) cdc_wdm(E) cdc_ncm(E) cdc_ether(E) usbnet(E) mii(E) dmi_sysfs(E) qrtr(E) mhi(E) snd_ctl_led(E) intel_rapl_msr(E) nls_iso8859_1(E) intel_rapl_common(E) nls_cp437(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) vfat(E) snd_hda_codec_hdmi(E) kvm_amd(E) think_lmi(E) fat(E) firmware_attributes_class(E) wmi_bmof(E) snd_hda_intel(E) snd_intel_dspcfg(E) kvm(E) snd_hda_codec(E) snd_hwdep(E) irqbypass(E) snd_hda_core(E) thinkpad_acpi(E) snd_pcm(E) ledtrig_audio(E) video(E) platform_profile(E) snd_timer(E) snd_rn_pci_acp3x(E) i2c_piix4(E) snd(E) snd_acp_config(E) snd_soc_acpi(E) snd_pci_acp3x(E) ucsi_acpi(E) soundcore(E) typec_ucsi(E) roles(E) typec(E) thermal(E) rfkill(E) fan(E) battery(E) ac(E) wmi(E) acpi_cpufreq(E) amd_pmc(E) acpi_tad(E) button(E) drm(E) fuse(E) configfs(E) efi_pstore(E) ip_tables(E) x_tables(E) ext4(E) crc16(E) mbcache(E) jbd2(E) crc32_pclmul(E) crc32c_intel(E)
[   13.774456]  ghash_clmulni_intel(E) xhci_pci(E) aesni_intel(E) xhci_pci_renesas(E) nvme(E) crypto_simd(E) cryptd(E) xhci_hcd(E) nvme_core(E) serio_raw(E) t10_pi(E) sp5100_tco(E) crc64_rocksoft_generic(E) usbcore(E) ccp(E) crc64_rocksoft(E) crc64(E) i2c_hid_acpi(E) i2c_hid(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) msr(E) efivarfs(E) [last unloaded: qmi_helpers(E)]
[   13.774478] CPU: 13 PID: 111 Comm: kworker/u32:1 Tainted: G            E      6.3.0-rc6-xtest5-dirty #11 334ee2e65583e46a1a254d52b14ac04857fddcf8
[   13.774481] Hardware name: LENOVO 21CRS0K63E/21CRS0K63E, BIOS R22ET60W (1.30 ) 02/09/2023
[   13.774483] Workqueue: events_unbound async_run_entry_fn
[   13.774489] RIP: 0010:ring_interrupt_active+0x216/0x270
[   13.774492] Code: c7 d0 00 00 00 48 89 14 24 e8 96 f4 f1 ff 48 8b 14 24 4d 89 f9 45 89 f0 4c 89 e1 48 89 c6 48 c7 c7 b8 a5 50 98 e8 1a 7b 79 ff <0f> 0b e9 3c ff ff ff 0f b6 43 78 d3 e0 09 c7 e9 d4 fe ff ff 03 56
[   13.774494] RSP: 0018:ffffa801c0553d78 EFLAGS: 00010082
[   13.774496] RAX: 0000000000000000 RBX: ffff90dac11ab5c0 RCX: 0000000000000027
[   13.774497] RDX: 0000000000000000 RSI: 00000000ffff7fff RDI: ffff90ddef1624c8
[   13.774499] RBP: 0000000000000009 R08: 0000000000000000 R09: c0000000ffff7fff
[   13.774500] R10: 0000000000000006 R11: ffffa801c0553c10 R12: ffffffff9850a3af
[   13.774501] R13: 0000000000038200 R14: 0000000000000000 R15: ffffffff98486ff5
[   13.774502] FS:  0000000000000000(0000) GS:ffff90ddef140000(0000) knlGS:0000000000000000
[   13.774504] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.774505] CR2: 0000000000000000 CR3: 000000014dc34000 CR4: 0000000000750ee0
[   13.774506] PKRU: 55555554
[   13.774507] Call Trace:
[   13.774510]  <TASK>
[   13.774514]  tb_ring_start+0x14e/0x240
[   13.774518]  tb_ctl_start+0x22/0x80
[   13.774522]  ? __pfx_pci_pm_restore_noirq+0x10/0x10
[   13.774526]  tb_domain_resume_noirq+0x25/0x60
[   13.774531]  dpm_run_callback+0x46/0x190
[   13.774535]  device_resume_noirq+0xe0/0x1e0
[   13.774538]  async_resume_noirq+0x1a/0x60
[   13.774541]  async_run_entry_fn+0x2e/0x110
[   13.774543]  process_one_work+0x226/0x440
[   13.774547]  worker_thread+0x2a/0x3b0
[   13.774550]  ? __pfx_worker_thread+0x10/0x10
[   13.774552]  kthread+0xd9/0x100
[   13.774556]  ? __pfx_kthread+0x10/0x10
[   13.774559]  ret_from_fork+0x2c/0x50
[   13.774565]  </TASK>
[   13.774566] ---[ end trace 0000000000000000 ]---

And once again for RX:
[   13.774605] thunderbolt 0000:34:00.6: interrupt for RX ring 0 is already enabled
[   13.774611] WARNING: CPU: 13 PID: 111 at drivers/thunderbolt/nhi.c:125 ring_interrupt_active+0x216/0x270
....

And, this seems to be happening only when the Thunderbolt driver is configured as built-in (CONFIG_USB4=y).  On the kernel with TB as a module (CONFIG_USB4=m), this bug isn't triggered.

My wild guess is that the TB stuff gets initialized on the kernel that does restore the image, and the irq setup is taken over at resuming from the image.
Adding some debug prints and enabling the debug output, the kernel shows like below just before the warning:

[   13.774312] thunderbolt 0000:34:00.6: XXX nhi_restore_noirq
[   13.774338] thunderbolt 0000:34:00.6: XXX nhi_resume_noirq
[   13.774362] thunderbolt 0000:34:00.6: enabling interrupt at register 0x38200 bit 0 (0x9 -> 0x9)

So, this is happening at the restore_noirq call, and the irqs have been already set up.

The symptom is found from 5.15 to the latest Linus tree (6.3-rc6).
(The reason starting from 5.15 is that AMD TB got fixed on that version.)

Except for the ugly interrupts, there is no other problem.
Comment 1 Takashi Iwai 2023-04-20 14:34:21 UTC
I suspect rather that this might be an AMD-specific problem.  Adding Mario to Cc.

During the debug, I noticed that the interrupt bit flip in ring_interrupt_active() isn't taken immediately on the machine.
Namely, even after the iowrite32() on REG_RING_INTERRUPT_BASE + idx, reading the same register returns the old value.

That is, the suspend / freeze ops does turn off the interrupt by writing on that register, but this alone doesn't suffice.  At the hibernation (with TB built-in), the device has been already initialized.  Even if it's (runtime-)suspended, the irq bit on the register isn't actually cleared, and at the hibernation resume, the driver checks it by reading the register, hence it complains the inconsistency, as it seems.
Comment 2 Mario Limonciello (AMD) 2023-04-20 15:02:07 UTC
My knee jerk reaction is "What is the value to this warning message?".

> Namely, even after the iowrite32() on REG_RING_INTERRUPT_BASE + idx, reading
> the same register returns the old value.

It sounds like by having it as a module it changes the timing and it takes some time for the hardware to enable (or disable) the interrupt.

If the warning should stay one option is that we can poll `REG_RING_INTERRUPT_BASE + idx` after the write to ensure it's settled.  Can you have a try with that perhaps?
Comment 3 Takashi Iwai 2023-04-20 15:08:30 UTC
(In reply to Mario Limonciello (AMD) from comment #2)
> My knee jerk reaction is "What is the value to this warning message?".
> 
> > Namely, even after the iowrite32() on REG_RING_INTERRUPT_BASE + idx,
> reading
> > the same register returns the old value.
> 
> It sounds like by having it as a module it changes the timing and it takes
> some time for the hardware to enable (or disable) the interrupt.

Right, if the TB is a module, it's not loaded on the kernel that does restore the hibernation image, hence at the hibernation resume, TB device is still in the uninitialized state.  OTOH, when it's built-in, the TB device was already initialized at hibernation resume, and it triggers the bug.

> If the warning should stay one option is that we can poll
> `REG_RING_INTERRUPT_BASE + idx` after the write to ensure it's settled.  Can
> you have a try with that perhaps?

I tried to add the 1000 times reads, but this didn't seem help, and the value persisted.  But I need to double-check.
Comment 4 Takashi Iwai 2023-04-20 15:25:17 UTC
(In reply to Takashi Iwai from comment #3)
> (In reply to Mario Limonciello (AMD) from comment #2)
> > If the warning should stay one option is that we can poll
> > `REG_RING_INTERRUPT_BASE + idx` after the write to ensure it's settled. 
> Can
> > you have a try with that perhaps?
> 
> I tried to add the 1000 times reads, but this didn't seem help, and the
> value persisted.  But I need to double-check.

I confirmed that the old value persists even after 1 second.  HOWEVER: this happens only when you turn off the bit.  Turning on the bit is effective immediately, as it seems.
Comment 5 Mario Limonciello (AMD) 2023-04-21 01:59:15 UTC
Created attachment 304168 [details]
possible patch

Can you have a try with this patch?
Comment 6 Takashi Iwai 2023-04-21 07:13:50 UTC
Yes, the patch works fine!
Comment 7 Takashi Iwai 2023-04-21 07:14:50 UTC
So feel free to take:
Tested-by: Takashi Iwai <tiwai@suse.de>

Will be looking forward to seeing your upstream submission :)
Comment 8 Mario Limonciello (AMD) 2023-05-22 21:03:50 UTC
It's part of 6.4-rc3 and trickling into stable kernels.

c4af8e3fecd0 ("thunderbolt: Clear registers properly when auto clear isn't in use")
Comment 9 Philipp 2023-06-08 20:14:36 UTC
Is this only a cosmetic/logging issue, or are there any user-visible symptoms of this bug?

Specifically, my system freezes for a minute or two when resuming from hibernate (and when printing this message) and then also the wifi is broken until I reboot, and upon shutdown/reboot it will freeze again. So hibernation leaves my system in a really weird state, but then again I know that this could be due to any number of hibernation issues. Just wondering if this specific patch would be useful for me to have.
Comment 10 Mario Limonciello (AMD) 2023-06-20 04:08:26 UTC
I don't expect these messages and your issue are related; but if you would like to confirm you can do this:
1) modify your bootloader configuration so that modprobe.blacklist=thunderbolt is on your kernel command line
2) hibernate system
3) Resume system from hibernate (ensure that the resuming kernel has same kernel command line to blacklist)

If your issue still happens it's unrelated.
Comment 11 Bernd Buschinski 2023-06-21 06:16:22 UTC
Hello, starting with kernel 6.3.4 (currently using 6.3.7) I am getting:

> [   36.445310] thunderbolt 0000:03:00.0: interrupt for TX ring 0 is already
> enabled
> [   36.445329] WARNING: CPU: 7 PID: 3577 at drivers/thunderbolt/nhi.c:137
> ring_interrupt_active+0x202/0x250
> [   36.445334] Modules linked in: snd_hda_codec_hdmi snd_ctl_led
> snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg
> snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd efivarfs
> [   36.445341] CPU: 7 PID: 3577 Comm: laptop_mode Not tainted 6.3.7-gentoo #1
> [   36.445343] Hardware name: Dell Inc. Latitude 7410/0M5G57, BIOS 1.22.0
> 03/20/2023
> [   36.445344] RIP: 0010:ring_interrupt_active+0x202/0x250
> [   36.445346] Code: 44 89 44 24 08 e8 1e 3e 9c ff 4c 89 f9 48 c7 c7 d8 f0 68
> 83 4c 8b 0c 24 44 8b 44 24 08 48 89 c6 48 8b 54 24 10 e8 ee 2c e3 fe <0f> 0b
> 45 84 f6 44 8b 54 24 0c 0f 85 06 ff ff ff 48 8b 43 08 f6 40
> [   36.445347] RSP: 0018:ffffc90000c9fcb0 EFLAGS: 00010086
> [   36.445348] RAX: 0000000000000044 RBX: ffff888100c07900 RCX:
> 0000000000000027
> [   36.445349] RDX: ffff88848d7db4c8 RSI: 0000000000000001 RDI:
> ffff88848d7db4c0
> [   36.445350] RBP: 0000000000000000 R08: 0000000000000001 R09:
> 0000000000000000
> [   36.445350] R10: ffffffff83873400 R11: ffffffff83873400 R12:
> 0000000000038200
> [   36.445351] R13: 0000000000000001 R14: 0000000000000001 R15:
> ffffffff8368efd1
> [   36.445352] FS:  00007fc83615fb80(0000) GS:ffff88848d7c0000(0000)
> knlGS:0000000000000000
> [   36.445353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   36.445354] CR2: 00007f12fabbe6d8 CR3: 0000000104d22005 CR4:
> 00000000003706e0
> [   36.445355] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [   36.445355] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [   36.445356] Call Trace:
> [   36.445357]  <TASK>
> [   36.445358]  ? ring_interrupt_active+0x202/0x250
> [   36.445360]  ? __warn+0x78/0x130
> [   36.445363]  ? ring_interrupt_active+0x202/0x250
> [   36.445364]  ? report_bug+0x16b/0x1a0
> [   36.445366]  ? handle_bug+0x36/0x70
> [   36.445368]  ? exc_invalid_op+0x13/0x60
> [   36.445370]  ? asm_exc_invalid_op+0x16/0x20
> [   36.445372]  ? ring_interrupt_active+0x202/0x250
> [   36.445373]  tb_ring_start+0x14e/0x240
> [   36.445375]  tb_ctl_start+0x1a/0x70
> [   36.445376]  ? pci_pm_restore_noirq+0xb0/0xb0
> [   36.445378]  tb_domain_runtime_resume+0x10/0x30
> [   36.445381]  __rpm_callback+0x3c/0x160
> [   36.445384]  ? pci_pm_restore_noirq+0xb0/0xb0
> [   36.445386]  rpm_callback+0x54/0x60
> [   36.445388]  ? pci_pm_restore_noirq+0xb0/0xb0
> [   36.445389]  rpm_resume+0x53e/0x780
> [   36.445391]  ? do_filp_open+0x98/0x120
> [   36.445393]  ? _copy_from_iter+0x8d/0x470
> [   36.445395]  ? __kmem_cache_alloc_node+0x3b/0x410
> [   36.445397]  pm_runtime_forbid+0x35/0x40
> [   36.445399]  control_store+0x73/0x80
> [   36.445400]  kernfs_fop_write_iter+0x112/0x1a0
> [   36.445403]  vfs_write+0x224/0x3d0
> [   36.445405]  ksys_write+0x56/0xd0
> [   36.445406]  do_syscall_64+0x35/0x80
> [   36.445408]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [   36.445410] RIP: 0033:0x7fc836294a20
> [   36.445412] Code: 40 00 48 8b 15 f9 53 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff
> ff ff eb b7 0f 1f 00 80 3d c1 db 0d 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d
> 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
> [   36.445412] RSP: 002b:00007fff32579188 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000001
> [   36.445414] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> 00007fc836294a20
> [   36.445414] RDX: 0000000000000003 RSI: 000055b6d6a45550 RDI:
> 0000000000000001
> [   36.445415] RBP: 00007fc83636b780 R08: 0000000000000000 R09:
> 0000000000000000
> [   36.445416] R10: 0000000000000000 R11: 0000000000000202 R12:
> 0000000000000003
> [   36.445416] R13: 000055b6d6a45550 R14: 0000000000000a68 R15:
> 0000000000000003
> [   36.445418]  </TASK>

I think this is related to https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.3.8&id=27b98a67a8ab21de9ae0f6a9553af9aa6aa2b43a (this bug).

Should I open a new bug for this?
Comment 13 Bernd Buschinski 2023-06-21 07:59:27 UTC
I tested v6.4-rc7 and the warning is gone. Works fine, thanks! :)
Comment 14 Mario Limonciello (AMD) 2023-06-21 14:24:26 UTC
FYI it's also landed in 6.3.9 as:
d826dc20ff05 ("thunderbolt: Mask ring interrupt on Intel hardware as well")