Bug 213079 - [bisected] IRQ problems and crashes on a PowerMac G5 with 5.12.3
Summary: [bisected] IRQ problems and crashes on a PowerMac G5 with 5.12.3
Status: NEEDINFO
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: PPC-64 (show other bugs)
Hardware: PPC-64 Linux
: P1 normal
Assignee: platform_ppc-64
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-15 11:58 UTC by Erhard F.
Modified: 2022-07-29 07:01 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.12.3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (5.13-rc1, PowerMac G5 11,2) (66.02 KB, text/plain)
2021-05-15 11:58 UTC, Erhard F.
Details
kernel .config (5.13-rc1, PowerMac G5 11,2) (107.38 KB, text/plain)
2021-05-15 11:58 UTC, Erhard F.
Details
bisect.log (2.69 KB, text/plain)
2021-06-06 18:14 UTC, Erhard F.
Details
dmesg (5.13-rc6 + debug, PowerMac G5 11,2) (88.58 KB, text/plain)
2021-06-17 23:51 UTC, Erhard F.
Details
dmesg (5.13-rc6 w. patch fbbefb3 reverted + debug, PowerMac G5 11,2) (88.95 KB, text/plain)
2021-06-17 23:52 UTC, Erhard F.
Details
kernel .config (5.13-rc6, PowerMac G5 11,2) (107.34 KB, text/plain)
2021-06-17 23:54 UTC, Erhard F.
Details
dmesg (5.13-rc6 + DEBUG_VM_PGTABLE, PowerMac G5 11,2) (63.13 KB, text/plain)
2021-06-18 23:29 UTC, Erhard F.
Details
hackfix for MSI init (1.93 KB, application/mbox)
2021-07-05 14:11 UTC, Oliver O'Halloran
Details
dmesg (5.14-rc6, PowerMac G5 11,2) (72.59 KB, text/plain)
2021-08-20 00:12 UTC, Erhard F.
Details
kernel .config (5.14-rc6, PowerMac G5 11,2) (108.37 KB, text/plain)
2021-08-20 00:13 UTC, Erhard F.
Details

Description Erhard F. 2021-05-15 11:58:01 UTC
Created attachment 296759 [details]
dmesg (5.13-rc1, PowerMac G5 11,2)

With v5.13-rc1 I get IRQ problems and crashes on my G5 sooner or later. IRQ 63 is my NVMe SSD.

[...]
irq 63: nobody cared (try booting with the "irqpoll" option)
CPU: 1 PID: 11783 Comm: emerge Tainted: G        W         5.13.0-rc1-PowerMacG5 #3
Call Trace:
[c00000000ffefae0] [c000000000549790] .dump_stack+0xe0/0x13c (unreliable)
[c00000000ffefb80] [c0000000000def44] .__report_bad_irq+0x34/0xf0
[c00000000ffefc20] [c0000000000dee2c] .note_interrupt+0x258/0x300
[c00000000ffefce0] [c0000000000db0a8] .handle_irq_event_percpu+0x64/0x90
[c00000000ffefd70] [c0000000000db118] .handle_irq_event+0x44/0x70
[c00000000ffefe00] [c0000000000e0530] .handle_fasteoi_irq+0xac/0x158
[c00000000ffefea0] [c0000000000da164] .generic_handle_irq+0x38/0x58
[c00000000ffeff10] [c000000000011674] .__do_irq+0x15c/0x238
[c00000000ffeff90] [c000000000012068] .do_IRQ+0x180/0x188
[c00000014d357d70] [c000000000011f88] .do_IRQ+0xa0/0x188
[c00000014d357e10] [c000000000007f94] hardware_interrupt_common_virt+0x1a4/0x1b0
--- interrupt: 500 at 0x3fffb07a1a9c
NIP:  00003fffb07a1a9c LR: 00003fffb07a3d08 CTR: 00003fffb074cb30
REGS: c00000014d357e80 TRAP: 0500   Tainted: G        W          (5.13.0-rc1-PowerMacG5)
MSR:  900000000000f032 <SF,HV,EE,PR,FP,ME,IR,DR,RI>  CR: 22482820  XER: 20000000
IRQMASK: 0 
GPR00: 00003fffb07a3d08 00003fffe84d07a0 00003fffb0ad1200 00003fffa8131100 
GPR04: 00003fffa9ea4bd0 a5a8b016e7fdc57d 00003fffe84d0810 00003fffb0aa7ac0 
GPR08: 00003fffb0ab3708 00003fffab4eb870 0000000000000000 0000000000000000 
GPR12: 00003fffb07b92a0 00003fffb0b8e850 00003fffe84d0a58 000000014df42388 
GPR16: 00003fffe84d0a70 ffffffffffffffff 00003fffafbf54c0 ffffffffffffffff 
GPR20: 0000000000000000 000000014df42338 000000014c677878 0000000000000000 
GPR24: 00003fffafc0b5b0 000000014c677830 00003fffafcc8a50 a5a8b016e7fdc57d 
GPR28: 00003fffa863bcc0 00003fffa8131100 00003fffa9ea4bd0 00003fffa8131100 
NIP [00003fffb07a1a9c] 0x3fffb07a1a9c
LR [00003fffb07a3d08] 0x3fffb07a3d08
--- interrupt: 500
handlers:
[<00000000370eb0ba>] .nvme_irq
[<00000000370eb0ba>] .nvme_irq
Disabling IRQ #63
Call Trace:
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 0 PID: 814 Comm: kworker/u4:2 Tainted: G        W         5.13.0-rc1-PowerMacG5 #3
Workqueue: writeback .wb_workfn (flush-254:1)
[c00000007db5ab40] [c000000000549790] .dump_stack+0xe0/0x13c (unreliable)
[c00000007db5abe0] [c0000000000680dc] .panic+0x168/0x430
[c00000007db5ac90] [c000000000811e40] .__schedule+0x80/0x840
[c00000007db5ad70] [c00000000081274c] .preempt_schedule_common+0x28/0x48
[c00000007db5adf0] [c00000000081279c] .__cond_resched+0x30/0x4c
[c00000007db5ae70] [c0000000001c6a98] .mempool_alloc+0x38/0x1a4
[c00000007db5af50] [c0000000004a1a70] .bio_alloc_bioset+0x94/0x174
[c00000007db5b000] [c000000000354840] .ext4_bio_write_page+0x314/0x480
[c00000007db5b0c0] [c0000000003334d4] .mpage_submit_page+0x70/0xa0
[c00000007db5b140] [c000000000333630] .mpage_process_page_bufs+0x12c/0x18c
[c00000007db5b1d0] [c0000000003338b8] .mpage_prepare_extent_to_map+0x1f8/0x228
[c00000007db5b320] [c000000000339088] .ext4_writepages+0x360/0xe5c
[c00000007db5b5d0] [c0000000001cee84] .do_writepages+0x54/0xa0
[c00000007db5b650] [c0000000002a49bc] .__writeback_single_inode+0x100/0x560
[c00000007db5b700] [c0000000002a53d8] .writeback_sb_inodes+0x2dc/0x4c8
[c00000007db5b880] [c0000000002a5654] .__writeback_inodes_wb+0x90/0xcc
[c00000007db5b930] [c0000000002a58c0] .wb_writeback+0x230/0x3dc
[c00000007db5ba50] [c0000000002a6790] .wb_workfn+0x380/0x460
[c00000007db5bbb0] [c0000000000890a0] .process_one_work+0x318/0x4dc
[c00000007db5bca0] [c000000000089730] .worker_thread+0x224/0x290
[c00000007db5bd60] [c000000000091200] .kthread+0x134/0x13c
[c00000007db5be10] [c00000000000bbf4] .ret_from_kernel_thread+0x58/0x64
Rebooting in 120 seconds..


 # lspci -vv -s 0001:08:00.0
0001:08:00.0 Non-Volatile memory controller: Intel Corporation SSD Pro 7600p/760p/E 6100p Series (rev 03) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [NVM Express]
	Device tree node: /sys/firmware/devicetree/base/ht@0,f2000000/pci@5/pci8086,390b@0
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 63
	NUMA node: 0
	Region 0: Memory at a0000000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (downgraded), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
	Kernel driver in use: nvme
Comment 1 Erhard F. 2021-05-15 11:58:27 UTC
Created attachment 296761 [details]
kernel .config (5.13-rc1, PowerMac G5 11,2)
Comment 2 Erhard F. 2021-05-15 12:30:16 UTC
Hmm... Just also happened on 5.12.3. But without the Kernel panic (yet).

[...]
irq 63: nobody cared (try booting with the "irqpoll" option)
Call Trace:
CPU: 1 PID: 43491 Comm: emerge Tainted: G        W         5.12.3-gentoo-PowerMacG5 #2
[c00000000ffefae0] [c00000000053950c] .dump_stack+0xe0/0x13c (unreliable)
[c00000000ffefb80] [c0000000000ddb68] .__report_bad_irq+0x34/0xf0
[c00000000ffefc20] [c0000000000dda50] .note_interrupt+0x250/0x2f8
[c00000000ffefce0] [c0000000000d9cf8] .handle_irq_event_percpu+0x64/0x90
[c00000000ffefd70] [c0000000000d9d68] .handle_irq_event+0x44/0x70
[c00000000ffefe00] [c0000000000df164] .handle_fasteoi_irq+0xac/0x158
[c00000000ffefea0] [c0000000000d8db8] .generic_handle_irq+0x38/0x58
[c00000000ffeff10] [c000000000011314] .__do_irq+0x15c/0x238
[c00000000ffeff90] [c00000000001fe04] .call_do_irq+0x14/0x24
[c000000056e2fd70] [c00000000001154c] .do_IRQ+0x15c/0x164
[c000000056e2fe10] [c000000000007d38] hardware_interrupt_common_virt+0x158/0x160
--- interrupt: 500 at 0x3fffb8a21520
handlers:
NIP:  00003fffb8a21520 LR: 00003fffb8a214a0 CTR: 00003fffb8ae6d20
REGS: c000000056e2fe80 TRAP: 0500   Tainted: G        W          (5.12.3-gentoo-PowerMacG5)
MSR:  900000000200f032 <SF,HV,VEC,EE,PR,FP,ME,IR,DR,RI>  CR: 42482824  XER: 20000000
IRQMASK: 0 
GPR00: 00003fffb8a214a0 00003fffdb199650 00003fffb8df7200 000000014e8ddc60 
GPR04: 00003fffb210e000 95bfd31b66b69e10 00003fffdb199478 0000000000024d50 
GPR08: 000000014cb987c0 0000000000000002 0000000000000000 0000000000000000 
GPR12: 00003fffb8ae0e50 00003fffb8eb4850 00003fffdb199a58 000000014e8ddf60 
GPR16: 00003fffdb199a70 ffffffffffffffff 0000000000000001 000000014b5d8460 
GPR20: 0000000000000000 0000000000000002 000000014e8ddf38 00003fffb6b176e8 
GPR24: 000000014c126958 00003fffb2030390 000000014b94c380 000000014b5d8460 
GPR28: 000000014c1267f0 000000014c126a60 000000014c1267f0 0000000000000000 
NIP [00003fffb8a21520] 0x3fffb8a21520
LR [00003fffb8a214a0] 0x3fffb8a214a0
--- interrupt: 500
[<000000000e5af612>] .nvme_irq
[<000000000e5af612>] .nvme_irq
Disabling IRQ #63
Comment 3 Erhard F. 2021-05-15 14:50:23 UTC
Some time after the "irq 63: nobody cared" on 5.12.3:

[...]
--- interrupt: 500
[<000000000e5af612>] .nvme_irq
[<000000000e5af612>] .nvme_irq
Disabling IRQ #63
Call Trace:
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 0 PID: 105549 Comm: kworker/u4:1 Tainted: G        W         5.12.3-gentoo-PowerMacG5 #2
Workqueue:  0x0 (flush-259:0)
[c000000078dc79f0] [c00000000053950c] .dump_stack+0xe0/0x13c (unreliable)
[c000000078dc7a90] [c000000000066074] .panic+0x168/0x430
[c000000078dc7b40] [c0000000007f19f0] .__schedule+0x80/0x848
[c000000078dc7c20] [c0000000007f2270] .schedule+0xb8/0x110
[c000000078dc7ca0] [c000000000086d18] .worker_thread+0x278/0x290
[c000000078dc7d60] [c00000000008e75c] .kthread+0x134/0x13c
[c000000078dc7e10] [c00000000000b1f4] .ret_from_kernel_thread+0x58/0x64
Rebooting in 120 seconds..
Comment 4 Erhard F. 2021-06-06 18:14:30 UTC
Created attachment 297191 [details]
bisect.log

Turns out the problem was introduced between v5.11 and v5.12 by following commit:

 # git bisect good
fbbefb320214db14c3e740fce98e2c95c9d0669b is the first bad commit
commit fbbefb320214db14c3e740fce98e2c95c9d0669b
Author: Oliver O'Halloran <oohall@gmail.com>
Date:   Tue Nov 3 15:35:07 2020 +1100

    powerpc/pci: Move PHB discovery for PCI_DN using platforms
    
    Make powernv, pseries, powermac and maple use ppc_mc.discover_phbs.
    These platforms need to be done together because they all depend on
    pci_dn's being created from the DT. The pci_dn contains a pointer to
    the relevant pci_controller so they need to be created after the
    pci_controller structures are available, but before PCI devices are
    scanned. Currently this ordering is provided by initcalls and the
    sequence is:
    
      1. PHBs are discovered (setup_arch) (early boot, pre-initcalls)
      2. pci_dn are created from the unflattended DT (core initcall)
      3. PHBs are scanned pcibios_init() (subsys initcall)
    
    The new ppc_md.discover_phbs() function is also a core_initcall so we
    can't guarantee ordering between the creation of pci_controllers and
    the creation of pci_dn's which require a pci_controller. We could use
    the postcore, or core_sync initcall levels, but it's cleaner to just
    move the pci_dn setup into the per-PHB inits which occur inside of
    .discover_phb() for these platforms. This brings the boot-time path in
    line with the PHB hotplug path that is used for pseries DLPAR
    operations too.
    
    Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
    [mpe: Squash powermac & maple in to avoid breakage those platforms,
          convert memblock allocs to use kmalloc to avoid warnings]
    Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
    Link: https://lore.kernel.org/r/20201103043523.916109-2-oohall@gmail.com
Comment 5 Oliver O'Halloran 2021-06-07 04:13:16 UTC
Hmm, it's pretty weird to see an NVMe drive using LSIs. Not too sure what to make of that. I figure there's something screwy going on with interrupt routing, but I don't have any g5 hardware to replicate this with.

Could you add "debug" to the kernel command line and post the dmesg output for a boot with the patch applied and reverted?
Comment 6 Erhard F. 2021-06-07 06:49:37 UTC
This is already a custom built kernel with lots of debugging options turned on (see bugzilla attached kernel .config). But of course I can add "debug" to the other kernel command line parameters.

I'll report back when I get access to this G5 next time in about 2-3 weeks.
Comment 7 Erhard F. 2021-06-17 23:48:50 UTC
(In reply to Oliver O'Halloran from comment #5)
> Could you add "debug" to the kernel command line and post the dmesg output
> for a boot with the patch applied and reverted?
Ok, on top of 5.13-rc6 I reverted fbbefb3, which went fine execpt the "pci-ioda.c"-part where I needed to manually apple the old code.

Here's the vanilla debug dmesg and the debug dmesg with the patch reverted.
Comment 8 Erhard F. 2021-06-17 23:51:00 UTC
Created attachment 297435 [details]
dmesg (5.13-rc6 + debug, PowerMac G5 11,2)
Comment 9 Erhard F. 2021-06-17 23:52:45 UTC
Created attachment 297437 [details]
dmesg (5.13-rc6 w. patch fbbefb3 reverted + debug, PowerMac G5 11,2)
Comment 10 Erhard F. 2021-06-17 23:54:53 UTC
Created attachment 297439 [details]
kernel .config (5.13-rc6, PowerMac G5 11,2)
Comment 11 Erhard F. 2021-06-18 23:29:49 UTC
Created attachment 297473 [details]
dmesg (5.13-rc6 + DEBUG_VM_PGTABLE, PowerMac G5 11,2)

The trace got some additional data with DEBUG_VM_PGTABLE=y, slub_debug=P and page_poison=1:

[...]
irq 63: nobody cared (try booting with the "irqpoll" option)
Call Trace:
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W         5.13.0-rc6-PowerMacG5+ #2
[c00000000fff7ae0] [c00000000054eafc] .dump_stack+0xe0/0x13c (unreliable)
[c00000000fff7b80] [c0000000000e1428] .__report_bad_irq+0x34/0xf0
[c00000000fff7c20] [c0000000000e1310] .note_interrupt+0x258/0x300
[c00000000fff7ce0] [c0000000000dd58c] .handle_irq_event_percpu+0x64/0x90
[c00000000fff7d70] [c0000000000dd5fc] .handle_irq_event+0x44/0x70
[c00000000fff7e00] [c0000000000e2a14] .handle_fasteoi_irq+0xac/0x158
[c00000000fff7ea0] [c0000000000dc648] .generic_handle_irq+0x38/0x58
[c00000000fff7f10] [c000000000011688] .__do_irq+0x15c/0x238
[c00000000fff7f90] [c00000000001207c] .do_IRQ+0x180/0x188
[c0000000012db810] [c000000000011f9c] .do_IRQ+0xa0/0x188
[c0000000012db8b0] [c000000000007f94] hardware_interrupt_common_virt+0x1a4/0x1b0
--- interrupt: 500 at .power4_idle_nap+0x30/0x34
NIP:  c00000000002cc04 LR: c000000000016828 CTR: c000000000016768
REGS: c0000000012db920 TRAP: 0500   Tainted: G        W          (5.13.0-rc6-PowerMacG5+)
MSR:  9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 44082242  XER: 00000000
IRQMASK: 0 
GPR00: c0000000000167dc c0000000012dbbc0 c0000000012df700 0000000000000001 
GPR04: 0000000000000000 0000000000000000 0000000000000002 9000000000049032 
GPR08: 0000000000000001 c0000000011b3b80 0000000000000001 0000000000000016 
GPR12: 0000000044082242 c0000000023a6000 000000000014aa88 00000000ffb30100 
GPR16: 0000000001e7b8da 0000000001e7bd5f 0000000001e7b9f0 0000000001e88d8d 
GPR20: 0000000001e7bd3d 0000000001e7b98b 0000000001e7bbb2 0000000001e7b89c 
GPR24: 000000000270f700 c000000001081008 c000000000a7c02d 0000000000000000 
GPR28: c0000000012edb9c c0000000011b3b80 9000000000009032 c0000000012ed985 
NIP [c00000000002cc04] .power4_idle_nap+0x30/0x34
LR [c000000000016828] .power4_idle+0xc0/0xe8
--- interrupt: 500
[c0000000012dbbc0] [c0000000000167dc] .power4_idle+0x74/0xe8 (unreliable)
handlers:
[c0000000012dbc40] [c00000000001665c] .arch_cpu_idle+0x80/0x18c
[c0000000012dbcc0] [c00000000081f058] .default_idle_call+0x7c/0xd0
[c0000000012dbd30] [c0000000000a7bcc] .do_idle+0x128/0x140
[c0000000012dbdd0] [c0000000000a7eb4] .cpu_startup_entry+0x28/0x2c
[c0000000012dbe40] [c000000000010044] .rest_init+0x1b0/0x1bc
[c0000000012dbec0] [c0000000010047f4] .start_kernel+0x934/0x9b8
[c0000000012dbf90] [c00000000000b390] start_here_common+0x1c/0x8c
[<000000001553d54b>] .nvme_irq
[<000000001553d54b>] .nvme_irq
Disabling IRQ #63
Comment 12 Oliver O'Halloran 2021-07-05 14:11:35 UTC
Created attachment 297755 [details]
hackfix for MSI init
Comment 13 Oliver O'Halloran 2021-07-05 14:20:11 UTC
Hi,

I got a loaner G5 with an NVMe drive, but I haven't been able to replicate the crash you're seeing. However, I think that's probably because I'm only reading from the NVMe since it's NTFS formatted and I didn't want to trash someone else's files. I'm waiting for a new NVMe drive to arrive so I can do some destructive testing which should hopefully replicate the bug.

In the meanwhile, can you try the patch above? That seems to fix bug which is causing MSIs to be unusable. I'm not 100% sure why that woudld matter, but it's possible the crashes are due to some other bug which doesn't appear when MSIs are in use.
Comment 14 Erhard F. 2021-07-05 16:04:58 UTC
Thanks for the patch! I will try it as soon as I get to this G5 again.

Don't know whether write access is necessary to trigger the bug. The past weekend I've seen it only by doing an 'emerge -pv distcc' on its' Gentoo partition, which only shows the flags and version distcc is going to be installed, but does not build anything yet. Still the bug was triggered. Filesystem was ext4, but I've seen it on btrfs at other times. Running kernel 5.10.x LTS for the time being which works just fine.
Comment 15 Erhard F. 2021-07-23 12:47:20 UTC
(In reply to Oliver O'Halloran from comment #13)
> In the meanwhile, can you try the patch above? That seems to fix bug which
> is causing MSIs to be unusable. I'm not 100% sure why that woudld matter,
> but it's possible the crashes are due to some other bug which doesn't appear
> when MSIs are in use.
Now I had time to test your patch on top of kernel 5.13-rc6 and 5.13.4. Can't test it on top of 5.14-rc2 due to bug #213803.

Your patch seems to work fine and I don't get this "irq 63: nobody cared" messages and crashes any longer! However now when building stuff the G5 sooner or later crashes with:

[...]
Kernel panic - not syncing: corrupted stack end detected inside scheduler
Call Trace:
CPU: 1 PID: 2968 Comm: powerpc64-unkno Tainted: G        W         5.13.0-rc6-PowerMacG5+ #2
[c0000000717178c0] [c0000000005412d0] .dump_stack+0xe0/0x13c (unreliable)
[c000000071717960] [c0000000000681a0] .panic+0x168/0x430
[c000000071717a10] [c000000000809ca0] .__schedule+0x80/0x840
[c000000071717af0] [c0000000000a0ea8] .do_task_dead+0x54/0x58
[c000000071717b70] [c00000000006e7b4] .do_exit+0xa14/0xa6c
[c000000071717c60] [c00000000006e89c] .do_group_exit+0x50/0xb0
[c000000071717cf0] [c00000000006e910] .__wake_up_parent+0x0/0x34
[c000000071717d60] [c000000000021530] .system_call_exception+0x1b4/0x1ec
[c000000071717e10] [c00000000000b9c4] system_call_common+0xe4/0x214
--- interrupt: c00 at 0x3fffa8092aa8
NIP:  00003fffa8092aa8 LR: 00003fffa7ff2d04 CTR: 0000000000000000
REGS: c000000071717e80 TRAP: 0c00   Tainted: G        W          (5.13.0-rc6-PowerMacG5+)
MSR:  900000000200f032 <SF,HV,VEC,EE,PR,FP,ME,IR,DR,RI>  CR: 22000482  XER: 00000000
IRQMASK: 0 
GPR00: 00000000000000ea 00003fffd04ef2a0 00003fffa81b1300 0000000000000000 
GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR12: 0000000000000000 00003fffa8318c30 000000012e5ff800 00000001136b53b0 
GPR16: 00000001200cec38 00003fffddea1c68 00000001200ceb28 000000000000002f 
GPR20: 0000000000000000 00003fffa81abff8 0000000000000001 00003fffa81aaa58 
GPR24: 0000000000000000 0000000000000000 0000000000000003 0000000000000001 
GPR28: 0000000000000000 00003fffa8311c50 fffffffffffff000 0000000000000000 
NIP [00003fffa8092aa8] 0x3fffa8092aa8
LR [00003fffa7ff2d04] 0x3fffa7ff2d04
--- interrupt: c00
Rebooting in 120 seconds..


Don't know whether this is related. I'll throw more debugging stuff in,  file this as a seperate issue and link it here just in case.
Comment 16 Erhard F. 2021-08-20 00:12:24 UTC
Created attachment 298371 [details]
dmesg (5.14-rc6, PowerMac G5 11,2)

As there is a fix now for bug #213803 I was able to build v5.14-rc6 and gave it a testride. Looks like the issue persists:

[...]
irq 63: nobody cared (try booting with the "irqpoll" option)
CPU: 0 PID: 10732 Comm: emerge Tainted: G        W         5.14.0-rc6-PowerMacG5+ #2
Call Trace:
[c00000000fff7af0] [c00000000054de24] .dump_stack_lvl+0x98/0xe0 (unreliable)
[c00000000fff7b80] [c0000000000e1724] .__report_bad_irq+0x34/0xf0
[c00000000fff7c20] [c0000000000e160c] .note_interrupt+0x258/0x300
[c00000000fff7ce0] [c0000000000dd840] .handle_irq_event_percpu+0x5c/0x88
[c00000000fff7d70] [c0000000000dd8b0] .handle_irq_event+0x44/0x70
[c00000000fff7e00] [c0000000000e2d34] .handle_fasteoi_irq+0xac/0x158
[c00000000fff7ea0] [c0000000000dc8bc] .handle_irq_desc+0x34/0x54
[c00000000fff7f10] [c000000000012058] .__do_irq+0x15c/0x238
[c00000000fff7f90] [c000000000012978] .__do_IRQ+0xac/0xb4
[c00000001e9cfcf0] [c00000001e9cfd90] 0xc00000001e9cfd90
[c00000001e9cfd90] [c000000000012ac4] .do_IRQ+0x144/0x194
[c00000001e9cfe10] [c000000000008050] hardware_interrupt_common_virt+0x210/0x220
--- interrupt: 500 at 0x3fffb9b25d9c
NIP:  00003fffb9b25d9c LR: 00003fffb9b2811c CTR: 00003fffb9b25d9c
REGS: c00000001e9cfe80 TRAP: 0500   Tainted: G        W          (5.14.0-rc6-PowerMacG5+)
MSR:  900000000000f032 <SF,HV,EE,PR,FP,ME,IR,DR,RI>  CR: 22482822  XER: 20000000
IRQMASK: 0 
GPR00: 00003fffb9b28100 00003ffffd4e7550 00003fffb9ef6200 00003fffb7977790 
GPR04: 00003fffb7977790 00003fffb55e8b80 0000000000000000 00003fffb9eccac0 
GPR08: 00003fffb9b25d9c 0000000000000000 000000000000000f 0000000000000000 
GPR12: 00003fffb9b7eeb0 00003fffb9fc8890 00003ffffd4e7658 00003fffb395c548 
GPR16: 00003ffffd4e7670 ffffffffffffffff 00003fffb7902480 ffffffffffffffff 
GPR20: 0000000000000000 00003fffb395c528 000000014b8f7878 0000000000000000 
GPR24: 00003fffb7969a80 000000014b8f7830 00003fffb7a750d0 000000000000000a 
GPR28: 00003fffb7a750dc 000000000000007c 000000014b8f9420 00003fffb395c3c0 
NIP [00003fffb9b25d9c] 0x3fffb9b25d9c
LR [00003fffb9b2811c] 0x3fffb9b2811c
--- interrupt: 500
handlers:
[<c0000000015a6568>] .nvme_irq
[<c0000000015a6568>] .nvme_irq
Disabling IRQ #63
Comment 17 Erhard F. 2021-08-20 00:13:04 UTC
Created attachment 298373 [details]
kernel .config (5.14-rc6, PowerMac G5 11,2)
Comment 18 Erhard F. 2021-08-20 16:39:51 UTC
The 'hackfix for MSI init' patch also applies on top of v5.14-rc6.

But unchanged the G5 runs later into bug #213837.
Comment 19 Erhard F. 2022-07-07 10:30:22 UTC
(Luckily) I am no longer able to reproduce this. Re-tested on 5.19-rc5.

Perhaps the problem was also specific for this specific NVMe SSD. I swapped it for another one and now I have not seen this issue so far.

I'll keep an eye on it and will close here if it stays like that for the next few stable kernels.

Note You need to log in before you can comment on or make changes to this bug.