Bug 195429
Summary: | unable to handle kernel NULL pointer dereference, mtip_irq_handler+0x262/0x3c0 [mtip32xx] | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Jozef Mikovic (jmikovic) |
Component: | Other | Assignee: | io_other |
Status: | NEW --- | ||
Severity: | normal | CC: | bjorn, hladky.jiri, lmusil, tom.leiming |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.11.0-0.rc6 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
lshw
dmidecode mtip32xx: fix mtip_cmd_from_tag boot messages after patch |
Created attachment 255867 [details]
dmidecode
Is this a regression? Sounds like it might be new in v4.11-rc1? Since it's so reproducible, it should be easy to either bisect it or add debug to mtip_handle_irq() to figure out what's going wrong. This doesn't *look* like a PCI core problem, so I'll try to find a better category to reassign it to. I assume that when you try to reproduce this on other machines, those machines also have Micron P320 SSDs (the devices claimed by the mtip32xx driver) in them? Created attachment 255879 [details]
mtip32xx: fix mtip_cmd_from_tag
Hi Jozef Mikovic,
Could you verify if the attached patch fixes your issue?
Thanks,
Ming
Hello Bjorn Helgass Yes, we can reproduce this bug on machines with Micron P320h. We tested it on IBM x3650 M4 and IBM x3750 M4 with same results (In reply to Bjorn Helgaas from comment #2) > Is this a regression? Sounds like it might be new in v4.11-rc1? > > Since it's so reproducible, it should be easy to either bisect it or add > debug to mtip_handle_irq() to figure out what's going wrong. This doesn't > *look* like a PCI core problem, so I'll try to find a better category to > reassign it to. > > I assume that when you try to reproduce this on other machines, those > machines also have Micron P320 SSDs (the devices claimed by the mtip32xx > driver) in them? Hello Bjorn Helgass Yes, we can reproduce this bug on another machines with Micron P320h. We tested it on IBM x3650 M4 and IBM x3750 M4 with same results. (In reply to Lei Ming from comment #3) > Created attachment 255879 [details] > mtip32xx: fix mtip_cmd_from_tag > > Hi Jozef Mikovic, > > Could you verify if the attached patch fixes your issue? > > > Thanks, > Ming Hello Lei Ming, sorry for delay. I test the patch, there is some progress, but machine still fails. There is some warnings at boot: [ 0.000000] ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe0Block: 128/64 (20170119/tbfadt-603) [ 0.000000] ACPI BIOS Warning (bug): Invalid length for FADT/Pm1aControlBlock: 32, using default 16 (20170119/tbfadt-708) Please see complete output in attachment. Thanks Lukas Musil. Created attachment 255935 [details]
boot messages after patch
(In reply to Lukas Musil from comment #5) > sorry for delay. I test the patch, there is some progress, but machine still > fails. > > There is some warnings at boot: > [ 0.000000] ACPI BIOS Warning (bug): 32/64X length mismatch in > FADT/Gpe0Block: 128/64 (20170119/tbfadt-603) > [ 0.000000] ACPI BIOS Warning (bug): Invalid length for > FADT/Pm1aControlBlock: 32, using default 16 (20170119/tbfadt-708) How exactly does it fail? The ACPI BIOS warnings above are potential firmware issues, but nothing we can fix in Linux. They're not related to the original mtip_irq_handler NULL pointer issue. (In reply to Bjorn Helgaas from comment #7) > (In reply to Lukas Musil from comment #5) > > sorry for delay. I test the patch, there is some progress, but machine > still > > fails. > > > > There is some warnings at boot: > > [ 0.000000] ACPI BIOS Warning (bug): 32/64X length mismatch in > > FADT/Gpe0Block: 128/64 (20170119/tbfadt-603) > > [ 0.000000] ACPI BIOS Warning (bug): Invalid length for > > FADT/Pm1aControlBlock: 32, using default 16 (20170119/tbfadt-708) > > How exactly does it fail? > > The ACPI BIOS warnings above are potential firmware issues, but nothing we > can fix in Linux. They're not related to the original mtip_irq_handler NULL > pointer issue. Machine immediately restart during boot, every time at same place. Complete console output is above. We do not see any kernel crash or so (we using remote managment or serial console). In case we disable PCI-e slot with Micron RealSSD P320h in BIOS/UEFI, system boot normally I have posted three patches in the following link: http://marc.info/?l=linux-block&m=149258785408240&w=2 which should address the two issues. Thanks, Ming (In reply to Lei Ming from comment #9) > I have posted three patches in the following link: > > http://marc.info/?l=linux-block&m=149258785408240&w=2 > > which should address the two issues. > > Thanks, > Ming After applying patches from link above, machine with Micron RealSSD P320h boot correctly. Thanks, Lukas |
Created attachment 255865 [details] lshw Hello, I am getting kernel panic on reboot after installing 4.11 kernel, panic occurs every time I try to install 4.11 kernel (since rc1) but I cannot reproduce it on another machine. [ 3.896646] BUG: unable to handle kernel NULL pointer dereference at 0000000000000170 [ 3.896652] IP: mtip_irq_handler+0x262/0x3c0 [mtip32xx] [ 3.896653] PGD 0 [ 3.896653] [ 3.896654] Oops: 0000 [#1] SMP [ 3.896655] Modules linked in: ttm ata_piix drm libata crc32c_intel megaraid_sas bnx2 mtip32xx(+) [ 3.896660] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-0.rc5.git0.1.el7.x86_64 #1 [ 3.896661] Hardware name: IBM System x3650 M3 -[7945J2G]-/69Y4438, BIOS -[D6E162AUS-1.20]- 05/07/2014 [ 3.896661] task: ffffffff8fc104c0 task.stack: ffffffff8fc00000 [ 3.896664] RIP: 0010:mtip_irq_handler+0x262/0x3c0 [mtip32xx] [ 3.896664] RSP: 0018:ffff96931b003e80 EFLAGS: 00010046 [ 3.896665] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 [ 3.896666] RDX: 0000000000000148 RSI: 0000000000000000 RDI: ffff96930e2c3b00 [ 3.896666] RBP: ffff96931b003eb0 R08: 0000000000000004 R09: 00000000000000fe [ 3.896667] R10: 0000000000000000 R11: 0000000000000018 R12: ffff96930f346000 [ 3.896668] R13: ffff969310015000 R14: 0000000000000000 R15: 0000000000000000 [ 3.896669] FS: 0000000000000000(0000) GS:ffff96931b000000(0000) knlGS:0000000000000000 [ 3.896670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3.896670] CR2: 0000000000000170 CR3: 00000005acc09000 CR4: 00000000000006f0 [ 3.896671] Call Trace: [ 3.896672] <IRQ> [ 3.896677] __handle_irq_event_percpu+0x3c/0x1a0 [ 3.896678] handle_irq_event_percpu+0x32/0x80 [ 3.896679] handle_irq_event+0x3b/0x60 [ 3.896681] handle_edge_irq+0x8d/0x130 [ 3.896684] handle_irq+0xab/0x130 [ 3.896687] do_IRQ+0x48/0xd0 [ 3.896688] common_interrupt+0x93/0x93 [ 3.896691] RIP: 0010:cpuidle_enter_state+0xe1/0x260 [ 3.896691] RSP: 0018:ffffffff8fc03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff2d [ 3.896693] RAX: ffff96931b0195c0 RBX: ffff96931b021600 RCX: 000000000000001f [ 3.896693] RDX: 0000000000000000 RSI: ffff96931b016dd8 RDI: 0000000000000000 [ 3.896694] RBP: ffffffff8fc03e00 R08: 0000000000000001 R09: cccccccccccccccd [ 3.896694] R10: 0000000000000050 R11: 0000000000000018 R12: 0000000000000003 [ 3.896695] R13: 0000000000000000 R14: ffffffff8fce9c80 R15: 00000000e841a8b5 [ 3.896696] </IRQ> [ 3.896698] ? cpuidle_enter_state+0xc0/0x260 [ 3.896699] cpuidle_enter+0x17/0x20 [ 3.896701] call_cpuidle+0x2c/0x50 [ 3.896702] do_idle+0x175/0x200 [ 3.896704] cpu_startup_entry+0x71/0x80 [ 3.896705] rest_init+0x77/0x80 [ 3.896708] start_kernel+0x4b1/0x4d2 [ 3.896710] ? set_init_arg+0x55/0x55 [ 3.896711] ? early_idt_handler_array+0x120/0x120 [ 3.896713] x86_64_start_reservations+0x24/0x26 [ 3.896714] x86_64_start_kernel+0x14c/0x16f [ 3.896716] start_cpu+0x14/0x14 [ 3.896717] Code: 8d 90 48 01 00 00 80 e1 01 0f 84 4d ff ff ff 48 85 d2 0f 84 44 ff ff ff 49 8b 8c 24 98 00 00 00 8b 09 80 e1 01 0f 85 31 ff ff ff <48> 8b 80 70 01 00 00 48 85 c0 0f 84 21 ff ff ff 31 c9 31 f6 4c [ 3.896741] RIP: mtip_irq_handler+0x262/0x3c0 [mtip32xx] RSP: ffff96931b003e80 [ 3.896741] CR2: 0000000000000170 [ 3.896749] ---[ end trace 6d7422721e045b62 ]--- [ 3.896750] Kernel panic - not syncing: Fatal exception in interrupt [ 3.900483] Kernel Offset: 0xe000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)