Bug 195429 - unable to handle kernel NULL pointer dereference, mtip_irq_handler+0x262/0x3c0 [mtip32xx]
Summary: unable to handle kernel NULL pointer dereference, mtip_irq_handler+0x262/0x3c...
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: io_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-12 12:39 UTC by Jozef Mikovic
Modified: 2017-04-26 06:10 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.11.0-0.rc6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lshw (43.77 KB, text/plain)
2017-04-12 12:39 UTC, Jozef Mikovic
Details
dmidecode (23.71 KB, text/plain)
2017-04-12 12:40 UTC, Jozef Mikovic
Details
mtip32xx: fix mtip_cmd_from_tag (1.54 KB, patch)
2017-04-13 06:49 UTC, Lei Ming
Details | Diff
boot messages after patch (42.70 KB, text/plain)
2017-04-20 14:09 UTC, Lukas Musil
Details

Description Jozef Mikovic 2017-04-12 12:39:51 UTC
Created attachment 255865 [details]
lshw

Hello,
I am getting kernel panic on reboot after installing 4.11 kernel, panic occurs every time I try to install 4.11 kernel (since rc1) but I cannot reproduce it on another machine.

[    3.896646] BUG: unable to handle kernel NULL pointer dereference at 0000000000000170 
[    3.896652] IP: mtip_irq_handler+0x262/0x3c0 [mtip32xx] 
[    3.896653] PGD 0  
[    3.896653]  
[    3.896654] Oops: 0000 [#1] SMP 
[    3.896655] Modules linked in: ttm ata_piix drm libata crc32c_intel megaraid_sas bnx2 mtip32xx(+) 
[    3.896660] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-0.rc5.git0.1.el7.x86_64 #1 
[    3.896661] Hardware name: IBM System x3650 M3 -[7945J2G]-/69Y4438, BIOS -[D6E162AUS-1.20]- 05/07/2014 
[    3.896661] task: ffffffff8fc104c0 task.stack: ffffffff8fc00000 
[    3.896664] RIP: 0010:mtip_irq_handler+0x262/0x3c0 [mtip32xx] 
[    3.896664] RSP: 0018:ffff96931b003e80 EFLAGS: 00010046 
[    3.896665] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 
[    3.896666] RDX: 0000000000000148 RSI: 0000000000000000 RDI: ffff96930e2c3b00 
[    3.896666] RBP: ffff96931b003eb0 R08: 0000000000000004 R09: 00000000000000fe 
[    3.896667] R10: 0000000000000000 R11: 0000000000000018 R12: ffff96930f346000 
[    3.896668] R13: ffff969310015000 R14: 0000000000000000 R15: 0000000000000000 
[    3.896669] FS:  0000000000000000(0000) GS:ffff96931b000000(0000) knlGS:0000000000000000 
[    3.896670] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[    3.896670] CR2: 0000000000000170 CR3: 00000005acc09000 CR4: 00000000000006f0 
[    3.896671] Call Trace: 
[    3.896672]  <IRQ> 
[    3.896677]  __handle_irq_event_percpu+0x3c/0x1a0 
[    3.896678]  handle_irq_event_percpu+0x32/0x80 
[    3.896679]  handle_irq_event+0x3b/0x60 
[    3.896681]  handle_edge_irq+0x8d/0x130 
[    3.896684]  handle_irq+0xab/0x130 
[    3.896687]  do_IRQ+0x48/0xd0 
[    3.896688]  common_interrupt+0x93/0x93 
[    3.896691] RIP: 0010:cpuidle_enter_state+0xe1/0x260 
[    3.896691] RSP: 0018:ffffffff8fc03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff2d 
[    3.896693] RAX: ffff96931b0195c0 RBX: ffff96931b021600 RCX: 000000000000001f 
[    3.896693] RDX: 0000000000000000 RSI: ffff96931b016dd8 RDI: 0000000000000000 
[    3.896694] RBP: ffffffff8fc03e00 R08: 0000000000000001 R09: cccccccccccccccd 
[    3.896694] R10: 0000000000000050 R11: 0000000000000018 R12: 0000000000000003 
[    3.896695] R13: 0000000000000000 R14: ffffffff8fce9c80 R15: 00000000e841a8b5 
[    3.896696]  </IRQ> 
[    3.896698]  ? cpuidle_enter_state+0xc0/0x260 
[    3.896699]  cpuidle_enter+0x17/0x20 
[    3.896701]  call_cpuidle+0x2c/0x50 
[    3.896702]  do_idle+0x175/0x200 
[    3.896704]  cpu_startup_entry+0x71/0x80 
[    3.896705]  rest_init+0x77/0x80 
[    3.896708]  start_kernel+0x4b1/0x4d2 
[    3.896710]  ? set_init_arg+0x55/0x55 
[    3.896711]  ? early_idt_handler_array+0x120/0x120 
[    3.896713]  x86_64_start_reservations+0x24/0x26 
[    3.896714]  x86_64_start_kernel+0x14c/0x16f 
[    3.896716]  start_cpu+0x14/0x14 
[    3.896717] Code: 8d 90 48 01 00 00 80 e1 01 0f 84 4d ff ff ff 48 85 d2 0f 84 44 ff ff ff 49 8b 8c 24 98 00 00 00 8b 09 80 e1 01 0f 85 31 ff ff ff <48> 8b 80 70 01 00 00 48 85 c0 0f 84 21 ff ff ff 31 c9 31 f6 4c  
[    3.896741] RIP: mtip_irq_handler+0x262/0x3c0 [mtip32xx] RSP: ffff96931b003e80 
[    3.896741] CR2: 0000000000000170 
[    3.896749] ---[ end trace 6d7422721e045b62 ]--- 
[    3.896750] Kernel panic - not syncing: Fatal exception in interrupt 
[    3.900483] Kernel Offset: 0xe000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Comment 1 Jozef Mikovic 2017-04-12 12:40:11 UTC
Created attachment 255867 [details]
dmidecode
Comment 2 Bjorn Helgaas 2017-04-12 14:10:38 UTC
Is this a regression?  Sounds like it might be new in v4.11-rc1?

Since it's so reproducible, it should be easy to either bisect it or add debug to mtip_handle_irq() to figure out what's going wrong.  This doesn't *look* like a PCI core problem, so I'll try to find a better category to reassign it to.

I assume that when you try to reproduce this on other machines, those machines also have Micron P320 SSDs (the devices claimed by the mtip32xx driver) in them?
Comment 3 Lei Ming 2017-04-13 06:49:33 UTC
Created attachment 255879 [details]
mtip32xx: fix mtip_cmd_from_tag

Hi Jozef Mikovic,

Could you verify if the attached patch fixes your issue?


Thanks,
Ming
Comment 4 Lukas Musil 2017-04-20 14:00:09 UTC
Hello Bjorn Helgass

Yes, we can reproduce this bug on machines with Micron P320h. We tested it on IBM x3650 M4 and IBM x3750 M4 with same results
(In reply to Bjorn Helgaas from comment #2)
> Is this a regression?  Sounds like it might be new in v4.11-rc1?
> 
> Since it's so reproducible, it should be easy to either bisect it or add
> debug to mtip_handle_irq() to figure out what's going wrong.  This doesn't
> *look* like a PCI core problem, so I'll try to find a better category to
> reassign it to.
> 
> I assume that when you try to reproduce this on other machines, those
> machines also have Micron P320 SSDs (the devices claimed by the mtip32xx
> driver) in them?

Hello Bjorn Helgass

Yes, we can reproduce this bug on another machines with Micron P320h. We tested it on IBM x3650 M4 and IBM x3750 M4 with same results.
Comment 5 Lukas Musil 2017-04-20 14:06:01 UTC
(In reply to Lei Ming from comment #3)
> Created attachment 255879 [details]
> mtip32xx: fix mtip_cmd_from_tag
> 
> Hi Jozef Mikovic,
> 
> Could you verify if the attached patch fixes your issue?
> 
> 
> Thanks,
> Ming

Hello Lei Ming,

sorry for delay. I test the patch, there is some progress, but machine still fails.

There is some warnings at boot:
[    0.000000] ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe0Block: 128/64 (20170119/tbfadt-603) 
[    0.000000] ACPI BIOS Warning (bug): Invalid length for FADT/Pm1aControlBlock: 32, using default 16 (20170119/tbfadt-708) 

Please see complete output in attachment.

Thanks Lukas Musil.
Comment 6 Lukas Musil 2017-04-20 14:09:13 UTC
Created attachment 255935 [details]
boot messages after patch
Comment 7 Bjorn Helgaas 2017-04-20 18:23:14 UTC
(In reply to Lukas Musil from comment #5)
> sorry for delay. I test the patch, there is some progress, but machine still
> fails.
> 
> There is some warnings at boot:
> [    0.000000] ACPI BIOS Warning (bug): 32/64X length mismatch in
> FADT/Gpe0Block: 128/64 (20170119/tbfadt-603) 
> [    0.000000] ACPI BIOS Warning (bug): Invalid length for
> FADT/Pm1aControlBlock: 32, using default 16 (20170119/tbfadt-708) 

How exactly does it fail?

The ACPI BIOS warnings above are potential firmware issues, but nothing we can fix in Linux.  They're not related to the original mtip_irq_handler NULL pointer issue.
Comment 8 Lukas Musil 2017-04-21 13:02:39 UTC
(In reply to Bjorn Helgaas from comment #7)
> (In reply to Lukas Musil from comment #5)
> > sorry for delay. I test the patch, there is some progress, but machine
> still
> > fails.
> > 
> > There is some warnings at boot:
> > [    0.000000] ACPI BIOS Warning (bug): 32/64X length mismatch in
> > FADT/Gpe0Block: 128/64 (20170119/tbfadt-603) 
> > [    0.000000] ACPI BIOS Warning (bug): Invalid length for
> > FADT/Pm1aControlBlock: 32, using default 16 (20170119/tbfadt-708) 
> 
> How exactly does it fail?
> 
> The ACPI BIOS warnings above are potential firmware issues, but nothing we
> can fix in Linux.  They're not related to the original mtip_irq_handler NULL
> pointer issue.


Machine immediately restart during boot, every time at same place. Complete console output is above. We do not see any kernel crash or so (we using remote managment or serial console). In case we disable PCI-e slot with Micron RealSSD P320h in BIOS/UEFI, system boot normally
Comment 9 Lei Ming 2017-04-21 15:57:49 UTC
I have posted three patches in the following link:

http://marc.info/?l=linux-block&m=149258785408240&w=2

which should address the two issues.

Thanks,
Ming
Comment 10 Lukas Musil 2017-04-26 06:10:48 UTC
(In reply to Lei Ming from comment #9)
> I have posted three patches in the following link:
> 
> http://marc.info/?l=linux-block&m=149258785408240&w=2
> 
> which should address the two issues.
> 
> Thanks,
> Ming

After applying patches from link above, machine with Micron RealSSD P320h boot correctly.

Thanks, Lukas

Note You need to log in before you can comment on or make changes to this bug.