Bug 194837
Summary: | VM with virtio-scsi drive often crashes during boot with kernel 4.11rc1 | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Adam Williamson (adamw) |
Component: | SCSI | Assignee: | linux-scsi (linux-scsi) |
Status: | NEW --- | ||
Severity: | blocking | CC: | linux, normand, osandov, regressions |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.11rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Adam Williamson
2017-03-09 23:53:52 UTC
Downstream bug is https://bugzilla.redhat.com/show_bug.cgi?id=1430043 , Fedora kernel maintainers can provide info on the delta between the affected kernel builds, I hope. The changelog just says: * Mon Mar 06 2017 Laura Abbott - 4.11.0-0.rc1.git0.1 - Linux v4.11-rc1 * Mon Mar 06 2017 Laura Abbott - Disable debugging options. * Fri Mar 03 2017 Laura Abbott - 4.11.0-0.rc0.git9.1 - Linux v4.10-11319-gc82be9d so the delta is to 'v4.10-11319-gc82be9d', but I dunno what that means exactly. I seem to get a different traceback every time this happens in local testing, aside from the times it just hangs - but it always seems to hang or crash at approximately the same point in the boot process, right after the "virtio-pci 0000:00:0a.0: virtio_pci: leaving for legacy driver" messages. The last one I got was: [ 1.327442] Kernel panic - not syncing: CRED: put_cred_rcu() sees ffff938bb584c000 with usage 1953655158 [ 1.327442] [ 1.328949] CPU: 1 PID: 21 Comm: rcuos/1 Not tainted 4.11.0-0.rc1.git0.1.fc26.x86_64 #1 [ 1.328949] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 1.331157] Call Trace: [ 1.331157] dump_stack+0x63/0x84 [ 1.331157] panic+0xe4/0x22d [ 1.331157] ? __prepare_to_swait+0x52/0x80 [ 1.331157] put_cred_rcu+0xb0/0xb0 [ 1.331157] rcu_nocb_kthread+0x15f/0x500 [ 1.335141] ? get_state_synchronize_sched+0x20/0x20 [ 1.335141] kthread+0x11e/0x140 [ 1.335141] ? kthread_park+0x90/0x90 [ 1.335141] ret_from_fork+0x2c/0x40 [ 1.335141] Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 1.335141] ---[ end Kernel panic - not syncing: CRED: put_cred_rcu() sees ffff938bb584c000 with usage 1953655158 aside from that I've seen: [ 1.419027] Call Trace: [ 1.435026] <IRQ> [ 1.435026] mpage_end_io+0x39/0x60 [ 1.435026] bio_endio+0x54/0x60 [ 1.435026] blk_update_request+0x8d/0x340 [ 1.435026] blk_mq_end_request+0x1a/0x80 [ 1.435026] virtblk_request_done+0x45/0x80 [virtio_blk] [ 1.435026] __blk_mq_complete_request+0xc9/0x120 [ 1.435026] blk_mq_complete_request+0x1e/0x20 [ 1.435026] virtblk_done+0x74/0x100 [virtio_blk] [ 1.435026] vring_interrupt+0x34/0x80 [virtio_ring] [ 1.435026] __handle_irq_event_percpu+0x3f/0x1a0 [ 1.435026] handle_irq_event_percpu+0x32/0x80 [ 1.435026] handle_irq_event+0x2c/0x50 [ 1.435026] handle_edge_irq+0x6f/0x120 [ 1.435026] handle_irq+0xad/0x130 [ 1.435026] do_IRQ+0x46/0xd0 [ 1.435026] common_interrupt+0x93/0x93 [ 1.435026] RIP: 0010:native_safe_halt+0x6/0x10 [ 1.435026] RSP: 0018:ffffffffaae03de0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff7e [ 1.435026] RAX: 0000000000000000 RBX: ffffffffaae104c0 RCX: 0000000000000000 [ 1.435026] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1.435026] RBP: ffffffffaae03de0 R08: 000000001c32e304 R09: ffff883d767b1f00 [ 1.435026] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 1.435026] R13: ffffffffaae104c0 R14: 0000000000000000 R15: 0000000000000000 [ 1.435026] </IRQ> [ 1.435026] default_idle+0x20/0xe0 [ 1.435026] arch_cpu_idle+0xf/0x20 [ 1.435026] default_idle_call+0x23/0x30 [ 1.435026] do_idle+0x170/0x200 [ 1.435026] cpu_startup_entry+0x71/0x80 [ 1.435026] rest_init+0x77/0x80 [ 1.435026] start_kernel+0x45a/0x47b [ 1.435026] ? early_idt_handler_array+0x120/0x120 [ 1.435026] x86_64_start_reservations+0x24/0x26 [ 1.435026] x86_64_start_kernel+0x13c/0x15f [ 1.435026] start_cpu+0x14/0x14 [ 1.435026] Code: ff c3 b8 f8 ff ff ff c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 40 84 f6 75 12 85 d2 75 21 <48> 8b 47 20 a8 01 75 66 f0 80 0f 08 eb 2a 85 d2 75 35 48 89 df [ 1.435026] RIP: page_endio+0x1a/0xa0 RSP: ffff883dbfc03d68 [ 1.435026] CR2: 0000000000000020 [ 1.435026] ---[ end trace cf261be33b5d0d05 ]--- [ 1.435026] Kernel panic - not syncing: Fatal exception in interrupt [ 1.435026] Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 1.435026] ---[ end Kernel panic - not syncing: Fatal exception in interrupt and: [ 1.530432] general protection fault: 0000 [#1] SMP [ 1.531004] Modules linked in: virtio_scsi(+) virtio_console qxl drm_kms_helper ttm drm crc32c_intel sym53c8xx serio_raw scsi_transport_spi virtio_pci virtio_ring virtio ata_generic pata_acpi qemu_fw_cfg [ 1.531004] CPU: 0 PID: 314 Comm: systemd-udevd Not tainted 4.11.0-0.rc1.git0.1.fc26.x86_64 #1 [ 1.531004] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 1.531004] task: ffff9c49760c0000 task.stack: ffffbfbcc05d8000 [ 1.531004] RIP: 0010:kmem_cache_alloc+0x84/0x1b0 [ 1.531004] RSP: 0018:ffffbfbcc05dbe60 EFLAGS: 00010246 [ 1.531004] RAX: 2d316f6974726976 RBX: 00000000014000c0 RCX: 00007fcf9e251b30 [ 1.531004] RDX: 0000000000002192 RSI: 00000000014000c0 RDI: 000000000001d780 [ 1.531004] RBP: ffffbfbcc05dbe90 R08: ffff9c49bfc1d780 R09: 0000000000000077 [ 1.531004] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000014000c0 [ 1.531004] R13: ffff9c49bd098140 R14: 2d316f6974726976 R15: ffff9c49bd098140 [ 1.531004] FS: 00007fcf9f3398c0(0000) GS:ffff9c49bfc00000(0000) knlGS:0000000000000000 [ 1.531004] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.531004] CR2: 00007ffe1682ffc8 CR3: 000000003617c000 CR4: 00000000000006f0 [ 1.531004] Call Trace: [ 1.531004] ? prepare_creds+0x26/0xf0 [ 1.531004] prepare_creds+0x26/0xf0 [ 1.531004] SyS_access+0x41/0x230 [ 1.531004] do_syscall_64+0x67/0x170 [ 1.531004] entry_SYSCALL64_slow_path+0x25/0x25 [ 1.531004] RIP: 0033:0x7fcf9df8651a [ 1.531004] RSP: 002b:00007ffe16830f88 EFLAGS: 00000246 ORIG_RAX: 0000000000000015 [ 1.531004] RAX: ffffffffffffffda RBX: 0000556016def330 RCX: 00007fcf9df8651a [ 1.531004] RDX: 00316f6974726976 RSI: 0000000000000000 RDI: 00007ffe16830f90 [ 1.531004] RBP: 00007ffe16831030 R08: 0000000000001000 R09: 0000000000000077 [ 1.531004] R10: 00007fcf9e251b30 R11: 0000000000000246 R12: 00007ffe16830f90 [ 1.531004] R13: 0000556016dea880 R14: 00000000ffffffea R15: 0000556015e1baf8 [ 1.531004] Code: 49 83 78 10 00 4d 8b 30 0f 84 fe 00 00 00 4d 85 f6 0f 84 f5 00 00 00 49 63 47 20 49 8b 3f 4c 01 f0 40 f6 c7 0f 0f 85 21 01 00 00 <48> 8b 18 48 8d 4a 01 4c 89 f0 65 48 0f c7 0f 0f 94 c0 84 c0 74 [ 1.531004] RIP: kmem_cache_alloc+0x84/0x1b0 RSP: ffffbfbcc05dbe60 [ 1.561110] ---[ end trace b9396075a1c8bec4 ]--- that one actually shows *12* general protection faults, all with the same traceback. Seeing this also (RHEL Host, KVM) Adam, did you report this here only or also to some mailinglist? (In reply to Adam Williamson from comment #1) > so the delta is to 'v4.10-11319-gc82be9d', but I dunno what that means > exactly. FWIW, most likely suspects afaics: the second batches with changes to KVM and SCSI https://git.kernel.org/torvalds/c/2d62e0768d3c28536d4cfe4c40ba1e5e8e442a93 https://git.kernel.org/torvalds/c/a3b4924b027f9a4b95ce89a914c1e0459e76f18a Only here and downstream Bugzilla. I'm not really signed up to any kernel mailing lists. Please do feel free to forward it anywhere appropriate... FWIW, the culprit is somewhere between 10851-g20b83643abbc and 11620-ga3b4924b027f Leaving the keyboard now; I'm down to 1827adb11ad2...590dce2d4934 (includes e0d072250a54669dce876d8ade70e417356aae74 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ) This was fixed in v4.11-rc2. (In reply to Omar Sandoval from comment #8) > This was fixed in v4.11-rc2. Care to provide a few more details and a commit that fixes this? Because I'm still seeing crashes with 4.11-rc2; but a few details looks fishy and I more and more wonder if I'm dealing with two bugs here :-/ (In reply to Thorsten Leemhuis from comment #9) > (In reply to Omar Sandoval from comment #8) >> This was fixed in v4.11-rc2. > Care to provide a few more details and a commit that fixes this? Okay, now assuming you mean https://git.kernel.org/torvalds/c/737f98cfe7de8df7433a4d846850aa8efa44bd48 Will investigateā¦ The fix was https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df23de55615fa7a190a85f49a950ccecdd9102f3, which was part of this patch series: c01228db4ba9 Revert "scsi, block: fix duplicate bdi name registration crashes" 90f16fddcc28 block: Make del_gendisk() safer for disks without queues df23de55615f bdi: Fix use-after-free in wb_congested_put() b6f8fec4448a block: Allow bdi re-registration If you are still seeing crashes on -rc2, could you please report it to linux-block@vger.kernel.org? (In reply to Omar Sandoval from comment #11) > The fix was [...] Thx for providing details! > If you are still seeing crashes on -rc2, Yes, but different ones every few boot attempts. But I just noticed those crashes happen without virtio-scsi as well. So definitely something else now. 4.10 otoh works well; just retested to be sure. Sigh, I guess I need to take a closer look to find out what's wrong; that might take a day or two. (In reply to Omar Sandoval from comment #11) > > If you are still seeing crashes on -rc2, could you please report it to > linux-block@vger.kernel.org? FWIW, these are the crashes I see: https://plus.google.com/+ThorstenLeemhuis/posts/FjyyGjNtrrG But those happen without virtio-scsi as well. Status update: Had a hunch and reverted the vhost/virtio changes (https://git.kernel.org/torvalds/c/54d7989f476ca57fc3c5cc71524c480ccb74c481 ) on top of master and now it's running fine. Will look closer now which of the changes is the culprit. Adam or somebody else: Can you please close this Bug to avoid confusion? There clearly were some bugs in virtio_scsi, but those were already fixed. But there are more bugs (at least for me): I bisectected the problem I'm still seeing with todays mainline and found that "virtio_pci: use shared interrupts for virtqueues" (5c34d002dcc7; from hch and mst) was the first bad commit for me. Filed Bug 194911 to track that. |