Bug 202283

Summary: virtio-balloon, panic on "Unable to handle kernel pointer dereference in virtual kernel address space" issue
Product: Virtualization Reporter: Colin Ian King (colin.king)
Component: kvmAssignee: virtualization_kvm
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: blocking    
Priority: P1    
Hardware: S390-64   
OS: Linux   
Kernel Version: 4.19 Subsystem:
Regression: No Bisected commit-id:

Description Colin Ian King 2019-01-15 14:47:50 UTC
s390x inside VM, hitting the following kernel panic on 4.20:

[ 0.394835] Linux version 4.20.0-2-generic (buildd@bos02-s390x-015) (gcc version 8.2.0 (Ubuntu 8.2.0-13ubuntu1)) #3-Ubuntu SMP Thu Jan 3 18:43:01 UTC 2019 (Ubuntu 4.20.0-2.3-generic 4.20.0)
[ 0.394838] setup.289988: Linux is running under KVM in 64-bit mode
[ 0.394846] setup.b050d0: The maximum memory size is 2048MB
[ 0.394866] numa.196305: NUMA mode: plain
[ 0.394893] cpu.33a262: 2 configured CPUs, 0 standby CPUs
[ 0.394999] Write protected kernel read-only data: 10464k
[ 0.395034] Zone ranges:
[ 0.395035] DMA [mem 0x0000000000000000-0x000000007fffffff]
[ 0.395037] Normal empty
[ 0.395038] Movable zone start for each node
[ 0.395039] Early memory node ranges
[ 0.395041] node 0: [mem 0x0000000000000000-0x000000007fffffff]
[ 0.395042] Initmem setup node 0 [mem 0x0000000000000000-0x000000007fffffff]
[ 0.413802] percpu: Embedded 25 pages/cpu @(____ptrval____) s62208 r8192 d32000 u102400
[ 0.413824] Built 1 zonelists, mobility grouping on. Total pages: 516096
[ 0.413825] Policy zone: DMA
[ 0.413827] Kernel command line: root=LABEL=cloudimg-rootfs
[ 0.450276] Memory: 2034164K/2097152K available (8116K kernel code, 1079K rwdata, 2344K rodata, 992K init, 772K bss, 62988K reserved, 0K cma-reserved)
[ 0.450283] random: get_random_u64 called from kmem_cache_open+0x4c/0x4f0 with crng_init=0
[ 0.450432] SLUB: HWalign=256, Order=0-3, MinObjects=0, CPUs=3, Nodes=1
[ 0.450434] ftrace: allocating 27385 entries in 107 pages
[ 0.473325] rcu: Hierarchical RCU implementation.
[ 0.473327] rcu: RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=3.
[ 0.473328] Tasks RCU enabled.
[ 0.473329] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[ 0.473330] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=3
[ 0.475371] NR_IRQS: 3, nr_irqs: 3, preallocated irqs: 3
[ 0.475402] clocksource: tod: mask: 0xffffffffffffffff max_cycles: 0x3b0a9be803b0a9, max_idle_ns: 1805497147909793 ns
[ 0.475558] printk: console [ttyS1] enabled
[ 0.475623] pid_max: default: 32768 minimum: 301
[ 0.475658] LSM: Security Framework initializing
[ 0.475660] Yama: becoming mindful.
[ 0.475677] AppArmor: AppArmor initialized
[ 0.476134] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
[ 0.476352] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
[ 0.476370] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.476377] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.476726] rcu: Hierarchical SRCU implementation.
[ 0.476927] smp: Bringing up secondary CPUs ...
[ 0.477181] smp: Brought up 1 node, 2 CPUs
[ 0.477549] devtmpfs: initialized
[ 0.477882] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[ 0.477902] futex hash table entries: 1024 (order: 6, 262144 bytes)
[ 0.478105] NET: Registered protocol family 16
[ 0.478134] audit: initializing netlink subsys (disabled)
[ 0.478183] audit: type=2000 audit(1547122707.960:1): state=initialized audit_enabled=0 res=1
[ 0.478222] Spectre V2 mitigation: execute trampolines
[ 0.479205] HugeTLB registered 1.00 MiB page size, pre-allocated 0 pages
[ 0.479623] SCSI subsystem initialized
[ 0.479856] NetLabel: Initializing
[ 0.479858] NetLabel: domain hash size = 128
[ 0.479859] NetLabel: protocols = UNLABELED CIPSOv4 CALIPSO
[ 0.479871] NetLabel: unlabeled traffic allowed by default
[ 0.507362] VFS: Disk quotas dquot_6.6.0
[ 0.507378] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[ 0.507486] AppArmor: AppArmor Filesystem Enabled
[ 0.507847] NET: Registered protocol family 2
[ 0.507976] tcp_listen_portaddr_hash hash table entries: 1024 (order: 2, 16384 bytes)
[ 0.507994] TCP established hash table entries: 16384 (order: 5, 131072 bytes)
[ 0.508126] TCP bind hash table entries: 16384 (order: 6, 262144 bytes)
[ 0.508235] TCP: Hash tables configured (established 16384 bind 16384)
[ 0.508269] UDP hash table entries: 1024 (order: 3, 32768 bytes)
[ 0.508287] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes)
[ 0.508336] NET: Registered protocol family 1
[ 0.508343] NET: Registered protocol family 44
[ 0.508374] Unpacking initramfs...
[ 0.742412] Freeing initrd memory: 15824K
[ 0.742685] kvm-s390: SIE not available
[ 0.742706] hypfs.7f5705: The hardware system does not support hypfs
[ 0.742714] hypfs.7a79f0: Initialization of hypfs failed with rc=-61
[ 0.743110] Initialise system trusted keyrings
[ 0.743119] Key type blacklist registered
[ 0.743140] workingset: timestamp_bits=42 max_order=19 bucket_order=0
[ 0.744539] zbud: loaded
[ 0.745012] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 0.745211] fuse init (API version 7.28)
[ 0.771153] Key type asymmetric registered
[ 0.771157] Asymmetric key parser 'x509' registered
[ 0.771171] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 250)
[ 0.771203] io scheduler noop registered
[ 0.771204] io scheduler deadline registered
[ 0.771242] io scheduler cfq registered (default)
[ 0.771407] random: fast init done
[ 0.771413] hvc_iucv.691dff: The z/VM IUCV HVC device driver cannot be used without z/VM
[ 0.772923] loop: module loaded
[ 0.772991] tun: Universal TUN/TAP device driver, 1.6
[ 0.773072] device-mapper: uevent: version 1.0.3
[ 0.773140] device-mapper: ioctl: 4.39.0-ioctl (2018-04-03) initialised: dm-devel@redhat.com
[ 0.773193] cio.b5d5f6: Channel measurement facility initialized using format extended (mode autodetected)
[ 0.773515] ap.3677f7: The hardware system does not support AP instructions
[ 0.773762] NET: Registered protocol family 10
[ 0.773936] sclp_sd.ef2911: Store Data request failed (eq=2, di=3, response=0x40f0, flags=0x00, status=0, rc=-5)
[ 0.775411] Unable to handle kernel pointer dereference in virtual kernel address space
[ 0.775413] Failing address: 0000000000000000 TEID: 0000000000000483
[ 0.775414] Fault in home space mode while using kernel ASCE.
[ 0.775416] AS:0000000000d40007 R3:000000007ffd4007 S:000000007ffdb000 P:000000000000003d
[ 0.775468] Oops: 0004 ilc:3 [#1] SMP
[ 0.775470] Modules linked in:
[ 0.775473] CPU: 0 PID: 7 Comm: kworker/u6:0 Not tainted 4.20.0-2-generic #3-Ubuntu
[ 0.775474] Hardware name: IBM 2964 N63 400 (KVM/Linux)
[ 0.775479] Workqueue: events_unbound async_run_entry_fn
[ 0.775481] Krnl PSW : 0404c00180000000 00000000008d1a9a (rb_insert_color_cached+0x1b2/0x258)
[ 0.775487] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[ 0.775489] Krnl GPRS: 0000000000000000 0000000000000000 000000007f841118 000000007ff92630
[ 0.775491] 0000000000000000 0000000000001358 0000000000b5e3b0 0700000000000000
[ 0.775493] 000000002e37cbb5 0000000000000001 000000007f9eb100 000000007f841100
[ 0.775494] 0000000000001358 00000000008f6e70 00000000001aeb7c 000003e00035fb40
[ 0.775501] Krnl Code: 00000000008d1a88: e30040000024 stg %r0,0(%r4)
[ 0.775501] 00000000008d1a8e: e32050000024 stg %r2,0(%r5)
[ 0.775501] #00000000008d1a94: e34020080004 lg %r4,8(%r2)
[ 0.775501] >00000000008d1a9a: e34010100024 stg %r4,16(%r1)
[ 0.775501] 00000000008d1aa0: b9040052 lgr %r5,%r2
[ 0.775501] 00000000008d1aa4: e31050080024 stg %r1,8(%r5)
[ 0.775501] 00000000008d1aaa: b90400c2 lgr %r12,%r2
[ 0.775501] 00000000008d1aae: ec46ff56007c cgij %r4,0,6,8d195a
[ 0.775518] Call Trace:
[ 0.775522] ([<00000000001aecb2>] enqueue_entity+0x552/0xae8)
[ 0.775524] [<00000000001af2ba>] enqueue_task_fair+0x72/0x5a0
[ 0.775527] [<000000000019f878>] ttwu_do_activate+0x58/0xb0
[ 0.775529] [<00000000001a0a68>] try_to_wake_up+0x1f0/0x4e8
[ 0.775532] [<00000000001f9f1c>] call_timer_fn+0x3c/0x180
[ 0.775534] [<00000000001fb104>] expire_timers.part.5+0x134/0x160
[ 0.775536] [<00000000001fb1e6>] run_timer_softirq+0xb6/0x1d8
[ 0.775538] [<00000000008eaa18>] __do_softirq+0xf8/0x368
[ 0.775541] [<0000000000175d70>] irq_exit+0xb8/0xc8
[ 0.775581] [<000000000010c380>] do_IRQ+0x78/0xb0
[ 0.775586] [<00000000008e9f7c>] ext_int_handler+0x128/0x12c
[ 0.775590] [<000000000064b6aa>] __vring_new_virtqueue+0x13a/0x200
[ 0.775592] ([<000000000064b5ba>] __vring_new_virtqueue+0x4a/0x200)
[ 0.775594] [<000000000064b818>] vring_new_virtqueue+0xa8/0xb8
[ 0.775597] [<000000000074283a>] virtio_ccw_find_vqs+0x1e2/0x920
[ 0.775599] [<0000000000651670>] init_vqs+0xf8/0x208
[ 0.775601] [<0000000000651fde>] virtballoon_probe+0x116/0x3f8
[ 0.775603] [<000000000064a8aa>] virtio_dev_probe+0x172/0x2a0
[ 0.775607] [<0000000000688c32>] really_probe+0xf2/0x400
[ 0.775609] [<000000000068926a>] driver_probe_device+0x13a/0x1a0
[ 0.775611] [<0000000000686446>] bus_for_each_drv+0x86/0xc8
[ 0.775612] [<0000000000688a94>] __device_attach+0xec/0x188
[ 0.775614] [<00000000006879a8>] bus_probe_device+0xa8/0xc0
[ 0.775616] [<00000000006832dc>] device_add+0x3ac/0x6a0
[ 0.775618] [<000000000064a4d8>] register_virtio_device+0xe0/0x158
[ 0.775620] [<0000000000741e82>] virtio_ccw_online+0x1b2/0x258
[ 0.775623] [<000000000071eff6>] ccw_device_set_online+0x116/0x530
[ 0.775625] [<000000000074330e>] virtio_ccw_auto_online+0x26/0x58
[ 0.775627] [<0000000000197ca2>] async_run_entry_fn+0x62/0x1d0
[ 0.775629] [<000000000018cd18>] process_one_work+0x278/0x4b8
[ 0.775631] [<000000000018cfa8>] worker_thread+0x50/0x4e0
[ 0.775633] [<0000000000193e5c>] kthread+0x144/0x160
[ 0.775635] [<00000000008e99a2>] kernel_thread_starter+0xa/0x10
[ 0.775637] [<00000000008e9998>] kernel_thread_starter+0x0/0x10
[ 0.775638] Last Breaking-Event-Address:
[ 0.775640] [<00000000008d1a7a>] rb_insert_color_cached+0x192/0x258
[ 0.775641]
[ 0.775644] Kernel panic - not syncing: Fatal exception in interrupt

I've successfully bisected this down to:

git bisect good
86a559787e6f5cf662c081363f64a20cad654195 is the first bad commit
commit 86a559787e6f5cf662c081363f64a20cad654195
Author: Wei Wang <wei.w.wang@intel.com>
Date: Mon Aug 27 09:32:17 2018 +0800

    virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

    Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
    support of reporting hints of guest free pages to host via virtio-balloon.
    Currenlty, only free page blocks of MAX_ORDER - 1 are reported. They are
    obtained one by one from the mm free list via the regular allocation
    function.

    Host requests the guest to report free page hints by sending a new cmd id
    to the guest via the free_page_report_cmd_id configuration register. When
    the guest starts to report, it first sends a start cmd to host via the
    free page vq, which acks to host the cmd id received. When the guest
    finishes reporting free pages, a stop cmd is sent to host via the vq.
    Host may also send a stop cmd id to the guest to stop the reporting.

    VIRTIO_BALLOON_CMD_ID_STOP: Host sends this cmd to stop the guest
    reporting.
    VIRTIO_BALLOON_CMD_ID_DONE: Host sends this cmd to tell the guest that
    the reported pages are ready to be freed.

    Why does the guest free the reported pages when host tells it is ready to
    free?
    This is because freeing pages appears to be expensive for live migration.
    free_pages() dirties memory very quickly and makes the live migraion not
    converge in some cases. So it is good to delay the free_page operation
    when the migration is done, and host sends a command to guest about that.

    Why do we need the new VIRTIO_BALLOON_CMD_ID_DONE, instead of reusing
    VIRTIO_BALLOON_CMD_ID_STOP?
    This is because live migration is usually done in several rounds. At the
    end of each round, host needs to send a VIRTIO_BALLOON_CMD_ID_STOP cmd to
    the guest to stop (or say pause) the reporting. The guest resumes the
    reporting when it receives a new command id at the beginning of the next
    round. So we need a new cmd id to distinguish between "stop reporting" and
    "ready to free the reported pages".

    TODO:
    - Add a batch page allocation API to amortize the allocation overhead.

    Signed-off-by: Wei Wang <wei.w.wang@intel.com>
    Signed-off-by: Liang Li <liang.z.li@intel.com>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Comment 1 Colin Ian King 2019-01-15 14:56:43 UTC
Also, I've tested this with today's upstream from Linus' repo, still panics.