Bug 101391

Summary: Lockup after resume (kernel BUG at include/drm/drm_mm.h:145!)
Product: Drivers Reporter: Michael Long (harn-solo)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED UNREPRODUCIBLE    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg log

Description Michael Long 2015-07-12 09:12:10 UTC
Created attachment 182481 [details]
dmesg log

Hi,

since kernel version 4.1 (4.1.1, 4.1.2) shortly (5-10s) after a resume from s2ram, the system freezes. See attached log. This bug is reproducible.

Thanks.
Comment 1 Michel Dänzer 2015-07-13 01:58:06 UTC
What was the last kernel version that didn't have this problem? Does it also happen with 4.1.0? Can you bisect?
Comment 2 Michael Long 2015-07-13 18:42:35 UTC
The last working kernel version is 4.0.x, I'm back on 4.0.8 right now. It also does happen with 4.1.0. I'll try to do a bisect.
Comment 3 Michael Long 2015-08-05 17:17:16 UTC
Sorry for the delay, I tried a bisect:

git bisect start 'v4.0..v4.1' 'drivers/gpu/drm'
# good: [39a8804455fb23f09157341d3ba7db6d7ae6ee76] Linux 4.0
git bisect good 39a8804455fb23f09157341d3ba7db6d7ae6ee76
# bad: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1
git bisect bad b953c0d234bc72e8489d3bf51a276c5c4ec85345
# good: [b5f1c97f944482e98e6e39208af356630389d1ea] drm/i915: vlv: fix save/restore of GFX_MAX_REQ_COUNT reg
git bisect good b5f1c97f944482e98e6e39208af356630389d1ea
# good: [2611015c7511106719bae904cac459383c55ffef] drm/exynos: mixer: add 2x scaling to mixer_graph_buffer
git bisect good 2611015c7511106719bae904cac459383c55ffef
# good: [362ff251390f3d1f8fe94666f4fc4e5876381114] drm/radeon/audio: don't enable packets until the end
git bisect good 362ff251390f3d1f8fe94666f4fc4e5876381114
# good: [d10ebb9f136669a1e9fa388fc450bf1822c93dd5] drm/exynos: fb: use drm_format_num_planes to get buffer count
git bisect good d10ebb9f136669a1e9fa388fc450bf1822c93dd5
# bad: [7c0411d2fabc2e2702c9871ffb603e251158b317] drm/radeon: partially revert "fix VM_CONTEXT*_PAGE_TABLE_END_ADDR handling"
git bisect bad 7c0411d2fabc2e2702c9871ffb603e251158b317

After finding the first bad commit I'm stuck with:

Bisecting: 12 revisions left to test after this (roughly 4 steps)
error: Your local changes to the following files would be overwritten by checkout:
	Documentation/devicetree/bindings/clock/silabs,si5351.txt
	Documentation/devicetree/bindings/net/cdns-emac.txt
	Documentation/virtual/kvm/mmu.txt
	MAINTAINERS
	Makefile
	arch/arm/boot/dts/zynq-7000.dtsi
	arch/arm/xen/enlighten.c
	arch/powerpc/kernel/mce.c
	arch/powerpc/kernel/vmlinux.lds.S
	arch/powerpc/kvm/book3s_hv.c
	arch/powerpc/mm/hugetlbpage.c
	arch/powerpc/mm/pgtable_64.c
	arch/s390/crypto/ghash_s390.c
	arch/s390/crypto/prng.c
	arch/s390/include/asm/pgtable.h
	arch/s390/net/bpf_jit_comp.c
	arch/x86/include/asm/kvm_host.h
	arch/x86/kvm/cpuid.c
	arch/x86/kvm/cpuid.h
	arch/x86/kvm/mmu.c
	arch/x86/kvm/mmu.h
	arch/x86/kvm/paging_tmpl.h
	arch/x86/kvm/svm.c
	arch/x86/kvm/vmx.c
	arch/x86/kvm/x86.c
	block/blk-core.c
	crypto/algif_aead.c
	drivers/block/nvme-scsi.c
	drivers/bluetooth/ath3k.c
	drivers/bluetooth/btusb.c
	drivers/clk/clk-si5351.c
	drivers/clk/clk.c
	drivers/clk/qcom/gcc-msm8916.c
	drivers/clk/samsung/Makefile
	drivers/clk/samsung/clk-exynos5420.c
	drivers/clk/samsung/clk-exynos5433.c
	drivers/gpu/drm/drm_plane_helper.c
	drivers/gpu/drm/i915/intel_pm.c
	drivers/gpu/drm/msm/mdp/mdp5/mdp5_plane.c
	drivers/gpu/drm/radeon/atombios_crtc.c
	drivers/gpu/drm/radeon/atombios_dp.c
	drivers/gpu/drm/radeon/cik.c
	drivers/gpu/drm/radeon/evergreen.c
	drivers/gpu/drm/radeon/evergreen_hdmi.c
	drivers/gpu/drm/radeon/ni.c
	drivers/gpu/drm/radeon/r600.c
	drivers/gpu/drm/radeon/radeon_audio.c
	drivers/gpu/drm/radeon/radeon_connectors.c
	drivers/gpu/drm/radeon/radeon_dp_auxch.c
	drivers/gpu/drm/radeon/rv770.c
	drivers/gpu/drm/radeon/si.c
	drivers/gpu/drm/vgem/Makefile
	drivers/gpu/drm/vgem/vgem_drv.c
	drivers/gpu/drm/vgem/vgem_drv.h
	drivers/hid/hid-ids.h
	drivers/hid/hid-logitech-hidpp.c
	drivers/hid/hid-sensor-hub.c
	drivers/hid/i2c-hid/i2c-hid.c
	drivers/hid/usbhid/hid-quirks.c
	drivers/hid/wacom_wac.c
	drivers/infiniband/core/cm.c
	drivers/infiniband/core/cma.c
	drivers/infiniband/hw/ocrdma/ocrdma.h
	drivers/infiniband/hw/ocrdma/ocrdma_ah.c
	drivers/infiniband/hw/ocrdma/ocrdma_hw.c
	drivers/infiniband/hw/ocrdma/ocrdma_sli.h
	drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
	drivers/input/joydev.c
	drivers/input/mouse/Kconfig
	drivers/input/mouse/alps.c
	drivers/input/mouse/elantech.c
	drivers/input/touchscreen/stmpe-ts.c
	drivers/input/touchscreen/sx8654.c
	drivers/irqchip/irq-gic-v3-its.c
	drivers/md/bitmap.c
	drivers/md/raid0.c
	drivers/md/raid5.c
	drivers/mmc/host/atmel-mci.c
	drivers/net/bonding/bond_options.c
	drivers/net/ethernet/cadence/macb.c
	drivers/net/ethernet/cadence/macb.h
	drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
	drivers/net/ethernet/rocker/rocker.c
	drivers/net/phy/phy.c
	drivers/net/usb/cdc_ncm.c
	drivers/net/vxlan.c
	drivers/pwm/pwm-img.c
	drivers/s390/crypto/ap_bus.c
	drivers/scsi/be2iscsi/be.h
	drivers/scsi/be2iscsi/be_cmds.c
	drivers/scsi/be2iscsi/be_cmds.h
	drivers/scsi/be2iscsi/be_iscsi.c
	drivers/scsi/be2iscsi/be_iscsi.h
	drivers/scsi/be2iscsi/be_main.c
	drivers/scsi/be2iscsi/be_main.h
	drivers/scsi/be2iscsi/be_mgmt.c
	drivers/scsi/be2iscsi/be_mgmt.h
	drivers/scsi/lpfc/lpfc_scsi.c
	drivers/scsi/sd.c
	drivers/scsi/storvsc_drv.c
	drivers/thermal/armada_thermal.c
	drivers/thermal/ti-soc-thermal/dra752-thermal-data.c
	drivers/thermal/ti-soc-thermal/omap5-thermal-data.c
	drivers/thermal/ti-soc-thermal/ti-bandgap.c
	drivers/thermal/ti-soc-thermal/ti-bandgap.h
	drivers/tty/hvc/hvc_xen.c
	drivers/xen/events/events_base.c
	fs/btrfs/backref.c
	fs/btrfs/extent-tree.c
	fs/btrfs/volumes.c
	fs/nfs/nfs4proc.c
	fs/nfs/write.c
	include/linux/blkdev.h
	include/linux/hid-sensor-hub.h
	include/linux/ktime.h
	include/linux/platform_data/si5351.h
	include/linux/rhashtable.h
	include/linux/skbuff.h
	include/linux/tcp.h
	include/net/inet_connection_sock.h
	include/uapi/linux/netfilter/nf_conntrack_tcp.h
	include/uapi/linux/rtnetlink.h
	include/xen/events.h
	kernel/sched/core.c
	kernel/time/hrtimer.c
	kernel/watchdog.c
	lib/rhashtable.c
	net/8021q/vlan.c
	net/bluetooth/hci_core.c
	net/bridge/br_multicast.c
	net/bridge/br_netfilter.c
	net/bridge
Aborting
Comment 4 Michel Dänzer 2015-08-06 02:12:39 UTC
(In reply to Michael Long from comment #3)
> Bisecting: 12 revisions left to test after this (roughly 4 steps)
> error: Your local changes to the following files would be overwritten by
> checkout:

You need to remove your local modifications. Save them first if you still need them. If you need the local modifications for bisecting, you can apply them again after git bisect checked out the next commit to test.
Comment 5 Michael Long 2015-08-12 09:01:14 UTC
I managed to proceed with the bisect process but it doesn't lead to another bad commit. 

After that I restarted the bisection but this time without limiting to a sub-tree. This time e640a280ccb9c448a3d9d522ea730ce00efa8cf0 (Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux) was the first bad commit, but continuing also did not reveal another bad commit. 

I'm not really familiar with the process, so I might do something fundamentally wrong but I will try again.

But I found out the following:

For testing around with KVM-based vga-passthrough, I've installed three different GPUs in my system:

AMD Radeon HD 5430 card for the host (using radeon)
AMD Radeon HD 5570 card for VM1 (using radeon) or another NVIDIA based card (using nouveau)
NVIDIA GTX 980 card for VM2 (no driver bound)

Hence the GTX 980 is not supported well, it is stubbed out via kernel parameter pci-stub.ids=...., the other cards were bound to a driver when they are not used with KVM/qemu.

The problem ONLY arises when more than 2 GPUs are installed (my guess is seen by vgaarb) and at least 2 GPUs are bound to a driver, e.g. radeon or nouveau. 

If both cards are stubbed out the resume problem does not occur. I could do this as a temporary workaround but then I run into this bug here: https://bugs.freedesktop.org/show_bug.cgi?id=88773

I'm absolutely aware that this is corner case but still, it works with 4.0.9.
Comment 6 Michel Dänzer 2015-08-12 09:12:09 UTC
"No more bad commits" isn't necessarily a problem. Just keep testing the commits you get from running "git bisect good/bad" according to the test result with the previous commit, until the process is finished and it prints out which commit seems to have introduced the problem.

FWIW, it's better not to limit the bisection to a sub-tree.
Comment 7 Michael Long 2015-09-26 16:37:31 UTC
I tried several times to bisect this problem but it is an incredible task for me. Between the bisection steps I had a variety of other hangs and freezes not related to drm. 

In the meantime I've also tested 4.2 kernel extensively and so far the original problem is gone. Occasionally (1 out of 20 resumes) instead of the KDE unlock screen the last console output is shown but no traces of error messages or backtraces. I leave it like that. Thx for your time anyway.