Bug 194731
Summary: | drm general protection fault in drm_atomic_init | ||
---|---|---|---|
Product: | Drivers | Reporter: | Janpieter Sollie (janpieter.sollie) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | RESOLVED OBSOLETE | ||
Severity: | high | ||
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
See Also: | https://bugzilla.kernel.org/show_bug.cgi?id=194559 | ||
Kernel Version: | 4.10.0 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Janpieter Sollie
2017-02-28 08:20:36 UTC
(In reply to Janpieter Sollie from comment #0) > I modified the drm_atomic.c a bit to be more verbose. the drm_init calls a > general protection fault when allocating my second GPU(does it work this > way?) of the amdgpu module. [...] > [ 77.950445] Modules linked in: amdkfd amdgpu(O+) amdttm(O) amdkcl(O) amdttm and amdkcl indicate that you're using an out-of-tree version of the amdgpu kernel module, possibly from an amdgpu-pro release? Does the problem also occur with the in-tree amdgpu driver? an additional post-protection fault which proves it screws up the memory management of the kernel: <6>[ 376.340003] general protection fault: 0000 [#3] SMP <6>[ 376.340005] Modules linked in: amdkfd amdgpu(O+) amdttm(O) amdkcl(O) esp4 xfrm4_mode_tunnel ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc deflate cmac af_key w83627ehf hwmon_vid k10temp fam15h_power i2c_piix4 pcspkr <6>[ 376.340033] CPU: 16 PID: 10421 Comm: e2label Tainted: G D O 4.10.0-rc8 #12 <6>[ 376.340035] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 2.0a 01/09/2012 <6>[ 376.340037] task: ffff881039be6200 task.stack: ffffc9002663c000 <6>[ 376.340047] RIP: 0010:kmem_cache_alloc+0x57/0xd0 <6>[ 376.340049] RSP: 0018:ffffc9002663f950 EFLAGS: 00010286 <6>[ 376.340052] RAX: 0000000000000000 RBX: ffff88083bd7d000 RCX: 0000000000000574 <6>[ 376.340053] RDX: 0000000000000573 RSI: 0000000001408000 RDI: 0000000000019b70 <6>[ 376.340055] RBP: ffffc9002663f970 R08: ffff88183fc19b70 R09: ffffffff81454f96 <6>[ 376.340056] R10: 0000000000000001 R11: ffff88203a80d550 R12: ff88183888cc0000 <6>[ 376.340057] R13: ffff88083f803a00 R14: 0000000001408000 R15: 0000000000000005 <6>[ 376.340060] FS: 00007f62cc6b2780(0000) GS:ffff88183fc00000(0000) knlGS:0000000000000000 <6>[ 376.340062] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <6>[ 376.340064] CR2: 00007f62cc2a8320 CR3: 0000001837444000 CR4: 00000000000406e0 <6>[ 376.340065] Call Trace: <6>[ 376.340073] scsi_execute_req_flags+0x46/0x100 <6>[ 376.340076] scsi_test_unit_ready+0x85/0x130 <6>[ 376.340080] sd_check_events+0x116/0x160 <6>[ 376.340085] disk_check_events+0x5a/0x130 <6>[ 376.340088] disk_clear_events+0x6a/0x110 <6>[ 376.340092] check_disk_change+0x32/0x70 <6>[ 376.340095] sd_open+0x69/0x130 <6>[ 376.340097] __blkdev_get+0x2ff/0x3f0 <6>[ 376.340100] blkdev_get+0x115/0x330 <6>[ 376.340106] ? _raw_spin_unlock+0x9/0x10 <6>[ 376.340109] blkdev_open+0x56/0x70 <6>[ 376.340113] do_dentry_open+0x215/0x300 <6>[ 376.340115] ? blkdev_get_by_dev+0x60/0x60 <6>[ 376.340118] vfs_open+0x4c/0x70 <6>[ 376.340123] ? may_open+0x96/0x100 <6>[ 376.340127] path_openat+0x29f/0x1350 <6>[ 376.340131] do_filp_open+0x7c/0xd0 <6>[ 376.340135] ? vma_link+0xc5/0xd0 <6>[ 376.340139] ? _raw_spin_unlock+0x9/0x10 <6>[ 376.340142] ? __alloc_fd+0xa9/0x170 <6>[ 376.340145] do_sys_open+0x121/0x200 <6>[ 376.340148] SyS_open+0x19/0x20 <6>[ 376.340152] do_syscall_64+0x63/0x180 <6>[ 376.340156] entry_SYSCALL64_slow_path+0x25/0x25 <6>[ 376.340158] RIP: 0033:0x7f62cb74d760 <6>[ 376.340160] RSP: 002b:00007fff591f3d58 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 <6>[ 376.340162] RAX: ffffffffffffffda RBX: 0000000000004000 RCX: 00007f62cb74d760 <6>[ 376.340164] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001449420 <6>[ 376.340165] RBP: 0000000000000000 R08: 0000000000000021 R09: 00007f62cb79a9b0 <6>[ 376.340167] R10: 0000000000000130 R11: 0000000000000246 R12: 0000000001449310 <6>[ 376.340168] R13: 0000000001449240 R14: 0000000000000009 R15: 0000000001449118 <6>[ 376.340170] Code: 8b 45 00 65 49 8b 50 08 65 4c 03 05 24 5f ea 7e 49 83 78 10 00 4d 8b 20 74 4a 4d 85 e4 74 45 49 63 45 20 48 8d 4a 01 49 8b 7d 00 <49> 8b 1c 04 4c 89 e0 65 48 0f c7 0f 0f 94 c0 84 c0 74 c1 49 63 <1>[ 376.340210] RIP: kmem_cache_alloc+0x57/0xd0 RSP: ffffc9002663f950 (In reply to Michel Dänzer from comment #1) > (In reply to Janpieter Sollie from comment #0) > > I modified the drm_atomic.c a bit to be more verbose. the drm_init calls a > > general protection fault when allocating my second GPU(does it work this > > way?) of the amdgpu module. > > [...] > > > [ 77.950445] Modules linked in: amdkfd amdgpu(O+) amdttm(O) amdkcl(O) > > amdttm and amdkcl indicate that you're using an out-of-tree version of the > amdgpu kernel module, possibly from an amdgpu-pro release? Does the problem > also occur with the in-tree amdgpu driver? no, then the kernel simply reboots. I'm sorry for using the amd pro module, but native amd doesn't tell me anything: modprobe amdgpu & cat /proc/kmsg gives this as its last output: <6>[ 252.228071] amdgpu 0000:41:00.0: fence driver on ring 0 use gpu addr 0x0000000040000010, cpu addr 0xffff881831cc6010 <6>[ 252.228253] amdgpu 0000:41:00.0: fence driver on ring 1 use gpu addr 0x0000000040000020, cpu addr 0xffff881831cc6020 <6>[ 252.228601] amdgpu 0000:41:00.0: fence driver on ring 2 use gpu addr 0x0000000040000030, cpu addr 0xffff881831cc6030 <6>[ 252.228768] amdgpu 0000:41:00.0: fence driver on ring 3 use gpu addr 0x0000000040000040, cpu addr 0xffff881831cc6040 <6>[ 252.228939] amdgpu 0000:41:00.0: fence driver on ring 4 use gpu addr 0x0000000040000050, cpu addr 0xffff881831cc6050 <6>[ 252.229092] [drm] probing gen 2 caps for device 1002:5a1f = 31cd02/0 <6>[ 252.229096] [drm] PCIE gen 2 link speeds already enabled <6>[ 252.458236] [drm] ring test on 0 succeeded in 7 usecs <6>[ 252.460247] [drm] ring test on 1 succeeded in 1 usecs <6>[ 252.460256] [drm] ring test on 2 succeeded in 1 usecs <6>[ 252.460265] [drm] ring test on 3 succeeded in 3 usecs <6>[ 252.460272] [drm] ring test on 4 succeeded in 3 usecs but this is probably related to another bug: 194559 a workaround (and a very dirty one) for the amdgpu-pro driver is to comment, in line 410, this line: drm_fb_helper_initial_config(&rfbdev->helper, bpp_sel); after that, the pro-module works fine, as I don't use a fb on headless cards ;) This bugzilla is only for in-tree code. Also, amdgpu-pro doesn't officially support 4.10 yet, so this is basically an unsupported configuration. I know, but isn't it possible that the error is with DRM and not with the driver? if not, sorry for reporting a wrong bug hello, I have some other news about this bug (if anyone is still interested): I rewrote the amdgpu-pro driver, as the amdgpu driver seems more complex, and I took one step forward! this is my solution: - unload the Cape Verde card with echo 1 > /proc/sys/pci/0000:41:00.0/remove - load the driver, the driver loading finishes successfully - rescan the pci bus: echo 1 > /sys/bus/pci/rescan then the behaviour of amdgpu and amdgpu-pro are the same: the system reboots after initialisation, even if I order it not to restart on panic. can somebody tell me where I should look for Cape Verde initialization code in the in-tree driver? I may be able to fix the initialization bug. Thank you |