Bug 194731

Summary: drm general protection fault in drm_atomic_init
Product: Drivers Reporter: Janpieter Sollie (janpieter.sollie)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED OBSOLETE    
Severity: high    
Priority: P1    
Hardware: x86-64   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=194559
Kernel Version: 4.10.0 Subsystem:
Regression: No Bisected commit-id:

Description Janpieter Sollie 2017-02-28 08:20:36 UTC
I modified the drm_atomic.c a bit to be more verbose.  the drm_init calls a general protection fault when allocating my second GPU(does it work this way?) of the amdgpu module.
afterwards, the system is unable to allocate more kernel memory. init does not even manage to reboot, I can not do sudo without segfault
dmesg:
[   77.946291] [drm] Cannot find any crtc or sizes - going 1024x768
[   77.946575] [drm] fb mappable at 0xC0CCB000
[   77.946576] [drm] vram apper at 0xC0000000
[   77.946577] [drm] size 786432
[   77.946578] [drm] fb depth is 24
[   77.946578] [drm]    pitch is 1024
[   77.950389] kcalloc called 
[   77.950390] kcalloc2 called 
[   77.950433] kcalloc called 
[   77.950444] general protection fault: 0000 [#1] SMP
[   77.950445] Modules linked in: amdkfd amdgpu(O+) amdttm(O) amdkcl(O) ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc deflate cmac af_key w83627ehf hwmon_vid k10temp fam15h_power i2c_piix4 pcspkr
[   77.950462] CPU: 16 PID: 5094 Comm: kworker/16:2 Tainted: G           O    4.10.0-rc8 #12
[   77.950463] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 2.0a       01/09/2012
[   77.950470] Workqueue: events work_for_cpu_fn
[   77.950471] task: ffff88183c027000 task.stack: ffffc90020cfc000
[   77.950476] RIP: 0010:__kmalloc+0x6f/0x100
[   77.950477] RSP: 0018:ffffc90020cff6b0 EFLAGS: 00010286
[   77.950478] RAX: 0000000000000000 RBX: ffff8818362628ff RCX: 0000000000000897
[   77.950479] RDX: 0000000000000896 RSI: 0000000000000000 RDI: 0000000000019b70
[   77.950479] RBP: ffffc90020cff6d0 R08: ffff88183fc19b70 R09: ffffffff81417df1
[   77.950480] R10: 0000000000000300 R11: 0000000000000000 R12: ff88183626296000
[   77.950481] R13: ffff88083f803a00 R14: 00000000014080c0 R15: ffff88183602b238
[   77.950483] FS:  0000000000000000(0000) GS:ffff88183fc00000(0000) knlGS:0000000000000000
[   77.950484] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   77.950485] CR2: 00005590d5ce93d8 CR3: 0000000001c11000 CR4: 00000000000406e0
[   77.950485] Call Trace:
[   77.950492]  drm_atomic_state_init+0x81/0xf0
[   77.950494]  drm_atomic_state_alloc+0x53/0x80
[   77.950497]  drm_fb_helper_pan_display+0xf4/0x2a0
[   77.950501]  fb_pan_display+0xd6/0x170
[   77.950502]  bit_update_start+0x24/0x60
[   77.950504]  fbcon_switch+0x3e2/0x670
[   77.950507]  redraw_screen+0x15d/0x220
[   77.950510]  ? tty_do_resize+0x4d/0xa0
[   77.950512]  vc_do_resize+0x4c3/0x4f0
[   77.950513]  vc_resize+0x1a/0x20
[   77.950515]  fbcon_init+0x40b/0x660
[   77.950516]  visual_init+0xce/0x130
[   77.950517]  do_bind_con_driver+0x1c1/0x3a0
[   77.950519]  do_take_over_console+0x113/0x180
[   77.950520]  do_fbcon_takeover+0x51/0xb0
[   77.950522]  fbcon_event_notify+0x75a/0x880
[   77.950525]  notifier_call_chain+0x44/0x70
[   77.950526]  __blocking_notifier_call_chain+0x4e/0x80
[   77.950528]  blocking_notifier_call_chain+0x11/0x20
[   77.950529]  fb_notifier_call_chain+0x16/0x20
[   77.950531]  register_framebuffer+0x200/0x340
[   77.950533]  drm_fb_helper_initial_config+0x201/0x390

piece of code that prints the "kcalloc(2) called":
drm_atomic_state_init(struct drm_device *dev, struct drm_atomic_state *state)
{
        kref_init(&(state->ref));

        /* TODO legacy paths should maybe do a better job about
         * setting this appropriately?
         */
        state->allow_modeset = true;

        state->crtcs = kcalloc(dev->mode_config.num_crtc, sizeof(struct __drm_crtcs_state), GFP_KERNEL);
        printk("kcalloc called \n");
        if (!(state->crtcs))
                goto fail;
        state->planes = kcalloc(dev->mode_config.num_total_plane,
                                sizeof(struct __drm_planes_state), GFP_KERNEL);
        printk("kcalloc2 called \n");
        if (!(state->planes))
                goto fail;
        state->dev = dev;

        DRM_DEBUG_ATOMIC("Allocated atomic state %p\n", state);

        return 0;
fail:
        printk("fail called \n");
        drm_atomic_state_default_release(state);
        return -ENOMEM;
}
EXPORT_SYMBOL(drm_atomic_state_init);
Comment 1 Michel Dänzer 2017-02-28 09:26:17 UTC
(In reply to Janpieter Sollie from comment #0)
> I modified the drm_atomic.c a bit to be more verbose.  the drm_init calls a
> general protection fault when allocating my second GPU(does it work this
> way?) of the amdgpu module.

[...]

> [   77.950445] Modules linked in: amdkfd amdgpu(O+) amdttm(O) amdkcl(O)

amdttm and amdkcl indicate that you're using an out-of-tree version of the amdgpu kernel module, possibly from an amdgpu-pro release? Does the problem also occur with the in-tree amdgpu driver?
Comment 2 Janpieter Sollie 2017-02-28 09:33:36 UTC
an additional post-protection fault which proves it screws up the memory management of the kernel:
<6>[  376.340003] general protection fault: 0000 [#3] SMP
<6>[  376.340005] Modules linked in: amdkfd amdgpu(O+) amdttm(O) amdkcl(O) esp4 xfrm4_mode_tunnel ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc deflate cmac af_key w83627ehf hwmon_vid k10temp fam15h_power i2c_piix4 pcspkr
<6>[  376.340033] CPU: 16 PID: 10421 Comm: e2label Tainted: G      D    O    4.10.0-rc8 #12
<6>[  376.340035] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 2.0a       01/09/2012
<6>[  376.340037] task: ffff881039be6200 task.stack: ffffc9002663c000
<6>[  376.340047] RIP: 0010:kmem_cache_alloc+0x57/0xd0
<6>[  376.340049] RSP: 0018:ffffc9002663f950 EFLAGS: 00010286
<6>[  376.340052] RAX: 0000000000000000 RBX: ffff88083bd7d000 RCX: 0000000000000574
<6>[  376.340053] RDX: 0000000000000573 RSI: 0000000001408000 RDI: 0000000000019b70
<6>[  376.340055] RBP: ffffc9002663f970 R08: ffff88183fc19b70 R09: ffffffff81454f96
<6>[  376.340056] R10: 0000000000000001 R11: ffff88203a80d550 R12: ff88183888cc0000
<6>[  376.340057] R13: ffff88083f803a00 R14: 0000000001408000 R15: 0000000000000005
<6>[  376.340060] FS:  00007f62cc6b2780(0000) GS:ffff88183fc00000(0000) knlGS:0000000000000000
<6>[  376.340062] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<6>[  376.340064] CR2: 00007f62cc2a8320 CR3: 0000001837444000 CR4: 00000000000406e0
<6>[  376.340065] Call Trace:
<6>[  376.340073]  scsi_execute_req_flags+0x46/0x100
<6>[  376.340076]  scsi_test_unit_ready+0x85/0x130
<6>[  376.340080]  sd_check_events+0x116/0x160
<6>[  376.340085]  disk_check_events+0x5a/0x130
<6>[  376.340088]  disk_clear_events+0x6a/0x110
<6>[  376.340092]  check_disk_change+0x32/0x70
<6>[  376.340095]  sd_open+0x69/0x130
<6>[  376.340097]  __blkdev_get+0x2ff/0x3f0
<6>[  376.340100]  blkdev_get+0x115/0x330
<6>[  376.340106]  ? _raw_spin_unlock+0x9/0x10
<6>[  376.340109]  blkdev_open+0x56/0x70
<6>[  376.340113]  do_dentry_open+0x215/0x300
<6>[  376.340115]  ? blkdev_get_by_dev+0x60/0x60
<6>[  376.340118]  vfs_open+0x4c/0x70
<6>[  376.340123]  ? may_open+0x96/0x100
<6>[  376.340127]  path_openat+0x29f/0x1350
<6>[  376.340131]  do_filp_open+0x7c/0xd0
<6>[  376.340135]  ? vma_link+0xc5/0xd0
<6>[  376.340139]  ? _raw_spin_unlock+0x9/0x10
<6>[  376.340142]  ? __alloc_fd+0xa9/0x170
<6>[  376.340145]  do_sys_open+0x121/0x200
<6>[  376.340148]  SyS_open+0x19/0x20
<6>[  376.340152]  do_syscall_64+0x63/0x180
<6>[  376.340156]  entry_SYSCALL64_slow_path+0x25/0x25
<6>[  376.340158] RIP: 0033:0x7f62cb74d760
<6>[  376.340160] RSP: 002b:00007fff591f3d58 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
<6>[  376.340162] RAX: ffffffffffffffda RBX: 0000000000004000 RCX: 00007f62cb74d760
<6>[  376.340164] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001449420
<6>[  376.340165] RBP: 0000000000000000 R08: 0000000000000021 R09: 00007f62cb79a9b0
<6>[  376.340167] R10: 0000000000000130 R11: 0000000000000246 R12: 0000000001449310
<6>[  376.340168] R13: 0000000001449240 R14: 0000000000000009 R15: 0000000001449118
<6>[  376.340170] Code: 8b 45 00 65 49 8b 50 08 65 4c 03 05 24 5f ea 7e 49 83 78 10 00 4d 8b 20 74 4a 4d 85 e4 74 45 49 63 45 20 48 8d 4a 01 49 8b 7d 00 <49> 8b 1c 04 4c 89 e0 65 48 0f c7 0f 0f 94 c0 84 c0 74 c1 49 63 
<1>[  376.340210] RIP: kmem_cache_alloc+0x57/0xd0 RSP: ffffc9002663f950
Comment 3 Janpieter Sollie 2017-02-28 09:41:09 UTC
(In reply to Michel Dänzer from comment #1)
> (In reply to Janpieter Sollie from comment #0)
> > I modified the drm_atomic.c a bit to be more verbose.  the drm_init calls a
> > general protection fault when allocating my second GPU(does it work this
> > way?) of the amdgpu module.
> 
> [...]
> 
> > [   77.950445] Modules linked in: amdkfd amdgpu(O+) amdttm(O) amdkcl(O)
> 
> amdttm and amdkcl indicate that you're using an out-of-tree version of the
> amdgpu kernel module, possibly from an amdgpu-pro release? Does the problem
> also occur with the in-tree amdgpu driver?

no, then the kernel simply reboots. I'm sorry for using the amd pro module, but native amd doesn't tell me anything:

modprobe amdgpu & cat /proc/kmsg gives this as its last output:
<6>[  252.228071] amdgpu 0000:41:00.0: fence driver on ring 0 use gpu addr 0x0000000040000010, cpu addr 0xffff881831cc6010
<6>[  252.228253] amdgpu 0000:41:00.0: fence driver on ring 1 use gpu addr 0x0000000040000020, cpu addr 0xffff881831cc6020
<6>[  252.228601] amdgpu 0000:41:00.0: fence driver on ring 2 use gpu addr 0x0000000040000030, cpu addr 0xffff881831cc6030
<6>[  252.228768] amdgpu 0000:41:00.0: fence driver on ring 3 use gpu addr 0x0000000040000040, cpu addr 0xffff881831cc6040
<6>[  252.228939] amdgpu 0000:41:00.0: fence driver on ring 4 use gpu addr 0x0000000040000050, cpu addr 0xffff881831cc6050
<6>[  252.229092] [drm] probing gen 2 caps for device 1002:5a1f = 31cd02/0
<6>[  252.229096] [drm] PCIE gen 2 link speeds already enabled
<6>[  252.458236] [drm] ring test on 0 succeeded in 7 usecs
<6>[  252.460247] [drm] ring test on 1 succeeded in 1 usecs
<6>[  252.460256] [drm] ring test on 2 succeeded in 1 usecs
<6>[  252.460265] [drm] ring test on 3 succeeded in 3 usecs
<6>[  252.460272] [drm] ring test on 4 succeeded in 3 usecs
but this is probably related to another bug: 194559
Comment 4 Janpieter Sollie 2017-02-28 10:10:22 UTC
a workaround (and a very dirty one) for the amdgpu-pro driver is to comment, in line 410, this line: drm_fb_helper_initial_config(&rfbdev->helper, bpp_sel);
after that, the pro-module works fine, as I don't use a fb on headless cards ;)
Comment 5 Michel Dänzer 2017-03-01 01:07:59 UTC
This bugzilla is only for in-tree code. Also, amdgpu-pro doesn't officially support 4.10 yet, so this is basically an unsupported configuration.
Comment 6 Janpieter Sollie 2017-03-01 05:00:10 UTC
I know, but isn't it possible that the error is with DRM and not with the driver?
if not, sorry for reporting a wrong bug
Comment 7 Janpieter Sollie 2017-03-07 08:10:41 UTC
hello, I have some other news about this bug (if anyone is still interested):
I rewrote the amdgpu-pro driver, as the amdgpu driver seems more complex, and I took one step forward! this is my solution:
- unload the Cape Verde card with echo 1 > /proc/sys/pci/0000:41:00.0/remove
- load the driver, the driver loading finishes successfully
- rescan the pci bus: echo 1 > /sys/bus/pci/rescan
then the behaviour of amdgpu and amdgpu-pro are the same: the system reboots after initialisation, even if I order it not to restart on panic.
can somebody tell me where I should look for Cape Verde initialization code in the in-tree driver? I may be able to fix the initialization bug.

Thank you