This was filed per the request here: https://lkml.org/lkml/2019/3/21/1023 So I'm going to dump as much info as I can On a very specific subset of ThinkPad P50 SKUs, particularly ones that come with a Quadro M1000M chip instead of the M2000M variant, the BIOS seems to have a very nasty habit of not always resetting the secondary Nvidia GPU between full reboots if the laptop is configured in Hybrid Graphics mode. The reason for this happening is unknown, but the following steps and possibly a good bit of patience will reproduce the issue: 1. Boot up the laptop normally in Hybrid graphics mode 2. Make sure nouveau is loaded and that the GPU is awake 2. Allow the nvidia GPU to runtime suspend itself after being idle 3. Reboot the machine, the more sudden the better (e.g sysrq-b may help) 4. If nouveau loads up properly, reboot the machine again and go back to step 2 until you reproduce the issue This results in some very strange behavior: the GPU will quite literally be left in exactly the same state it was in when the previously booted kernel started the reboot. This has all sorts of bad sideaffects: for starters, this completely breaks nouveau starting with a mysterious EVO channel failure that happens well before we've actually used the EVO channel for anything: nouveau 0000:01:00.0: disp: chid 0 mthd 0000 data 00000400 00001000 00000002 Later on, this causes us to timeout trying to bring up the GR ctx: ------------[ cut here ]------------ nouveau 0000:01:00.0: timeout WARNING: CPU: 0 PID: 12 at drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.c:1547 gf100_grctx_generate+0x7b2/0x850 [nouveau] Modules linked in: nouveau mxm_wmi i915 crc32c_intel ttm i2c_algo_bit serio_raw drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci drm xhci_hcd i2c_core wmi video CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.0.0-rc5Lyude-Test+ #29 Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET82W (1.55 ) 12/18/2018 Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper] RIP: 0010:gf100_grctx_generate+0x7b2/0x850 [nouveau] Code: 85 d2 75 04 48 8b 57 10 48 89 95 28 ff ff ff e8 b4 37 0e e1 48 8b 95 28 ff ff ff 48 c7 c7 b1 97 57 a0 48 89 c6 e8 5a 38 c0 e0 <0f> 0b e9 b9 fd ff ff 48 8b 85 60 ff ff ff 48 8b 40 10 48 8b 78 10 RSP: 0018:ffffc900000b77f0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff888871af8000 RCX: 0000000000000000 RDX: ffff88887f41dfe0 RSI: ffff88887f415698 RDI: ffff88887f415698 RBP: ffffc900000b78c8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888872118000 R13: 0000000000000000 R14: ffffffffa0551420 R15: ffffc900000b7818 FS: 0000000000000000(0000) GS:ffff88887f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005644d0556ca8 CR3: 0000000002214006 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: gf100_gr_init_ctxctl+0x27b/0x2d0 [nouveau] gf100_gr_init+0x5bd/0x5e0 [nouveau] gf100_gr_init_+0x61/0x70 [nouveau] nvkm_gr_init+0x1d/0x20 [nouveau] nvkm_engine_init+0xcb/0x210 [nouveau] nvkm_subdev_init+0xd6/0x230 [nouveau] nvkm_engine_ref.part.0+0x52/0x70 [nouveau] nvkm_engine_ref+0x13/0x20 [nouveau] nvkm_ioctl_new+0x12c/0x260 [nouveau] ? nvkm_fifo_chan_child_del+0xa0/0xa0 [nouveau] ? gf100_gr_dtor+0xe0/0xe0 [nouveau] nvkm_ioctl+0xe2/0x180 [nouveau] nvkm_client_ioctl+0x12/0x20 [nouveau] nvif_object_ioctl+0x47/0x50 [nouveau] nvif_object_init+0xc8/0x120 [nouveau] nvc0_fbcon_accel_init+0x5c/0x960 [nouveau] nouveau_fbcon_create+0x5a5/0x5d0 [nouveau] ? drm_setup_crtcs+0x27b/0xcb0 [drm_kms_helper] ? __lock_is_held+0x5e/0xa0 __drm_fb_helper_initial_config_and_unlock+0x27c/0x520 [drm_kms_helper] drm_fb_helper_hotplug_event.part.29+0xae/0xc0 [drm_kms_helper] drm_fb_helper_hotplug_event+0x1c/0x30 [drm_kms_helper] nouveau_fbcon_output_poll_changed+0xb8/0x110 [nouveau] drm_kms_helper_hotplug_event+0x2a/0x40 [drm_kms_helper] drm_dp_send_link_address+0x176/0x1c0 [drm_kms_helper] drm_dp_check_and_send_link_address+0xa0/0xb0 [drm_kms_helper] drm_dp_mst_link_probe_work+0xa4/0xc0 [drm_kms_helper] process_one_work+0x22f/0x5c0 worker_thread+0x44/0x3a0 kthread+0x12b/0x150 ? wq_pool_ids_show+0x140/0x140 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x3a/0x50 irq event stamp: 22490 hardirqs last enabled at (22489): [<ffffffff8113281d>] console_unlock+0x44d/0x5f0 hardirqs last disabled at (22490): [<ffffffff81001c03>] trace_hardirqs_off_thunk+0x1a/0x1c softirqs last enabled at (22486): [<ffffffff81c00330>] __do_softirq+0x330/0x44d softirqs last disabled at (22479): [<ffffffff810c3105>] irq_exit+0xe5/0xf0 WARNING: CPU: 0 PID: 12 at drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.c:1547 gf100_grctx_generate+0x7b2/0x850 [nouveau] ---[ end trace bf0976ed88b122a8 ]--- nouveau 0000:01:00.0: gr: wait for idle timeout (en: 1, ctxsw: 0, busy: 1) nouveau 0000:01:00.0: gr: wait for idle timeout (en: 1, ctxsw: 0, busy: 1) nouveau 0000:01:00.0: fifo: fault 01 [WRITE] at 0000000000008000 engine 00 [GR] client 15 [HUB/SCC_NB] reason c4 [] on channel -1 [0000000000 unknown] From which the GPU never manages to recover. Booting without nouveau loading causes issues as well, since the GPU starts sending spurious interrupts that cause other device's IRQs to get disabled by the kernel: irq 16: nobody cared (try booting with the "irqpoll" option) … handlers: [<000000007faa9e99>] i801_isr [i2c_i801] Disabling IRQ #16 … serio: RMI4 PS/2 pass-through port at rmi4-00.fn03 i801_smbus 0000:00:1f.4: Timeout waiting for interrupt! i801_smbus 0000:00:1f.4: Transaction timeout rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-110). i801_smbus 0000:00:1f.4: Timeout waiting for interrupt! i801_smbus 0000:00:1f.4: Transaction timeout rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts! Which in turn causes the touchpad and sometimes even other things to get disabled. ------------- How I originally discovered this bug, copy pasted from the Red Hat BZ report that we have, since that part at least shouldn't be under NDA So, finally that brings me to how we actually figured out that the hardware not being power cycled on reboot was the issue. What I did was get a snapshot of a couple of MMIO registers on the nvidia GPU right before shutting down: PMC 00000200: fc35f16d 00000003 FIFO channels 00800004: 00000001 0080000c: 00000001 ... ... FIFO uevent 00002140: 79810101 Display channels 00610490: 2d0b009b 08778ad3 006104e0: 4003009b 087712e3 006104a0: 0c05009b 08731553 006104f0: 4003009b 087707d3 006104b0: 4003009b 087710f3 00610500: 4003009b 087a18f3 006104c0: 4003009b 0877b8a3 00610510: 4003009b 08784803 006104d0: 4003009b 08784813 All of the registers shown above aren't touched by the BIOS or VBIOS, and aren't supposed retain their values after a power cycle. They're set by nouveau once the driver loads. Normally, if we reboot the machine the registers will look like this before nouveau is loaded: PMC 00000200: c0012020 00000003 FIFO channels 00800004: bad0da00 0080000c: bad0da00 00800014: bad0da00 0080001c: bad0da00 FIFO uevent 00002140: bad0da00 Display channels 00610490: 48000088 00000001 006104e0: 40000088 00000001 006104a0: 40000088 00000001 006104f0: 40000088 00000001 006104b0: 40000088 00000001 00610500: 40000088 00000001 006104c0: 40000088 00000001 00610510: 40000088 00000001 006104d0: 40000088 00000001 However, when reproducing this issue we actually observe something quite different. Before nouveau loads, the mmio registers look like this: PMC 00000200: fc35f16d 00000003 FIFO channels 00800004: 00000001 0080000c: 00000001 ... ... FIFO uevent 00002140: 79810101 Display channels 00610490: 490a009b 08778ad3 006104e0: 4003009b 087712e3 006104a0: 4003009b 08731553 006104f0: 4003009b 087707d3 006104b0: 4003009b 087710f3 00610500: 4003009b 087a18f3 006104c0: 4003009b 0877b8a3 00610510: 4003009b 08784803 006104d0: 4003009b 08784813 Notice how on the display channels, the right-most column still has the exact same values that I showed you in the first register snapshot. The only way this is possible is if the GPU is actually still in the same state that it was in right before shutting down. We further confirmed this issue by attempting a workaround of ensuring we powered off the GPU all the way before finishing each shutdown, doing this caused the problem to go away entirely. ----------------- As a final note, I've uploaded an sosreport for the system in question. This contains pretty much everything you guys could ask for: lspci -nnvv, dmidecode, kernel logs, etc. You can find the lspci -nnvv output in: sos_commands/pci/lspci_-nnvv And you can find the dmesg output in: sos_commands/kernel/dmesg And a more complete system log (includes more dmesg then the file mentioned above): sos_commands/logs/journalctl_--no-pager_--since_-3days First, the SOSReport from my P50 which exhibits the BIOS issue: https://people.freedesktop.org/~lyudess/archive/03-22-2019/sosreport-LyudeP50-2018-10-05-edlxase.tar.xz Second, an SOSReport from a P50 which does not exhibit the BIOS issue (for comparison): https://people.freedesktop.org/~lyudess/archive/03-22-2019/vamshee-sosreport.tar.xz Don't hesistate to let me know if you need any more information