Bug 215678

Summary: vmwgfx: probe of 0000:00:0f.0 failed with error -16
Product: Drivers Reporter: sander44 (ionut_n2001)
Component: Video(Other)Assignee: drivers_video-other
Status: NEW ---    
Severity: high CC: linux-kernel, regressions, zackr
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.17-rc7 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
journalctl
lscpu
lspci

Description sander44 2022-03-13 17:11:39 UTC
Hi,

With latest kernel, i notice this:

dmesg | grep vmwgfx
[    2.959200] vmwgfx 0000:00:0f.0: vgaarb: deactivate vga console
[    2.959764] vmwgfx 0000:00:0f.0: BAR 1: can't reserve [mem 0xf0000000-0xf7ffffff pref]
[    2.959766] vmwgfx: probe of 0000:00:0f.0 failed with error -16


lspci -s 0000:00:0f.0 -nnvv
00:0f.0 VGA compatible controller [0300]: VMware SVGA II Adapter [15ad:0405] (prog-if 00 [VGA controller])
        Subsystem: VMware SVGA II Adapter [15ad:0405]
        Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 16
        Region 0: I/O ports at 2140 [size=16]
        Region 1: Memory at f0000000 (32-bit, prefetchable) [size=128M]
        Region 2: Memory at fb800000 (32-bit, non-prefetchable) [size=8M]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Len=00 <?>
        Capabilities: [44] PCI Advanced Features
                AFCap: TP+ FLR+
                AFCtrl: FLR-
                AFStatus: TP-
        Kernel modules: vmwgfx

Driver version:
#define VMWGFX_DRIVER_DATE "20211206"

Host: Windows 11 & VMware Workstation 16 Pro 16.2.0 build-18760230
Guest: Debian 11
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-17 13:42:50 UTC
Just to be sure: it works with older kernel versions, like Linux 5.16?
Comment 2 sander44 2022-03-18 06:53:44 UTC
Yes, it's working correctly.
Comment 3 Zack Rusin 2022-03-18 12:41:36 UTC
It's a little hard to tell without the full log but this looks like the pci reservation bug that was fixed by:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/video/fbdev/core/fbmem.c?id=27599aacbaefcbf2af7b06b0029459bbf682000d
It should go in through drm-misc tree, drm-misc-next branch.
Comment 4 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-18 13:01:03 UTC
(In reply to Zack Rusin from comment #3)
> It's a little hard to tell without the full log but this looks like the pci
> reservation bug that was fixed by:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> drivers/video/fbdev/core/fbmem.c?id=27599aacbaefcbf2af7b06b0029459bbf682000d

thx

According to https://patchwork.freedesktop.org/series/99243/ this seems to be a patch from a series. Am I right in assuming the patch you specified is enough to fix this (assuming  that this bug is triggered by the "pci reservation bug")?
Comment 5 Zack Rusin 2022-03-18 14:40:17 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #4)
> (In reply to Zack Rusin from comment #3)
> > It's a little hard to tell without the full log but this looks like the pci
> > reservation bug that was fixed by:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> >
> drivers/video/fbdev/core/fbmem.c?id=27599aacbaefcbf2af7b06b0029459bbf682000d
> 
> thx
> 
> According to https://patchwork.freedesktop.org/series/99243/ this seems to
> be a patch from a series. Am I right in assuming the patch you specified is
> enough to fix this (assuming  that this bug is triggered by the "pci
> reservation bug")?

Yes, that's correct. Thomas and Javier were doing more work in those areas so there might be more related changes, but that one specific commit is enough to get platform fb drivers to release pci resources and allow drm drivers like vmwgfx to load correctly.
Comment 6 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-18 15:07:22 UTC
(In reply to sander44 from comment #2)
> Yes, it's working correctly.

@sander44: the patch to resolve this actually fixes an issue already 5.11; is it possible that your 5.17-rc kernel is build from a similar configuration as the older kernel that was working (see https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/plain/Documentation/admin-guide/reporting-regressions.rst for an explanation)? And was it actually 5.16? Or a even older kernel?

Side note: /me wonders why the fix for this issue wasn't merged this cycle, as it was approved weeks ago...
Comment 7 sander44 2022-03-23 06:08:32 UTC
Hi Thorsten Leemhuis,

I will try today to make a compilation with 5.17 mainline to see if it reproduces.
I will attach more logs if it reproduces.
Comment 8 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-23 06:20:42 UTC
(In reply to sander44 from comment #7)
>
> I will try today to make a compilation with 5.17 mainline to see if it
> reproduces.

It likely will afaics.

> I will attach more logs if it reproduces.

No need. Try to apply https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/video/fbdev/core/fbmem.c?id=27599aacbaefcbf2af7b06b0029459bbf682000d ontop of 5.17. If will be applied soon and like get backported in round about two weeks. Further logs likely won't change much.
Comment 9 sander44 2022-03-23 07:02:46 UTC
I started the system configuration with 5.17. 
And it seems to have worked for me now.

But i notice this:
[    3.415301] ------------[ cut here ]------------
[    3.415304] refcount_t: addition on 0; use-after-free.
[    3.415310] WARNING: CPU: 1 PID: 713 at lib/refcount.c:25 refcount_warn_saturate+0x9b/0x150
[    3.415316] Modules linked in: qrtr vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock intel_rapl_msr intel_rapl_common nls_iso8859_1 vmw_balloon crct10dif_pclmul ghash_clmulni_intel aesni_intel snd_ens1371 crypto_simd cryptd snd_ac97_codec gameport snd_rawmidi snd_seq_device snd_pcsp ac97_bus joydev input_leds snd_pcm snd_timer serio_raw snd efi_pstore soundcore vmw_vmci mac_hid ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbmouse usbhid hid vmwgfx drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core crc32_pclmul mptspi mptscsih mptbase psmouse drm e1000 scsi_transport_spi i2c_piix4 pata_acpi
[    3.415348] CPU: 1 PID: 713 Comm: Xorg Not tainted 5.17.0-mainline-vanilla-lowlatency #1
[    3.415350] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18452719.B64.2108091906 08/09/2021
[    3.415351] RIP: 0010:refcount_warn_saturate+0x9b/0x150
[    3.415353] Code: c9 c3 0f b6 1d a5 6f be 01 80 fb 01 0f 87 5e c3 6c 00 83 e3 01 75 e5 48 c7 c7 20 dd e1 84 c6 05 89 6f be 01 01 e8 77 de 68 00 <0f> 0b eb ce 0f b6 1d 7b 6f be 01 80 fb 01 0f 87 1e c3 6c 00 83 e3
[    3.415354] RSP: 0018:ffffb06840d5bbf8 EFLAGS: 00010282
[    3.415356] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
[    3.415357] RDX: ffffa08875e60988 RSI: 0000000000000001 RDI: ffffa08875e60980
[    3.415357] RBP: ffffb06840d5bc00 R08: 0000000000000003 R09: fffffffffff1b468
[    3.415358] R10: 000000000000002c R11: 0000000000000001 R12: ffffa0874f206800
[    3.415358] R13: ffffa0875066fe00 R14: ffffa0875066fe00 R15: ffffa0875066fe00
[    3.415359] FS:  00007f902f498ec0(0000) GS:ffffa08875e40000(0000) knlGS:0000000000000000
[    3.415360] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.415361] CR2: 000055a6d3823798 CR3: 00000001100ac000 CR4: 0000000000750ee0
[    3.415379] PKRU: 55555554
[    3.415380] Call Trace:
[    3.415381]  <TASK>
[    3.415384]  drm_gem_handle_create_tail+0x197/0x1a0 [drm]
[    3.415398]  drm_gem_handle_create+0x36/0x40 [drm]
[    3.415408]  vmw_gb_surface_reference_internal+0x9b/0x1d0 [vmwgfx]
[    3.415417]  ? vmw_gb_surface_reference_ioctl+0xa0/0xa0 [vmwgfx]
[    3.415423]  vmw_gb_surface_reference_ext_ioctl+0x14/0x20 [vmwgfx]
[    3.415428]  drm_ioctl_kernel+0xb7/0x150 [drm]
[    3.415439]  drm_ioctl+0x264/0x4b0 [drm]
[    3.415448]  ? vmw_gb_surface_reference_ioctl+0xa0/0xa0 [vmwgfx]
[    3.415454]  vmw_generic_ioctl+0xc0/0x180 [vmwgfx]
[    3.415460]  vmw_unlocked_ioctl+0x15/0x20 [vmwgfx]
[    3.415465]  __x64_sys_ioctl+0x91/0xc0
[    3.415468]  do_syscall_64+0x5c/0xc0
[    3.415471]  ? syscall_exit_to_user_mode+0x27/0x50
[    3.415472]  ? do_syscall_64+0x69/0xc0
[    3.415474]  ? syscall_exit_to_user_mode+0x27/0x50
[    3.415475]  ? do_syscall_64+0x69/0xc0
[    3.415476]  ? asm_exc_page_fault+0x8/0x30
[    3.415478]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    3.415480] RIP: 0033:0x7f902f90e397
[    3.415482] Code: 3c 1c e8 1c ff ff ff 85 c0 79 87 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a9 da 0d 00 f7 d8 64 89 01 48
[    3.415483] RSP: 002b:00007ffccdb90b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[    3.415484] RAX: ffffffffffffffda RBX: 00007ffccdb90b90 RCX: 00007f902f90e397
[    3.415484] RDX: 00007ffccdb90b90 RSI: 00000000c060645c RDI: 0000000000000013
[    3.415485] RBP: 00000000c060645c R08: 0000000000000013 R09: 00007f902f9ecc00
[    3.415485] R10: 000055a6d381c540 R11: 0000000000000246 R12: 0000000000000000
[    3.415486] R13: 0000000000000013 R14: 000055a6d353c110 R15: 00007ffccdb90c68
[    3.415488]  </TASK>
[    3.415489] ---[ end trace 0000000000000000 ]---
Comment 10 sander44 2022-03-23 07:03:33 UTC
Created attachment 300603 [details]
dmesg
Comment 11 sander44 2022-03-23 07:04:42 UTC
Created attachment 300604 [details]
journalctl
Comment 12 sander44 2022-03-23 07:05:06 UTC
Created attachment 300605 [details]
lscpu
Comment 13 sander44 2022-03-23 07:05:42 UTC
Created attachment 300606 [details]
lspci
Comment 14 Zack Rusin 2022-03-23 13:54:48 UTC
The new logs look like you're just using efifb, it's the sysfb that's broken without the above patch. efifb, besides some one off bug, should be fine now.

The gem warning is unrelated. It looks like some userspace app is trying to reference something that hasn't been initialized as a surface. I think I fixed something like that recently, if you have a second and could try to reproduce on drm-tip https://cgit.freedesktop.org/drm-tip (drm-misc-next would be the second best https://cgit.freedesktop.org/drm/drm-misc ) I'd be very interested to know what userspace app triggers this to fix it (probably in a separate bug though).
Comment 15 Joe Breuer 2022-04-10 22:00:31 UTC
I'm trying to get a "headful" VM working in qemu / Proxmox, and I'm having trouble with both qxl and vmwgfx.

The qxl issue may or may not be related, it also hinges on a memory range mapping being denied: https://bugs.gentoo.org/829759#c7

The qxl kernel module correctly takes over the console during boot, "only" the X.org qxl_drv.so fails to initialize ultimately because it can't mmap() what seems to be the fb region to me.

I gave up on qxl and tried vmwgfx next. X.org would segfault on me, console is stuck on the bootloader (reFInd) messages, and I found the same kernel error messages mentioned above, which brought me here:

pci 0000:00:01.0: [15ad:0405] type 00 class 0x030000
pci 0000:00:01.0: reg 0x10: [io  0xd320-0xd32f]
pci 0000:00:01.0: reg 0x14: [mem 0xc0000000-0xc3ffffff pref]
pci 0000:00:01.0: reg 0x18: [mem 0xc5240000-0xc524ffff pref]
pci 0000:00:01.0: reg 0x30: [mem 0xffff0000-0xffffffff pref]
pci 0000:00:01.0: BAR 1: assigned to efifb
pci 0000:00:01.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[...]
pci 0000:00:01.0: vgaarb: setting as boot VGA device
pci 0000:00:01.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
pci 0000:00:01.0: vgaarb: bridge control possible
vgaarb: loaded
[...]
pci 0000:00:01.0: can't claim BAR 0 [io  0xd320-0xd32f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
[...]
pci 0000:00:01.0: BAR 0: assigned [io  0x1420-0x142f]
[...]
pci_bus 0000:01: resource 0 [io  0xd000-0xdfff]
pci_bus 0000:01: resource 1 [mem 0xc5000000-0xc51fffff]
pci_bus 0000:01: resource 2 [mem 0x800100000-0x8002fffff 64bit pref]
[...]
vmwgfx 0000:00:01.0: vgaarb: deactivate vga console
[TTM] Zone  kernel: Available graphics memory: 11974078 KiB
[TTM] Zone   dma32: Available graphics memory: 2097152 KiB
vmwgfx 0000:00:01.0: BAR 1: can't reserve [mem 0xc0000000-0xc3ffffff pref]
[TTM] Zone  kernel: Used memory at exit: 0 KiB
[TTM] Zone   dma32: Used memory at exit: 0 KiB
vmwgfx: probe of 0000:00:01.0 failed with error -16

Applying just the patch mentioned in comment #8 does NOT change this at all for me.

Then, I applied the whole series of 5 patches it belongs to:

https://patchwork.freedesktop.org/series/99243/#rev2

This still gives exactly the same kernel log messages, I still don't have a console on vmwgfx, but X.org starts correctly now.
Comment 16 Joe Breuer 2022-04-10 22:03:21 UTC
Unfortunate correction / addition: With te patches applied, X.org works SOMETIMES. Just restarted X in the process of configuring the display manager, and it broke again:

[   800.506] (EE) Backtrace:
[   800.506] (EE) 0: /usr/bin/X (xorg_backtrace+0x4d) [0x556d7e16a9f2]
[   800.507] (EE) 1: /usr/bin/X (0x556d7e03b000+0x133206) [0x556d7e16e206]
[   800.507] (EE) 2: /lib64/libc.so.6 (0x7f6bfb380000+0x3db40) [0x7f6bfb3bdb40]
[   800.507] (EE) 3: /usr/lib64/xorg/modules/drivers/vmware_drv.so (0x7f6bfac67000+0xa52a) [0x7f6bfac7152a]
[   800.507] (EE) 4: /usr/bin/X (0x556d7e03b000+0x151ff7) [0x556d7e18cff7]
[   800.507] (EE) 5: /usr/bin/X (xf86VTEnter+0x76) [0x556d7e1859f7]
[   800.507] (EE) 6: /usr/bin/X (WakeupHandler+0xa7) [0x556d7e0b01d2]
[   800.507] (EE) 7: /usr/bin/X (WaitForSomething+0x190) [0x556d7e16860b]
[   800.507] (EE) 8: /usr/bin/X (0x556d7e03b000+0x70bb1) [0x556d7e0abbb1]
[   800.507] (EE) 9: /usr/bin/X (0x556d7e03b000+0x7479a) [0x556d7e0af79a]
[   800.507] (EE) 10: /lib64/libc.so.6 (0x7f6bfb380000+0x291ca) [0x7f6bfb3a91ca]
[   800.507] (EE) 11: /lib64/libc.so.6 (__libc_start_main+0x78) [0x7f6bfb3a9278]
[   800.507] (EE) 12: /usr/bin/X (_start+0x21) [0x556d7e075a41]
[   800.507] (EE) 
[   800.507] (EE) Segmentation fault at address 0x0
Comment 17 Zack Rusin 2022-04-11 14:37:51 UTC
(In reply to Joachim Breuer from comment #15)
> I'm trying to get a "headful" VM working in qemu / Proxmox, and I'm having
> trouble with both qxl and vmwgfx.
> 
> The qxl issue may or may not be related, it also hinges on a memory range
> mapping being denied: https://bugs.gentoo.org/829759#c7
> 
> The qxl kernel module correctly takes over the console during boot, "only"
> the X.org qxl_drv.so fails to initialize ultimately because it can't mmap()
> what seems to be the fb region to me.
> 
> I gave up on qxl and tried vmwgfx next. X.org would segfault on me, console
> is stuck on the bootloader (reFInd) messages, and I found the same kernel
> error messages mentioned above, which brought me here:
> 
> pci 0000:00:01.0: [15ad:0405] type 00 class 0x030000
> pci 0000:00:01.0: reg 0x10: [io  0xd320-0xd32f]
> pci 0000:00:01.0: reg 0x14: [mem 0xc0000000-0xc3ffffff pref]
> pci 0000:00:01.0: reg 0x18: [mem 0xc5240000-0xc524ffff pref]
> pci 0000:00:01.0: reg 0x30: [mem 0xffff0000-0xffffffff pref]
> pci 0000:00:01.0: BAR 1: assigned to efifb
> pci 0000:00:01.0: Video device with shadowed ROM at [mem
> 0x000c0000-0x000dffff]
> [...]
> pci 0000:00:01.0: vgaarb: setting as boot VGA device
> pci 0000:00:01.0: vgaarb: VGA device added:
> decodes=io+mem,owns=io+mem,locks=none
> pci 0000:00:01.0: vgaarb: bridge control possible
> vgaarb: loaded
> [...]
> pci 0000:00:01.0: can't claim BAR 0 [io  0xd320-0xd32f]: address conflict
> with PCI Bus 0000:01 [io  0xd000-0xdfff]
> [...]
> pci 0000:00:01.0: BAR 0: assigned [io  0x1420-0x142f]
> [...]
> pci_bus 0000:01: resource 0 [io  0xd000-0xdfff]
> pci_bus 0000:01: resource 1 [mem 0xc5000000-0xc51fffff]
> pci_bus 0000:01: resource 2 [mem 0x800100000-0x8002fffff 64bit pref]
> [...]
> vmwgfx 0000:00:01.0: vgaarb: deactivate vga console
> [TTM] Zone  kernel: Available graphics memory: 11974078 KiB
> [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
> vmwgfx 0000:00:01.0: BAR 1: can't reserve [mem 0xc0000000-0xc3ffffff pref]
> [TTM] Zone  kernel: Used memory at exit: 0 KiB
> [TTM] Zone   dma32: Used memory at exit: 0 KiB
> vmwgfx: probe of 0000:00:01.0 failed with error -16
> 
> Applying just the patch mentioned in comment #8 does NOT change this at all
> for me.

This looks like a different bug. What virtualization platform is this on? It's hard to tell without the full log but it looks like the kernel has to reenumarate pci devices due to bar range conflict and the svga device doesn't acknowledge the new ranges. We fixed a bug related to this in VMware's products. I'm guessing that if you remove the PCI devices that causes the BAR range conflict the vm will be working again.
Comment 18 Joe Breuer 2022-04-12 09:35:07 UTC
Hi Zack,

(In reply to Zack Rusin from comment #17)
> (In reply to Joachim Breuer from comment #15)
> > I'm trying to get a "headful" VM working in qemu / Proxmox, and I'm having
> > trouble with both qxl and vmwgfx.
> 
> This looks like a different bug. What virtualization platform is this on?
> It's hard to tell without the full log but it looks like the kernel has to
> reenumarate pci devices due to bar range conflict and the svga device
> doesn't acknowledge the new ranges. We fixed a bug related to this in
> VMware's products. I'm guessing that if you remove the PCI devices that
> causes the BAR range conflict the vm will be working again.

This is on/in Proxmox VE 6.3-6, ie their variant/fork/whatever it is of qemu/kvm.

In the mean time I can say that X also crashes with qxl and the whole kernel patch series applied; ie it bugs differently - without the kernel patch series, qxl does not initialize: 

int fd = open("/sys/bus/pci/devices/0000:00:01.0/resource0", O_RDWR | O_CLOEXEC);
void *mem = mmap(NULL, 0x20000000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

yields MAP_FAILED with errno == EINVAL

(This is basically what the qxl X driver does to obtain the framebuffer region.)

With the kernel patch series applied, that mmap() works as expected, and the X server crashes quite a bit later within xf86InitViewport().

As a data point, it would seem that qxl requires the patch series fix similarly to vmwgfx, although for both that's not enough to get a working X display for me.

Emulated "Standard VGA" works on the same VM.

This VM is suitable for testing, so I'd be happy to try things out.

I've seen indications that "something changed" between X.org 1.20 and 21.1.3 I'm currently running, so I'll first dig into that next.
Comment 19 Joe Breuer 2022-04-12 18:12:53 UTC
Forgot to mention: The "patch series required" issue affects me on released kernel version 5.15.32.