Bug 218993

Summary: SIGBUS with amdgpu on multi-GPU system on X server with DRI3/GBM
Product: Drivers Reporter: adaha
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED MOVED    
Severity: normal    
Priority: P3    
Hardware: All   
OS: Linux   
URL: https://gitlab.freedesktop.org/drm/amd/-/issues/3457
Kernel Version: 6.9.5-200.fc40.x86_64 Subsystem:
Regression: No Bisected commit-id:
Attachments: trace before crash, Xvnc on Ryzen 5 7600, vkcube on Arc A380

Description adaha 2024-06-27 10:30:23 UTC
Created attachment 306503 [details]
trace before crash, Xvnc on Ryzen 5 7600, vkcube on Arc A380

I ran into a SIGBUS when using multiple GPUs and DRI with an X server that has
GPU acceleration (TigerVNC's Xvnc). This happened on a machine with:
OS: Fedora 40 running 6.9.5-200.fc40.x86_64
iGPU: Ryzen 5 7600
dGPU: RTX 4060 | Arc A380 | RX 7600

The issue occurs when the X server is configured to use an AMD rendernode, and
an application wants to use a non-AMD rendernode.

When opening the AMD rendernode using gbm_create_device(), a SIGBUS will occur
when gbm_bo_map() is called, if the application wants to use another rendernode
that is not an AMD GPU.

In my setup, /dev/dri/renderD128 is the AMD iGPU, and /dev/dri/renderD129 is an
RTX 4060.

If I run the X server with
$ Xvnc :50 -rendernode /dev/dri/renderD128

and vkcube with renderD129 on the X server
$ DISPLAY=:50 vkcube --gpu_number 1

I get the sigbus:
(EE) 
(EE) Backtrace:
(EE) 0: Xvnc (xorg_backtrace+0x82) [0x560c52b47d42]
(EE) 1: Xvnc (0x560c52991000+0x1b7f4c) [0x560c52b48f4c]
(EE) 2: /lib64/libc.so.6 (0x7f0c99613000+0x40710) [0x7f0c99653710]
(EE) 3: /lib64/libpixman-1.so.0 (0x7f0c99ed0000+0x8a2d0) [0x7f0c99f5a2d0]
(EE) 4: /lib64/libpixman-1.so.0 (pixman_blt+0x81) [0x7f0c99ede8d1]
(EE) 5: Xvnc (vncDRI3SyncPixmapFromGPU+0x10e) [0x560c529f303e]
(EE) 6: Xvnc (0x560c52991000+0x622c3) [0x560c529f32c3]
(EE) 7: Xvnc (dri3_pixmap_from_fds+0xcf) [0x560c52a7fdaf]
(EE) 8: Xvnc (0x560c52991000+0xf1309) [0x560c52a82309]
(EE) 9: Xvnc (Dispatch+0x426) [0x560c52ae3f56]
(EE) 10: Xvnc (dix_main+0x46a) [0x560c52af2d4a]
(EE) 11: /lib64/libc.so.6 (0x7f0c99613000+0x2a088) [0x7f0c9963d088]
(EE) 12: /lib64/libc.so.6 (__libc_start_main+0x8b) [0x7f0c9963d14b]
(EE) 13: Xvnc (_start+0x25) [0x560c529eed75]
(EE) 
(EE) Bus error at address 0x7f0c8e211000
(EE) 
Fatal server error:
(EE) Caught signal 7 (Bus error). Server aborting
(EE) 
Aborted (core dumped)

The same crash occurs when running vkcube on an Arc GPU (A380).

However, running the X server on an Arc or Nvidia GPU, and vkcube on the AMD
GPU, does not cause a crash. Neither does running the X server on AMD, and
vkcube on a different AMD GPU (iGPU & RX 7600 for example).

I've attached a stacktrace with the last call to mmap() before the crash.
Comment 1 adaha 2024-06-27 10:47:10 UTC
To clarify, the crash does not come from gbm_bo_map() directly, but by the incorrectly mapped memory which causes a crash later in the program.
Comment 2 Artem S. Tashkinov 2024-06-27 11:10:28 UTC
This doesn't look like a kernel bug to me.

You could try asking here though: https://gitlab.freedesktop.org/drm/amd/-/issues