Bug 206461

Summary: [REGRESSION][Bisected][AMD] Parts of OpenGL windows are replaced with a white background when DRI_PRIME is used
Product: Drivers Reporter: pawel.885
Component: IOMMUAssignee: drivers_iommu
Status: RESOLVED CODE_FIX    
Severity: normal CC: alexdeucher, joro
Priority: P1    
Hardware: x86-64   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=206895
Kernel Version: 5.5 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Screenshot of the issue
Bisect log
lspci -vv on unaffected 5.4.15 kernel
dmesg output

Description pawel.885 2020-02-07 22:56:11 UTC

    
Comment 1 pawel.885 2020-02-07 23:01:11 UTC
Created attachment 287245 [details]
Screenshot of the issue
Comment 2 pawel.885 2020-02-07 23:17:53 UTC
Description of problem:
Windows of OpenGL applications ran with DRI_PRIME=1 environment variable (hybrid graphics) are filled with white backgrounds when scaled instead of being rendered properly.

How reproducible:
Always

Steps to Reproduce:
1. Boot with kernel version 5.5rc1 or higher and login to your xorg session (I use xmonad without a compositor)
2. Run any OpenGL application with DRI_PRIME=1 set. I used glxgears for this test (tested mpv with --vo=opengl, radium are known to be affected as well)
For example:

DRI_PRIME=1 glxgears

Actual results:
Half of the window is filled with white, white pixels do not go away as you resize the window.

Expected results:
Application is rendered correctly.

Bisect results:
be62dbf554c5b50718a54a359372c148cd9975c7 is the first bad commit
commit be62dbf554c5b50718a54a359372c148cd9975c7
Author: Tom Murphy <murphyt7@tcd.ie>
Date:   Sun Sep 8 09:56:41 2019 -0700

    iommu/amd: Convert AMD iommu driver to the dma-iommu api

    Convert the AMD iommu driver to the dma-iommu api. Remove the iova
    handling and reserve region code from the AMD iommu driver.

    Signed-off-by: Tom Murphy <murphyt7@tcd.ie>
    Signed-off-by: Joerg Roedel <jroedel@suse.de>

 drivers/iommu/Kconfig     |   1 +
 drivers/iommu/amd_iommu.c | 692 +++++-----------------------------------------
 2 files changed, 68 insertions(+), 625 deletions(-)

Extra info:
dmesg prints this from radeon/amd-vi once when glxgears is started:
AMD-Vi: Event logged [IO_PAGE_FAULT device=08:00.0 domain=0x0000 address=0xfffffff2c0 flags=0x0010]

awk -f scripts/ver_linux
Linux archy 5.5.0-1-git-10086-g41dcd67e8868 #3 SMP PREEMPT Fri, 07 Feb 2020 22:37:35 +0000 x86_64 GNU/Linux

GNU C                   9.2.0
GNU Make                4.3
Binutils                2.33.1
Util-linux              2.35.1
Mount                   2.35.1
Module-init-tools       26
E2fsprogs               1.45.5
Jfsutils                1.1.15
Reiserfsprogs           3.6.27
Xfsprogs                5.4.0
Bison                   3.5.1
Flex                    2.6.4
Linux C Library         2.30
Dynamic linker (ldd)    2.30
Linux C++ Library       6.0.27
Procps                  3.3.15
Net-tools               2.10
Kbd                     2.2.0
Console-tools           2.2.0
Sh-utils                8.31
Udev                    244
Modules Loaded          acpi_cpufreq aesni_intel agpgart ahci amdgpu asus_wmi async_memcpy async_pq async_raid6_recov async_tx async_xor battery bcache blake2b_generic btrfs ccp crc32c_ge
neric crc32c_intel crc32_pclmul crc64 crct10dif_pclmul cryptd crypto_simd dca dm_mod dm_raid drm drm_kms_helper eeepc_wmi evdev fat fb_sys_fops ghash_clmulni_intel glue_helper gpio_amdpt
gpu_sched hid hid_generic i2c_algo_bit i2c_piix4 igb input_leds ip_tables irqbypass jc42 joydev k10temp kvm kvm_amd libahci libata libcrc32c mac_hid macvlan macvtap mc md_mod mousedev mxm
_wmi nls_cp437 nls_iso8859_1 pcspkr pinctrl_amd radeon raid1 raid456 raid6_pq rfkill rng_core scsi_mod sd_mod snd snd_aloop snd_hda_codec snd_hda_codec_hdmi snd_hda_core snd_hda_intel snd
_hwdep snd_intel_dspcfg snd_pcm snd_rawmidi snd_timer snd_usb_audio snd_usbmidi_lib soundcore sparse_keymap syscopyarea sysfillrect sysimgblt tap ttm tun usbhid vfat vfio vfio_iommu_type1
 vfio_pci vfio_virqfd vhost vhost_net wmi wmi_bmof xfs xhci_hcd xhci_pci xor x_tables

Known fixes:
Reverting commit be62dbf554c5b50718a54a359372c148cd9975c7 on 5.5 tag or disabling IOMMU in bios resolves the issue and glxgears is rendered correctly.
Comment 3 pawel.885 2020-02-07 23:18:51 UTC
Created attachment 287247 [details]
Bisect log
Comment 4 pawel.885 2020-02-07 23:25:28 UTC
Created attachment 287253 [details]
lspci -vv on unaffected 5.4.15 kernel
Comment 5 Joerg Roedel 2020-02-19 09:58:55 UTC
This is likely a bug in the DRM code for your GPU. The recent changes in the AMD IOMMU driver might cause sg-list entries to be merged by the DMA-API. Some drivers probably don't handle this correctly.

I've seen a similar problem with RDMA devices, caused by the same change.

Please report this issue to the developers of your GPU driver.
Comment 6 Alex Deucher 2020-02-21 19:02:01 UTC
Please attach a copy of your dmesg output.  @Joerg, can you point to what changes are required in drivers to handle this?
Comment 7 Alex Deucher 2020-02-21 19:05:24 UTC
Also, has anyone seen similar issues with other IOMMU drivers that use use the dma-iommu api?  If not, that would point to a problem on the IOMMU side.
Comment 8 pawel.885 2020-02-21 19:56:29 UTC
Created attachment 287545 [details]
dmesg output
Comment 9 pawel.885 2020-02-21 20:00:46 UTC
@Alex Added dmesg log
Ticket for drm/amd is at https://gitlab.freedesktop.org/drm/amd/issues/1056

Tried the Patch/hack in the RDMA thread (https://www.spinics.net/lists/linux-nfs/msg76402.html) and it helps with this issue as well fwiw
Comment 10 Joerg Roedel 2020-02-22 12:37:09 UTC
(In reply to Alex Deucher from comment #6)
> Please attach a copy of your dmesg output.  @Joerg, can you point to what
> changes are required in drivers to handle this?

The AMD IOMMU driver in v5.5 switched its DMA-API implementation to the common dma-iommu code. The main difference in behavior between the old and the dma-iommu implementation is that dma-iommu does sg-list merging, which means it can return less mapped segments than requested. See this paragraph from Documentation/DMA-API.txt:

>         int
>         dma_map_sg(struct device *dev, struct scatterlist *sg,
>                    int nents, enum dma_data_direction direction)
>
> Returns: the number of DMA address segments mapped (this may be shorter
> than <nents> passed in if some elements of the scatter/gather list are
> physically or virtually adjacent and an IOMMU maps them with a single
> entry).

(In reply to Alex Deucher from comment #7)
> Also, has anyone seen similar issues with other IOMMU drivers that use use
> the dma-iommu api?  If not, that would point to a problem on the IOMMU side.

There is only one other driver on x86, the VT-d driver. And this one uses its own DMA-API implementation which doesn't do sg-entry merging.

I have also seen similar bug-reports for RDMA devices, and the culprit there was also that the RDMA device driver did not correctly handle the sg-entry merging case.
Comment 11 Alex Deucher 2020-03-23 16:02:10 UTC
See https://bugzilla.kernel.org/show_bug.cgi?id=206895
Comment 12 pawel.885 2020-04-12 21:26:26 UTC
fixed in 5.6:

commit 42e67b479eab6d26459b80b4867298232b0435e7
commit 0199172f933342d8b1011aae2054a695c25726f4

https://gitlab.freedesktop.org/drm/amd/issues/1056#note_457214