Bug 210301

Summary: *ERROR* IB test failed on gfx (-110) on Ryzen 4750u
Product: Drivers Reporter: ar_registration
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: alexdeucher, me
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.9.8 Subsystem:
Regression: No Bisected commit-id:

Description ar_registration 2020-11-21 22:43:05 UTC
Boting in a new Thinkpad T14 AMD (4750u), I get *sometimes* the following errors:

2020-11-21T16:45:15.53245 kern.err: [    5.934142] amdgpu 0000:07:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).
2020-11-21T16:45:15.53247 kern.err: [    5.935598] [drm:process_one_work] *ERROR* ib ring test failed (-110).

When this happens (typically 9/10 times) the machine eventually hangs. Sometimes with a page fault, sometimes with a general protection fault, but *always* in what appears to be the zfs calls. 

When the machine boots without the "*ERROR* IB test failed on gfx ", the machines *never* produces segmentation faults/general protection faults, even under very heavy i/o load.

I have reproduced this problem:
- In kernels 5.8.18, 5.9.6, 5.9.7, 5.9.8
- With two different SSD
- With the firmware versions 20201118, 20201022, 20200918
- I have updated te BIOS two times (versions 1.09, 1.25, 1.27)

I boot with the kernel command line options amd_iommu=on iommu=soft in order to avoid *another* error (AMD-Vi: Unable to read/write to IOMMU perf counter.)

For example:

2020-10-24T22:00:29.88940 kern.warn: [ 1021.160344] general protection fault, probably for non-canonical address 0x3ff800040000000: 0000 [#1] SMP NOPTI
2020-10-24T22:00:29.89005 kern.warn: [ 1021.160348] CPU: 10 PID: 12746 Comm: bonnie++ Tainted: P           O      5.8.16_1 #1
2020-10-24T22:00:29.89014 kern.warn: [ 1021.160349] Hardware name: LENOVO 20UDCTO1WW/20UDCTO1WW, BIOS R1BET40W(1.09 ) 08/07/2020
2020-10-24T22:00:29.89005 kern.warn: [ 1021.160348] CPU: 10 PID: 12746 Comm: bonnie++ Tainted: P           O      5.8.16_1 #1
2020-10-24T22:00:29.89014 kern.warn: [ 1021.160349] Hardware name: LENOVO 20UDCTO1WW/20UDCTO1WW, BIOS R1BET40W(1.09 ) 08/07/2020
2020-10-24T22:00:29.89021 kern.warn: [ 1021.160385] RIP: 0010:dbuf_find+0x86/0x190 [zfs]
2020-10-24T22:00:29.89030 kern.warn: [ 1021.160387] Code: 7b 01 00 49 89 57 28 4a 8b 04 f0 48 85 c0 0f 84 b2 00 00 00 48 8b 0c 24 49 89 d6 eb 0d 48 8b 40 38 48 85 c0 0f 84 9c 00 00 00 <48> 39 18 75 ee 48 39 68 20 75 e8 44 38 68 50 75 e2 48 39 48 40 75
2020-10-24T22:00:29.89033 kern.warn: [ 1021.160387] RSP: 0018:ffffb3ff15a4ba28 EFLAGS: 00010206
2020-10-24T22:00:29.89030 kern.warn: [ 1021.160387] Code: 7b 01 00 49 89 57 28 4a 8b 04 f0 48 85 c0 0f 84 b2 00 00 00 48 8b 0c 24 49 89 d6 eb 0d 48 8b 40 38 48 85 c0 0f 84 9c 00 00 00 <48> 39 18 75 ee 48 39 68 20 75 e8 44 38 68 50 75 e2 48 39 48 40 75
2020-10-24T22:00:29.89033 kern.warn: [ 1021.160387] RSP: 0018:ffffb3ff15a4ba28 EFLAGS: 00010206
2020-10-24T22:00:29.89041 kern.warn: [ 1021.160389] RAX: 03ff800040000000 RBX: 00000000000587ce RCX: 0000000000001fd4
2020-10-24T22:00:29.89044 kern.warn: [ 1021.160389] RDX: ffff98bc4b6dbe00 RSI: 00000000000587ce RDI: ffffffffc11be4d0
2020-10-24T22:00:29.89046 kern.warn: [ 1021.160390] RBP: ffff98bc9cfa3000 R08: 99de27c9bb992351 R09: 9ae16a3b2f90404f
2020-10-24T22:00:29.89049 kern.warn: [ 1021.160391] R10: 0000000000000000 R11: ffff98bcca928000 R12: 0000000000030fc0
2020-10-24T22:00:29.89061 kern.warn: [ 1021.160391] R13: 0000000000000000 R14: ffff98bc4b6dbe00 R15: ffffffffc11be4d0
2020-10-24T22:00:29.89065 kern.warn: [ 1021.160392] FS:  00007f739cf48740(0000) GS:ffff98bccf080000(0000) knlGS:0000000000000000
2020-10-24T22:00:29.89070 kern.warn: [ 1021.160393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2020-10-24T22:00:29.89077 kern.warn: [ 1021.160393] CR2: 00007f790e77d280 CR3: 0000000757346000 CR4: 0000000000340ee0
2020-10-24T22:00:29.89081 kern.warn: [ 1021.160394] Call Trace:
2020-10-24T22:00:29.89084 kern.warn: [ 1021.160400]  ? _cond_resched+0x15/0x30
2020-10-24T22:00:29.89088 kern.warn: [ 1021.160410]  dbuf_hold_impl_arg+0x41/0x610 [zfs]
2020-10-24T22:00:29.89091 kern.warn: [ 1021.160413]  ? spl_kmem_alloc+0xbf/0x100 [spl]
2020-10-24T22:00:29.89095 kern.warn: [ 1021.160424]  dbuf_hold_impl+0x94/0xc0 [zfs]
2020-10-24T22:00:29.89098 kern.warn: [ 1021.160434]  dbuf_hold_level+0x2b/0x60 [zfs]
2020-10-24T22:00:29.89102 kern.warn: [ 1021.160446]  dmu_tx_check_ioerr+0x35/0xd0 [zfs]
2020-10-24T22:00:29.89105 kern.warn: [ 1021.160457]  dmu_tx_count_write+0xe5/0x1a0 [zfs]
2020-10-24T22:00:29.89109 kern.warn: [ 1021.160468]  dmu_tx_hold_write_by_dnode+0x35/0x50 [zfs]
2020-10-24T22:00:29.89112 kern.warn: [ 1021.160481]  zfs_write+0x410/0xe00 [zfs]
2020-10-24T22:00:29.89116 kern.warn: [ 1021.160485]  ? enqueue_entity+0xf1/0x580
2020-10-24T22:00:29.89121 kern.warn: [ 1021.160497]  zpl_write_common_iovec+0xaf/0x100 [zfs]
2020-10-24T22:00:29.89126 kern.warn: [ 1021.160509]  zpl_iter_write+0xed/0x120 [zfs]
2020-10-24T22:00:29.89129 kern.warn: [ 1021.160511]  new_sync_write+0x11b/0x1b0
2020-10-24T22:00:29.89133 kern.warn: [ 1021.160513]  vfs_write+0x1c3/0x1f0
2020-10-24T22:00:29.89136 kern.warn: [ 1021.160515]  ksys_write+0x5f/0xe0
2020-10-24T22:00:29.89140 kern.warn: [ 1021.160516]  do_syscall_64+0x44/0x70
2020-10-24T22:00:29.89145 kern.warn: [ 1021.160518]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
2020-10-24T22:00:29.89150 kern.warn: [ 1021.160519] RIP: 0033:0x7f739d4f7de3
2020-10-24T22:00:29.89155 kern.warn: [ 1021.160521] Code: 75 05 48 83 c4 58 c3 e8 9b 43 ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
2020-10-24T22:00:29.89160 kern.warn: [ 1021.160521] RSP: 002b:00007ffee6a2caa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
2020-10-24T22:00:29.89165 kern.warn: [ 1021.160522] RAX: ffffffffffffffda RBX: 00007ffee6a2caf0 RCX: 00007f739d4f7de3
2020-10-24T22:00:29.89177 kern.warn: [ 1021.160523] RDX: 0000000000002000 RSI: 00005623e29fd000 RDI: 0000000000000003
2020-10-24T22:00:29.89188 kern.warn: [ 1021.160523] RBP: 0000000000001d41 R08: 0000000000000000 R09: 0036343732312e00
2020-10-24T22:00:29.89202 kern.warn: [ 1021.160524] R10: 0000000000000180 R11: 0000000000000246 R12: 000000000001fd40
2020-10-24T22:00:29.89207 kern.warn: [ 1021.160524] R13: 00007ffee6a2caf0 R14: 00007ffee6a2cad0 R15: 00005623e29fd000
2020-10-24T22:00:29.89212 2020-10-24T22:00:29.89212 2020-10-24T22:00:29.89212 kern.warn: [ 1021.160525] Modules linked in: ctr ccm acpi_call(O) 8021q garp mrp stp llc uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev btusb mc btrtl btbcm btintel joydev snd_soc_dmic snd_acp3x_pdm_dma snd_acp3x_rn snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus tps6598x typec roles edac_mce_amd kvm_amd iwlmvm kvm mac80211 irqbypass crct10dif_pclmul ghash_clmulni_intel snd_hda_codec_realtek aesni_intel libarc4 snd_hda_codec_generic snd_hda_codec_hdmi crypto_simd cryptd snd_hda_intel glue_helper snd_intel_dspcfg rapl iwlwifi snd_hda_codec psmouse input_leds pcspkr wmi_bmof snd_hda_core snd_hwdep ipmi_devintf cfg80211 r8169 snd_rn_pci_acp3x sp5100_tco tpm_crb snd_pcm snd_pci_acp3x ccp ipmi_msghandler realtek k10temp i2c_piix4 thinkpad_acpi i2c_multi_instantiate ledtrig_audio ac tpm_tis tpm_tis_core evdev mac_hid tpm i2c_scmi rng_core acpi_cpufreq i2c_designware_platform tiny_power_button i2c_designware_core snd_seq snd_seq_device snd_timer snd soundcore
kern.warn: [ 1021.160525] Modules linked in: ctr ccm acpi_call(O) 8021q garp mrp stp llc uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev btusb mc btrtl btbcm btintel joydev snd_soc_dmic snd_acp3x_pdm_dma snd_acp3x_rn snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus tps6598x typec roles edac_mce_amd kvm_amd iwlmvm kvm mac80211 irqbypass crct10dif_pclmul ghash_clmulni_intel snd_hda_codec_realtek aesni_intel libarc4 snd_hda_codec_generic snd_hda_codec_hdmi crypto_simd cryptd snd_hda_intel glue_helper snd_intel_dspcfg rapl iwlwifi snd_hda_codec psmouse input_leds pcspkr wmi_bmof snd_hda_core snd_hwdep ipmi_devintf cfg80211 r8169 snd_rn_pci_acp3x sp5100_tco tpm_crb snd_pcm snd_pci_acp3x ccp ipmi_msghandler realtek k10temp i2c_piix4 thinkpad_acpi i2c_multi_instantiate ledtrig_audio ac tpm_tis tpm_tis_core evdev mac_hid tpm i2c_scmi rng_core acpi_cpufreq i2c_designware_platform tiny_power_button i2c_designware_core snd_seq snd_seq_device snd_timer snd soundcore
2020-10-24T22:00:29.89238 kern.warn: [ 1021.160552]  vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost tap vhost_iotlb uhid hid hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio dm_mod uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic cuse fuse zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) amdgpu iommu_v2 gpu_sched i2c_algo_bit ttm rtsx_pci_sdmmc mmc_core drm_kms_helper syscopyarea sysfillrect zcommon(PO) sysimgblt fb_sys_fops cec znvpair(PO) rc_core xhci_pci xhci_pci_renesas spl(O) xhci_hcd ehci_pci crc32_pclmul drm crc32c_intel ehci_hcd serio_raw usbcore agpgart rtsx_pci wmi button battery video
2020-10-24T22:00:29.89243 kern.warn: [ 1021.160577] ---[ end trace 17b0063aeadb5c32 ]---
2020-10-24T22:00:29.89248 kern.warn: [ 1021.160589] RIP: 0010:dbuf_find+0x86/0x190 [zfs]
2020-10-24T22:00:29.89253 kern.warn: [ 1021.160589] Code: 7b 01 00 49 89 57 28 4a 8b 04 f0 48 85 c0 0f 84 b2 00 00 00 48 8b 0c 24 49 89 d6 eb 0d 48 8b 40 38 48 85 c0 0f 84 9c 00 00 00 <48> 39 18 75 ee 48 39 68 20 75 e8 44 38 68 50 75 e2 48 39 48 40 75
2020-10-24T22:00:29.89267 kern.warn: [ 1021.160590] RSP: 0018:ffffb3ff15a4ba28 EFLAGS: 00010206
2020-10-24T22:00:29.89271 kern.warn: [ 1021.160591] RAX: 03ff800040000000 RBX: 00000000000587ce RCX: 0000000000001fd4
2020-10-24T22:00:29.89276 kern.warn: [ 1021.160591] RDX: ffff98bc4b6dbe00 RSI: 00000000000587ce RDI: ffffffffc11be4d0
2020-10-24T22:00:29.89281 kern.warn: [ 1021.160592] RBP: ffff98bc9cfa3000 R08: 99de27c9bb992351 R09: 9ae16a3b2f90404f
2020-10-24T22:00:29.89287 kern.warn: [ 1021.160592] R10: 0000000000000000 R11: ffff98bcca928000 R12: 0000000000030fc0
2020-10-24T22:00:29.89292 kern.warn: [ 1021.160593] R13: 0000000000000000 R14: ffff98bc4b6dbe00 R15: ffffffffc11be4d0
2020-10-24T22:00:29.89297 kern.warn: [ 1021.160593] FS:  00007f739cf48740(0000) GS:ffff98bccf080000(0000) knlGS:0000000000000000
2020-10-24T22:00:29.89302 kern.warn: [ 1021.160594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2020-10-24T22:00:29.89307 kern.warn: [ 1021.160594] CR2: 00007f790e77d280 CR3: 0000000757346000 CR4: 0000000000340ee0
Comment 1 ramast 2022-06-10 12:38:09 UTC
I had same problem. I found that removing iommu_v2 module from kernel solved the problem.
Comment 2 Alex Deucher 2022-06-10 16:04:31 UTC
Please attach your full dmesg output.  Does iommu=pt help?  Sounds like perhaps there are sbios bugs around the IOMMU on your platform.  Can you try a newer sbios?
Comment 3 ramast 2022-06-11 02:26:18 UTC
I think mine could be a different problem but I'll paste the output here anyway in case it help anyone. Mine consistently happen the second time my PC goes into sleep mode (S3).

https://pastebin.com/DZ8rZuHq
Comment 4 ramast 2022-06-11 03:09:53 UTC
After more log review it seems I had two problems:
One related to have two graphics card and that's the one resulting in the "amdgpu_ib_ring_tests" error

https://pastebin.com/Eu7PF3en

Then after disabling the Dedicated graphics card, I got the error in my previous comment. After removing iommu_v2 I am finally error free.