Bug 216645 - Fence fallback timer expired on ring gfx
Summary: Fence fallback timer expired on ring gfx
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-10-31 13:22 UTC by Martin Šušla
Modified: 2022-11-01 08:41 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.15.0-43-generic
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Kernel log created by the script in the menuentry (249.50 KB, text/plain)
2022-10-31 13:22 UTC, Martin Šušla
Details
Kernel log interlaced with contents of /proc/interrupts polled every second (2.01 MB, text/plain)
2022-10-31 17:56 UTC, Martin Šušla
Details

Description Martin Šušla 2022-10-31 13:22:31 UTC
Created attachment 303109 [details]
Kernel log created by the script in the menuentry

Sometimes when I run a KDE system monitor, or Chrome, my laptop freezes and won't unfreeze until reboot (well, after a while I can move the mouse cursor, but that's all I can do). 
I'm using Dell G5 SE 5505 with AMD Ryzen 7 4800H as a CPU, Radeon RX Vega 7 as iGPU and AMD Radeon RX 5600M as dGPU. 

I've searched through existing bugs and found that it might be related to interrupts. With that in mind, I've compiled a list of kernel parameters which might be related and, as well as that, I've tested all of them: 

PW = Probably Working, NW = Not Working, NB = Not Booting
PW	pcie_port_pm=off
PW	amdgpu.msi=0
NW	amd_iommu=fullflush
NW	amd_iommu=force_isolation
NW	amd_iommu=off
NW	amd_iommu_intr=legacy
NW	amd_iommu_intr=vapic kvm-amd.avic=1
NW	iommu=off
NW	iommu=force
NW	iommu=noforce
NW	iommu=biomerge
NW	iommu=merge
NW	iommu=nomerge
NW	iommu=forcesac
NW	iommu=soft
NW	iommu=pt
NW	irqfixup
NW	irqpoll
NW	nointremap
NW	pcie_port_pm=force
NW	amdgpu.pcie_gen2=1
NW	amdgpu.pcie_gen2=0
NW	amdgpu.msi=1
NW	amdgpu.lockup_timeout=1000
NW	amdgpu.lockup_timeout=100
NW	amdgpu.aspm=1
NW	amdgpu.aspm=0
NW	amdgpu.bapm=1
NW	amdgpu.bapm=0
NW	amdgpu.ppfeaturemask=0xfff7bff7
NW	amdgpu.ppfeaturemask=0xfff7bdff
NW	amdgpu.ppfeaturemask=0xfff7bbff
NW	amdgpu.ppfeaturemask=0xfff73fff
NW	amdgpu.ppfeaturemask=0xfff3bfff
NW	amdgpu.exp_hw_support=1
NW	amdgpu.exp_hw_support=0
NW	amdgpu.forcelongtraining=0
NW	amdgpu.forcelongtraining=1
NW	amdgpu.cg_mask=0x00000000
NW	amdgpu.cg_mask=0xffffffff
NW	amdgpu.pg_mask=0xffffffff
NW	amdgpu.ngg=1
NW	amdgpu.ngg=0
NW	amdgpu.job_hang_limit=1000
NW	amdgpu.job_hang_limit=100
NW	amdgpu.lbpw=1
NW	amdgpu.lbpw=0
NW	amdgpu.gpu_recovery=1
NW	amdgpu.gpu_recovery=0
NW	amdgpu.sched_policy=2
NW	amdgpu.sched_policy=1
NW	amdgpu.sched_policy=0
NW	amdgpu.ignore_crat=0
NW	amdgpu.ignore_crat=1
NW	amdgpu.ras_enable=0
NW	amdgpu.ras_enable=1
NW	amdgpu.async_gfx_ring=0
NW	amdgpu.async_gfx_ring=1
NW	amdgpu.mcbp=1
NW	amdgpu.mcbp=0
NW	amdgpu.mes=0
NW	amdgpu.mes_kiq=1
NW	amdgpu.mes_kiq=0
NW	amdgpu.reset_method=0
NW	amdgpu.reset_method=1
NW	amdgpu.reset_method=2
NW	amdgpu.reset_method=3
NW	amdgpu.reset_method=4
NW	amdgpu.reset_method=-1
NW	idle=nomwait
NB	amdgpu.pg_mask=0x00000000
NB	amdgpu.mes=1



I've developed a script and a GRUB2 menu entry for live Kubuntu that triggers the freeze and saves the dmesg into a file called Freeze_Dell_G5_SE_5505.sh.log at the root of the drive it's being booted from.
Replace the ISO variable value with the path to your iso file if it's not at root directory of the drive and/or if it's of a different version: 

menuentry "Start Kubuntu 22.04.1 (64 bit) without Ubiquity and with a freezing script" {
	ISO=/kubuntu-22.04.1-desktop-amd64.iso
	set gfxpayload=keep
	loopback loop "$ISO"
	probe -u $root --set=rootid
	linux	(loop)/casper/vmlinuz   iso-scan/filename="$ISO" file=/cdrom/preseed/kubuntu.seed maybe-ubiquity quiet splash init=/bin/sh -- -c 'for script in /home/kubuntu/Desktop/Freeze_Dell_G5_SE_5505.sh ; do for autorun in /home/kubuntu/.config/autostart/${script##*/} ; do ln -fs /dev/null /etc/systemd/system/graphical.target.wants/ubiquity.service ; mkdir -p ${script%/*} ${autorun%/*} ; printf \043!_/bin/sh++print\050\051_{+\tprintf_"@1"_,_seq_-s"_"_@\050\050_@\050stty_size_\074_@t_?_sed_"s/^/\050/,_s/_/_-_1_\051_*_/"\051_-_@{\0431}_\051\051_?_sed_s/[0-9]//g+}+t\075"@\050readlink_/proc/self/fd/0\051"++d\075"@\050env_LANG\075C_udisksctl_mount_-b_/dev/disk/by-uuid/$0_-o_sync_2\076_/dev/null_?_sed_"s/^Mounted_.*_at_//g,_s/\\.@//g"\051"+[_-d_"@d"_]_\046\046_f\075oflag\075direct_??_d\075"@{0%%/*}"+sudo_dmesg_-w_?_sudo_dd_of\075"@d/@{0\043\043*/}.log"_@f_\046+i\0750+seq_28_150000_?_while_read_N_,_do+\tprint_@N+\ttimeout_3_env_DISPLAY\075:0_plasma-systemmonitor_\076_/dev/null_2\076\0461+\tn\075@N_,_while_[_0_-lt_@n_]_,_do+\t\tsleep_1+\t\tn\075@\050\050_@n_-_1_\051\051+\t\ti\075@\050\050_@i_^_1_\051\051+\t\t[_"@i"_\075_1_]_\046\046_printf_"\\33[30m\\33[47m"_??_printf_"\\33[37m\\33[40m"+\t\tprint_@n+\tdone+done++echo_END!+exit+ | tr _,?@+ \40\73\174\044\n > $script ; printf [Desktop_Entry]\nType=Application\nExec=kstart_--maximize_--_konsole_-e_  | tr _ \40 > ${autorun%.sh}.desktop ; printf $script\n >> ${autorun%.sh}.desktop ; chmod +x $script ${autorun%.sh}.desktop ; chown -R kubuntu:kubuntu /home/kubuntu ; exec /sbin/init maybe-ubiquity splash --- ; done ; done' $rootid
	initrd	(loop)/casper/initrd
}



The script generated on the live Kubuntu's desktop runs KDE's System Monitor for a three seconds and waits before running it again. With each iteration, it waits one second longer than before. The parameter passed the test if it managed not to freeze until the script was waiting for 50 seconds (now I'd recommend 60, as with 50 it sometimes froze after the second boot) for five boots in a row. 

Would someone also tell us which workaround should be used under which performace/latency requirements? ("Maybe wrong but still an" EXAMPLE: Users who need the best performace or lowest latency should use pcie_port_pm=off, users who need the best battery life should use amdgpu.msi=0.)

If you fix the issue, may you please tell the users (not just developers) what was the problem? ("Maybe wrong but still an" EXAMPLE: The driver was waiting for an interrupt, but the bus was down, therefore the message-signalled interrupt could not have come and the operation timed out.)

Thanks.
Comment 1 Alex Deucher 2022-10-31 15:40:25 UTC
(In reply to Martin Šušla from comment #0)
> 
> Would someone also tell us which workaround should be used under which
> performace/latency requirements? ("Maybe wrong but still an" EXAMPLE: Users
> who need the best performace or lowest latency should use pcie_port_pm=off,
> users who need the best battery life should use amdgpu.msi=0.)
> 

You should not need to override any of the defaults other than for debugging.
Comment 2 Alex Deucher 2022-10-31 15:42:23 UTC
Are you getting interrupts on the GPU?  Check /proc/interrupts to see if you are getting interrupts for the GPU.
Comment 3 Martin Šušla 2022-10-31 17:53:04 UTC
After the message mention in title appears, not even a single interrupt is registered.
Comment 4 Martin Šušla 2022-10-31 17:56:09 UTC
Created attachment 303110 [details]
Kernel log interlaced with contents of /proc/interrupts polled every second

#! /bin/sh

print() {
	printf "$1" ; seq -s" " $(( $(stty size < $t | sed "s/^/(/; s/ / - 1 ) * /") - ${#1} )) | sed s/[0-9]//g
}
t="$(readlink /proc/self/fd/0)"

d="$(env LANG=C udisksctl mount -b /dev/disk/by-uuid/$1 -o sync 2> /dev/null | sed "s/^Mounted .* at //g; s/\.$//g")"
[ -d "$d" ] && f=oflag=direct || d="${0%/*}" f=oflag=direct

(sudo dmesg -w & while sleep 1 ; do cat /proc/interrupts ; done) | sudo dd of="$d/${0##*/}.log" $f &
i=0
seq 28 150000 | while read N ; do
	print $N
	timeout 3 env DISPLAY=:0 plasma-systemmonitor > /dev/null 2>&1
	n=$N ; while [ 0 -lt $n ] ; do
		sleep 1
		n=$(( $n - 1 ))
		i=$(( $i ^ 1 ))
		[ "$i" = 1 ] && printf "\33[30m\33[47m" || printf "\33[37m\33[40m"
		print $n
	done
done

echo END!
exit

# This script was used to generate it
Comment 5 Martin Šušla 2022-10-31 18:02:12 UTC
(In reply to Martin Šušla from comment #3)
> After the message mention in title appears, not even a single interrupt is
> registered.

(Valid for both interrupts of the amdgpu driver.)
Comment 6 Alex Deucher 2022-10-31 21:53:03 UTC
(In reply to Martin Šušla from comment #5)
> (In reply to Martin Šušla from comment #3)
> > After the message mention in title appears, not even a single interrupt is
> > registered.
> 
> (Valid for both interrupts of the amdgpu driver.)

There are two GPUs in the system.  You appear to be getting at least some interrupts.

Are you using the dGPU at all or just the APU?  You might try a newer system bios if there is one available for your system.
Comment 7 Martin Šušla 2022-11-01 08:41:48 UTC
Sure, the GPU (at 0000:03:00.0) being initialized before the message mentioned in the title appears is the dGPU. Line 1078 in the "Kernel log created by the script in the menuentry" confirms this as the APU (at 0000:07:00.0, line 1184) doesn't use 6128M VRAM, it uses just 512M.

Note You need to log in before you can comment on or make changes to this bug.