Bug 210929
Summary: | MCE bea0000000000108 Crash on heavy/gaming workload since Kernel 5.5 | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | binarytamer (kernel) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | RESOLVED INVALID | ||
Severity: | normal | CC: | alexdeucher, bmisa6233, bp, christian.koenig, kernel, pmenzel+bugzilla.kernel.org |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.5 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
LSPCI for checking NV kerneldriver status
Kernel .config for checking the Userconfiguration DMESG after crash dmesg after crash(5.10 kernel) current lsmod(5.10 kernel) |
Description
binarytamer
2020-12-28 09:56:20 UTC
> GPU: NVIDIA RTX 2080 only for IOMMU
Are you using the nvidia proprietary driver? If so, try to reproduce without it.
And pls upload full dmesg.
Thx.
Created attachment 294367 [details]
LSPCI for checking NV kerneldriver status
Created attachment 294369 [details]
Kernel .config for checking the Userconfiguration
Created attachment 294371 [details]
DMESG after crash
Hi, thank you for your fast answer. I realized that I didn't proper described the problem. The crash occurs when I play games native on linux with the 5700XT. My NVIDIA is only used in a virtual machine and has no proprietary driver nor is the kernel configured to provide the nouveau driver. No VM is running when a crash happens. The NVIDIA is in powermode S3. I attached the following: - DMESG after the crash - Current .config of the Kernel - Output of LSPCI to check the status of the NVIDIA Thanks.
A couple of observations:
[ 0.000000] Notice: NX (Execute Disable) protection missing in CPU!
Why is that? Do you have some strange setting in your BIOS which
disables NX? I'd reenable it.
[ 0.360218] efi: Error mapping PA 0xff000000 -> VA 0xff000000!
[ 0.360218] efi: Error mapping PA 0xff000000 -> VA 0xfffffffeff000000!
[ 0.360219] efi: Error mapping PA 0xfedd4000 -> VA 0xfedd4000!
[ 0.360219] efi: Error mapping PA 0xfedd4000 -> VA 0xfffffffefefd4000!
[ 0.360220] efi: Error mapping PA 0xfedc2000 -> VA 0xfedc2000!
...
this is *really* strange. It basically says that you can't map any EFI
runtime services. Should not happen either.
The MCE itself decodes to:
[ 346.622085] [Hardware Error]: System Fatal error.
[ 346.627024] [Hardware Error]: CPU:12 (17:1:2) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108
[ 346.637614] [Hardware Error]: Error Addr: 0x0001ffffb50a37bc
[ 346.643497] [Hardware Error]: IPID: 0x0000000000000000, Syndrome: 0x000000004d000000
[ 346.651444] [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
[ 346.659803] [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN
which normally fires when some transaction times out. Which brings us
to the huuge bugzilla entry which you've already quoted, where people
complain about such things. That looks like a hardware issue and not a
kernel issue so far.
> All this starts with kernel version >= 5.5 when I install kernel
> 5.4.80 (in Gentoo currently marked stable)the system is rock solid not
> a single crash.
If that is the case you could perhaps bisect the issue - see "man
git-bisect" for how to do that. And ask questions if something's not
clear still.
If 5.4.80 is good and 5.5 is bad, then that should be not too many
bisection steps. If successful, it might point us to a commit which
could be a guilty one.
HTH.
Hi, I will check the NX option, probably I disabled it while checking the EFI settings for the culprit. I red into this git-bisect procedure and I want to ask if you could check my approach. Which would look like the following: - cloning the kernel repo with: git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-git - setting the good & bad tags: git bisect good v5.4 git bisect bad v5.5 - building the kernel and installing it: make make modules_install make install grub-mkconfig -o /boot/grub/grub.cfg - testing the kernel - set the current version to good or bad: git bisect good / bad - repeat the steps - when I found a version that crashes: provide the output of git bisect log Thank you for checking that unfortunately I never did this before, so sorry for the annoyance... (In reply to binarytamer from comment #7) > - cloning the kernel repo with: git clone > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-git Yap, however you also need to add the stable trees too so that you can test those tags too. After the above step, you do: $ git remote add stable git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git $ git fetch stable You say 5.4.80 is good so you should do git bisect good v5.4.80 git bisect bad v5.5 But do test those two first because you need to make sure they're really good and bad respectively. > - setting the good & bad tags: git bisect good v5.4 git bisect bad > v5.5 > > - building the kernel and installing it: make make modules_install > make install grub-mkconfig -o /boot/grub/grub.cfg > > - testing the kernel > > - set the current version to good or bad: git bisect good / bad > > - repeat the steps Yap, exactly. Just be careful when you do the steps because one mistake and you go off "into the weeds". Happens to me from time to time so I use a pen and paper too. :-) > - when I found a version that crashes: provide the output of git > bisect log Once the bisection is done, it'll tell you "the first bad commit is... " > Thank you for checking that unfortunately I never did this before, so > sorry for the annoyance... No worries, thanks for reporting and bisecting! I have absolutely same issue, but on kernel 5.4 issue still occur. Specs: CPU: AMD Ryzen 3 3100 GPU: AMD Radeon HD 7970 RAM: 32GB 2x16GB DDR4 3200MT/s Motherboard: ASRock B450M PRO4-F Using linux-firmware version 20201218 Tried these kernels: 5.11rc1, 5.10.4, 5.4.80 Error: дек 31 17:13:19 archlinux kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108 дек 31 17:13:19 archlinux kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc0373d30 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 дек 31 17:13:19 archlinux kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1609409597 SOCKET 0 APIC 8 microcode 8701021 Hi, Thankyou for checking. It took a lot of time, but I am now done with bisecting of the kernel versions. Here is the output: a3511321fd004d0b2a6d81dab1837dcc6c752da4 is the first bad commit commit a3511321fd004d0b2a6d81dab1837dcc6c752da4 Author: Stephen Rothwell <sfr@canb.auug.org.au> Date: Thu Nov 21 14:54:03 2019 +1100 merge fix for "ftrace: Rework event_create_dir()" Reviewed-by: Kevin Wang <kevin1.wang@amd.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) I will provide addition information if needed. Hopefully this is any helpful and I did the procedure correctly. Also I will paste the bisect log: git bisect start # good: [9f4b26f3ea18cb2066c9e58a84ff202c71739a41] Linux 5.4.80 git bisect good 9f4b26f3ea18cb2066c9e58a84ff202c71739a41 # bad: [d5226fa6dbae0569ee43ecfc08bdcd6770fc4755] Linux 5.5 git bisect bad d5226fa6dbae0569ee43ecfc08bdcd6770fc4755 # good: [219d54332a09e8d8741c1e1982f5eae56099de85] Linux 5.4 git bisect good 219d54332a09e8d8741c1e1982f5eae56099de85 # good: [8c39f71ee2019e77ee14f88b1321b2348db51820] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net git bisect good 8c39f71ee2019e77ee14f88b1321b2348db51820 # bad: [76bb8b05960c3d1668e6bee7624ed886cbd135ba] Merge tag 'kbuild-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild git bisect bad 76bb8b05960c3d1668e6bee7624ed886cbd135ba # bad: [21b26d2679584c6a60e861aa3e5ca09a6bab0633] Merge tag '5.5-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 git bisect bad 21b26d2679584c6a60e861aa3e5ca09a6bab0633 # good: [3275a71e76fac5bc276f0d60e027b18c2e8d7a5b] Merge tag 'drm-next-5.5-2019-10-09' of git://people.freedesktop.org/~agd5f/linux into drm-next git bisect good 3275a71e76fac5bc276f0d60e027b18c2e8d7a5b # good: [2ef4144d1ea8b181d377d0783c43032cb44889f7] Merge tag 'drm-intel-next-2019-11-01-1' of git://anongit.freedesktop.org/drm/drm-intel into drm-next git bisect good 2ef4144d1ea8b181d377d0783c43032cb44889f7 # bad: [0a6cad5df541108cfd3fbd79eef48eb824c89bdc] Merge branch 'vmwgfx-coherent' of git://people.freedesktop.org/~thomash/linux into drm-next git bisect bad 0a6cad5df541108cfd3fbd79eef48eb824c89bdc # good: [ad4d81dc57e2dff7cf3b55f63356f0d0017050a1] drm/amdgpu/renoir: move gfxoff handling into gfx9 module git bisect good ad4d81dc57e2dff7cf3b55f63356f0d0017050a1 # good: [78e2ea291ead1e395864ff1583064e07b1adeb62] drm/i915/display: Fix TRANS_DDI_MST_TRANSPORT_SELECT definition git bisect good 78e2ea291ead1e395864ff1583064e07b1adeb62 # good: [c0e21ea1d0b557bdedd5b54d529162f74e7ef407] drm/amdgpu: put flush_delayed_work at first git bisect good c0e21ea1d0b557bdedd5b54d529162f74e7ef407 # bad: [1b34de7c3fef0c7ebb3d05acc1756bfb585279ca] drm/amd/amdgpu/sriov skip RLCG s/r list for arcturus VF. git bisect bad 1b34de7c3fef0c7ebb3d05acc1756bfb585279ca # good: [8fc41344138831071c5d5f51635c7eb33459e249] drm/amdgpu: disable gfxoff on original raven git bisect good 8fc41344138831071c5d5f51635c7eb33459e249 # good: [57fb0ab2f1398d81b42a8143a40e5d209a290a48] drm/amdgpu: Update Arcturus golden registers git bisect good 57fb0ab2f1398d81b42a8143a40e5d209a290a48 # bad: [210b3b3c7563df391bd81d49c51af303b928de4a] drm/amdgpu/gfx10: re-init clear state buffer after gpu reset git bisect bad 210b3b3c7563df391bd81d49c51af303b928de4a # bad: [a3511321fd004d0b2a6d81dab1837dcc6c752da4] merge fix for "ftrace: Rework event_create_dir()" git bisect bad a3511321fd004d0b2a6d81dab1837dcc6c752da4 # first bad commit: [a3511321fd004d0b2a6d81dab1837dcc6c752da4] merge fix for "ftrace: Rework event_create_dir()" Thanks for the support and if there is anything I should test just describe it and will try. Created attachment 294503 [details]
dmesg after crash(5.10 kernel)
my dmesg log
Created attachment 294505 [details]
current lsmod(5.10 kernel)
(In reply to binarytamer from comment #10) > Hi, > > Thankyou for checking. It took a lot of time, but I am now done with > bisecting of the kernel versions. Here is the output: > > a3511321fd004d0b2a6d81dab1837dcc6c752da4 is the first bad commit > commit a3511321fd004d0b2a6d81dab1837dcc6c752da4 > Author: Stephen Rothwell <sfr@canb.auug.org.au> > Date: Thu Nov 21 14:54:03 2019 +1100 > > merge fix for "ftrace: Rework event_create_dir()" Yeah, I warned you that bisection might veer off into the weeds. So this is only a build fix for: from drivers/gpu/drm/amd/amdgpu/amdgpu_trace_points.c:29: ./include/trace/../../drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h:520:52: error: expected expression before ‘;’ token 520 | __string(ring, sched_job->base.sched->name); which means that it is highly unlikely that this patch is really causing the MCE. And you can't revert it ontop of 5.5 to check because it really is only a build fix. Which means that you could try the bisection again. Yap, that takes a lot of time but if you do it and encounter yet another innocent commit as the first bad one, then it very likely could be that this really is a hardware issue. Just like "danknil" says in comment #9 that he/she can trigger even on 5.4. The fact that you run a game and some transaction timeouts could be something GPU-related like the GPU sucking too much power or so and it resulting in a transaction timeout. Without proper equipment that is very hard to debug, unfortunately. But this is all pure speculation. >Which means that you could try the bisection again. Yap, that takes a
>lot of time but if you do it and encounter yet another innocent commit
>as the first bad one, then it very likely could be that this really is a
>hardware issue. Just like "danknil" says in comment #9 that he/she can
>trigger even on 5.4.
I'm also test out some games on Windows 10 for about 4-5 hours without any issues, so i don't think it hardware one.
(In reply to danknil from comment #14) > I'm also test out some games on Windows 10 for about 4-5 hours without any > issues, so i don't think it hardware one. This happens only when you play games, right? I.e., when the GPU is being stressed. And yes people have reported that they can't trigger on windoze but that doesn't mean a whole lot: it could be the windoze GPU driver doing something else, power management too or even windoze not reporting the MCEs (I doubt it but still). In any case, see https://bugzilla.kernel.org/show_bug.cgi?id=206903. There are some ideas what to try there, you could try them. My best idea would be to disable all power management features and see if the problem disappears. Try amdgpu.cg_mask=0 and amdgpu.pg_mask=0 on the kernel command line. Or amdgpu.dpm=0 to disable dynamic clock switching. If that helps, you can also use the ppfeaturemask option to further narrow down which power feature (if any) causes the issue (e.g., drop the dpm=0 and add ppfeaturemask=0x...). The mask bits are defined here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/include/amd_shared.h#n196 >This happens only when you play games, right? Yes. >see https://bugzilla.kernel.org/show_bug.cgi?id=206903. There are some ideas >>what to try there, you could try them. I'm already check it and tried all methods described(expect possible fix in attachment) (In reply to Alex Deucher from comment #17) >Or amdgpu.dpm=0 to disable dynamic clock switching. If that helps, you can >also >use the ppfeaturemask option to further narrow down which power feature >(if any) >causes the issue (e.g., drop the dpm=0 and add ppfeaturemask=0x...). > The mask >bits are defined here: Tried disable separately through ppfeaturemask. Next day try both (In reply to Christian König from comment #16) > My best idea would be to disable all power management features and see if > the problem disappears. > > Try amdgpu.cg_mask=0 and amdgpu.pg_mask=0 on the kernel command line. Okay, i'll test (In reply to Christian König from comment #16) > My best idea would be to disable all power management features and see if > the problem disappears. > > Try amdgpu.cg_mask=0 and amdgpu.pg_mask=0 on the kernel command line. nothing changed :( (In reply to Christian König from comment #16) > My best idea would be to disable all power management features and see if > the problem disappears. > > Try amdgpu.cg_mask=0 and amdgpu.pg_mask=0 on the kernel command line. nothing changed :( Hi, I just want to update the current status of the issue on my system. It took a very long time to be sure because of the sporadic kind of occurrence. I tried to bisect the kernel, but I still cannot pin down a specific commit. That said, I directed my efforts to find out which component, or which combination components may cause the problem. After building a test system I am sure that this MCE occurs on different AMD Systems with my 5700XT GPU. On a Intel Skylake I could not reproduce the problem. CPU, RAM, PSU, Mainboard were all changed during the tests. Than I red about a bad ASUS Firmware on my 5700XT. I already tried to update the 5700XT in the past but the ASUS Update Tool always reports that there is no update needed. So I flashed a ROM file from the ASUS Update with the AMD Flash tool this time. And the MCE is gone! No reboots on newer Kernels >5.4. @binarytamer: Thank you for reporting back with the fix for your problem. For the record, what Asus firmware was on the 5700XT, and what version did you flash? @danknil: As this issue is very complicated, and you were able to reproduce it with Linux 5.4.80, I’d say it’s a different issue, and recommend to open a separate issue. (In reply to Paul Menzel from comment #25) > @binarytamer: Thank you for reporting back with the fix for your problem. > For the record, what Asus firmware was on the 5700XT, and what version did > you flash? In bug 206903, Alex Deucher suggested the command below: sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info Hello Paul Sorry for the extreme late answer. But here is the output of cat /sys/kernel/debug/dri/0/amdgpu_firmware_info: VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 32, firmware version: 0x00000061 PFP feature version: 32, firmware version: 0x00000093 CE feature version: 32, firmware version: 0x00000025 RLC feature version: 1, firmware version: 0x00000080 RLC SRLC feature version: 0, firmware version: 0x00000000 RLC SRLG feature version: 0, firmware version: 0x00000000 RLC SRLS feature version: 0, firmware version: 0x00000000 MEC feature version: 32, firmware version: 0x0000008d MEC2 feature version: 32, firmware version: 0x0000008d SOS feature version: 0, firmware version: 0x00100450 ASD feature version: 0, firmware version: 0x2100004a TA RAS feature version: 0x00000000, firmware version: 0x2100002a TA XGMI feature version: 0x00000000, firmware version: 0x2100002a TA HDCP feature version: 0x17000010, firmware version: 0x2100002a TA DTM feature version: 0x12000003, firmware version: 0x2100002a SMC feature version: 0, firmware version: 0x002a3f00 SDMA0 feature version: 50, firmware version: 0x00000023 SDMA1 feature version: 50, firmware version: 0x00000023 VCN feature version: 0, firmware version: 0x0510a00d DMCU feature version: 0, firmware version: 0x00000000 DMCUB feature version: 0, firmware version: 0x00000000 TOC feature version: 0, firmware version: 0x00000000 VBIOS version: 115-D199PI0-101 Hey everybody, I'd like to update my GPUs firmware as well, because I experience the same issues using a RX 5700 and Ryzen 5 3600. Here's my output, but unfortunately the VBIOS version is not displayed for some reason. VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 32, firmware version: 0x00000061 PFP feature version: 32, firmware version: 0x00000093 CE feature version: 32, firmware version: 0x00000025 RLC feature version: 1, firmware version: 0x00000080 RLC SRLC feature version: 0, firmware version: 0x00000000 RLC SRLG feature version: 0, firmware version: 0x00000000 RLC SRLS feature version: 0, firmware version: 0x00000000 MEC feature version: 32, firmware version: 0x0000008d MEC2 feature version: 32, firmware version: 0x0000008d SOS feature version: 0, firmware version: 0x00100450 ASD feature version: 0, firmware version: 0x2100004a TA RAS feature version: 0x00000000, firmware version: 0x2100002a TA XGMI feature version: 0x00000000, firmware version: 0x2100002a TA HDCP feature version: 0x17000010, firmware version: 0x2100002a TA DTM feature version: 0x12000003, firmware version: 0x2100002a SMC feature version: 0, firmware version: 0x002a3f00 SDMA0 feature version: 50, firmware version: 0x00000023 SDMA1 feature version: 50, firmware version: 0x00000023 VCN feature version: 0, firmware version: 0x0510a00d DMCU feature version: 0, firmware version: 0x00000000 DMCUB feature version: 0, firmware version: 0x00000000 TOC feature version: 0, firmware version: 0x00000000 VBIOS version: But amdvbflash -ai outputs this: Adapter 0 SEG=0000, BN=28, DN=00, PCIID=731F1002, SSID=381C1462) Asic Family : Navi10 Flash Type : W25Q80 (1024 KB) Product Name : 113-MSITV381MH.281 Bios Config File : 281.bin Bios P/N : P/N Not Available Bios Version : 017.001.000.049.000000 Bios Date : 11/13/19 04:23 ROM Image Type : Hybrid Images ROM Image Details : Image[0]: Size(59392 Bytes), Type(Legacy Image) Image[1]: Size(44032 Bytes), Type(EFI Image) As you have a different card – sorry my oversight in bug 206903 – please create a separate issue. |