Bug 217994 - Kernel 6.5 Won't Boot With Some IEEE1394 Devices
Summary: Kernel 6.5 Won't Boot With Some IEEE1394 Devices
Status: RESOLVED DUPLICATE of bug 217993
Alias: None
Product: Drivers
Classification: Unclassified
Component: IEEE1394 (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: drivers_ieee1394
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-10 21:41 UTC by Ian Donnelly
Modified: 2024-01-16 01:45 UTC (History)
4 users (show)

See Also:
Kernel Version: 6.5
Subsystem:
Regression: Yes
Bisected commit-id: dcadfd7f7c74ef9ee415e072a19bdf6c085159eb


Attachments

Description Ian Donnelly 2023-10-10 21:41:04 UTC
It seems there is a bug preventing machines from booting with certain IEEE1394 firewire devices. This has been reported in multiple places but doesn't seem to have a report on here.

Examples: 
https://bugzilla.suse.com/show_bug.cgi?id=1215436
https://bugzilla.redhat.com/show_bug.cgi?id=2240973

I experienced the same issue with this card: https://www.amazon.com/gp/product/B07QPDN3XK/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1

It looks like it uses a VIA VT6307 chip and was causing mce error reports on my computer while trying to boot Kernel 6.5 on Fedora 38. I could boot normally with 6.4 and with Windows.

Here are those MCE errors:
```
[    0.860834] mce: [Hardware Error]: Machine check events logged
[    0.860834] microcode: CPU20: patch_level=0x0a201025
[    0.860835] microcode: CPU21: patch_level=0x0a201025
[    0.860836] microcode: CPU23: patch_level=0x0a201025
[    0.860836] microcode: CPU22: patch_level=0x0a201025
[    0.860837] mce: [Hardware Error]: CPU 17: Machine Check: 0 Bank 0: bc00080001010135
[    0.860845] fbcon: Taking over console
[    0.860847] mce: [Hardware Error]: TSC 0 ADDR fca000f0 MISC d012000000000000 IPID 1000b000000000 
[    0.860854] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1696955537 SOCKET 0 APIC b microcode a201025
[    0.860860] microcode: CPU0: patch_level=0x0a201025
[    0.861676] microcode: Microcode Update Driver: v2.2.
```
Comment 1 Artem S. Tashkinov 2023-10-11 14:36:16 UTC
Could you please bisect?

https://docs.kernel.org/admin-guide/bug-bisect.html
Comment 2 Artem S. Tashkinov 2023-10-11 14:40:26 UTC
CC'ing Takashi Sakamoto, who's probably introduced the regression.
Comment 3 Ian Donnelly 2023-10-11 19:28:00 UTC
I'll try to get that done tonight or tomorrow.

Just `git bisect bad dcadfd7f7c74ef9ee415e072a19bdf6c085159eb` and then build, correct?
Comment 4 Ian Donnelly 2023-10-11 23:01:27 UTC
(In reply to Artem S. Tashkinov from comment #1)
> Could you please bisect?
> 
> https://docs.kernel.org/admin-guide/bug-bisect.html

Sorry I'm not that familiar with git bisect. I tried: 

```
git bisect start
git bisect bad dcadfd7f7c74ef9ee415e072a19bdf6c085159eb
git bisect good ddfcf8fb914438b422892f56717508867dfbd6af #This is the tag for kernel 6.4.15 which is the version I was previously using with no issue
```

and got `[44c026a73be8038f03dbdeef028b642880cf1511] Linux 6.4-rc3`

which doesn't seem particularly helpful. Let me know if there is other troubleshooting steps I can help with, it's my first kernel bug report so it will be a bit of a learning experience.
Comment 5 Takashi Sakamoto 2023-10-12 00:23:18 UTC
Hi Ian and Artem,

I'm a maintainer of Linux FireWire subsystem, and sorry for the regression.

At present, the issue is just for the case that the extension card has the similar design with Asmedia ASM 1083 and ASM 1085 for PCIe-to-PCI bus bridge, thus not so widely applied; e.g. VIA VT1615, TI chipsets, Agere FW643, and so on.

As long as bisecting by SUSE stuff (Stuart Rogers and Jiri Slaby), a commit dcadfd7f7c74 ("firewire: core: use union for callback of transaction completion
") brings the issue. The change is to execute `readl` to a register (`CYCLE_TIME`) defined in 1394 OHCI. The access to register is quite common, so it does not perform something specific. Then people puzzled, the more familiar to low-level software they are, "why such simple register access causes system reboots?".

In my experiments, the host system reboot happens even if the issued 1394 OHCI hardware is bound to virtual machine by PCI-passthrough (vfio-pci).

Furthermore, without the PCIe-to-PCI bridge chip, the system reboot is not triggered. I purchased 1394 OHCI (VIA VT6306 and VT6307) hardware to figure out the issue, and realized that they work well with the latest Linux FireWire subsystem.

In my opinion, my experiments point that the cause of reboot is triggered by quite low-level in software stack (e.g. Linux PCI subsystem) or hardware itself due to any quirk of the bridge chip. Actually we can see some issues relevant to the bridge chip; e.g. interrupt, DMA, and so on. I've never figure out it yet.

In my current understanding, we have longstanding potential problem to use the bridge chip in Linux environment. The change of Linux FireWire subsystem reveals it, unfortunately.
Comment 6 Tony Vroon 2023-10-20 20:14:27 UTC
I have two of these ASMedia bridges in my system, one for FireWire:
25:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 04) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 94, IOMMU group 25
	Bus: primary=25, secondary=26, subordinate=26, sec-latency=32
	I/O behind bridge: e000-efff [size=4K] [16-bit]
	Memory behind bridge: fc900000-fc9fffff [size=1M] [32-bit]
	Prefetchable memory behind bridge: [disabled] [64-bit]
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Power Management version 3
	Capabilities: [80] Express PCI-Express to PCI/PCI-X Bridge, MSI 00
	Capabilities: [c0] Subsystem: Device [0000:0000]
	Capabilities: [100] Virtual Channel

26:00.0 FireWire (IEEE 1394) [0c00]: VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller [1106:3044] (rev 80) (prog-if 10 [OHCI])
	Subsystem: VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller [1106:3044]
	Flags: bus master, stepping, medium devsel, latency 32, IRQ 94, IOMMU group 25
	Memory at fc900000 (32-bit, non-prefetchable) [size=2K]
	I/O ports at e000 [size=128]
	Capabilities: [50] Power Management version 2
	Kernel driver in use: firewire_ohci
	Kernel modules: firewire_ohci

And one for the sound card, which is:
27:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 03) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 24, IOMMU group 26
	Bus: primary=27, secondary=28, subordinate=28, sec-latency=32
	I/O behind bridge: d000-dfff [size=4K] [16-bit]
	Memory behind bridge: [disabled] [32-bit]
	Prefetchable memory behind bridge: [disabled] [64-bit]
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Power Management version 3
	Capabilities: [80] Express PCI-Express to PCI/PCI-X Bridge, MSI 00
	Capabilities: [c0] Subsystem: Device [0000:0000]
	Capabilities: [100] Virtual Channel

28:04.0 Multimedia audio controller [0401]: C-Media Electronics Inc CMI8788 [Oxygen HD Audio] [13f6:8788]
	Subsystem: ASUSTeK Computer Inc. Virtuoso 100 (Xonar Essence STX) [1043:835c]
	Flags: bus master, medium devsel, latency 32, IRQ 24, IOMMU group 26
	I/O ports at d000 [size=256]
	Capabilities: [c0] Power Management version 2
	Kernel driver in use: snd_virtuoso
	Kernel modules: snd_virtuoso

Just in case you're looking to prove whether 1394 is involved or not, it may help to know other hardware with that specific bridge chip?
I have the MCE and failure to boot on a Fedora 6.5.7 kernel. 6.4.7 is okay.
Comment 7 Takashi Sakamoto 2023-10-21 13:03:12 UTC
Hi Tony Vroon,

> Just in case you're looking to prove whether 1394 is involved or
> not, it may help to know other hardware with that specific bridge
> chip? I have the MCE and failure to boot on a Fedora 6.5.7 kernel.
> 6.4.7 is okay.

Thanks for the report. However, as I posted in LKML[1], the issue
seems to appear just in the combination of the issued 1394 OHCI
hardware (i.e. VIA6307), the issued PCIe-to-PCI bridge (i.e. ASM1083),
and recent AMD chipset for Ryzen machines (I.e. B450, X370, X570). I
guess you use such AMD chipset when encountering the MCE failure.

At present, I judge multiple causes underlies in the issue. Linux
kernel potentially has the problem, and the change of firewire-ohci
module reveals it by indirect way.

[1] https://lore.kernel.org/lkml/20231016155657.GA7904@workstation.local/


Regards
Comment 8 Tony Vroon 2023-10-21 21:12:02 UTC
(In reply to Takashi Sakamoto from comment #7)
> Thanks for the report. However, as I posted in LKML[1], the issue
> seems to appear just in the combination of the issued 1394 OHCI
> hardware (i.e. VIA6307), the issued PCIe-to-PCI bridge (i.e. ASM1083),
> and recent AMD chipset for Ryzen machines (I.e. B450, X370, X570). I
> guess you use such AMD chipset when encountering the MCE failure.

Absolutely right, MSI MS-7D53 (MPG X570S EDGE MAX WIFI) with an X570 chipset.
Comment 9 Takashi Sakamoto 2023-10-22 13:03:03 UTC
> Absolutely right, MSI MS-7D53 (MPG X570S EDGE MAX WIFI) with an X570 chipset.

Okay. It is what I expected.

Would I ask you to test snd_virtuoso to your Xonar sound card without loading firewire-ohci kernel module? We can use `modules_blacklist=firewire_ohci` in your v6.5.7 kernel command line option for the purpose.

Even if the above test were successful, it has not immediately meant ASM1083 would work well in VT6306/7/8 side I think, since the bridge chip is used by different ways, patterns, and configurations. However, it is helpful information to investigate the issue.

Thanks
Comment 10 Tony Vroon 2023-10-22 14:31:24 UTC
I get the reboot even with modules_blacklist=firewire_ohci it seems.
Comment 11 Tony Vroon 2023-10-22 14:35:23 UTC
Do we want to look at the ASPM refactoring that went in with 6.5?
Comment 12 Takashi Sakamoto 2023-10-22 23:03:32 UTC
> I get the reboot even with modules_blacklist=firewire_ohci it seems.

OK. I presume that ALSA PCM/Control character devices for your Xonar card work as expected in the case after booting up, right?

> Do we want to look at the ASPM refactoring that went in with 6.5?

Unfortunately, I can regenerate the issue with backported firewire-ohci module in v6.2 kernel. It occurs in my AMD X370 chipset (Gigabyte GA-AX370-Gaming 5 rev. 1.0, F51h BIOS version).
Comment 13 Mario Limonciello (AMD) 2023-11-07 21:28:54 UTC

*** This bug has been marked as a duplicate of bug 217993 ***
Comment 14 Takashi Sakamoto 2024-01-16 01:45:07 UTC
Hi,

The change for 1394 OHCI driver, aimed at suppressing the unexpected
system reboot in AMD Ryzen machine[1], has been merged into Linux kernel
v6.7[2]. It has also been applied to the following releases of stable and
longterm kernels.

* 6.6.11[3]
* 6.1.72[4]
* 5.15.147[5]
* 5.10.208[6]
* 5.4.267[7]
* 4.19.305[8]
* 4.14.336[9]

Once the downstream distribution project provides the corresponding kernel
packages, you should no longer encounter the unexpected system reboot.

Note that the following combination of hardware is not necessarily suitable,
depending on your use case:

* Any type of AMD Ryzen machine
* 1394 OHCI hardware consists of:
    * Asmedia ASM1083/1085
    * VIA VT6306/6307/6308

When working with time-aware protocol, such as audio sample processing, it
is advisable to avoid the combination. The change accompanies a functional
limitation that the software stack does not provides precise hardware time
in this case.

If you choose to continue using AMD Ryzen machine, the recommendation is
to replace the 1394 OHCI hardware with another one. Conversely, if you
choose to continue using the 1394 OHCI hardware, the recommendation is to
use the machine provided by vendors other than AMD.

Thanks for your report and long patience.

[1] https://git.kernel.org/torvalds/linux/c/ac9184fbb847
[2] https://lore.kernel.org/lkml/CAHk-=widprp4XoHUcsDe7e16YZjLYJWra-dK0hE1MnfPMf6C3Q@mail.gmail.com/
[3] https://lore.kernel.org/lkml/2024011058-sheep-thrower-d2f8@gregkh/
[4] https://lore.kernel.org/lkml/2024011052-unsightly-bronze-e628@gregkh/
[5] https://lore.kernel.org/lkml/2024011541-defective-scuff-c55e@gregkh/
[6] https://lore.kernel.org/lkml/2024011532-lustiness-hybrid-fc72@gregkh/
[7] https://lore.kernel.org/lkml/2024011519-mating-tag-1f62@gregkh/
[8] https://lore.kernel.org/lkml/2024011508-shakiness-resonant-f15e@gregkh/
[9] https://lore.kernel.org/lkml/2024011046-ecology-tiptoeing-ce50@gregkh/


Thanks

Takashi Sakamoto

Note You need to log in before you can comment on or make changes to this bug.