Bug 218003
Description
Suroto
2023-10-12 07:44:05 UTC
additional details hardware: S/N : Pf4DM1wG Model Name: Lenovo V14 G4 AMN. ( before any typo mistake, G14) (In reply to Suroto from comment #0) > Hello, > > I have installed fedora workstation 38, with kernel default (versi 6.2 ??). > But keyboard not working. > Then I try ubuntu 22.04.3 LTS . it's same problem. > Then try upgrade kernel of Fedora Workstation 38 to kernel 6.6.. > keyboard still not working. > Does hotkeys (usually laptop-specific) have problem? hotkey working only on the contras screen key Mario, could you take a look please? This was reported months ago: https://forum.manjaro.org/t/lenovo-amd-no-keyboard/141597/5 https://unix.stackexchange.com/questions/751668/lenovo-v15-g4-amn-mouse-and-keyboard-is-not-working-under-linux https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe-6.2/+bug/2034477 https://askubuntu.com/questions/1480477/laptop-keyboard-and-mosuepad-dosnt-work-on-lenovo-v15-g4-amn-type-82yu Multiple Lenovo laptops exhibit this issue and the company doesn't seem to care one bit. They could have fixed the issue themselves. Please just don't buy or recommend their products to anyone who cares about Linux. I've no idea who to ping or what to do. I've heard that some modern laptops have some sort of I2C (?) multiplexors which need to be set up properly in order to enable keyboard input but how it all works, again I've no clue. I'm CC'ing Mario Limonciello because I remember he knows something about that. Maybe he could chime in and give the instructions how to debug the issue and fix it all the affected people. CC'ing Tamim Khan for good measure. Thank you for your explanation. Since 15 years ago, I have installed Linux on various brands and types of computers, Just recently, I encountered a hardware problem that Linux doesn't recognize. I will try to send an issue ticket to Levono Same issue with IdeaPad Slim 3 14AMN8, quite a lot of good info in https://bugzilla.kernel.org/show_bug.cgi?id=217873 but no solution. I have tried an immense amount of kernel parameter combinations and patching the latest mainline kernel with IRQ quirks of different combinations, to no avail. The only combination of i8042 parameters that do ANYTHING is "i8042.direct i8042.dumbkbd i8042.noloop i8042.nopnp" which finally gets a working keyboard (but with an insane delay of 15-60 seconds per keypress). Dumbkbd doesn't seem strictly necessary but you receive 10-20 commas before any actual keyboard input gets through. Really hoping for a solution or any suggestions on what to try next. Can both of you guys please test with the latest 6.6-rc, and if it continues to fail attach these to the bug: 1) acpidump 2) your dmesg from 6.6-rcX 3) dmidecode And if you haven't already; please upgrade to the latest BIOS release for your systems from Lenovo. Created attachment 305213 [details]
acpidump 6.6-rc5
Created attachment 305214 [details]
dmesg 6.6-rc5
Created attachment 305215 [details]
dmidecode 6.6-rc5
Attached the requested files, running 6.6-rc5 and the latest BIOS available (L1CN32WW) Hi p0lloc, I read in bug 217873 that you already have tried to add your model to the DMI quirks to make drivers/acpi/resource.c use the MADT override info on your AMD Zen based laptops resulting in: ACPI: IRQ 1 override to edge, high(!) Showing up in the log, which is what that DMI quirk should do. So it looks like you got the patch right but this unfortunately does not help. This suggests that your IdeaPad Slim 3 14AMN8 is a whole new category of kbd IRQ bug where neither the DSDT nor the MADT override info is correct (what fun). This really shows once more that to fix this once and for all we need to find a way to readback the actually IRQ settings setup by the BIOS at boot and use those (ignoring all the tables which tell us the info since those seem to be unreliable). But for now first lets confirm that this really is an IRQ type / polarity issue. By default 6.6-rc5 uses "edge low" as trigger type and with the DMI quirk which you tested it uses "edge high" as trigger type and neither works. We have seen several Intel based laptop models which need to NOT use the MADT override so that they stick with the DSDT setting of "level low" so lets give that a try. I'll attach a quick hack which ignores all the ACPI tables and just sets IRQ1 directly to "level low". Please build a kernel with this and see if this helps. If it does not help please try "level high" by changing: u8 pol = ACPI_ACTIVE_LOW; to: u8 pol = ACPI_ACTIVE_HIGH; Note a level IRQ with the wrong polarity (low vs high) will lead to an interrupt storm which might very well lead to a non booting system. So make sure you have an entry for another kernel in your grub menu as fallback. Also if the IRQ storm is not "bad" enough it may seem as if everything works, while at the same time the CPU is very busy. So if things seem to work please run: cat /proc/interrupts and check that the numbers for IRQ 1 are not going crazy (each keypress should add 2 to the IRQ count one for press + one for release). Created attachment 305217 [details]
[PATCH] ACPI: resource: IRQ polarity hack
Here is the promised patch for testing on the Lenovo IdeaPad Slim 3 15AMN8.
Suroto for starters can you please provide the logs which Mario requested in comment 8 ? p.s. p0lloc after building the kernel with the patch dmesg should contain: ACPI: IRQ 1 override to level(!), low please check that this is present in dmesg (assuming the kernel boots at all). (In reply to Hans de Goede from comment #14) > Hi p0lloc, > > I read in bug 217873 that you already have tried to add your model to the > DMI quirks to make drivers/acpi/resource.c use the MADT override info on > your AMD Zen based laptops resulting in: > > ACPI: IRQ 1 override to edge, high(!) > > Showing up in the log, which is what that DMI quirk should do. So it looks > like you got the patch right but this unfortunately does not help. > > This suggests that your IdeaPad Slim 3 14AMN8 is a whole new category of kbd > IRQ bug where neither the DSDT nor the MADT override info is correct (what > fun). > > This really shows once more that to fix this once and for all we need to > find a way to readback the actually IRQ settings setup by the BIOS at boot > and use those (ignoring all the tables which tell us the info since those > seem to be unreliable). > > But for now first lets confirm that this really is an IRQ type / polarity > issue. > > By default 6.6-rc5 uses "edge low" as trigger type and with the DMI quirk > which you tested it uses "edge high" as trigger type and neither works. > > We have seen several Intel based laptop models which need to NOT use the > MADT override so that they stick with the DSDT setting of "level low" so > lets give that a try. > > I'll attach a quick hack which ignores all the ACPI tables and just sets > IRQ1 directly to "level low". Please build a kernel with this and see if > this helps. > > If it does not help please try "level high" by changing: > > u8 pol = ACPI_ACTIVE_LOW; > > to: > > u8 pol = ACPI_ACTIVE_HIGH; > > Note a level IRQ with the wrong polarity (low vs high) will lead to an > interrupt storm which might very well lead to a non booting system. So make > sure you have an entry for another kernel in your grub menu as fallback. > > Also if the IRQ storm is not "bad" enough it may seem as if everything > works, while at the same time the CPU is very busy. > > So if things seem to work please run: > > cat /proc/interrupts > > and check that the numbers for IRQ 1 are not going crazy (each keypress > should add 2 to the IRQ count one for press + one for release). Hello Hans, thank you for the swift response! I tried applying your patch but unfortunately it doesn't seem to help. Curiously I get the message "ACPI: IRQ 1 override to edge(!), high" and not "level(!), low" as you suggested. Looking at the source code, it doesn't seem that the variables "t" or "p" are ever modified, so I assume it should always show "edge" and "high"? I went ahead and tried all combinations of edge/level and high/low but sadly to no avail. This is the output in dmesg: LEVEL_SENSITIVE ACTIVE_LOW: edge(!), high LEVEL_SENSITIVE ACTIVE_HIGH: edge(!), high(!) EDGE_SENSITIVE ACTIVE_HIGH: edge, high(!) EDGE_SENSITIVE ACTIVE_LOW: no override message > I tried applying your patch but unfortunately it doesn't seem to help. Curiously I get the message "ACPI: IRQ 1 override to edge(!), high" and not "level(!), low" as you suggested. Looking at the source code, it doesn't seem that the variables "t" or "p" are ever modified, so I assume it should always show "edge" and "high"? Ah right my patch neglects to change t and p, but the (!) does show the settings are actually being changed so it does do what it should. Its unfortunate that the patch does not help. From your description in bug 217873 where you write "This causes an extreme input delay (15 seconds to 1 minute per key press) but at least something happens." I was hoping this would be an IRQ issue, because that sounds a lot like IRQs not working. I wonder what happens if you mix my patch which manually sets the interrupt trigger type with some of the i8042 options. E.g. try i8042.nopnp (as suggested in the dmesg) + i8042.dumbkbd + i8042.noloop note using i8042.direct generally speaking is a bad idea on laptops ... I wonder what the output of "dmesg | grep i8042" is when adding "i8042.nopnp i8042.dumbkbd i8042.noloop" to the kernel commandline ? And as said try mixing that with overriding the IRQ trigger type for starters I would go with "edge high" since that is the default override (which gets skipped on your laptop because it is AMD Zen based). (In reply to Mario Limonciello (AMD) from comment #8) > Can both of you guys please test with the latest 6.6-rc, and if it continues > to fail attach these to the bug: > > 1) acpidump > 2) your dmesg from 6.6-rcX > 3) dmidecode ----------- OK, I'll try kernel 6.6-rc. Later I will update the results. Created attachment 305228 [details]
dmesg
Created attachment 305229 [details]
dmidecode34
Suroto, thank you for the logs. To try and debug this further we really also need an acpidump though. On Fedora can you please do: sudo dnf install acpica-tools and then run: sudo acpidump -o acpidump.txt And then attach the generated acpidump.txt file here ? I have now tried to apply the i8042 options together with the 3 different IRQ types. Interestingly they all make the keyboard work, but with the same extreme delay as previously mentioned. It doesn't seem that i8042.direct was never needed since it works fine with just "i8042.noloop i8042.dumbkbd i8042.nopnp". The "dmesg | grep i8042" reads the same for all of them: i8042: PNP detection disabled serio: i8042 KBD port at 0x60,0x64 irq 1 serio: i8042 AUX port at 0x60,0x64 irq 12 input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input3 It seems that the AUX port is vital for the keyboard to work, since it only works whenever "AUX port at 0x60,0x64 irq 12" is displayed. This confuses me since I always thought that the AUX port was for the mouse/trackpad. I tried to add some debug prints in the i8042 code and it seems that i8042_pnp_aux_probe in i8042-acpipnpio.h is never called, which prevents the AUX port from being created. Does this mean that the AUX port could not be found or is it just the IRQ failing? Nothing else in the code seems to be returning any errors. Something I want to point out a few things about p0lloc's logs I noticed. 1) This system is Mendocino based, which I want to say is the first time we've seen one of these IRQ bugs outside of Cezanne, Barcelo and Rembrandt. There isn't anything fundamental about Mendocino incompatibilities, just want to mention it in case it shows any patterns in the future. 2) The GPIO controller failed to set up. It's possible that there is an IRQ storm going on because those interrupts aren't being serviced. [ 0.276113] amd_gpio AMDI0030:00: error -EINVAL: IRQ index 0 not found [ 0.278464] amd_gpio: probe of AMDI0030:00 failed with error -22 I do see that IRQ should be IRQ 7 from that acpidump. Device (GPIO) { Name (_HID, "AMDI0030") // _HID: Hardware ID Name (_CID, "AMDI0030") // _CID: Compatible ID Name (_UID, 0x00) // _UID: Unique ID Method (_CRS, 0, NotSerialized) // _CRS: Current Resource Settings { Name (RBUF, ResourceTemplate () { Interrupt (ResourceConsumer, Level, ActiveLow, Shared, ,, ) { 0x00000007, } Memory32Fixed (ReadWrite, 0xFED81500, // Address Base 0x00000400, // Address Length ) }) Return (RBUF) /* \_SB_.GPIO._CRS.RBUF */ } Method (_STA, 0, NotSerialized) // _STA: Status { If ((TSOS >= 0x70)) { Return (0x0F) } Else { Return (0x00) } } } 3) I don't see any reasons that it should be on IRQ12. Regarding Suroto's logs: 1) The system is also Mendocino based. 2) I see the same amd_gpio probe failed error. IMO we should start out with fixing the problem causing the probe for amd_gpio to fail on both of these systems and hopefully that leads to a solution for the keyboard as well. Maybe we can start with /proc/ioports and /proc/iomem (both captured as root) to look for any resource overlap or misassignment relative to what the kernel is declaring. Or Hans, do these mentions jog any other ideas for you? Re: comment #25, the following logs are from a Lenovo V15 G4 AMN, the same machine as Suroto's, as far as I can tell. They are from an official Debian kernel (6.1.0-13-amd64, 6.1.55-1), i.e., without any patches from this bug, but for ioports and iomem, these shouldn't make any difference. For completeness, I'm adding the acpidump output as well. Created attachment 305230 [details]
dmesg-6.1.55.txt
Created attachment 305231 [details]
acpidump.txt
Created attachment 305232 [details]
dmidecode.txt
Created attachment 305233 [details]
/proc/ioports
Created attachment 305234 [details]
/proc/iomem
Mario, thank you for your input. I agree that it seems that the IRQ polarity and/or i8042 options seem to be a dead end here. The GPIO controller issue indeed looks suspect so lets start with trying to fix that and then see from there. The "error -EINVAL: IRQ index 0 not found" error comes from platform_get_irq() and looking at that function the only way it can return -EINVAL without first printing an error is by acpi_irq_get() returning -EINVAL. Which means that either one of these 2 calls is returning -EINVAL: int acpi_irq_get(acpi_handle handle, unsigned int index, struct resource *res) { ... rc = acpi_irq_parse_one(handle, index, &fwspec, &flags); if (rc) return rc; ... rc = irq_create_fwspec_mapping(&fwspec); if (rc <= 0) return -EINVAL; with me suspecting that it is the second call which is failing. Created attachment 305236 [details]
[PATCH] i8259 hack allow mapping of legacy IRQs with NULL PIC
I believe that this is related to this dmesg message:
0.066641] Using NULL legacy PIC
Which likely is causing mapping of IRQs under 16 to fail, which would explain both the kbd IRQ and the GPIO irq failure to work.
Can someone please give this quick hack a try and see if that helps ?
Excellent find! If it's using "NULL legacy PIC" that means that 8259 wasn't programmed properly by BIOS (Linux currently expects it). I've seen this in the past, and it's always been fixed by BIOS. If Hans' hack doesn't work, another hack idea to confirm it is to program registers for it from a "broken" Linux kernel and then kexec into "another" kernel and see if it helps. You can use any port I/O tools to write "0x68" to port "0x21" followed by "0x1" to port "0x21" to do this. Created attachment 305238 [details]
Another possible hack
I discussed this a bit with Boris, can you try this other hack? This should be tried independently of the two other hacks that Hans and I have suggested.
It would force the system down the ACPI reduced hardware path.
If this "works", please check if suspend/resume works and also share a kernel log.
(In reply to Hans de Goede from comment #23) > Suroto, thank you for the logs. To try and debug this further we really also > need an acpidump though. > > On Fedora can you please do: > > sudo dnf install acpica-tools > > and then run: > > sudo acpidump -o acpidump.txt > > And then attach the generated acpidump.txt file here ? I got problem when install acpica-tools. Still try to be solving I will send the file, if I have successfully installed this program Hello Hans, Mario and Boris! I have this laptop and applied your patches to Debian (kernel 6.5.7). dmesg will be attached. To put it briefly: 1. Hans patch: -- Keyboard and Trackpad is working -- Suspend also working (have a bug with multiple "nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xb6674000 flags=0x0000]" but still work) 2. Mario/Boris patch: -- Keyboard and Trackpad is working -- Suspend isn't work, performing something like poweroff and startup Created attachment 305241 [details]
kernel log with Hans patch
Created attachment 305242 [details]
kernel log with Hans patch (suspend)
Created attachment 305243 [details]
kernel log with Mario and Boris patch
Created attachment 305244 [details]
kernel log with Mario and Boris patch (suspend)
> 1. Hans patch:
> -- Keyboard and Trackpad is working
> -- Suspend also working (have a bug with multiple "nvme 0000:01:00.0: AMD-Vi:
> Event > logged [IO_PAGE_FAULT domain=0x000b address=0xb6674000 flags=0x0000]"
> but still work)
Ok, so we are definitely on the right track here. As simple as my patch is I don't expect it to be able to go upstream as is though.
I'm a bit out of my depth here. The whole PIC <-> IOAPIC logic is very convoluted. Mario can you ask Boris to take a look a this issue?
I realize this is a BIOS bug, but I still think we should fix this in the kernel since chasing all BIOSes which have this and get them updated is not going to work.
Kyrylo, about the nvme suspend/resume issue, does this also happen without my patch ?
Looking at the code more I do think that doing something like my hack is probably the way to deal with this. Maybe something like this: diff --git a/arch/x86/kernel/i8259.c b/arch/x86/kernel/i8259.c index 30a55207c000..2136e85dd5b0 100644 --- a/arch/x86/kernel/i8259.c +++ b/arch/x86/kernel/i8259.c @@ -317,6 +317,16 @@ static int probe_8259A(void) if (new_val != probe_val) { printk(KERN_INFO "Using NULL legacy PIC\n"); legacy_pic = &null_legacy_pic; +#ifdef CONFIG_X86_IO_APIC + /* + * Some AMD Zen based devices have the PIC disabled by the BIOS + * but they still use legacy ISA IRQs attached through the IOAPIC. + */ + if (boot_cpu_has(X86_FEATURE_ZEN)) { + printk(KERN_INFO "Using IOAPIC for legacy PIC/ISA IRQs\n"); + legacy_pic->nr_legacy_irqs = NR_IRQS_LEGACY; + } +#endif } raw_spin_unlock_irqrestore(&i8259A_lock, flags); ? (In reply to Hans de Goede from comment #42) > > 1. Hans patch: > > -- Keyboard and Trackpad is working > > -- Suspend also working (have a bug with multiple "nvme 0000:01:00.0: > AMD-Vi: > > Event > logged [IO_PAGE_FAULT domain=0x000b address=0xb6674000 > flags=0x0000]" > > but still work) > > Ok, so we are definitely on the right track here. As simple as my patch is I > don't expect it to be able to go upstream as is though. > > I'm a bit out of my depth here. The whole PIC <-> IOAPIC logic is very > convoluted. Mario can you ask Boris to take a look a this issue? > > I realize this is a BIOS bug, but I still think we should fix this in the > kernel since chasing all BIOSes which have this and get them updated is not > going to work. > > Kyrylo, about the nvme suspend/resume issue, does this also happen without > my patch ? Tested it, issue is happening but logged only once (one bunch of record), not as it happens with patch (every 5 sec) Created attachment 305246 [details]
raw kernel log 6.1.0 before suspend
Created attachment 305247 [details]
raw kernel log 6.1.0 after suspend
Created attachment 305248 [details]
run missing BIOS init sequence
#43:
Although simplistic I'm worried that is overzealous. It's going to apply to Epyc processors too, and those happen to be used in things like Azure. In Azure don't they use the NULL PIC path? I don't know if the Zen X86 feature exposes into confidential computing VM though.
How about instead we just run the missing initialization sequence from the BIOS as a quirk when these systems are detected?
Have a try with this patch instead.
Just a side note, regarding Mario's patch in #47, I've noticed that the exact same BIOS update[0] applies to the following systems: IdeaPad 14s AMN8 IdeaPad 15s AMN8 IdeaPad Slim 3 14AMN8 IdeaPad Slim 3 14AMN8 1 IdeaPad Slim 3 15AMN8 IdeaPad Slim 3 15AMN8 1 Lenovo V14 G4 AMN Lenovo V14 G4 AMN 1 Lenovo V14 G4 AMN 2 Lenovo V15 G4 AMN Lenovo V15 G4 AMN 1 Lenovo V15 G4 AMN 2 [0] https://pcsupport.lenovo.com/de/en/products/laptops-and-netbooks/lenovo-v-series-laptops/lenovo-v15-g4-amn/82yu/downloads/driver-list/component?name=BIOS%2FUEFI&id=5AC6A815-321D-440E-8833-B07A93E0428C So I'm assuming they're all affected by this bug. I'm not sure how to get their DMI_PRODUCT_NAME, maybe they're already covered by "82Y" and "82X". Anyone know? > So I'm assuming they're all affected by this bug. I'm not sure how to get > their DMI_PRODUCT_NAME, maybe they're already covered by "82Y" and "82X". > Anyone know? @Mark - can you help here? > Just a side note, regarding Mario's patch in #47, I've noticed that the exact > same BIOS update[0] applies to the following systems: Let's see if this patch works for the ones I quirked so far. If it does work, I'd rather move the DMI detection into the error path in which case we can probably change it to "82" to cover all the Lenovo systems from this year. Created attachment 305249 [details] dmesg-6.6.0-rc6.txt Mario's patch in comment #47 seems to work (see dmesg) on my machine (V15 G4 AMN, model number 82YU). The keyboard works, suspend-resume also works (only tried suspend to ram). Created attachment 305250 [details]
run missing bios init sequence (v2)
OK thanks! Can you see if this version works still too?
> [ 1244.309246] PM: hibernation: hibernation entry Also, I notice in your log you tested hibernate (suspend to disk), not suspend-to-ram. > [ 0.355620] Low-power S0 idle used by default for system suspend Can you please test suspend to idle (the default for your system). Created attachment 305251 [details]
acpidump 6.6.rc6
Mario,
Sorry late respond, attached file - acpidump
(In reply to Mario Limonciello (AMD) from comment #51) > Created attachment 305250 [details] > run missing bios init sequence (v2) > > OK thanks! Can you see if this version works still too? First of all let me say that I'm happy that we seem to be heading to a conclusion with a fix here. 2 comments about the v2 patch: 1. Are all models starting with 82 AMD based laptops? I thought the 82 indicated the introduction year of the model. So this can also run on Intel based models now ? IOW I think this maybe needs a if (boot_cpu_has(X86_FEATURE_ZEN)) check ? 2. I'm wondering if it would not be better to re-run the PIC detection after force-enabling it. So something like this: bool retried = false; /* ... */ raw_spin_lock_irqsave(&i8259A_lock, flags); retry: outb(0xff, PIC_SLAVE_IMR); /* mask all of 8259A-2 */ outb(probe_val, PIC_MASTER_IMR); new_val = inb(PIC_MASTER_IMR); if (new_val != probe_val) { if (!retried && boot_cpu_has(X86_FEATURE_ZEN) && dmi_check_system(amd_force_enable_pic_table)) { retried = true; goto retry; } printk(KERN_INFO "Using NULL legacy PIC\n"); ... At least it seems to me that re-doing the check after the force enable to see if the force-enable actually worked is maybe a good idea ? (In reply to David Lazar from comment #48) > Just a side note, regarding Mario's patch in #47, I've noticed that the > exact same BIOS update[0] applies to the following systems: > > IdeaPad 14s AMN8 > IdeaPad 15s AMN8 > IdeaPad Slim 3 14AMN8 > IdeaPad Slim 3 14AMN8 1 > IdeaPad Slim 3 15AMN8 > IdeaPad Slim 3 15AMN8 1 > Lenovo V14 G4 AMN > Lenovo V14 G4 AMN 1 > Lenovo V14 G4 AMN 2 > Lenovo V15 G4 AMN > Lenovo V15 G4 AMN 1 > Lenovo V15 G4 AMN 2 > > [0] > https://pcsupport.lenovo.com/de/en/products/laptops-and-netbooks/lenovo-v- > series-laptops/lenovo-v15-g4-amn/82yu/downloads/driver-list/ > component?name=BIOS%2FUEFI&id=5AC6A815-321D-440E-8833-B07A93E0428C > > So I'm assuming they're all affected by this bug. I'm not sure how to get > their DMI_PRODUCT_NAME, maybe they're already covered by "82Y" and "82X". > Anyone know? FYI, After BIOS update , keyboard still is not working (In reply to Mario Limonciello (AMD) from comment #52) I finally managed to test the v2 patch (from comment #51). The keyboard and touchpad work fine. > Also, I notice in your log you tested hibernate (suspend to disk), not > suspend-to-ram. > > > [ 0.355620] Low-power S0 idle used by default for system suspend > > Can you please test suspend to idle (the default for your system). I was imprecise with my choice of words: what I had tested was `systemctl hybrid-sleep`, which I assumed was equivalent to suspend-to-ram unless the battery runs empty while suspended. That appears not to be the case, so this time I tested all three: 1. systemctl hibernate -- "suspend to disk", this seems to work fine. 2. systemctl hybrid-sleep -- also seems to work fine 3. systemctl suspend -- "suspend to ram", had some issues: the first attempt managed to suspend, but upon resume the file system was reporting IO errors, to the point that even "ls" didn't work anymore. I couldn't get a dmesg output in this state. (Interestingly enough, my ssh connection resumed fine, presumably because it was already in RAM). I tried again, this time running in background `dmesg -w > dmesg-6.6.0-rc6-v2-suspend.txt`, so I'd at least capture output up to the crash (attached). This time, resume seemed to work, so I tried again, and this third time, the network did not resume properly, so I couldn't continue my ssh session, but the file system seemed to be ok, so I got some dmesg output (dmesg-6.6.0-rc6-v2-suspend-2.txt). There's a kernel stack trace in there, with this message: > [ 804.018115] Hardware became unavailable upon resume. This could be a > software issue prior to suspend or a hardware issue. I'll attach the logs right away. Created attachment 305252 [details]
dmesg-6.6.0-rc6-v2-hibernate.txt
systemctl hibernate -- seemed to work fine
Created attachment 305253 [details]
dmesg-6.6.0-rc6-v2-hybrid-sleep.txt
systemctl hybrid-sleep -- seemed to work fine
Created attachment 305254 [details]
dmesg-6.6.0-rc6-v2-suspend.txt
systemctl suspend -- this one seemed to resume fine, but see next log
Created attachment 305255 [details]
dmesg-6.6.0-rc6-v2-suspend-2.txt
systemctl suspend -- network did not come back up from sleep
#54 1. I had a similar concern while adjusting v2. HOWEVER I think it's mitigated by the fact the DMI check ONLY runs in the failure path now. So even though it technically applies to all laptops from "82" family for Lenovo only ones with a problem will run it. I would much rather we don't use a Zen heuristic because this has nothing to do with it being an AMD bug. It happens to be a BIOS bug on some Lenovo AMD systems. I don't see this issue on Mendocino "reference hardware", nor on some other vendor's Mendocino hardware I've seen. The only way we'll have a definitively correct list is if Mark (Lenovo) can provide it. 2. So you mean like the PIC malfunctions specifically on the quirked systems? Seems unlikely to me. Another idea is that we can put the DMI check and writing to those registers somewhere else that runs "earlier" than probe_8259A() runs (earlier than device_initcall I guess). I'm not sure a good place for that though. Nothing in arch_initcall() stands out to me. #56 OK thanks! #59 I would say that this doesn't look fine. I see NVME IOMMU page faults on both suspend and resume, which means there is still "something" wrong. I *think* it's a different issue. If it was just resume I would say it's similar to one that happened on a lot of other Lenovo systems which was another BIOS bug. By chance was your "time to resume" 10+ seconds? Can you please try (separately) iommu=pt and amd_iommu=off? (In reply to Mario Limonciello (AMD) from comment #61) > #54 > > 1. I had a similar concern while adjusting v2. HOWEVER I think it's > mitigated by the fact the DMI check ONLY runs in the failure path now. So > even though it technically applies to all laptops from "82" family for > Lenovo only ones with a problem will run it. So what I'm wondering is if the magic: outb(0x68, PIC_MASTER_IMR); outb(0x1, PIC_MASTER_IMR); sequence is something AMD specific or some generic PIC enable sequence ? If this is AMD specific then adding an boot_cpu_has(X86_FEATURE_ZEN) check seems to make sense. If this is not AMD specific then I'm thinking to maybe make the fix much more generic (continued below) > I would much rather we don't use a Zen heuristic because this has nothing to > do with it being an AMD bug. It happens to be a BIOS bug on some Lenovo AMD > systems. See above, this may not be an AMD bug, but if the magic enable sequence is AMD specific then IMHO it should still be guarded by some AMD check. > 2. So you mean like the PIC malfunctions specifically on the quirked > systems? Seems unlikely to me. It just generally seems like a good idea to re-check things to ensure that the enable sequence actually worked. Also since the probe path is already poking the PIC_MASTER_IMR register anyway doing the enable-sequence magic should not do anything on systems where there really is no PIC. I quick duckduckgo shows that the "Using NULL legacy PIC" message has only ever been seen in the wild (before now) on virtual machines either oracle vms (virtualbox?) or WSL2. Which presumably really don't have anything at the PIC io addresses in the specific VM config which triggers this. Also note that such configs really should set the acpi-reduced-hw flag at which point probe_8259A() will not be called at all, because the legacy_pic pointer is already set to the NULL PIC early on then. So what I'm thinking is that since the Lenovo BIOS setup does work under Windows we may see more of this and a more generic non DMI quirked fix would be better (with the retry). Specifically I'm thinking about doing something like this: bool retried = false; /* ... */ raw_spin_lock_irqsave(&i8259A_lock, flags); retry: outb(0xff, PIC_SLAVE_IMR); /* mask all of 8259A-2 */ outb(probe_val, PIC_MASTER_IMR); new_val = inb(PIC_MASTER_IMR); if (new_val != probe_val) { if (!retried) { printk(KERN_INFO "No PIC trying to force enable PIC\n"); outb(0x68, PIC_MASTER_IMR); outb(0x1, PIC_MASTER_IMR); retried = true; goto retry; } printk(KERN_INFO "Using NULL legacy PIC\n"); ... This is non AMD specific and should be harmless in the case where there really is no PIC: If there really is nothing at the PIC_MASTER_IMR IO address this just does 4 extra writes (2 enable sequence + 2 for retry probe) + 1 extra read and then continues as before. And if there actually is a PIC there it should now be enabled and things should work the same as when the BIOS itself had actually bothered to enable the PIC. Created attachment 305257 [details]
dmesg-6.6.0-rc6-v2-iommu-pt.txt
This is `systemctl suspend` with `iommu=pt`. The system appeared to resume fine (to my untrained eyes), and the resume time was under one second.
Created attachment 305258 [details]
dmesg-6.6.0-rc6-v2-amd-iommu-off.txt
This is `systemctl suspend` with `amd_iommu=off`. Again, resuming was fast and appeared to work ok.
#62 > See above, this may not be an AMD bug, but if the magic enable sequence is > AMD specific then IMHO it should still be guarded by some AMD check. No; it is not an AMD specific sequence. No magic here, this behaves like a standard PIC. If anything; "the magic base address" is specific to Lenovo's BIOS. It's setting the 8259 vector base to 0x68 and then configuring it to run in 8086 mode. You can see some similar code examples here: https://wiki.osdev.org/8259_PIC#Code_Examples >Which presumably really don't have anything at the PIC io addresses in the >specific VM config which triggers this. I think we need the author of this commit to confirm. When we agree on a patch we need them in the "To" line. e179f6914152 ("x86, irq, pic: Probe for legacy PIC and set legacy_pic appropriately") >Also note that such configs really should set the acpi-reduced-hw flag at >which point probe_8259A() will not be called at all, because the legacy_pic >pointer is already set to the NULL PIC early on then. ACPI hardware reduced has lots of other non-obvious implications. As you noticed from the above testing suspend/resume got broken by using ACPI hardware reduced. > This is non AMD specific and should be harmless in the case where there > really is no PIC: So the problem with this approach is that it's attempting to program a vector base that might not work for all systems. It /happens/ to work on these Lenovo ones, but I don't know about others. #63 OK then there is a secondary issue with the BIOS configuration and usage of the IOMMU. Let's treat it separately from this issue. David can you check if reverting e179f6914152 fixes this issue? I would think it's still broken without the missing sequence. (In reply to Mario Limonciello (AMD) from comment #65) > #62 > > It's setting the 8259 vector base to 0x68 and then configuring it to run in > 8086 mode. You can see some similar code examples here: > > https://wiki.osdev.org/8259_PIC#Code_Examples Thanks this is useful, this does show though that your init sequence seems to be wrong, the proper init sequence is: outb_pic(0x11, PIC_MASTER_CMD); /* ICW1: select 8259A-1 init, missing */ outb_pic(ISA_IRQ_VECTOR(0), PIC_MASTER_IMR); /* your 0x68 */ outb_pic(0x04, PIC_MASTER_IMR); /* missing */ outb_pic(0x01, PIC_MASTER_IMR); But this is all done by init_8259A(), so for the force-init you can just call init_8259A(false). Except that you need to unlock the i8259A_lock first and relock it after. This also illustrates why I think the retry is a good idea, that will test if a re-init is sufficient or if the PIC is disabled at some deeper level. What I believe happens with your patch now is that since you make probe() succeed then either: 1. init_8259A() fixes things up properly, so it would be enough with the quirk to just make probe() succeed; or 2. The PIC really is disabled at some lower level, but Linux re-routes the IRQs to the IOAPIC anyway so this does not really matter. This is basically what my hack does, add mappings for the Legacy IRQs to make them work and then rely on the IOAPIC to handle them. > I think we need the author of this commit to confirm. When we agree on a > patch we need them in the "To" line. > e179f6914152 ("x86, irq, pic: Probe for legacy PIC and set legacy_pic > appropriately") Ack. > > This is non AMD specific and should be harmless in the case where there > > really is no PIC: > > So the problem with this approach is that it's attempting to program a > vector base that might not work for all systems. It /happens/ to work on > these Lenovo ones, but I don't know about others. See above, the kernel resets the vector base to its own value in init_8259A(). But since it re-inits anyways the only thing which we really need to do is make probe_8259A() not set the NULL PIC, although I would prefer to make probe_8259A() call init_8259A() when we hit the bug and then have it retry the probe. > See above, the kernel resets the vector base to its own value in init_8259A(). But since it re-inits anyways the only thing which we really need to do is make probe_8259A() not set the NULL PIC, although I would prefer to make probe_8259A() call init_8259A() when we hit the bug and then have it retry the probe. Ah it seems to me that given the kernel resets the vector base anyway; what the probe code as it stands today was "really" doing was checking if PIC was enabled by the BIOS already or not. In that case I think that reverting e179f6914152 will fix this issue too. > 1. init_8259A() fixes things up properly, so it would be enough with the > quirk to just make probe() succeed; or Another option is that we revert e179f6914152 and come up with another way to fix kexec for HyperV. (In reply to Mario Limonciello (AMD) from comment #66) > David can you check if reverting e179f6914152 fixes this issue? I would > think it's still broken without the missing sequence. The code has changed since 2014, when that commit was made, so it no longer reverts cleanly, and I had to do it by hand. The kernel is still compiling, and I'll post the results when that's done, but I wanted to confirm that you just want me to remove the probing code. This is what probe_8259A() looks in my test: static int probe_8259A(void) { unsigned long flags; raw_spin_lock_irqsave(&i8259A_lock, flags); outb(0xff, PIC_MASTER_IMR); /* mask all of 8259A-1 */ outb(0xff, PIC_SLAVE_IMR); /* mask all of 8259A-2 */ raw_spin_unlock_irqrestore(&i8259A_lock, flags); return nr_legacy_irqs(); } Basically, it never touches legacy_pic. > OK then there is a secondary issue with the BIOS configuration and usage of > the IOMMU. > Let's treat it separately from this issue. Should I file a separate bug for that? > This is what probe_8259A() looks in my test: > Basically, it never touches legacy_pic. Yup, you did what I had in mind. I think that will do the trick. > Should I file a separate bug for that? Yes, please file it separately. Created attachment 305261 [details] dmesg-6.6.0-rc6-rev-pic-probe.txt (In reply to Mario Limonciello (AMD) from comment #66) > David can you check if reverting e179f6914152 fixes this issue? I would > think it's still broken without the missing sequence. Indeed, reverting e179f6914152 makes the keyboard work. I've also attached the dmesg output (running with amd_iommu=off, and tried a suspend/resume, it worked). > Yes, please file it separately. Filed bug 218024. > Indeed, reverting e179f6914152 makes the keyboard work. I've also attached
> the dmesg output (running with amd_iommu=off, and tried a suspend/resume, it
> worked).
Great, thanks!
So that reaffirms the above findings from Hans that the magic sequence isn't needed, it's just a way to claim that the PIC really is there even if the BIOS didn't program it.
The problem is the method that e179f6914152 used to detect the PIC doesn't work here.
I see three distinct approaches:
1. We revert e179f6914152 and store a variable to indicate whether a PIC was found and pass this information on to future kexec runs. This should fix the intent of e179f6914152
2. We revert e179f6914152 and as part of the shutdown sequence or kexec sequence disable the PIC.
3. We Go back to a quirk for detection for ZEN. All AMD Zen SoCs do have a PIC, whether it's programmed or not by BIOS appears to be a decision made by the BIOS vendor.
(In reply to Mario Limonciello (AMD) from comment #49) > > So I'm assuming they're all affected by this bug. I'm not sure how to get > > their DMI_PRODUCT_NAME, maybe they're already covered by "82Y" and "82X". > > Anyone know? > > @Mark - can you help here? > > > Just a side note, regarding Mario's patch in #47, I've noticed that the > exact > > same BIOS update[0] applies to the following systems: > > Let's see if this patch works for the ones I quirked so far. If it does > work, I'd rather move the DMI detection into the error path in which case we > can probably change it to "82" to cover all the Lenovo systems from this > year. Does this link with the other Mendocino bug I just commented on (218024) or is it different? For getting device IDs - suggest using PSREF. e.g: https://psref.lenovo.com/Product/IdeaPad/IdeaPad_Slim_3_14AMN8 Go to the model tab and click on the filter icon next to "Machine Type" and that will show what models that platform has. I couldn't find what an Ideapad 14s was though - doesn't seem to exist. I don't know that model line up at all I'm afraid. They're two separate BIOS bugs on the same laptop (family). Created attachment 305264 [details] fix probing GPIO controller I started a thread on LKML about this issue: https://lore.kernel.org/all/878r7z4kb4.ffs@tglx/. Can someone affected please attach the requested items from tglx to this issue? Also - I've got another patch that I think should help fix the GPIO controller probing. Can you try this (alone) and let me know if it helps that issue? If it helps fix the keyboard as well, that would be a pleasant surprise. Created attachment 305266 [details] no-pic-probe.tar.gz Attached is the information that Thomas Gleixner requested in the LKML thread[0]. > 1) dmesg with 'apic=verbose' on the command line > 2) /proc/interrupts > 3) /sys/kernel/debug/irq/irqs/{0..15} > > Two versions of that please: > > 1) Unpatched kernel > 2) Patched kernel [0] https://lore.kernel.org/all/878r7z4kb4.ffs@tglx/ I managed to force what I think the problem here is on one of my machines by hardcoding NULL PIC. The patch in #75 isn't enough to fix things. Let's see what tglx has to say about this situation. I expect this series fixes this issue, can you guys please test. https://lore.kernel.org/linux-kernel/20231020033806.88063-1-mario.limonciello@amd.com/ Created attachment 305269 [details] mario-v3.tar.gz Attached is the dmesg, /proc/interrupts, and /sys/kernel/debug/irq/irqs from Mario's patch series linked from comment #78. The keyboard and mouse work fine. Likely unrelated, but maybe worth noting: the display was blank with this kernel (although I could see the desktop fine over VNC). I suspect it's caused by me using the kernel from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git. If I understand correctly, this is the tree we're supposed to use for submitting patches to the IRQ subsystem. I'll also try the patch series against 6.6.0-rc6, which I've been using so far, to make sure the blank screen is not caused by the patch. Thanks, all the interrupts look like I would expect. > Likely unrelated, but maybe worth noting: the display was blank with this > kernel (although I could see the desktop fine over VNC). I suspect it's > caused by me using the kernel from: > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git. If I understand > correctly, this is the tree we're supposed to use for submitting patches to > the IRQ subsystem. Yeah that should be unrelated; it's because the x86 tree didn't pick up other stuff in 6.6-rcs. I see the UCSI NULL pointer bug too which got fixed in later RCs. > I'll also try the patch series against 6.6.0-rc6, which I've been using so > far, to make sure the blank screen is not caused by the patch. Thanks, hopefully everything should be good there. If you pair it with your NVME suspend fix patch (or iommu=pt) I'd expect you're in good shape. (In reply to Mario Limonciello (AMD) from comment #80) > > Likely unrelated, but maybe worth noting: the display was blank with this > > kernel (although I could see the desktop fine over VNC). I suspect it's > > caused by me using the kernel from: > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git. If I understand > > correctly, this is the tree we're supposed to use for submitting patches to > > the IRQ subsystem. > > Yeah that should be unrelated; it's because the x86 tree didn't pick up > other stuff in 6.6-rcs. I see the UCSI NULL pointer bug too which got fixed > in later RCs. Just to close the loop on this: everything works fine when running off of 6.6.0-rc6. Thanks for your help, Mario. Looking forward to the discussion on the LKML. Hey I have about 30 computers to dump with linux that have the problem from this bug. I saw that a patch has been made, but I just need confirmation (if possible) that it is in live linux kernel (at least the one from arch repositories). (In reply to Martin FILLON from comment #82) > but I just need confirmation (if > possible) that it is in live linux kernel (at least the one from arch > repositories). Hans and Mario have identified the issue, but the fix is still being discussed, so the patch isn't yet available in any kernels. If you're fine compiling your own kernel, you can apply the patches in comment #78 (but see also bug 218024). Otherwise, hang tight, it's being worked on. Created attachment 305291 [details] tglx-v4.tar.gz I've tested Thomas' patch from [0], and the keyboard and mouse work fine. [0] https://lore.kernel.org/all/87v8avawe0.ffs@tglx/ Attached is the usual debug info. I have tested the patch on my lenovo laptop. Now the keyboard and mouse seems to work fine. I am requesting team to release this fix as soon as possible to help many users who are stuck without keyboard and trackpad. This the patch https://bugzilla.kernel.org/attachment.cgi?id=305236 and install instructions are here https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe-6.2/+bug/2034477/comments/64 Akshay #85, you tested the wrong patch. Here is the correct patch: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=38d54ecfe293ed8bb26d05e6f0270a0aaa6656c6 The patch from #86 works without problems on my IdeaPad Slim 3 14AMN8 Ryzen 3 7320U. Kernel: 6.5.9-arch2-1, EndeavourOS A big thank you to Mario and Hans, I had a great first experience compiling my own kernel. I have had the same problem for a long time. I have had the same problem for a long time. I'am using Lenovo Thinkbook 15 Gen2 ARE 20VG006XTX. I only find these solutions worked: https://lucraymond.net/2021/07/09/fixing-suspend-resume-on-lenovo-thinkbook-15-g2-are-laptop-with-amd-in-linux/ https://lucraymond.net/2022/10/04/linux-fixing-suspend-resume-on-amd-renoir-lenovo-thinkbook-g2-are-on-kernel-5-19-6-0-and-up/ memsleep_default=deep doesn't work. Only the amd_iommu=soft (for older kernels) and amd_iommu=off (kernel 6.0+) works. Is this (https://www.phoronix.com/news/Linux-6.6-Fixes-9-Lenovo-Laptop) fix this issue for my laptop? Please help me, I'am waiting very very long time for get rid of this issue. This is the distro and kernel info that I using now: OS: TUXEDO OS 2 x86_64 Host: LENOVO LNVNB161216 Kernel: 6.5.0-10006-tuxedo Uptime: 16 mins Packages: 2177 (dpkg), 67 (flatpak) Shell: bash 5.1.16 Resolution: 1920x1080 DE: Plasma 5.27.8 (Wayland) WM: kwin_wayland_wr Theme: [Plasma], Breeze [GTK2/3] Icons: [Plasma], breeze [GTK2/3] Terminal: konsole CPU: AMD Ryzen 5 4500U with Radeon Graphics (6) @ 2.375GHz GPU: AMD ATI Renoir Memory: 4314MiB / 6790MiB You have a different issue, please open a separate one. (In reply to Mario Limonciello (AMD) from comment #89) > You have a different issue, please open a separate one. Okay, but I have a similar issue that when I suspend my laptop and woke up the laptop, it is hanging on black screen and I have to force to shutdown laptop by hold pressed button on keyboard. Unless I add amd_iommu=off to my grub file. If it is really so different issue, I'am gonna open a new issue. Could you help me? #86. Fixed. Thank you very much to Mario, Hans (In reply to Mario Limonciello (AMD) from comment #89) > You have a different issue, please open a separate one. I opened a different issue, but I was not sure which context should I choose for this. So I choose Kernel. If anyone help, I appreciate. I forgot to add a link to the issue. https://bugzilla.kernel.org/show_bug.cgi?id=218092 Note the fixes for this have landed in 6.5.10 now, which should be available through your distro for Arch and Fedora users soon. Thanks Mario and Hans for their precious efforts on this issue. Is there any distros using 6.6 kernel officially? And if not, can I download the kernel from and apply it to my own OS? - Emir > Is there any distros using 6.6 kernel officially? And if not, can I download > the kernel from and apply it to my own OS? The fixes are also available 6.5.10 and 6.5.10 is currently available in Fedora updates-testing. So you can install Fedora 38 with an USB keyboard + mouse attached to the laptop and then do: sudo dnf --enablerepo=updates-testing update 'kernel*' Reboot into the new kernel and then the keyboard and touchpad should work. Note coming Tuesday Fedora 39 gets released, so you may want to wait till Tuesday and then download and install F39 right away to avoid having to upgrade later. To download Fedora go to: https://fedoraproject.org/ I've tried install Fedora 39 from bootable ISO and there are not working mouse and keyboard in installation area. Lenovo V15 G4 AMN > I've tried install Fedora 39 from bootable ISO and there are not working
> mouse and keyboard in installation area. Lenovo V15 G4 AMN
This is expected, F39's images use 6.5.6 as kernel and the fixes for this are only available in 6.5.10 and later.
So you need to either install 6.5.10 for your current distribution (most distributions have some mainline kernel repository somewhere) or install Fedora 39 using an usb keyboard + mouse and then install all updates and after that things should work.
yeah, after update kernel everything okay. Thanks! |