Bug 197953
Description
Guenter Roeck
2017-11-21 21:59:48 UTC
Created attachment 260767 [details]
dmesg
Comment on attachment 260767 [details]
dmesg
dmesg output taken after making the following change to avoid excessive logging.
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -30,7 +30,11 @@ void handle_bad_irq(struct irq_desc *desc)
{
unsigned int irq = irq_desc_get_irq(desc);
+#if 1
+ WARN_ONCE(1, "irq %d is bad\n", irq);
+#else
print_irq_desc(irq, desc);
+#endif
kstat_incr_irqs_this_cpu(desc);
ack_bad_irq(irq);
}
Comment on attachment 260767 [details]
dmesg
dmesg output taken after making the following change to avoid excessive logging.
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -30,7 +30,11 @@ void handle_bad_irq(struct irq_desc *desc)
{
unsigned int irq = irq_desc_get_irq(desc);
+#if 1
+ WARN_ONCE(1, "irq %d is bad\n", irq);
+#else
print_irq_desc(irq, desc);
+#endif
kstat_incr_irqs_this_cpu(desc);
ack_bad_irq(irq);
}
Created attachment 260769 [details]
dmidecode
Created attachment 260771 [details]
grep -H 15 /sys/bus/acpi/devices/*/status
Created attachment 260773 [details]
lspci -vv -nk
On the first glance ACPI tables are broken in different ways. Though I need to dig documentation to understand if it's so or not. While continue looking at DSDT, I have got one question to you (or maybe two): - is this device have latest and greatest BIOS (coreboot) flashed? - is the device itself a prototype or actual device on market just with new software on top? To me it looks like there are two pins that are misconfigured by the firmware in a way that they have a) interrupt triggering enabled and b) the line is active all the time. Just to be sure, can you also attach content of /sys/kernel/debug/pinctrl/*/pins to the bug? A similar problem is also seen on Celes (Samsung Chromebook 3), though there is it not an interrupt storm but stops after a while. (GOOGLE Celes, BIOS Google_Celes.7287.92.104 08/20/2017) [ 0.662722] irq 171, desc: ffff92727a3d0400, depth: 1, count: 0, unhandled: 0 [ 0.662728] ->handle_irq(): ffffffff87eb4091, [ 0.662736] handle_bad_irq+0x0/0x1e8 [ 0.662739] ->irq_data.chip(): ffffffff88aab200, [ 0.662742] 0xffffffff88aab200 [ 0.662746] ->action(): (null) [ 0.662748] IRQ_NOPROBE set [ 0.662751] unexpected IRQ trap at vector ab [ 0.662875] irq 171, desc: ffff92727a3d0400, depth: 1, count: 0, unhandled: 0 [ 0.662879] ->handle_irq(): ffffffff87eb4091, [ 0.662887] handle_bad_irq+0x0/0x1e8 [ 0.662890] ->irq_data.chip(): ffffffff88aab200, [ 0.662893] 0xffffffff88aab200 [ 0.662896] ->action(): (null) [ 0.662899] IRQ_NOPROBE set [ 0.662902] unexpected IRQ trap at vector ab [ 0.662948] irq 171, desc: ffff92727a3d0400, depth: 1, count: 0, unhandled: 0 [ 0.662951] ->handle_irq(): ffffffff87eb4091, [ 0.662954] handle_bad_irq+0x0/0x1e8 [ 0.662957] ->irq_data.chip(): ffffffff88aab200, [ 0.662959] 0xffffffff88aab200 [ 0.662962] ->action(): (null) [ 0.662964] IRQ_NOPROBE set [ 0.662966] unexpected IRQ trap at vector ab [ 0.662942] i8042: Warning: Keylock active [ 0.662990] irq 171, desc: ffff92727a3d0400, depth: 1, count: 0, unhandled: 0 [ 0.662993] ->handle_irq(): ffffffff87eb4091, [ 0.662996] handle_bad_irq+0x0/0x1e8 [ 0.662999] ->irq_data.chip(): ffffffff88aab200, [ 0.663000] 0xffffffff88aab200 [ 0.663003] ->action(): (null) [ 0.663005] IRQ_NOPROBE set [ 0.663008] unexpected IRQ trap at vector ab [ 0.663027] irq 171, desc: ffff92727a3d0400, depth: 1, count: 0, unhandled: 0 [ 0.663031] ->handle_irq(): ffffffff87eb4091, [ 0.663034] handle_bad_irq+0x0/0x1e8 [ 0.663037] ->irq_data.chip(): ffffffff88aab200, [ 0.663039] 0xffffffff88aab200 [ 0.663042] ->action(): (null) [ 0.663044] IRQ_NOPROBE set [ 0.663047] unexpected IRQ trap at vector ab [ 0.663063] irq 171, desc: ffff92727a3d0400, depth: 1, count: 0, unhandled: 0 [ 0.663065] ->handle_irq(): ffffffff87eb4091, [ 0.663068] handle_bad_irq+0x0/0x1e8 [ 0.663071] ->irq_data.chip(): ffffffff88aab200, [ 0.663073] 0xffffffff88aab200 [ 0.663076] ->action(): (null) [ 0.663078] IRQ_NOPROBE set [ 0.663080] unexpected IRQ trap at vector ab [ 0.663128] irq 171, desc: ffff92727a3d0400, depth: 1, count: 0, unhandled: 0 [ 0.663131] ->handle_irq(): ffffffff87eb4091, [ 0.663134] handle_bad_irq+0x0/0x1e8 [ 0.663137] ->irq_data.chip(): ffffffff88aab200, [ 0.663139] 0xffffffff88aab200 [ 0.663142] ->action(): (null) [ 0.663144] IRQ_NOPROBE set [ 0.663147] unexpected IRQ trap at vector ab Created attachment 260793 [details]
/sys/kernel/debug/pinctrl/*/pins
We need also /sys/kernel/debug/irq_domain_mapping when IRQ domain debug option is set in kernel configuration. If you can get from both affected machines, it would be great. BIOS: I don't know; I'll need to check. The HW itself is a prototype, but HW-wise it should be close if not the same as the actually shipping device (Acer Chromebook R11). The kernel fails to boot with release units in the lab. Also, the BIOS is known to have an unfixed problem with fixed interrupt numbers (see https://bugs.chromium.org/p/chromium/issues/detail?id=704198; as far as I know the fix did not find its way into a release BIOS). I have no idea if the fix posted with that bug fixes this problem. I'll try to repeat the test with a release unit and post the result here. [Note that #9 was with a release unit] Created attachment 260795 [details]
/sys/kernel/debug/irq_domain_mapping
Can you apply below patch (it's mangled, though I think it's could be copied'n'pasted to the ->probe() function manually) and put the result here? --- a/drivers/pinctrl/intel/pinctrl-cherryview.c +++ b/drivers/pinctrl/intel/pinctrl-cherryview.c @@ -1605,6 +1605,9 @@ static int chv_gpio_probe(struct chv_pinctrl *pctrl, int irq) offset += range->npins; } +dev_info(pctrl->dev, "INTSTAT: %x", readl(pctrl->regs + CHV_INTSTAT)); +dev_info(pctrl->dev, "INTMASK: %x", readl(pctrl->regs + CHV_INTMASK)); + /* Do not add GPIOs that can only generate GPEs to the IRQ domain */ for (i = 0; i < pctrl->community->npins; i++) { const struct pinctrl_pin_desc *desc; Here you are: [ 1.620833] cherryview-pinctrl INT33FF:00: INTSTAT: 2 [ 1.620856] cherryview-pinctrl INT33FF:00: INTMASK: e [ 1.661506] cherryview-pinctrl INT33FF:01: INTSTAT: 21 [ 1.661531] cherryview-pinctrl INT33FF:01: INTMASK: 12f [ 1.706562] cherryview-pinctrl INT33FF:02: INTSTAT: 0 [ 1.706586] cherryview-pinctrl INT33FF:02: INTMASK: 0 [ 1.736480] cherryview-pinctrl INT33FF:03: INTSTAT: 0 [ 1.736504] cherryview-pinctrl INT33FF:03: INTMASK: 1 Created attachment 260797 [details]
Another debug patch to test
Please, apply this one and share results (from dmesg).
P.S. Very interesting bug, indeed.
With the patch in #16, the bad interrupt message is gone, but I get an endless sequence of: [ 51.705063] chv_gpio_irq_handler: 201673 callbacks suppressed [ 51.705082] cherryview-pinctrl INT33FF:01: 20 12f [ 51.705158] cherryview-pinctrl INT33FF:01: 20 12f [ 51.705223] cherryview-pinctrl INT33FF:01: 20 12f [ 51.705283] cherryview-pinctrl INT33FF:01: 20 12f [ 51.705367] cherryview-pinctrl INT33FF:01: 20 12f [ 51.705434] cherryview-pinctrl INT33FF:01: 20 12f [ 51.705489] cherryview-pinctrl INT33FF:01: 20 12f [ 51.705545] cherryview-pinctrl INT33FF:01: 20 12f [ 51.705629] cherryview-pinctrl INT33FF:01: 20 12f [ 51.705685] cherryview-pinctrl INT33FF:01: 20 12f Keyboard and touchpad both work. Relevant data from /proc/interrupts is now: 135: 0 0 1 0 chv-gpio 20 TS3A227E 180: 0 0 0 0 chv-gpio 9 ACPI:Event 182: 1 1 5 3 chv-gpio 11 i8042 183: 0 0 0 0 chv-gpio 12 ELAN0000:00 184: 0 0 0 0 chv-gpio 13 ELAN0001:00 300: 0 0 0 0 chv-gpio 46 max98090_interrupt 304: 0 0 0 0 chv-gpio 50 INT33BB:01 cd meaning the "0" entries are gone. Created attachment 260799 [details]
Updated dmesg after applying #16
After some more initialization: [ 501.795021] chv_gpio_irq_handler: 479851 callbacks suppressed [ 501.795046] cherryview-pinctrl INT33FF:01: 20 12f [ 501.795055] cherryview-pinctrl INT33FF:00: 2 e [ 501.795087] cherryview-pinctrl INT33FF:00: 2 e [ 501.795135] cherryview-pinctrl INT33FF:01: 20 12f [ 501.795147] cherryview-pinctrl INT33FF:00: 2 e [ 501.795175] cherryview-pinctrl INT33FF:00: 2 e [ 501.795194] cherryview-pinctrl INT33FF:01: 20 12f [ 501.795226] cherryview-pinctrl INT33FF:01: 20 12f [ 501.795231] cherryview-pinctrl INT33FF:00: 2 e [ 501.795258] cherryview-pinctrl INT33FF:00: 2 e [ 506.796027] chv_gpio_irq_handler: 482149 callbacks suppressed I'll attach another dmesg. Created attachment 260801 [details]
Another dmesg, taken after initialization is complete
One bug that I found in the driver kinda fixed by the patch. The issue we have now is due to coreboot default settings: === From Linux OS === pin 6 (GPIO_DFX_8) GPIO 0x58008200 0x05c00001 Note very first 5 in the line which means interrupt line chosen, which becomes BIT(5) in INTSTAT and in INTMASK, i.e. 0x20 with unmasked IRQ 0x12f & 0x20 == 0x20: "Interrupt Mask (Interrupt_Mask): Mask bits for the 16 interrupts. Interrupt masked when 0." === From Coreboot sources === GPI(trig_edge_low, L5, NA, non_maskable, en_rx_data, NA, NA), It looks like some frequency is on DFX_8 pin. [ 501.795231] cherryview-pinctrl INT33FF:00: 2 e Same story with HDA, I suppose. pin 37 (MF_HDA_DOCKENB) GPIO 0x24c08201 0x05c00003 GPI(trig_edge_both, L2, P_1K_H, non_maskable, en_edge_detect, NA, NA), Pierre, you might be interested in the ALSA related issues here: 1. Potential dead lock. 2. HDA has two GPIO pins dedicated for interrupts, but it looks like driver only requests one by some reason. Sorry, if not. Feel free to remove from Cc in that case. Follow-up to #7, #12: I have confirmed that the problem is seen on production units in the lab with latest firmware. For #21 .. 23: Andy - are you going to send a patch with a fix for the driver problem ? Also, is there anything we can do to work around the problem, such as disabling an interrupt if an interrupt is received but there is no registered handler ? Thanks, Guenter Commit bcb48cca23ec9 ("pinctrl: cherryview: Do not mask all interrupts in probe") may explain why the problem is not seen with older kernels. From the dmesg output, if I read it right we have this: cherryview-pinctrl INT33FF:01: 20 12f I guess first one is status and second is mask. Now this seems to indicate the reason is intr_line 5. Looking at your pins file for INT33FF:01 there is this: pin 6 (GPIO_DFX_8) GPIO 0x58008200 0x05c00001 the 5 in the first dword is the intr_line. This pin seems to be triggering from falling edges. We can't see it here, though. One idea is that you could request this GPIO from the driver through sysfs and see if it stops interrupting: # cat /sys/kernel/debug/pinctrl/INT33FF\:01/gpio-ranges GPIO ranges handled: 0: INT33FF:01 GPIOS [397 - 405] PINS [0 - 8] 9: INT33FF:01 GPIOS [406 - 418] PINS [15 - 27] 22: INT33FF:01 GPIOS [419 - 430] PINS [30 - 41] 34: INT33FF:01 GPIOS [431 - 442] PINS [45 - 56] 46: INT33FF:01 GPIOS [443 - 455] PINS [60 - 72] # echo $((397+6)) > /sys/class/gpio/export [ 46.731362] cherryview-pinctrl INT33FF:01: request pin 6 (GPIO_DFX_8) for INT33FF:01:403 Created attachment 260829 [details]
Fix for wrong offset in /proc/interrupts
Guenter, please apply the fix, it should reveal the real problematic pins.
I was playing with the driver as well and I see that INT33FF:00 always reports pending 0x02 in interrupt routine. Trying to get mapping for the pin gives 0 and it never gets ACKed properly. If I roughly do: if (!irq) { // reset corresponding INTSTAT bit } else { generic_handle_irq(irq); } then it starts working much better. I'll test the patch from #27 Monday - I don't have access to the system right now. (In reply to Dmitry Torokhov from comment #28) > I was playing with the driver as well and I see that INT33FF:00 always > reports pending 0x02 in interrupt routine. Trying to get mapping for the pin > gives 0 and it never gets ACKed properly. If I roughly do: > > if (!irq) { > // reset corresponding INTSTAT bit > } else { > generic_handle_irq(irq); > } > > then it starts working much better. Unfortunately interrupt 0 is also valid so if anything is using that, it will suddenly stop working. I think the best option is to identify the misconfigured pins (with the help of Andy's latest patc) and then update coreboot to make sure they don't generate interrupts (disable interrupt trigger flags for example from padcfg1). IIRC, I saw the same interrupt storm issue when running Gallium on a Cyan device in legacy mode (which I understand as not using Coreboot), I could run maybe 5mn before the system ran out of disk space due to the system logs... With the patch from #27 applied, plus debug log messages, I get the following: [ 45.198033] chv_gpio_irq_handler: 532644 callbacks suppressed [ 45.198036] cherryview-pinctrl INT33FF:01: irq 48: mapped to irq 0, offset -1 [ 45.198048] cherryview-pinctrl INT33FF:00: irq 49: mapped to irq 0, offset -1 [ 45.198067] cherryview-pinctrl INT33FF:01: irq 48: 20 12f [ 45.198070] cherryview-pinctrl INT33FF:00: irq 49: 2 6 [ 45.198076] cherryview-pinctrl INT33FF:00: irq 49: mapped to irq 0, offset -1 [ 45.198094] cherryview-pinctrl INT33FF:01: irq 48: mapped to irq 0, offset -1 [ 45.198101] cherryview-pinctrl INT33FF:00: irq 49: 2 6 [ 45.198108] cherryview-pinctrl INT33FF:00: irq 49: mapped to irq 0, offset -1 [ 45.198127] cherryview-pinctrl INT33FF:00: irq 49: 2 6 [ 45.198138] cherryview-pinctrl INT33FF:00: irq 49: mapped to irq 0, offset -1 [ 45.198145] cherryview-pinctrl INT33FF:01: irq 48: 20 12f [ 45.198156] cherryview-pinctrl INT33FF:01: irq 48: mapped to irq 0, offset -1 [ 45.198164] cherryview-pinctrl INT33FF:00: irq 49: 2 6 [ 45.198169] cherryview-pinctrl INT33FF:00: irq 49: mapped to irq 0, offset -1 [ 45.198188] cherryview-pinctrl INT33FF:00: irq 49: 2 6 [ 45.198197] cherryview-pinctrl INT33FF:00: irq 49: mapped to irq 0, offset -1 [ 45.198212] cherryview-pinctrl INT33FF:01: irq 48: 20 12f [ 45.198221] cherryview-pinctrl INT33FF:01: irq 48: mapped to irq 0, offset -1 Log messages generated with: pending = readl(pctrl->regs + CHV_INTSTAT); + dev_info_ratelimited(pctrl->dev, "irq %d: %lx %x\n", + irq_desc_get_irq(desc), pending, readl(pctrl->regs + CHV_INTMASK)); The offending interrupts are no longer listed in /proc/interrupts, there is no log output unless I generate it myself. Note that the patch passes the negative offset to irq_find_mapping() which doesn't accept negative values as parameter; this seems wrong. The following code snippet solves (works around) the problem for me. for_each_set_bit(intr_line, &pending, pctrl->community->nirqs) { unsigned irq, offset; offset = pctrl->intr_lines[intr_line]; + if (offset == CHV_INVALID_HWIRQ) { + u32 value; + + dev_warn_ratelimited(pctrl->dev, + "Disabling chained interrupt line %d\n", + intr_line); + + value = readl(pctrl->regs + CHV_INTMASK); + value &= ~BIT(intr_line); + chv_writel(value, pctrl->regs + CHV_INTMASK); + chv_writel(BIT(intr_line), pctrl->regs + CHV_INTSTAT); + continue; + } + Resulting log output is: [ 0.187869] cherryview-pinctrl INT33FF:01: Disabling chained interrupt line 5 [ 0.847305] cherryview-pinctrl INT33FF:01: Disabling chained interrupt line 0 [ 19.164825] cherryview-pinctrl INT33FF:00: Disabling chained interrupt line 1 Would that be acceptable as add-on patch (maybe with a stronger error message) ? Guenter Hmm, the patch from Andy should expose the offending interrupts in /proc/interrupts. Note, we can't just go an disable anything because the same thing controls direct interrupts to the IO-APIC. I still think this is configuration issue and the coreboot you are running misconfigures pins to trigger interrupts for pins that are floating or so. That's why we need to identify those pins and then implement a fix to the correct place which I think is still coreboot, not the driver. Did you already try to request GPIO 403 from sysfs (without any patches)? Does it change anything? Snippet from /proc/interrupt with Andy's patch applied: 135: 1 0 0 0 chv-gpio 20 TS3A227E 180: 0 0 0 0 chv-gpio 9 ACPI:Event 182: 1 2 4 3 chv-gpio 11 i8042 183: 0 0 0 0 chv-gpio 12 ELAN0000:00 184: 0 0 0 0 chv-gpio 13 ELAN0001:00 300: 0 0 0 0 chv-gpio 46 max98090_interrupt 304: 0 0 0 0 chv-gpio 50 INT33BB:01 cd Unfortunately I don't see the 'bad' interrupts. Requesting gpio pin 403 with or without Andy's patch applied does not make a difference. I understand that this is a (coreboot ?) configuration issue. I'd just like to have a workaround in place to at least have a working image. In that sense, Andy's patch makes the situation even worse: It generates an endless amount of interrupts with no notice, warning, or any other output whatsoever. I think the easiest workaround is to add: chv_writel(0, pctrl->regs + CHV_INTMASK); to chv_gpio_probe() right before CHV_INTSTAT is cleared. That should mask all interrupts. #36: Maybe, but that would revert bcb48cca23ec9 ("pinctrl: cherryview: Do not mask all interrupts in probe"). Presumably you had a reason for applying that patch, which makes me hesitate to revert it. On another note, what happens if the BIOS/coreboot enables interrupts but Linux doesn't load/support the necessary driver(s) ? Even if the coreboot problem is fixed, wouldn't we end up with the same problem ? Created attachment 260905 [details]
Mask non-interrupt pins
So it looks like the pin that gives me trouble is pin 1 in SouthWest community, which is FST_SPI_D0. It is configured as a native function, not interrupt, but still is enabled in INTMASK register. If I mask all pins that are not configured as interrupts, then everything seems happy here.
Logs with my patch: [ 0.087998] cherryview-pinctrl INT33FF:00: chv_gpio_probe: pin 34 is not configured as interrupt, resetting interrupt mask [ 0.088024] cherryview-pinctrl INT33FF:00: chv_gpio_probe: pin 37 is not configured as interrupt, resetting interrupt mask [ 0.088072] cherryview-pinctrl INT33FF:00: chv_gpio_probe: pin 95 is not configured as interrupt, resetting interrupt mask [ 0.088713] cherryview-pinctrl INT33FF:01: chv_gpio_probe: pin 0 is not configured as interrupt, resetting interrupt mask [ 0.088749] cherryview-pinctrl INT33FF:01: chv_gpio_probe: pin 15 is not configured as interrupt, resetting interrupt mask [ 0.089921] cherryview-pinctrl INT33FF:03: chv_gpio_probe: pin 0 is not configured as interrupt, resetting interrupt mask Patch in #39 works for me. Log output on my system: [ 0.172618] cherryview-pinctrl INT33FF:00: chv_gpio_probe: pin 34 is not configured as interrupt, resetting interrupt mask [ 0.172631] cherryview-pinctrl INT33FF:00: chv_gpio_probe: pin 37 is not configured as interrupt, resetting interrupt mask [ 0.172681] cherryview-pinctrl INT33FF:00: chv_gpio_probe: pin 95 is not configured as interrupt, resetting interrupt mask [ 0.177315] cherryview-pinctrl INT33FF:01: chv_gpio_probe: pin 0 is not configured as interrupt, resetting interrupt mask [ 0.177344] cherryview-pinctrl INT33FF:01: chv_gpio_probe: pin 6 is not configured as interrupt, resetting interrupt mask [ 0.177363] cherryview-pinctrl INT33FF:01: chv_gpio_probe: pin 15 is not configured as interrupt, resetting interrupt mask [ 0.177381] cherryview-pinctrl INT33FF:01: chv_gpio_probe: pin 18 is not configured as interrupt, resetting interrupt mask [ 0.177391] cherryview-pinctrl INT33FF:01: chv_gpio_probe: pin 19 is not configured as interrupt, resetting interrupt mask [ 0.177406] cherryview-pinctrl INT33FF:01: chv_gpio_probe: pin 21 is not configured as interrupt, resetting interrupt mask [ 0.185332] cherryview-pinctrl INT33FF:03: chv_gpio_probe: pin 0 is not configured as interrupt, resetting interrupt mask and no more spurious interrupts. This is with "pinctrl: cherryview: Adjust interrupt line mapping" applied as well. That won't solve the other problem I mentioned, though: If Linux doesn't support all required drivers for which interrupts are enabled by the BIOS or coreboot, we might be back to having an interrupt storm. This leads to a follow-up question: Per commit bcb48cca23ec9, interrupts can not be masked because other entities, specifically "SCI and other direct interrupts", may need to have those interrupts to be enabled. If so, why don't the respective drivers request and enable those interrupts ? (In reply to Guenter Roeck from comment #40) > Patch in #39 works for me. Log output on my system: > > and no more spurious interrupts. This is with "pinctrl: cherryview: Adjust > interrupt line mapping" applied as well. I can't see this patch in this bugzilla or the mainline, can you provide a pointer? I need to run some audio tests on Cyan and would like to apply all suggested fixes. Thanks! #41: The patch is titled " Fix for wrong offset in /proc/interrupts " in attachments. Guenter meant "Mask non-interrupt pins". (In reply to Guenter Roeck from comment #40) > That won't solve the other problem I mentioned, though: If Linux doesn't > support all required drivers for which interrupts are enabled by the BIOS or > coreboot, we might be back to having an interrupt storm. The way it works (if using direct interrupts) is that the GPIO hardware needs to have the interrupt enabled in INTMASK and the it needs to be routed to the IO-APIC (there is a bit somewhere that BIOS/coreboot sets). Once properly done, the interrupt is routed to the IO-APIC where it is masked by default. The GPIO driver never gets interrupted by those. This means that we don't get spurious interrupts at all and drivers using the IO-APIC interrupt then enables the interrupt when needed using standard APIs. At least that's how I remember it worked. So it is important that the coreboot/BIOS configures all pins and masks properly to prevent any spurious interrupts. > This leads to a follow-up question: Per commit bcb48cca23ec9, interrupts can > not be masked because other entities, specifically "SCI and other direct > interrupts", may need to have those interrupts to be enabled. If so, why > don't the respective drivers request and enable those interrupts ? It depends on the system. Your system may not use SCI at all (there is "acpi" entry in /proc/interrupts). SCI for one is requested and enabled using standard APIs. (In reply to Mika Westerberg from comment #44) > (In reply to Guenter Roeck from comment #40) > > That won't solve the other problem I mentioned, though: If Linux doesn't > > support all required drivers for which interrupts are enabled by the BIOS > or > > coreboot, we might be back to having an interrupt storm. > > The way it works (if using direct interrupts) is that the GPIO hardware > needs to have the interrupt enabled in INTMASK and the it needs to be routed > to the IO-APIC (there is a bit somewhere that BIOS/coreboot sets). Once > properly done, the interrupt is routed to the IO-APIC where it is masked by > default. The GPIO driver never gets interrupted by those. This means that we > don't get spurious interrupts at all and drivers using the IO-APIC interrupt > then enables the interrupt when needed using standard APIs. > > At least that's how I remember it worked. It seems my memory about this is bit rusty :/ Checked from the documentation and if I understand it correctly the GPIO driver always gets interrupt when any of the pins are configured to trigger an interrupt with the exception of GPEs (there is another register GPIO_ROUT in PCU/PMC that controls whether GPE or SMI is generated). So for things to work properly the BIOS/coreboot should pass GPIOs to drivers using ACPI Gpio* resources and by default make sure they are masked. So is Dmitry's patch an acceptable workaround ? (In reply to Guenter Roeck from comment #46) > So is Dmitry's patch an acceptable workaround ? You mean for the mainline Linux? No, I think coreboot should be fixed instead. This is ridiculous. We essentially broke shipped hardware (Cyan used to work fine) and now require BIOS update to address a regression? While new BIOS is a possibility I have no idea how you expect to communicate this to users who do not run ChromeOS and thus will not get updated firmware automatically. That said, the patch above is broken; I did not actually read interrupt config before checking and masking it :/ (In reply to Dmitry Torokhov from comment #48) > This is ridiculous. We essentially broke shipped hardware (Cyan used to work > fine) and now require BIOS update to address a regression? While new BIOS is > a possibility I have no idea how you expect to communicate this to users who > do not run ChromeOS and thus will not get updated firmware automatically. What? Please point me to the regression we caused by a kernel change and we will fix it. Which commit broke this device we are talking about here? My understanding was that this is *new* hardware based on CYAN design and the GPIO has never worked any better there. Please correct me if I'm mistaken. No, we are talking about bog standard Acer Chromebook R11 (Cyan) with mainline kernel. The patch: ommit bcb48cca23ec9852739e4a464307fa29515bbe48 Author: Mika Westerberg <mika.westerberg@linux.intel.com> Date: Mon Aug 22 14:42:52 2016 +0300 pinctrl: cherryview: Do not mask all interrupts in probe breaks it badly. I'm in the middle of reading one document we never saw before (I will discuss my findings with Mika tomorrow) and I can just summarize that Mika's patch *reveals* the issue in Coreboot. In similar way as previously with that absolute IRQ numbering scheme for keyboard. It does not matter what the patch reveals, we still have to deal with it. The hardware has been in users hands for couple of years now. Something is telling me we will anticipate more DMI quirks for those broken platforms. Created attachment 260915 [details]
Mask all interrupts the community is able to generate (not GPEs)
Can you try the attached (untested) patch? Here we mask all interrupts the community can generate but do not touch GPEs. That should still keep GPEs running and at the same time prevent the interrupt storm. Pretty much similar to Dmitry's but more brutal :)
Andy, you were actually right, it was PCIE_CLKREQ3B/AUDIO_CODEC_IRQ pin that was set up as interrupt and was not masked by coreboot for some reason. I think the best way forward is to mask all interrupts Strago boards with 1.0 BIOS version and hopefully we'll sort it out in updated coreboot at some time. Created attachment 260931 [details]
Adjust interrupt line mapping (v2)
This should eventually unveil the culprit pins.
Guenter, can you please test this one (w/o any mask patch applied!)?
With added debug info, the patch from #57 results in: [ 171.732040] chv_gpio_irq_handler: 507297 callbacks suppressed [ 171.732052] cherryview-pinctrl INT33FF:00: 2 e [ 171.732058] cherryview-pinctrl INT33FF:01: intr_line=5: No irq mapping found [ 171.732098] cherryview-pinctrl INT33FF:01: 20 12f [ 171.732111] cherryview-pinctrl INT33FF:01: intr_line=5: No irq mapping found [ 171.732125] cherryview-pinctrl INT33FF:00: intr_line=1: No irq mapping found [ 171.732153] cherryview-pinctrl INT33FF:00: 2 e [ 171.732168] cherryview-pinctrl INT33FF:01: 20 12f [ 171.732180] cherryview-pinctrl INT33FF:01: intr_line=5: No irq mapping found [ 171.732201] cherryview-pinctrl INT33FF:00: intr_line=1: No irq mapping found [ 171.732225] cherryview-pinctrl INT33FF:01: 20 12f [ 171.732237] cherryview-pinctrl INT33FF:00: 2 e [ 171.732248] cherryview-pinctrl INT33FF:00: intr_line=1: No irq mapping found [ 171.732267] cherryview-pinctrl INT33FF:01: intr_line=5: No irq mapping found [ 171.732278] cherryview-pinctrl INT33FF:00: 2 e [ 171.732289] cherryview-pinctrl INT33FF:00: intr_line=1: No irq mapping found [ 171.732318] cherryview-pinctrl INT33FF:00: 2 e [ 171.732340] cherryview-pinctrl INT33FF:01: 20 12f [ 171.732345] cherryview-pinctrl INT33FF:00: intr_line=1: No irq mapping found [ 171.732371] cherryview-pinctrl INT33FF:01: intr_line=5: No irq mapping found [ 171.732376] cherryview-pinctrl INT33FF:00: 2 e which probably match the two pins mentioned by Dmitry in #56. The patch in #55 should be BIT(pctrl->community->nirqs). Everything looks good with both #55 (fixed) and #57 applied. Pins should be visible through/prof/interrupts. Can you share what is there now? Created attachment 260945 [details]
Quirk masking all interrupts in Intel_Strago based systems
This is an updated patch. Now we limit interrupt masking to Intel_Strago based systems.
Can you test that it still works?
#59: No, the pins are not there. This is what I have: 135: 0 0 0 0 chv-gpio 20 TS3A227E 180: 0 0 0 0 chv-gpio 9 ACPI:Event 182: 3 2 2 2 chv-gpio 11 i8042 183: 0 0 0 0 chv-gpio 12 ELAN0000:00 184: 0 0 0 0 chv-gpio 13 ELAN0001:00 300: 0 0 0 0 chv-gpio 46 max98090_interrupt 304: 0 0 0 0 chv-gpio 50 INT33BB:01 cd The patch in #60 doesn't work as-is. need_valid_mask is true if there is no dmi match, so the patch should be "if (!need_valid_mask) { ... }". With that change made, it works. (In reply to Guenter Roeck from comment #62) > The patch in #60 doesn't work as-is. > > need_valid_mask is true if there is no dmi match, so the patch should be "if > (!need_valid_mask) { ... }". With that change made, it works. Right, sorry about that. I just sent a fixed version upstream. Created attachment 260979 [details]
Adjust interrupt line mapping (v3)
v3 of the patch, please, provide /proc/interrupts when it's storming.
P.S. When we fill the cache we need to actually map BIOS configured interrupt at ->probe() stage. Otherwise the condition in the IRQ handler just false for unconfigured ones. I guess I didn't miss again something.
Created attachment 260981 [details] 260979: Adjust interrupt line mapping (v4) Typo fixed, sorry. See comment #64 for details. This is with the patch from #65 applied: 135: 0 0 0 1 chv-gpio 20 TS3A227E 180: 0 0 0 0 chv-gpio 9 ACPI:Event 182: 2 3 4 1 chv-gpio 11 i8042 183: 0 0 0 0 chv-gpio 12 ELAN0000:00 184: 0 0 0 0 chv-gpio 13 ELAN0001:00 300: 0 0 0 0 chv-gpio 46 max98090_interrupt 304: 0 0 0 0 chv-gpio 50 INT33BB:01 cd With added debug messages: [ 101.710013] chv_gpio_irq_handler: 524332 callbacks suppressed [ 101.710027] cherryview-pinctrl INT33FF:01: No interrupt handler for intr_line=5 [ 101.710044] cherryview-pinctrl INT33FF:00: No interrupt handler for intr_line=1 [ 101.710079] cherryview-pinctrl INT33FF:00: No interrupt handler for intr_line=1 [ 101.710111] cherryview-pinctrl INT33FF:01: No interrupt handler for intr_line=5 Somehow those unassigned interrupts hide pretty well. Should I add some debug logging somewhere ? My gosh, in the line in probe it should be ctx->irq_line = value ? intsel : CHV_INVALID_HWIRQ; (intsel instead of i) Can you give a try? Also, make sense to add these debug prints into handler (it seems you have this in your debug patch): dev_info_ratelimited(pctrl->dev, "%s: intr_line %u\n", __func__, intr_line); for (offset = 0; offset < pctrl->community->npins; offset++) { ctx = &pctrl->saved_pin_context[offset]; if (ctx->irq_line == intr_line) { irq = irq_find_mapping(gc->irq.domain, offset); dev_info_ratelimited(pctrl->dev, "%s: irq %u\n", __func__, irq); This looks better: 168: 989950 982234 990177 982452 chv-gpio 53 177: 1552615 1560150 1552186 1560079 chv-gpio 6 180: 0 0 0 0 chv-gpio 9 ACPI:Event 182: 4 3 6 5 chv-gpio 11 i8042 183: 0 0 0 0 chv-gpio 12 ELAN0000:00 184: 0 0 0 0 chv-gpio 13 ELAN0001:00 300: 0 0 0 0 chv-gpio 46 max98090_interrupt 304: 0 0 0 0 chv-gpio 50 INT33BB:01 cd I had to ratelimit print_irq_desc() in handle_bad_irq() to be able to boot. Yes, now it looks as expected, eventually. Pin 95 is #53 in the array. Though, Mika had sent a patch to eliminate that additional level of mapping and in new code it will be #95 in the /proc/interrupts. Because of his patch I need to rebase mine on top of his and send upstream. For final testing I'm going to attach both patches here. Created attachment 260983 [details]
gpio / ACPI: Drop unnecessary ACPI GPIO to Linux GPIO translation
Mika's patch (in upstream, though no available via any public git so far)
Created attachment 260985 [details]
Adjust interrupt line mapping (v5)
v5 of the patch. Dependent to "gpio / ACPI: Drop unnecessary ACPI GPIO to Linux GPIO translation".
Guenter, thanks for you patience and this long run of tests! I would really appreciate if you can perform a final test against attached patches. @Andy: My pleasure. Thanks for working on the fix! With both patches from #71 and #72 applied: 149: 0 0 0 0 chv-gpio 34 TS3A227E 168: 3984415 3924842 3924250 3983649 chv-gpio 53 182: 0 0 0 0 chv-gpio 67 i8042 219: 4475311 4535099 4535657 4476168 chv-gpio 6 224: 4440692 4500476 4501026 4441557 chv-gpio 11 225: 0 0 1 1 chv-gpio 12 226: 2 2 3 3 chv-gpio 13 228: 0 0 0 0 chv-gpio 15 ACPI:Event 231: 0 0 0 0 chv-gpio 18 ELAN0000:00 232: 0 0 0 0 chv-gpio 19 ELAN0001:00 390: 0 0 0 0 chv-gpio 77 max98090_interrupt 394: 0 0 0 0 chv-gpio 81 INT33BB:01 cd So we now have five unhandled interrupts, three of which are stuck. Again, I need --- a/kernel/irq/handle.c +++ b/kernel/irq/handle.c @@ -30,7 +30,8 @@ void handle_bad_irq(struct irq_desc *desc) { unsigned int irq = irq_desc_get_irq(desc); - print_irq_desc(irq, desc); + if (printk_ratelimit()) + print_irq_desc(irq, desc); kstat_incr_irqs_this_cpu(desc); ack_bad_irq(irq); } to reduce the flood of messages as seen in #1. Interestingly, I don't see all extra interrupts if I also apply "pinctrl: cherryview: Mask all interrupts on Intel_Strago based systems". In that case, the output of /proc/interrupts is: 149: 0 0 0 0 chv-gpio 34 TS3A227E 182: 0 0 0 0 chv-gpio 67 i8042 225: 22 25 26 24 chv-gpio 12 228: 0 0 0 0 chv-gpio 15 ACPI:Event 231: 22 25 26 24 chv-gpio 18 ELAN0000:00 232: 0 0 0 0 chv-gpio 19 ELAN0001:00 390: 0 0 0 0 chv-gpio 77 max98090_interrupt 394: 0 0 0 0 chv-gpio 81 INT33BB:01 cd Above comment should have been "With both patches from #70 and #71 applied". Sorry for the confusion. I just noticed that there is something new: [ 102.830210] handle_bad_irq: 12 callbacks suppressed [ 102.830225] irq 225, desc: ffff880159bab5c8, depth: 1, count: 0, unhandled: 0 [ 102.830261] ->handle_irq(): ffffffff9df093ae, [ 102.830279] handle_bad_irq+0x0/0x2f9 [ 102.830302] ->irq_data.chip(): ffffffff9eec6d00, [ 102.830315] chv_gpio_irqchip+0x0/0x110 [ 102.830339] ->action(): (null) [ 102.830354] IRQ_NOPROBE set [ 102.830370] unexpected IRQ trap at vector e1 This is with "pinctrl: cherryview: Mask all interrupts on Intel_Strago based systems" applied in addition to the other two patches. v4 works (with intsel vs. i fix) with older code, v5 needs more to do since another translation gone there are gaps now and one of way to solve is to reintroduce it via struct saved_pin_context (add a pin number to it and one more condition to the handler). Another way is convert it to a radix tree. Unfortunately I'll not able to look at this till week after next. It might be Mika will have time to see what we can do here. Out of curiosity, has this been fixed upstream? I have the two patches from November 28 in my tree, wondering if I can replace them with something newer, ideally a cherry-pick from upstream so that this fix goes away when I update the kernel. Thanks! Yes. d2b3c353595a pinctrl: cherryview: Mask all interrupts on Intel_Strago based systems 11bca0a83f83 genirq: Guard handle_bad_irq log messages One issue still left (not show stopper though) with wrong GPIO number in the /proc/interrupts when storm occurs. The keyboard on my Acer Chromebook cb3-431 still does not work with kernel 4.14.11. Has this been fixed upstream and will it be in a later release? Otherwise should I compile the latest kernel and submit a bug report? Thank you The second patch 11bca0a83f83 ("genirq: Guard handle_bad_irq log message") is not in stable as of 4.14.11, the first one is there - committed by GKH on 12/29. Are both required? The second patch is just a backup and should not be required. I'll have to re-test to check if I can reproduce the keyboard issue with chromeos-4.14. Maybe there is yet another problem. Since I don't know what I am doing, I tested with all three patches and the keyboard works fine. d2b3c353595a pinctrl: cherryview: Mask all interrupts on Intel_Strago based systems 11bca0a83f83 genirq: Guard handle_bad_irq log messages and [PATCH v5] pinctrl: cherryview: Adjust interrupt line mapping (from the attachment). I'll try if the first one alone works fine. |