Bug 194945

Summary: pinctrl cherryview changes broke keyboard on Intel_Strago platforms (ACER Chromebook CYAN)
Product: Drivers Reporter: Adam S Levy (theadamlevy)
Component: Input DevicesAssignee: Mika Westerberg (mika.westerberg)
Status: RESOLVED CODE_FIX    
Severity: blocking CC: andy.shevchenko, barnacs, bbaude, chrisjohgorman, dmitry.torokhov, grygorii.strashko, jpujades, linux, lvuksta, mail, mika.westerberg, mildred-bug.kernel, t.powa, theadamlevy, william
Priority: P1 Keywords: trivial
Hardware: Intel   
OS: Linux   
URL: https://bbs.archlinux.org/viewtopic.php?id=222481
Kernel Version: 4.9 and onward Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 199463    
Attachments: dmidecode
acpidump
dmesg from kernel 4.8.14 (keyboard working)
dmesg from kernel 4.10.4 (keyboard not working)
cat /proc/interrupts on kernel 4.8.14
cat /proc/interrupts on kernel 4.10.4 (keyboard not working)
dmesg from kernel 4.10.4 (keyboard not working), w/ kernel cmdline args
dmesg from kernel 4.8.14 (keyboard working) w/ kernel cmdline args
Patch to quirk pinctrl-cherryview on Acer Chromebook
dmesg from patched 4.11.0-rc3 kernel (keyboard working)
dmidecode output for the C301SA
dmidecode output for Samsung Chromebook 3 (CELES)
Add DMI_PRODUCT_FAMILY match support
Update the quirk to include Intel_Strago based machines
dmidecode for lenovo n22
proc/interupts for lenovo n22 4.11
Update the quirk to include Intel_Strago based machines (v2)
dmidecode from chromebook Cyan
patch for chromebook cyan kernel v4.11.5
test cases for 9cb6bbf3d32269d1e1193b10cb0658f7defed127
test for broken keyboard with dyn gpio irqs
fix using devm_irq_alloc_descs()
attachment-29207-0.html
Logs from 4.15 on Asus C300s - broken keyboard
dmidecode, Asus C300S
dmidecode, Asus C300S, UEFI firmware
/proc/interrupts, Asus C300S, 4.15.15
kernel config, Arch Linux 4.15.15 default
attachment-1920-0.html

Description Adam S Levy 2017-03-20 06:40:42 UTC
Problem:
The built-in keyboard does not work on kernels 4.9.0 and onward. Keystrokes appear to only be registered when the trackpad is being touched.

Research:
I performed a bisection and identified commit 

47c950d1020226179d278297c85ba6a988ee398b

https://github.com/torvalds/linux/commit/47c950d1020226179d278297c85ba6a988ee398b

I made an Arch Linux forum post here some weeks ago when I first discovered the issue on my own computer.

https://bbs.archlinux.org/viewtopic.php?id=222481

Others have commented with the same problem. You can find lspci and dmesg output.

It appears that GalliumOS, a linux distro for Chromebooks has a small one line patch that one user has reported fixes the issue. 

https://github.com/GalliumOS/linux/blob/v4.9.4/galliumos/diffs/increase-cherryview-num-irqs.diff

I haven't yet had a change to apply the patch and recompile the latest kernel to test this myself but I should have time this coming week.
Comment 1 Andy Shevchenko 2017-03-20 21:50:39 UTC
Mika, I think this is candidate to be configured through DMI strings or alike since (it looks like) we have same _HID for those.

@Adam, can you attach the output of dmidecode tool?
Comment 2 Andy Shevchenko 2017-03-20 21:58:43 UTC
This is also interesting...

[    2.556347] genirq: Flags mismatch irq 182. 00002002 (ELAN0000:00) vs. 00000080 (i8042)
[    2.556416] elan_i2c i2c-ELAN0000:00: cannot register irq=182
[    2.569716] elan_i2c: probe of i2c-ELAN0000:00 failed with error -1
Comment 3 Andy Shevchenko 2017-03-20 22:00:28 UTC
@Adam, can you also collect and attach information about ACPI tables?
You need to run
% acpidump -o tables.dat
and attach the file.
Comment 4 Adam S Levy 2017-03-20 23:44:09 UTC
Created attachment 255375 [details]
dmidecode

Acer Chromebook 14 CB3-431-C5FM
Linux Kernel 4.8.14-1
Problem does not exist on this version of the kernel
Comment 5 Adam S Levy 2017-03-20 23:44:48 UTC
Created attachment 255377 [details]
acpidump

Acer Chromebook 14 CB3-431-C5FM
Linux Kernel 4.8.14-1
Problem does not exist on this version of the kernel
Comment 6 Adam S Levy 2017-03-20 23:45:10 UTC
@Andy, 
I've attached output from '# dmidecode' and '# acpidump -o tables.dat' as requested. 

Note that this output was generated on my Acer Chromebook 14, CB3-431-C5FM, running Arch Linux's distribution of Linux Kernel 4.8.14-1, on which the problem does not present.

Let me know if you need this output from a version of the kernel with the problem and I can reboot into a 4.9.x kernel.

Thank you
Comment 7 Mika Westerberg 2017-03-21 09:59:29 UTC
Can you also attach dmesg and /proc/interrupts with and without the commit reverted?
Comment 8 Adam S Levy 2017-03-21 17:30:59 UTC
Created attachment 255407 [details]
dmesg from kernel 4.8.14 (keyboard working)
Comment 9 Adam S Levy 2017-03-21 17:31:40 UTC
Created attachment 255409 [details]
dmesg from kernel 4.10.4 (keyboard not working)
Comment 10 Adam S Levy 2017-03-21 17:32:50 UTC
Created attachment 255411 [details]
cat /proc/interrupts on kernel 4.8.14
Comment 11 Adam S Levy 2017-03-21 17:33:46 UTC
Created attachment 255413 [details]
cat /proc/interrupts on kernel 4.10.4 (keyboard not working)
Comment 12 Adam S Levy 2017-03-21 17:37:39 UTC
@Mika

Note that for the attached files I used kernel binaries as distributed by Arch Linux from before and after the offending commit was introduced. Let me know if you need me to instead revert the commit on the latest kernel source tree and recompile. I hope this is sufficient though because compiling the kernel takes a while on this chromebook, and that's currently all I have to work on for this week.
Comment 13 Adam S Levy 2017-03-21 17:48:36 UTC
I also just confirmed that keystrokes are registered only while interacting with the touchpad.
Comment 14 Andy Shevchenko 2017-03-21 20:51:39 UTC
Adam, sorry for asking again, but can you, please, attach as well dmesg for non-working case with "ignore_loglevel initcall_debug pinctrl_cherryview.dyndbg" added to kernel command line?
Comment 15 Adam S Levy 2017-03-21 22:30:35 UTC
Created attachment 255415 [details]
dmesg from kernel 4.10.4 (keyboard not working), w/ kernel cmdline args

dmesg on non working kernel with added kernel command line opts:
ignore_loglevel initcall_debug pinctrl_cherryview.dyndbg
Comment 16 Adam S Levy 2017-03-21 22:32:15 UTC
Created attachment 255417 [details]
dmesg from kernel 4.8.14 (keyboard working) w/ kernel cmdline args

dmesg on working kernel with added kernel command line options:
ignore_loglevel initcall_debug pinctrl_cherryview.dyndbg
Comment 17 Adam S Levy 2017-03-21 22:33:12 UTC
Hopefully this covers what you need. Let me know what else I can do to help.

Thank you
Comment 18 Mika Westerberg 2017-03-22 11:31:14 UTC
Yes, it does thanks.

So the working one has this in /proc/interrupts:

182:      79200         61         59         56  chv-gpio   11  i8042
183:     353803        141        164        299  chv-gpio   12  ELAN0000:00

Whereas non-working one has this instead:

182:        132        135        128        133  chv-gpio   12  i8042

There i8042 is the keyboard driver. Now it seems that this fails to work because coreboot hardcodes Linux IRQ numbers in their ACPI tables.

https://github.com/coreboot/coreboot/blob/master/src/mainboard/google/cyan/onboard.h

Shows this:

/* ToDO: change kbd irq to gpio bank index */
#define BOARD_I8042_IRQ			182

Now this extremely fragile and I'm surprised it even works at all. A little change in the Linux IRQ numbering (which are all virtual) will break this.

To be honest, I'm not sure how to fix this properly. Maybe we can quirk this particular machine to expose all GPIOS as IRQ capable but then any numberin change in IRQ subsystem will still this machine.
Comment 19 Mika Westerberg 2017-03-22 14:43:46 UTC
Created attachment 255427 [details]
Patch to quirk pinctrl-cherryview on Acer Chromebook

Can you try if the attached patch fixes the issue for you?
Comment 20 Dmitry Torokhov 2017-03-22 17:41:58 UTC
In the meantime I filed https://bugs.chromium.org/p/chromium/issues/detail?id=704198 to see if firmware can be fixed.
Comment 21 Adam S Levy 2017-03-25 17:47:30 UTC
@Mika,

Your patch appears to work. I applied it to the latest kernel. I've attached dmesg output and also used the above mentioned debug kernel command line parameters.

Please let me know what else I can do to help support the kernel on this platform. Feel free to reach out if you want something tested on this platform.
Comment 22 Adam S Levy 2017-03-25 17:48:44 UTC
Created attachment 255537 [details]
dmesg from patched 4.11.0-rc3 kernel (keyboard working)
Comment 23 Adam S Levy 2017-03-26 00:22:29 UTC
@Mika

GalliumOS patches this issue by increasing the nirqs from 8 to 9 in the struct chv_community north_community.

I linked to their patch in my first post, so I assume you saw it. Would you be willing to explain why you did not find that solution acceptable and wrote a different patch? 

I ask purely for my own learning. Apologies if this is not the correct forum for that.
Comment 24 Andy Shevchenko 2017-03-26 14:58:44 UTC
I'm not Mika, though I can answer to your question. The datasheet of that IP (unfortunately I can't find a chapter about it in public documentation for Atom Z8000 series) restricts use of pins like it's described in the driver.
Comment 25 Mika Westerberg 2017-03-27 09:29:38 UTC
Yes, some of the GPIO pins are only available as GPE sources (General Purpose ACPI Event) which means that those cannot ever generate interrupts. The nr_irqs holds the number interrupts the community has.

There are couple of problems with that patch:
1. It does this unconditionally for all machines meaning that it might break machines that actually need to use GPEs.
2. If there are another hard-coded IRQ number in the coreboot, it will not work if it is part of another GPIO community as the numbering is still different what the coreboot expects.

The solution that I attached is not perfect either but that's the best I could come up with. I'll send it upstream soon. Thanks for testing.
Comment 26 Luke 2017-04-17 00:08:36 UTC
This happens on the Asus C301SA as well. The same method fixes it; I've tested this by changing the function call that sets the bool to 

> bool need_valid_mask = false;

Unfortunately I am not familiar with dmi_system_id structure syntax, so I cannot fix this completely myself. I will attach my dmidecode output.
Comment 27 Luke 2017-04-17 00:15:52 UTC
Created attachment 255899 [details]
dmidecode output for the C301SA
Comment 28 Adam S Levy 2017-04-17 19:08:28 UTC
The proposed and accepted patch (https://www.spinics.net/lists/linux-gpio/msg21533.html) has been modified to also apply to the Asus C301SA here: https://aur.archlinux.org/cgit/aur.git/commit/?h=linux-galliumos-braswell&id=2f9c7e92761c3205938c8d69a2c327d16b8cd495

Comments can be found here:
https://aur.archlinux.org/pkgbase/linux-galliumos-braswell/

Waiting to hear back from user regarding whether this new patch still works.
Comment 29 Dmitry Torokhov 2017-04-17 21:20:17 UTC
I wonder if we should not add support for System Information/Family identifier to the DMI code and match for Intel_Strago as I think all Strago boards will exhibit this issue.
Comment 30 Luke 2017-04-18 04:06:19 UTC
The patch for the Asus on the Arch User Repository worked.
Comment 31 Barna Csorogi 2017-04-19 15:30:11 UTC
Created attachment 255923 [details]
dmidecode output for Samsung Chromebook 3 (CELES)

A user on AUR reported that Samsung Chromebook 3 (CELES) is also affected.

Attached dmidecode output. This one has unspecified System Information/Family so I'm not sure how to go about a universal fix.

Altering the patch to also match for this specific system has been confirmed to work:
https://aur.archlinux.org/cgit/aur.git/commit/?h=linux-galliumos-braswell&id=e19e452336b7900ddb0c3fb1fcdc9c79b57fee23
Comment 32 Mika Westerberg 2017-04-20 07:01:10 UTC
Dmitry, unfortunately the Samsung Chromebook 3 (CELES) does not have that Family information.

I don't see much other option than add all these machines to the quirk table of the driver :-(
Comment 33 Dmitry Torokhov 2017-04-20 18:58:35 UTC
OK, I think you will need add quite a few more entries for various "Strago"s then, we have quite a few of them. Ultimately we have Banon, Celes, Cyan, Edgar, Kefka, Locke, Reks, Relm, Sabin, Setzer, Terra, Tifa, Ultima, and Wizpig shat are based on Strago reference design.

OTOH I think only Celes and Cyan (Acer R11 - what you called Cyan is Acer Chromebook 14 - Edgar) are missing family information, I think the rest of them have it set as "Intel_Strago".
Comment 34 Mika Westerberg 2017-04-21 09:33:09 UTC
In that case then it makes sense to check for the family information string. I'll look how we can add the support for to the DMI matching functionality and then add the family string to the quirk in the driver.
Comment 35 Mika Westerberg 2017-04-24 12:52:44 UTC
Created attachment 255981 [details]
Add DMI_PRODUCT_FAMILY match support
Comment 36 Mika Westerberg 2017-04-24 12:53:36 UTC
Created attachment 255983 [details]
Update the quirk to include Intel_Strago based machines
Comment 37 Mika Westerberg 2017-04-24 12:55:39 UTC
Please test the attached two patches. They now extend the matching to Intel_Strago family of machines.

Note that I'm still adding BIOS date there. Maybe we should instead use some other version string instead.
Comment 38 Brent Baude 2017-04-25 20:44:48 UTC
Mika, I added those two patches to a 4.11.0-0.rc8 and I still observe the same behaviour as described (you must engage the trackpad to be able to register anything with the keyboard).  I'm on a Lenovo Chromebook N22-20.  

I'll attach the dmidecode first and if you want anything else, I'll follow up.  Happy to try patches here.
Comment 39 Brent Baude 2017-04-25 20:46:13 UTC
Created attachment 256027 [details]
dmidecode for lenovo n22
Comment 40 Brent Baude 2017-04-25 20:47:17 UTC
Created attachment 256029 [details]
proc/interupts for lenovo n22 4.11
Comment 41 Andy Shevchenko 2017-04-25 21:27:35 UTC
Brent, because in the patches we have no entry for Intel_Strago (Reks) platform. You may try to add yourself and try again:
	{
		.ident = "Intel_Strago (Reks)",
		.matches = {
			DMI_MATCH(DMI_SYS_VENDOR, "GOOGLE"),
			DMI_MATCH(DMI_PRODUCT_FAMILY, "Intel_Strago"),
			DMI_MATCH(DMI_BIOS_DATE, "06/24/2016"),
		},
	},
Comment 42 Dmitry Torokhov 2017-04-25 22:20:36 UTC
I'm afraid we need to drop the date matching that I proposed - all of them have different dates. If/when we get BIOS fix we will augment the quirk like:

bool need_valid_mask = !dmi_check_system(chv_no_valid_mask) ||
                       (<compare dmi_get_system_info(DMI_BIOS_DATE) with some
                         date, hopefully in 2017 ;) >);
Comment 43 Mika Westerberg 2017-04-26 09:23:55 UTC
Created attachment 256053 [details]
Update the quirk to include Intel_Strago based machines (v2)
Comment 44 Mika Westerberg 2017-04-26 09:25:43 UTC
Now the quirk should cover all Intel_Strago systems. Can you check that it works for you? You need both patches:

Add DMI_PRODUCT_FAMILY match support
Update the quirk to include Intel_Strago based machines (v2)
Comment 45 Brent Baude 2017-04-26 13:45:02 UTC
@Andy, with the Reks addition, I can confirm this fixes the problem on the Lenevo N22.
Comment 46 Mildred 2017-04-30 10:05:32 UTC
Created attachment 256143 [details]
dmidecode from chromebook Cyan

It seems that there is a mismatch between the Edgar and Cyan models. I have a Cyan model and the DMI product name (from dmidecode) show "Cyan" and not "Edgar". Also, it doesn't have a product family (not specified)

Attaching the dmidecode output of the cyan model.
Comment 47 Andy Shevchenko 2017-04-30 12:40:27 UTC
(In reply to Mildred from comment #46)
> It seems that there is a mismatch between the Edgar and Cyan models. I have
> a Cyan model and the DMI product name (from dmidecode) show "Cyan" and not
> "Edgar". Also, it doesn't have a product family (not specified)

Yeah...
So, take latest Mika's patches (see attachments to this bug) and add the following:

{
	.ident = "Intel_Strago (Cyan)",
	.matches = {
		DMI_MATCH(DMI_SYS_VENDOR, "GOOGLE"),
		DMI_MATCH(DMI_PRODUCT_NAME, "Cyan"),
	},
},

Confirm if it helps.
Comment 48 Mildred 2017-06-21 10:41:28 UTC
Created attachment 257117 [details]
patch for chromebook cyan kernel v4.11.5

Sorry, I did not test it before today. With the changes made, it works very nicely. No more issue with the inputs. Testes on fedora 26 with f26 kernel config and vanilla kernel v4.11.5. Attached the exact patch
Comment 49 A 2017-07-03 22:16:18 UTC
Observed another affected Intel_Strago device in the wild, HP Chromebook 11 G5 (Setzer). It does not match Mika's Strago family predicate because the vendor field is "HP" and not "GOOGLE".


# dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.
9 structures occupying 540 bytes.
Table at 0x7CC1B020.
...
Handle 0x0001, DMI type 1, 27 bytes
System Information
	Manufacturer: HP
	Product Name: Setzer
	Version: 1.0
	Serial Number: 123456789
	UUID: Not Settable
	Wake-up Type: Reserved
	SKU Number: Not Specified
	Family: Intel_Strago
...


Adding this appears to work:

{
	.ident = "Intel_Strago based Chromebooks (HP)",
	.matches = {
		DMI_MATCH(DMI_SYS_VENDOR, "HP"),
		DMI_MATCH(DMI_PRODUCT_FAMILY, "Intel_Strago"),
	},
},
Comment 50 Chris Gorman 2017-09-08 23:06:23 UTC
*** Bug 196837 has been marked as a duplicate of this bug. ***
Comment 51 Chris Gorman 2017-09-26 22:17:06 UTC
(In reply to Chris Gorman from comment #50)
> *** Bug 196837 has been marked as a duplicate of this bug. ***

Between 4.13.0-rc7 and 4.13.0 this bug has resurfaced.  For some reason the quirk does not fix the keyboard any more.  I did a git bisect for bug 196837, which is a duplicate of this one, which found

dc749a09ea5e413564115dee742c3bc958248707

and backed me off to 4.13.0-rc5. 

The galliumos patch to drivers/pinctrl/intel/pinctrl-cherryview.c that increased the number of irqs in the north_community fails now in 4.14-rc2.
Comment 52 Mika Westerberg 2017-09-27 08:27:41 UTC
You are saying commit dc749a09ea5e413564115dee742c3bc958248707 broke the keyboard again in the mainline kernel? The hack done by galliumos is not something we care about, though.
Comment 53 Chris Gorman 2017-09-27 13:01:19 UTC
(In reply to Mika Westerberg from comment #52)
> You are saying commit dc749a09ea5e413564115dee742c3bc958248707 broke the
> keyboard again in the mainline kernel? The hack done by galliumos is not
> something we care about, though.

In a word, yes I think so.  This is my first crack at using git bisect so I may have made a mistake.  In case I made an error, here's the git bisect log.

git bisect start
# good: [cc4a41fe5541a73019a864883297bd5043aa6d98] Linux 4.13-rc7
git bisect good cc4a41fe5541a73019a864883297bd5043aa6d98
# good: [cc4a41fe5541a73019a864883297bd5043aa6d98] Linux 4.13-rc7
git bisect good cc4a41fe5541a73019a864883297bd5043aa6d98
# bad: [e7d0c41ecc2e372a81741a30894f556afec24315] Merge tag 'devprop-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect bad e7d0c41ecc2e372a81741a30894f556afec24315
# good: [6c51e67b64d169419fb13318035bb442f9176612] Merge branch 'x86-syscall-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 6c51e67b64d169419fb13318035bb442f9176612
# good: [bf1d6b2c76eda86159519bf5c427b1fa8f51f733] Merge tag 'staging-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect good bf1d6b2c76eda86159519bf5c427b1fa8f51f733
# good: [fe9e31383e9a271a9b404488704e00acd1747ee3] Merge tag 'regulator-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
git bisect good fe9e31383e9a271a9b404488704e00acd1747ee3
# bad: [70b8e9eb3b50d8bded63f808b09c4844ef63c3b8] Merge tag 'gpio-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
git bisect bad 70b8e9eb3b50d8bded63f808b09c4844ef63c3b8
# good: [1865af212dfa0819ca21c7e5c18c2a75202c1827] pinctrl: aspeed: Fix ast2500 strap register write logic
git bisect good 1865af212dfa0819ca21c7e5c18c2a75202c1827
# bad: [1253baaafff03c695e49d6c63956f4ccf12dc0c8] gpio: twl6040: remove unneeded forward declaration
git bisect bad 1253baaafff03c695e49d6c63956f4ccf12dc0c8
# bad: [ca1f3ae3154ad6d08caa740c99be0d86644a4e44] gpio: msic: fix error return code in platform_msic_gpio_probe()
git bisect bad ca1f3ae3154ad6d08caa740c99be0d86644a4e44
# bad: [9c07409c34ddf1804721cee21ca835733e165bde] gpio: tegra: Use platform_get_irq()
git bisect bad 9c07409c34ddf1804721cee21ca835733e165bde
# bad: [dbd1dad2ab8ffca57e0aa386df0d7ec621c26ca8] gpio: rcar: add gen[123] fallback compatibility strings
git bisect bad dbd1dad2ab8ffca57e0aa386df0d7ec621c26ca8
# bad: [dc749a09ea5e413564115dee742c3bc958248707] gpiolib: allow gpio irqchip to map irqs dynamically
git bisect bad dc749a09ea5e413564115dee742c3bc958248707
Comment 54 Mika Westerberg 2017-09-27 13:07:52 UTC
If you revert dc749a09ea5e413564115dee742c3bc958248707 does the keyboard start working again?
Comment 55 Chris Gorman 2017-09-27 13:31:37 UTC
(In reply to Mika Westerberg from comment #54)
> If you revert dc749a09ea5e413564115dee742c3bc958248707 does the keyboard
> start working again?

If I revert that patch my kernel build fails with a kernel newer than 4.13.0-rc5.  I will try a build of 4.13.0-rc5 with and without that patch to see if it's the cause.  I just assumed that running git bisect necessarily found the error.
Comment 56 Mika Westerberg 2017-09-27 13:59:13 UTC
I just tried to revert it on top of v4.14-rc2 and my kernel builds just fine.
Comment 57 Chris Gorman 2017-09-27 14:14:08 UTC
(In reply to Mika Westerberg from comment #56)
> I just tried to revert it on top of v4.14-rc2 and my kernel builds just fine.

I will try 4.14-rc2 then, both with and without to see if I can get my keyboard back.
Comment 58 Chris Gorman 2017-09-27 15:28:00 UTC
(In reply to Chris Gorman from comment #57)
> (In reply to Mika Westerberg from comment #56)
> > I just tried to revert it on top of v4.14-rc2 and my kernel builds just
> fine.
> 
> I will try 4.14-rc2 then, both with and without to see if I can get my
> keyboard back.

4.14-rc2 keyboard works with a revert of dc749a09ea5e413564115dee742c3bc958248707

Thanks
Comment 59 Mika Westerberg 2017-09-28 08:35:10 UTC
I just sent an email in the original thread where the patch was discussed. Let's see if we can figure out how to fix this.
Comment 60 Grygorii Strashko 2017-09-28 15:06:36 UTC
Hi

Do you have CONFIG_SPARSE_IRQ enabled? 

is it possible to enable CONFIG_IRQ_DOMAIN_DEBUG and
dump:
- cat /proc/interrupts
- cat /sys/kernel/debug/irq_domain_mapping
for working and non-working cases?
Comment 61 Chris Gorman 2017-09-28 22:02:47 UTC
(In reply to Grygorii Strashko from comment #60)
> Hi
> 
> Do you have CONFIG_SPARSE_IRQ enabled? 
> 
> is it possible to enable CONFIG_IRQ_DOMAIN_DEBUG and
> dump:
> - cat /proc/interrupts
> - cat /sys/kernel/debug/irq_domain_mapping
> for working and non-working cases?
Hi Grygorii,

Yes, I have CONFIG_SPARCE_IRQ enabled.  It is selected by CONFIG_X86=y which I have selected as I have an x86 machine. 

I am going to try to attach my test cases as the output is long.  They are with and without dc749a09ea5e413564115dee742c3bc958248707 and are a cat of /proc/interrupts and /sys/kernel/debug/irq_domain_mapping as requested.  Hope this helps.  Let me know if you need anything further.
Comment 62 Chris Gorman 2017-09-28 22:06:32 UTC
Created attachment 258629 [details]
test cases for 9cb6bbf3d32269d1e1193b10cb0658f7defed127

/proc/interrupts and  /sys/kernel/debug/irq_domain_mapping with and without  9cb6bbf3d32269d1e1193b10cb0658f7defed127 applied.
Comment 63 Andy Shevchenko 2017-09-29 12:06:14 UTC
Ideally it should be fixed in the BIOS... in the first place.

The BIOS expects *exact* GPIO number on OS side, i.e. 182.
The mentioned patch naturally changes the numbering scheme for OS side IRQ lines.
Comment 64 Grygorii Strashko 2017-09-29 17:45:23 UTC
Created attachment 258639 [details]
test for broken keyboard with dyn gpio irqs

with CONFIG_SPARCE_IRQ enabled there are no guarantee of constant Linux IRQ numbering in general. It depends on Kernel build and boot sequence.
But here is the mix of dynamic and static Linux IRQ mappings.
Note. From logs there are 193 chv-gpio IRQ mapped and only 5 are in use - sizeof(struct irq_desc) == 256b which means 48K wasted memory.

I think, It can be fixed by passing base Linux IRQ numbers for each gpiochip (from logs I see 4 chv-gpio instances) in
 gpiochip_irqchip_add(chip, &chv_gpio_irqchip, base_irqs,
				   handle_bad_irq, IRQ_TYPE_NONE);

or by using irq_domain_add_legacy(). In the last case driver should allocate IRQ descriptors by itself and GPIO IRQ infrastructure can't be used (at least as is, need to check more).

Chris,
I think below ugly test diff should make it work using approach 1 above, but Linux IRQ numbers should not be defined in code (have no idea how to code them for x86 - in ARM we have DT) and such HACK should be allied for this HW. Sry, I'm not familiar with x86, so can't provide precise fix :(
Comment 65 Grygorii Strashko 2017-09-29 18:19:06 UTC
Created attachment 258641 [details]
fix using devm_irq_alloc_descs()

this might be a way to fix it.
Not tested in any way.
Comment 66 Andy Shevchenko 2017-09-29 19:00:40 UTC
Comment on attachment 258641 [details]
fix using devm_irq_alloc_descs()

When adding a diff to Bugzilla, be sure it has correct mime type.
Comment 67 Chris Gorman 2017-09-29 21:04:19 UTC
 > Chris,
> I think below ugly test diff should make it work using approach 1 above, but
> Linux IRQ numbers should not be defined in code (have no idea how to code
> them for x86 - in ARM we have DT) and such HACK should be allied for this
> HW. Sry, I'm not familiar with x86, so can't provide precise fix :(

Hi Grygorii,

The attachment 258639 [details] works fixes it.  I will try the other one as well.  Thanks for your help.
Comment 68 Chris Gorman 2017-09-29 22:24:51 UTC
The attachment 258641 [details] works as well.  Thanks again for all your help.
Comment 69 Adam S Levy 2017-10-03 17:07:38 UTC
Hi all,

I am commenting to note that my Acer CB3-431 is again affected by this bug as well. I have yet to try the patch but it seems likely to work. Let me know if there is anything I can do that would help promote this patch getting into the mainline kernel. 4.13 is effectively broken for my platform until this gets resolved.

Thank you
Comment 70 Luke 2017-11-13 03:00:29 UTC
My Asus is again having these issues upon updating to kernel 4.13. I don't recall the exact error, but it's about an unexpected irq trap. What information will be of use?
Comment 72 Chris Gorman 2017-11-13 20:31:06 UTC
Grygorii's patch applies and works with 4.13.  (I haven't tested 4.13.1 onwards, but suspect it will work with them too.)  The patch was put into v4.14-rc5, so will be available with v4.14-rc6 onwards.
Comment 73 Guenter Roeck 2017-11-15 00:41:11 UTC
I get a large number of bad interrupts when running an image based on 4.14 on cyan (more specifically it is a prototype of chromeos-4.14, ie 4.14 with Chrome OS patches on top of it).

[   60.089234] ->handle_irq():  ffffffffbe2f803f, 
[   60.090455] 0xffffffffbf2af380
[   60.090510] handle_bad_irq+0x0/0x2e5
[   60.090522] ->irq_data.chip(): ffffffffbf2af380, 
[   60.090553]    IRQ_NOPROBE set
[   60.090584] ->handle_irq():  ffffffffbe2f803f, 
[   60.090590] handle_bad_irq+0x0/0x2e5
[   60.090596] ->irq_data.chip(): ffffffffbf2af380, 
[   60.090602] 0xffffffffbf2af380
[   60.090608] ->action():           (null)
[   60.090779] handle_bad_irq+0x0/0x2e5

and so on. The output is that messy because there are just too many bad interrupts.

/proc/interrupts shows:

115:      78302      77589      77478      76348  chv-gpio    0
 135:          0          1          0          0  chv-gpio   20  TS3A227E
 171:     196025     196802     196579     197818  chv-gpio    0
 180:          0          0          0          0  chv-gpio    9  ACPI:Event
 182:          3          1          1          5  chv-gpio   11  i8042
 183:          0          0          0          0  chv-gpio   12  ELAN0000:00
 184:          0          0          0          0  chv-gpio   13  ELAN0001:00
 300:          0          0          0          0  chv-gpio   46  max98090_interrupt
 304:          0          0          0          0  chv-gpio   50  INT33BB:01 cd

Notice interrupts 115 and 171; those are the "bad" ones.

Reverting the patch from #71 fixes that problem, but then the keypad doesn't work. Any idea what might be wrong ?
Comment 74 Chris Gorman 2017-11-20 19:49:55 UTC
Sorry for the delay in replying Guenter.  In short, I don't know what's wrong with the interrupts.  I suspect that the complaint about hardcoded IRQ numbers in coreboot may be the reason for some of this.  I get the same problem as you with the handle_irq() on my Banon machine when I plug in or remove the headphones.  (It will work without throwing all the bad interrupts if I start the machine with the headphones plugged in.)  Without Grygorii's patch my keyboard fails to function too.
Comment 75 Grygorii Strashko 2017-11-20 20:08:18 UTC
Sry, I can help with above issues - it's some hw/fw issue.
In  Guenter Roeck report - there is interrupt storm from pins 0 of two chv instances and you can see it because with my patch all IRQs mapped and have handlers assigned, as result 
- chv_gpio_irq_handler() checks CHV_INTSTAT
- see bit 0 set
- tries to call handler which is handle_bad_irq() for non requested pins/irqs.
- IRQ not asked properly in source as it's not requested and no proper handler installed.
- IRQ re-triggered

without my patch - irqs not mapped and most probably they are still fired, but 
handle_bad_irq() not called from chv_gpio_irq_handler(). It can be checked by adding simple printf in chv_gpio_irq_handler() before executing loop with my patch reverted (or by checking ret from irq_find_mapping()&generic_handle_irq().

Summary:
- those IRQs (pin0) for some reasons configured and unmasked in HW, but SW knows nothing about that.

note - I could be mistaken in some details as I do not know HW.
Comment 76 Guenter Roeck 2017-11-20 21:02:56 UTC
I can confirm #75 - I see that several interrupts are enabled when the pinctrl driver starts, even though they are never requested.
Can we just disable the offending interrupt(s) in that situation ? I know that is not a real fix, but dumping error messages like crazy in handle_bad_irq() isn't very helpful.
Comment 77 Mika Westerberg 2017-11-21 10:03:16 UTC
Guenther,

Can you revert these two commits:

845e405e5e6c9dc9ed10306a4b5bfeaefebc2e84 ("pinctrl: cherryview: fix issues caused by dynamic gpio irqs mapping")
dc749a09ea5e413564115dee742c3bc958248707 ("gpiolib: allow gpio irqchip to map irqs dynamically")

and see if anything changes?

I remember that previously when we called gpiochip_irqchip_add() from the driver passing handle_bad_irq, the IRQ core will effectively mask all IRQs with that handler set. To me it seems that that does not happen anymore.
Comment 78 Guenter Roeck 2017-11-21 19:10:45 UTC
#77: Reverting the two patches does not make a difference. The interrupts are still allocated and occurring.

 115:    2257869    2348083    2257871    2349328  chv-gpio    0
 135:          0          0          1          0  chv-gpio   20  TS3A227E
 171:    2941113    2849723    2940718    2849389  chv-gpio    0
 180:          0          0          0          0  chv-gpio    9  ACPI:Event
 182:         45         40         40         44  chv-gpio   11  i8042
 183:         32         60         66         51  chv-gpio   12  ELAN0000:00
 184:          0          0          0          0  chv-gpio   13  ELAN0001:00
 300:          0          0          0          0  chv-gpio   46  max98090_interrupt
 304:          0          0          0          0  chv-gpio   50  INT33BB:01 cd

handle_bad_irq() is still called repeatedly.
Comment 79 Andy Shevchenko 2017-11-21 19:25:26 UTC
(In reply to Guenter Roeck from comment #78)
> #77: Reverting the two patches does not make a difference. The interrupts
> are still allocated and occurring.
> 
>  115:    2257869    2348083    2257871    2349328  chv-gpio    0
>  135:          0          0          1          0  chv-gpio   20  TS3A227E
>  171:    2941113    2849723    2940718    2849389  chv-gpio    0

The suspicious part is 0 in the pin number there
So, for me it looks like some other issue.

If Mika agrees I would rather recommend to split this into separate bug report with the following information attached:

1. % acpidump -o tables.dat # tables.dat is point of interest
2. % lspci -vv -nk # output of the command
3. % dmidecode # output of the command
4. % grep -H 15 /sys/bus/acpi/devices/*/status # output of the command
5. % dmesg # when kernel command line has the 'ignore_loglevel
initcall_debug' added

Also it might make sense to enable GPIO and PINCTRL debug in the kernel and, perhaps, add "apic=debug" to the kernel command line as well.
Comment 80 Mika Westerberg 2017-11-21 19:40:10 UTC
Yes, I agree. This is different issue.
Comment 81 Guenter Roeck 2017-11-21 22:08:27 UTC
Submitted https://bugzilla.kernel.org/show_bug.cgi?id=197953
Comment 82 Dmitry Torokhov 2018-01-06 18:43:10 UTC
*** Bug 198375 has been marked as a duplicate of this bug. ***
Comment 83 Grygorii Strashko 2018-01-06 18:55:27 UTC
Created attachment 273445 [details]
attachment-29207-0.html

Thank you for your message. I am currently ooo, so my replies can be delayed.
I will be returning on Jan 8
Comment 84 Josep Pujadas-Jubany 2018-01-24 16:41:27 UTC
Similar problem with Acer TravelMate B117-M (SSD or EMMC disk).
Upgraded the BIOS system from 1.07 to 1.11, 1.13 and 1.15 (the latest)
Lubuntu 16.04 LTS HWE 64 bit (up-to-date)
Tested on 4-5 computers

They work fine with 4.10.0-33 or 4.10.0-42 kernels.
They show the messages at start-up when using 4.13.0-26 or 4.13.0-31 kernels.

Curiously, if I turn off the system, connect/disconnect the battery (battery reset hole at the bottom of the laptop) and turn on the system, NO errors.

At next boot, errors reappear.

The systems seem to work without problems, but I can see many interrupts for chv-gpio as explained at #c73

No more IRQ errors with kernel 4.14.15-041415-generic

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14.15/
Comment 85 Josep Pujadas-Jubany 2018-01-25 16:17:09 UTC
"They show the messages at start-up when using 4.13.0-26 or 4.13.0-31 kernels." I'm sorry, the messages were:

unexpected irq trap at vector 73

Tested on HP Convertible x360 11-ab0XX with same results. No more messages about IRQ errors with kernel 4.14.15-041415-generic
Comment 86 Luke 2018-02-15 23:38:02 UTC
(In reply to Mika Westerberg from comment #71)
> This is fixed in the mainline with following commit:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> drivers/pinctrl/intel/pinctrl-cherryview.
> c?id=845e405e5e6c9dc9ed10306a4b5bfeaefebc2e84

(In reply to Dmitry Torokhov from comment #82)
> *** Bug 198375 has been marked as a duplicate of this bug. ***

My machine still does not work, I still cannot use a kernel above 4.12. Will someone take a look at this?
Comment 87 William Osler 2018-02-22 06:59:17 UTC
Created attachment 274339 [details]
Logs from 4.15 on Asus C300s - broken keyboard

Hi all,

Not sure if this is the same issue, but 4.13 also broke the keyboard on my Asus Chromebook C300s, which is also in the Braswell family. Unfortunately I can confirm that none of the fixes currently in 4.15.3 fix the issue, so there's something that still seems to be missing. 4.12 still works reliably for me.

I tried booting with some of the kernel parameters mentioned earlier in this thread, but unfortunately it seems with 4.15 that even USB keyboards no longer work for me, which is a problem since my disk is encrypted. I was able to do some temporary stuff with a key file embedded in my initramfs, just to get it to boot so I could capture the log output to disk. I used the kernel parameters mentioned earlier in this thread. This is my first time using Bugzilla, so please let me know if the attachment didn't make it through.

Thanks for all the work on this!
Comment 88 William Osler 2018-02-22 07:00:24 UTC
Created attachment 274341 [details]
dmidecode, Asus C300S

Here's the dmidecode output for my platform.
Comment 89 Andy Shevchenko 2018-02-22 19:06:02 UTC
(In reply to William Osler from comment #88)
> Created attachment 274341 [details]
> dmidecode, Asus C300S

@Luke, seems one more Asus C30xx platform has issues. They have a platform name in common, i.e. "Terra". Perhaps Dmitry can check if there any related difference in coreboot for that exact platform in comparison to the rest of Intel_Strago.
Comment 90 Dmitry Torokhov 2018-03-10 00:21:38 UTC
Hmm, I just got a Terra and booted vanilla 4.15 on it (in addition to the internal kernel based on 4.14.24), both keyboard and touchpad work fine and I do not see interrupt storm from the audio pin either. This is with Chrome OS userspace and kernel config though.

I do not recall seeing anything significantly different from other Stragos in coreboot either.
Comment 91 William Osler 2018-04-07 23:01:53 UTC
A few updates on this:

As of 4.15.15, keyboard is still broken on my Asus C300S. Touchpad and audio both work fine. Previously, I reported also having issues with USB keyboards, but this is resolved; USB keybaords work fine, and I'm typing this message with one right now. I've also updated my firmware from an older coreboot to the latest full UEFI firmware from MrChromebox. I'll attatch my new dmidecode output, but it doesn't appear much different at a glance.

Since it sounds like it could have something to do with kernel config, I'm also attaching my config. This is the unmodified kernel shipped by Arch Linux currently. if you have a link or can upload the "Chrome OS ... kernel config", I'd be glad to give it a try and see if that can resolve things.

Also, I'm uploading /proc/interrupts, since it sounds like some of these changes might be related to interrupt storms. I'm not quite sure how to interpret it, but I bet there's somebody here who can :)
Comment 92 William Osler 2018-04-07 23:04:01 UTC
Created attachment 275155 [details]
dmidecode, Asus C300S, UEFI firmware

dmidecode output on the new UEFI firmware for Asus C300S
Comment 93 William Osler 2018-04-07 23:04:57 UTC
Created attachment 275157 [details]
/proc/interrupts, Asus C300S, 4.15.15
Comment 94 William Osler 2018-04-07 23:06:29 UTC
Created attachment 275159 [details]
kernel config, Arch Linux 4.15.15 default

Here's the config of my currently running kernel
Comment 95 Dmitry Torokhov 2018-04-10 00:43:09 UTC
Can you try making CONFIG_PINCTRL_CHERRYVIEW=y ? I have no idea what happens when pin control is loaded late in the boot process.
Comment 96 William Osler 2018-04-10 04:41:06 UTC
Oh my goodness, now I just feel silly. That was exactly it, everything is working perfectly now, thank you!

For Arch users, you'll need to do the following to rebuild your kernel with the required option:

1) Install the `asp` package
2) `asp checkout linux`
3) `cd` into the relevant directory for your platform
4) Edit `config`, set `CONFIG_PINCTRL_CHERRYVIEW=y`
5) `makepkg --skipchecksums`
6) Install the new packages

It's probably worth blacklisting the linux package in your pacman config to be sure that you don't update to a broken version.

I'm going to open a bug on the Arch issue tracker and see if the mantainers are willing to make this a default for the config.
Comment 97 Andy Shevchenko 2019-01-04 16:44:09 UTC
Eight month no new comments and we have established fix in the few kernel releases already, thus, close it.

Anybody experiencing this issue feel free to reopen with sufficient data provided (see comments for the details).
Comment 98 Grygorii Strashko 2019-01-04 16:54:37 UTC
Created attachment 280267 [details]
attachment-1920-0.html

I am not checking e-mail from this account anymore.