Bug 43282 - APEI dmesg flood.
Summary: APEI dmesg flood.
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Huang Ying
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-23 14:21 UTC by Pawel Sikora
Modified: 2012-07-26 23:13 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.4.0
Tree: Mainline
Regression: Yes


Attachments
kernel log. (626.36 KB, text/plain)
2012-05-23 14:21 UTC, Pawel Sikora
Details
Provide debug information (call chain) (711 bytes, patch)
2012-05-25 06:29 UTC, Huang Ying
Details | Diff
acpi dump. (163.69 KB, application/octet-stream)
2012-05-25 08:19 UTC, Pawel Sikora
Details
[BUGFIX 1/2] Turn flags into bools (2.02 KB, patch)
2012-05-30 00:31 UTC, Huang Ying
Details | Diff
[BUGFIX 2/2] Disable GHES for too many firmware error (2.37 KB, patch)
2012-05-30 00:32 UTC, Huang Ying
Details | Diff
Check GAR in init time instead of runtime (3.57 KB, patch)
2012-06-07 05:41 UTC, Huang Ying
Details | Diff
Check GAR in map time (2.91 KB, patch)
2012-06-07 08:57 UTC, Huang Ying
Details | Diff
Fixup common BIOS bug (1.14 KB, patch)
2012-06-08 09:39 UTC, Jean Delvare
Details | Diff

Description Pawel Sikora 2012-05-23 14:21:31 UTC
Created attachment 73362 [details]
kernel log.

Hi,

on my opterons i see an acpi/apei flood in kernel log:

[Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]

can i fix this h/w problem? can i reduce this flood?
Comment 1 Zhang Rui 2012-05-25 02:17:11 UTC
ying,
will you please have a look at this problem?
Comment 2 Huang Ying 2012-05-25 06:29:38 UTC
Created attachment 73387 [details]
Provide debug information (call chain)
Comment 3 Huang Ying 2012-05-25 06:52:07 UTC
Hi, Pawel,

Can you try the debug patch attached with the following title?

Provide debug information (call chain)


And can you provide the result of the following command:

acpidump > acpi.dump
Comment 4 Pawel Sikora 2012-05-25 08:18:10 UTC
[   50.971079] ------------[ cut here ]------------
[   50.971091] WARNING: at drivers/acpi/apei/apei-base.c:593 apei_check_gar+0xed/0x100()
[   50.971094] Hardware name: H8DGU
[   50.971097] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
[   50.971100] Modules linked in: nfs fscache binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devintf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid0 dm_mod uvesafb autofs4 dummy ide_cd_mod cdrom ata_generic pata_acpi sp5100_tco pata_atiixp ide_pci_generic psmouse serio_raw i2c_piix4 pcspkr k10temp amd64_edac_mod edac_core powernow_k8 igb freq_table evdev mperf microcode edac_mce_amd hwmon i2c_core atiixp ide_core dca processor button ext3 jbd mbcache sd_mod crc_t10dif raid1 md_mod ahci libahci libata scsi_mod usbhid hid ohci_hcd ssb mmc_core pcmcia pcmcia_core ehci_hcd usbcore usb_common [last unloaded: scsi_wait_scan]
[   50.971186] Pid: 3529, comm: sfx Not tainted 3.4.0-dirty #4
[   50.971188] Call Trace:
[   50.971191]  <NMI>  [<ffffffff8104d7da>] warn_slowpath_common+0x7a/0xb0
[   50.971203]  [<ffffffff8104d8b1>] warn_slowpath_fmt+0x41/0x50
[   50.971208]  [<ffffffff812e40ad>] apei_check_gar+0xed/0x100
[   50.971212]  [<ffffffff812e420a>] apei_read+0x2a/0xb0
[   50.971216]  [<ffffffff812e763e>] ghes_read_estatus+0x2e/0x180
[   50.971221]  [<ffffffff814c4994>] ? _raw_spin_lock+0x34/0x40
[   50.971224]  [<ffffffff812e7d6f>] ghes_notify_nmi+0x9f/0x230
[   50.971229]  [<ffffffff814c6459>] nmi_handle.isra.1+0x79/0xc0
[   50.971233]  [<ffffffff814c63e0>] ? __die+0xf0/0xf0
[   50.971236]  [<ffffffff814c65a8>] do_nmi+0x108/0x350
[   50.971240]  [<ffffffff814c5a9c>] end_repeat_nmi+0x1a/0x1e
[   50.971243]  <<EOE>>
[   50.971245] ---[ end trace 4853387a071be848 ]---
Comment 5 Pawel Sikora 2012-05-25 08:19:41 UTC
Created attachment 73389 [details]
acpi dump.

# acpidump >acpi.dump
Wrong checksum for generic table!

pmtools-20110323-1.x86_64
Comment 6 Huang Ying 2012-05-30 00:31:02 UTC
Created attachment 73460 [details]
[BUGFIX 1/2] Turn flags into bools
Comment 7 Huang Ying 2012-05-30 00:32:11 UTC
Created attachment 73461 [details]
[BUGFIX 2/2] Disable GHES for too many firmware error
Comment 8 Huang Ying 2012-05-30 00:33:25 UTC
Hi, Pawel,

Thanks for your information.  Can you try the patches attached with the following title

[BUGFIX 1/2] Turn flags into bools
[BUGFIX 2/2] Disable GHES for too many firmware error
Comment 9 Pawel Sikora 2012-05-30 07:23:40 UTC
(In reply to comment #8)
> Hi, Pawel,
> 
> Thanks for your information.  Can you try the patches attached with the
> following title
> 
> [BUGFIX 1/2] Turn flags into bools
> [BUGFIX 2/2] Disable GHES for too many firmware error

with these patches i see in dmesg 10 apei logs:

[   50.782360] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
[   51.670227] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
[   52.644522] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
[   52.921203] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
[   52.934217] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
[   53.093343] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
[   53.280076] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
[   53.281892] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
[   53.309941] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
[   53.593970] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
Comment 10 Huang Ying 2012-05-30 13:02:40 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > Hi, Pawel,
> > 
> > Thanks for your information.  Can you try the patches attached with the
> > following title
> > 
> > [BUGFIX 1/2] Turn flags into bools
> > [BUGFIX 2/2] Disable GHES for too many firmware error
> 
> with these patches i see in dmesg 10 apei logs:

Yes.  This is the intended behavior, after 10 failed trying, we will disable the APEI GHES (Generic Hardware Error Source).  I think this fixes your issue.  Do you think so?
Comment 11 Jean Delvare 2012-06-06 09:55:02 UTC
I see the same here, on an Asus Z8NA-D6 board (Intel Xeon 5500 series), kernel 3.4.1.
Comment 12 Jean Delvare 2012-06-06 13:26:33 UTC
This is a regression, as kernel 3.3.8 doesn't exhibit this behavior. I'm not sure that limiting the number of errors logged as suggested in comment #7 is the proper fix... The code was already there in kernel 3.3 but it did not complain, so I suspect some change actually broke the code.

Should I bisect it?
Comment 13 Huang Ying 2012-06-07 00:33:51 UTC
(In reply to comment #12)
> This is a regression, as kernel 3.3.8 doesn't exhibit this behavior. I'm not
> sure that limiting the number of errors logged as suggested in comment #7 is
> the proper fix... The code was already there in kernel 3.3 but it did not
> complain, so I suspect some change actually broke the code.
> 
> Should I bisect it?

Yes.  You can bisect it.  I think the error message comes from some patch about ACPI Generic Address Structure bit width checking.  The following commit is suspicious,

15afae604651d4e17652d2ffb56f5e36f991cfef

But I think patches in comment 7 is needed anyway, because there will be real broken firmware, and we shuld not flush dmesg with that.
Comment 14 Huang Ying 2012-06-07 05:41:24 UTC
Created attachment 73520 [details]
Check GAR in init time instead of runtime
Comment 15 Huang Ying 2012-06-07 05:43:59 UTC
Hi, Pawel and Jean,

Found another issue regarding this bug report.  We should not do GAR checking in runtime for each read/write.  Please try the patch in comment #14.  There should only one error report during init time now.
Comment 16 Jean Delvare 2012-06-07 07:49:49 UTC
I confirm that reverting 15afae604651d4e17652d2ffb56f5e36f991cfef makes the error messages go away.
Comment 17 Jean Delvare 2012-06-07 07:58:24 UTC
The patch in comment #14 doesn't build on top of kernel 3.4.1. Checking the settings at initialization time would be a good idea though: if the firmware is broken, it won't magically fix itself at run time.
Comment 18 Huang Ying 2012-06-07 08:57:11 UTC
Created attachment 73522 [details]
Check GAR in map time

Sorry, please try this updated patch instead.  I add checking in map time (part of init time), but did not remove checking at run time.  So that we can capture the missing checking in init time in case we forget to do that in the future.
Comment 19 Jean Delvare 2012-06-07 09:12:57 UTC
I disassembled the ACPI code and found the problem in the HEST table:

[03Ch 0060  12]         Error Status Address : <Generic Address Structure>
[03Ch 0060   1]                     Space ID : 00 (SystemMemory)
[03Dh 0061   1]                    Bit Width : 20
[03Eh 0062   1]                   Bit Offset : 00
[03Fh 0063   1]         Encoded Access Width : 01 (Byte Access:8)
[040h 0064   8]                      Address : 00000000BF7B4480
(...)
[07Ch 0124  12]         Error Status Address : <Generic Address Structure>
[07Ch 0124   1]                     Space ID : 00 (SystemMemory)
[07Dh 0125   1]                    Bit Width : 20
[07Eh 0126   1]                   Bit Offset : 00
[07Fh 0127   1]         Encoded Access Width : 01 (Byte Access:8)
[080h 0128   8]                      Address : 00000000BF7B4690

So the BIOS defines two 32-bit registers with byte access width, which doesn't look right. The registers defined in all other tables are OK.

So indeed this is a firmware bug, but given that addresses are aligned on a 32-bit boundary, it seems clear that a 32-bit access is the proper thing to do. So, would it make sense to fix-up this kind of firmware oddity at run time?
Comment 20 Jean Delvare 2012-06-07 15:03:11 UTC
The patch from comment #18 works fine. But I think all checks should be moved to init/map time, and removed from run-time, as run-time check are more costly, and redundant now.
Comment 21 Jean Delvare 2012-06-07 15:49:33 UTC
Note that I just submitted a request to the Asus technical support to fix the issue in the BIOS. I have no idea if they'll follow up though (the board model is already 3 or 4 years old). Given that several other boards are affected, and Linux was working OK on them before, I think we want to have a workaround in the kernel anyway.
Comment 22 Jean Delvare 2012-06-08 09:39:20 UTC
Created attachment 73543 [details]
Fixup common BIOS bug

I propose to fixup this specific register definition bug. The 3 reports I am aware of, have the exact same issue, so I guess the bug was in a reference BIOS implementation and more systems are affected. Fixing it up that way is cheap and will prevent regressions.

Originally I planned to write more generic fixup code, but it was quite complex and unreadable, so unless someone insists on seeing it, I think more simple code to fix the one problem at hand is preferable.
Comment 23 Florian Mickler 2012-07-01 09:45:08 UTC
A patch referencing this bug report has been merged in Linux v3.5-rc5:

commit 34ddeb035d704eafdcdb3cbc781894300136c3c4
Author: Huang Ying <ying.huang@intel.com>
Date:   Tue Jun 12 11:20:19 2012 +0800

    ACPI, APEI, Avoid too much error reporting in runtime
Comment 24 Jean Delvare 2012-07-01 20:55:22 UTC
I'm only half convinced by the above commit. For one thing, the run time checks are still present, while they seem completely redundant to me now. For another, this solves the log flood but not the actual problem that APEI no longer works on the affected boards. Sure, this is a BIOS bug in the first place, but this is also a regression from the user's perspective. Hopefully my patch in comment #22 is at least scheduled to be merged in kernel 3.6.
Comment 25 Len Brown 2012-07-14 15:03:05 UTC
patch in comment #22 staged for 3.6-merge
Comment 26 Len Brown 2012-07-26 23:13:40 UTC
patch in comment #22 shipped 3.6-merge
should be picked up for stable 3.5 and 3.4.

commit f712c71f7b2b43b894d1e92e1b77385fcad8815f
Author: Jean Delvare <jdelvare@suse.de>
Date:   Tue Jun 12 10:43:28 2012 +0200

    ACPI, APEI: Fixup common access width firmware bug

Note You need to log in before you can comment on or make changes to this bug.