Created attachment 73362 [details] kernel log. Hi, on my opterons i see an acpi/apei flood in kernel log: [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] can i fix this h/w problem? can i reduce this flood?
ying, will you please have a look at this problem?
Created attachment 73387 [details] Provide debug information (call chain)
Hi, Pawel, Can you try the debug patch attached with the following title? Provide debug information (call chain) And can you provide the result of the following command: acpidump > acpi.dump
[ 50.971079] ------------[ cut here ]------------ [ 50.971091] WARNING: at drivers/acpi/apei/apei-base.c:593 apei_check_gar+0xed/0x100() [ 50.971094] Hardware name: H8DGU [ 50.971097] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] [ 50.971100] Modules linked in: nfs fscache binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devintf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid0 dm_mod uvesafb autofs4 dummy ide_cd_mod cdrom ata_generic pata_acpi sp5100_tco pata_atiixp ide_pci_generic psmouse serio_raw i2c_piix4 pcspkr k10temp amd64_edac_mod edac_core powernow_k8 igb freq_table evdev mperf microcode edac_mce_amd hwmon i2c_core atiixp ide_core dca processor button ext3 jbd mbcache sd_mod crc_t10dif raid1 md_mod ahci libahci libata scsi_mod usbhid hid ohci_hcd ssb mmc_core pcmcia pcmcia_core ehci_hcd usbcore usb_common [last unloaded: scsi_wait_scan] [ 50.971186] Pid: 3529, comm: sfx Not tainted 3.4.0-dirty #4 [ 50.971188] Call Trace: [ 50.971191] <NMI> [<ffffffff8104d7da>] warn_slowpath_common+0x7a/0xb0 [ 50.971203] [<ffffffff8104d8b1>] warn_slowpath_fmt+0x41/0x50 [ 50.971208] [<ffffffff812e40ad>] apei_check_gar+0xed/0x100 [ 50.971212] [<ffffffff812e420a>] apei_read+0x2a/0xb0 [ 50.971216] [<ffffffff812e763e>] ghes_read_estatus+0x2e/0x180 [ 50.971221] [<ffffffff814c4994>] ? _raw_spin_lock+0x34/0x40 [ 50.971224] [<ffffffff812e7d6f>] ghes_notify_nmi+0x9f/0x230 [ 50.971229] [<ffffffff814c6459>] nmi_handle.isra.1+0x79/0xc0 [ 50.971233] [<ffffffff814c63e0>] ? __die+0xf0/0xf0 [ 50.971236] [<ffffffff814c65a8>] do_nmi+0x108/0x350 [ 50.971240] [<ffffffff814c5a9c>] end_repeat_nmi+0x1a/0x1e [ 50.971243] <<EOE>> [ 50.971245] ---[ end trace 4853387a071be848 ]---
Created attachment 73389 [details] acpi dump. # acpidump >acpi.dump Wrong checksum for generic table! pmtools-20110323-1.x86_64
Created attachment 73460 [details] [BUGFIX 1/2] Turn flags into bools
Created attachment 73461 [details] [BUGFIX 2/2] Disable GHES for too many firmware error
Hi, Pawel, Thanks for your information. Can you try the patches attached with the following title [BUGFIX 1/2] Turn flags into bools [BUGFIX 2/2] Disable GHES for too many firmware error
(In reply to comment #8) > Hi, Pawel, > > Thanks for your information. Can you try the patches attached with the > following title > > [BUGFIX 1/2] Turn flags into bools > [BUGFIX 2/2] Disable GHES for too many firmware error with these patches i see in dmesg 10 apei logs: [ 50.782360] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] [ 51.670227] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] [ 52.644522] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] [ 52.921203] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] [ 52.934217] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] [ 53.093343] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] [ 53.280076] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] [ 53.281892] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] [ 53.309941] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0] [ 53.593970] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0xdfec6110/32/0/1/0]
(In reply to comment #9) > (In reply to comment #8) > > Hi, Pawel, > > > > Thanks for your information. Can you try the patches attached with the > > following title > > > > [BUGFIX 1/2] Turn flags into bools > > [BUGFIX 2/2] Disable GHES for too many firmware error > > with these patches i see in dmesg 10 apei logs: Yes. This is the intended behavior, after 10 failed trying, we will disable the APEI GHES (Generic Hardware Error Source). I think this fixes your issue. Do you think so?
I see the same here, on an Asus Z8NA-D6 board (Intel Xeon 5500 series), kernel 3.4.1.
This is a regression, as kernel 3.3.8 doesn't exhibit this behavior. I'm not sure that limiting the number of errors logged as suggested in comment #7 is the proper fix... The code was already there in kernel 3.3 but it did not complain, so I suspect some change actually broke the code. Should I bisect it?
(In reply to comment #12) > This is a regression, as kernel 3.3.8 doesn't exhibit this behavior. I'm not > sure that limiting the number of errors logged as suggested in comment #7 is > the proper fix... The code was already there in kernel 3.3 but it did not > complain, so I suspect some change actually broke the code. > > Should I bisect it? Yes. You can bisect it. I think the error message comes from some patch about ACPI Generic Address Structure bit width checking. The following commit is suspicious, 15afae604651d4e17652d2ffb56f5e36f991cfef But I think patches in comment 7 is needed anyway, because there will be real broken firmware, and we shuld not flush dmesg with that.
Created attachment 73520 [details] Check GAR in init time instead of runtime
Hi, Pawel and Jean, Found another issue regarding this bug report. We should not do GAR checking in runtime for each read/write. Please try the patch in comment #14. There should only one error report during init time now.
I confirm that reverting 15afae604651d4e17652d2ffb56f5e36f991cfef makes the error messages go away.
The patch in comment #14 doesn't build on top of kernel 3.4.1. Checking the settings at initialization time would be a good idea though: if the firmware is broken, it won't magically fix itself at run time.
Created attachment 73522 [details] Check GAR in map time Sorry, please try this updated patch instead. I add checking in map time (part of init time), but did not remove checking at run time. So that we can capture the missing checking in init time in case we forget to do that in the future.
I disassembled the ACPI code and found the problem in the HEST table: [03Ch 0060 12] Error Status Address : <Generic Address Structure> [03Ch 0060 1] Space ID : 00 (SystemMemory) [03Dh 0061 1] Bit Width : 20 [03Eh 0062 1] Bit Offset : 00 [03Fh 0063 1] Encoded Access Width : 01 (Byte Access:8) [040h 0064 8] Address : 00000000BF7B4480 (...) [07Ch 0124 12] Error Status Address : <Generic Address Structure> [07Ch 0124 1] Space ID : 00 (SystemMemory) [07Dh 0125 1] Bit Width : 20 [07Eh 0126 1] Bit Offset : 00 [07Fh 0127 1] Encoded Access Width : 01 (Byte Access:8) [080h 0128 8] Address : 00000000BF7B4690 So the BIOS defines two 32-bit registers with byte access width, which doesn't look right. The registers defined in all other tables are OK. So indeed this is a firmware bug, but given that addresses are aligned on a 32-bit boundary, it seems clear that a 32-bit access is the proper thing to do. So, would it make sense to fix-up this kind of firmware oddity at run time?
The patch from comment #18 works fine. But I think all checks should be moved to init/map time, and removed from run-time, as run-time check are more costly, and redundant now.
Note that I just submitted a request to the Asus technical support to fix the issue in the BIOS. I have no idea if they'll follow up though (the board model is already 3 or 4 years old). Given that several other boards are affected, and Linux was working OK on them before, I think we want to have a workaround in the kernel anyway.
Created attachment 73543 [details] Fixup common BIOS bug I propose to fixup this specific register definition bug. The 3 reports I am aware of, have the exact same issue, so I guess the bug was in a reference BIOS implementation and more systems are affected. Fixing it up that way is cheap and will prevent regressions. Originally I planned to write more generic fixup code, but it was quite complex and unreadable, so unless someone insists on seeing it, I think more simple code to fix the one problem at hand is preferable.
A patch referencing this bug report has been merged in Linux v3.5-rc5: commit 34ddeb035d704eafdcdb3cbc781894300136c3c4 Author: Huang Ying <ying.huang@intel.com> Date: Tue Jun 12 11:20:19 2012 +0800 ACPI, APEI, Avoid too much error reporting in runtime
I'm only half convinced by the above commit. For one thing, the run time checks are still present, while they seem completely redundant to me now. For another, this solves the log flood but not the actual problem that APEI no longer works on the affected boards. Sure, this is a BIOS bug in the first place, but this is also a regression from the user's perspective. Hopefully my patch in comment #22 is at least scheduled to be merged in kernel 3.6.
patch in comment #22 staged for 3.6-merge
patch in comment #22 shipped 3.6-merge should be picked up for stable 3.5 and 3.4. commit f712c71f7b2b43b894d1e92e1b77385fcad8815f Author: Jean Delvare <jdelvare@suse.de> Date: Tue Jun 12 10:43:28 2012 +0200 ACPI, APEI: Fixup common access width firmware bug