Bug 96681

Summary: Macbook sbs hangs in acpi_battery_get_state
Product: ACPI Reporter: Chris Bainbridge (chris.bainbridge)
Component: Power-BatteryAssignee: platform_x86_64 (platform_x86_64)
Severity: normal CC: aaron.lu, chris.bainbridge, lenb
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.0.0 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg

Description Chris Bainbridge 2015-04-15 10:04:14 UTC
Commit 7bc5a2b has introduced an intermittent boot failure on an IvyBridge Macbook Pro 13 (MacBookPro10,2). The failure rate is around 12%. This was diagnosed by doing a bisect with many boots for each kernel, then verifying failure on 7bc5a2b, and then booting 50 times again with the prior commit (9faf613) to verify that commit was ok (50 boots is necessary to get a high probability >99% of hitting this).

Example hang: http://imgur.com//QHWJTYS

The failure has been observed up to and including 4.0.

The Apple BIOS is MBP102.88Z.0106.B07.1501071215 01/07/2015

The git log for the bisected commit shows that it altered some Apple-specific boot code:

commit 7bc5a2b
Author: Matthew Garrett <matthew.garrett@nebula.com>
Date:   Sat Sep 20 13:19:47 2014 +0200

    ACPI: Support _OSI("Darwin") correctly
    Apple hardware queries _OSI("Darwin") in order to determine whether the
    system is running OS X, and changes firmware behaviour based on the
    answer.  The most obvious difference in behaviour is that Thunderbolt
    hardware is forcibly powered down unless the system is running OS X. The
    obvious solution would be to simply add Darwin to the list of supported
    _OSI strings, but this causes problems.
    Recent Apple hardware includes two separate methods for checking _OSI
    strings. The first will check whether Darwin is supported, and if so
    will exit. The second will check whether Darwin is supported, but will
    then continue to check for further operating systems. If a further
    operating system is found then later firmware code will assume that the
    OS is not OS X.  This results in the unfortunate situation where the
    Thunderbolt controller is available at boot time but remains powered
    down after suspend.
    The easiest way to handle this is to special-case it in the
    Linux-specific OSI handling code. If we see Darwin, we should answer
    true and then disable all other _OSI vendor strings.
    The next problem is that the Apple PCI _OSC method has the following
    if (LEqual (0x01, OSDW ()))
      if (LAnd (LEqual (Arg0, GUID), NEXP)
        (do stuff)
    NEXP is a value in high memory and is presumably under the control of
    the firmware. No methods sets it. The methods that are called in the "do
    stuff" path are dummies. Unless there's some additional firmware call in
    early boot, there's no way for this call to succeed - and even if it
    does, it doesn't do anything.
    The easiest way to handle this is simply to ignore it. We know which
    flags would be set, so just set them by hand if the platform is running
    in Darwin mode.
Comment 1 Chris Bainbridge 2015-04-15 14:51:42 UTC
The hang seems to last about 242 seconds, after which boot continues as normal. Relevant dmesg (full dmesg attached):

[    0.298529] efifb: probing for efifb
[    0.298557] efifb: framebuffer at 0x90000000, mapped to 0xffffc9000aa00000, using 7360k, total 7360k
[    0.298562] efifb: mode is 1680x1050x32, linelength=7168, pages=1
[    0.298565] efifb: scrolling: redraw
[    0.298569] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[    0.302314] Console: switching to colour frame buffer device 210x65
[    0.305819] fb0: EFI VGA frame buffer device
[    0.305946] ACPI: AC Adapter [ADP1] (on-line)
[    0.306073] input: Lid Switch as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0D:00/input/input0
[    0.306117] ACPI: Lid Switch [LID0]
[    0.306161] input: Power Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input1
[    0.306191] ACPI: Power Button [PWRB]
[    0.306233] input: Sleep Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0E:00/input/input2
[    0.306262] ACPI: Sleep Button [SLPB]
[    0.306304] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input3
[    0.306331] ACPI: Power Button [PWRF]
[    0.308103] Monitor-Mwait will be used to enter C-1 state
[    0.308127] Monitor-Mwait will be used to enter C-2 state
[    0.308149] Monitor-Mwait will be used to enter C-3 state
[    0.308178] ACPI: acpi_idle registered with cpuidle
[    0.309966] ACPI: SBS HC: EC = 0xffff88026540c6c0, offset = 0x20, query_bit = 0x10
[    0.755049] random: nonblocking pool is initialized
[    1.293994] tsc: Refined TSC clocksource calibration: 2494.336 MHz
[    2.295215] Switched to clocksource tsc
[  244.707578] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[  244.708082] Non-volatile memory driver v1.3
[  244.708128] Linux agpgart interface v0.103
[  244.708194] [drm] Initialized drm 1.1.0 20060810
[  244.708823] [drm] Memory usable by graphics device = 2048M
[  244.708846] checking generic (90000000 730000) vs hw (90000000 10000000)
[  244.708870] fb: switching to inteldrmfb from EFI VGA
[  244.708905] Console: switching to colour dummy device 80x25
[  244.708974] [drm] Replacing VGA console driver
[  244.761413] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[  244.761418] [drm] Driver supports precise vblank timestamp query.
[  244.761507] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[  245.491720] ACPI: Video Device [IGPU] (multi-head: yes  rom: no  post: no)
[  245.492141] acpi device:42: registered as cooling_device4
[  245.492235] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input4
[  245.492266] [drm] Initialized i915 1.6.0 20150130 for 0000:00:02.0 on minor 0
[  245.493077] loop: module loaded
[  245.493178] ahci 0000:00:1f.2: version 3.0
[  245.493380] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x1 impl SATA mode
[  245.493391] ahci 0000:00:1f.2: flags: 64bit ncq pm led clo pio slum part ems apst
[  245.493995] scsi host0: ahci
[  245.494071] scsi host1: ahci
[  245.494141] scsi host2: ahci
[  245.494228] scsi host3: ahci
[  245.494300] scsi host4: ahci
[  245.494367] scsi host5: ahci
Comment 2 Chris Bainbridge 2015-04-15 14:52:02 UTC
Created attachment 174091 [details]
Comment 3 Chris Bainbridge 2015-04-17 16:48:10 UTC
With a bit of debugging and a lot of reboots I narrowed this down to the SBS driver and acpi_battery_get_state:

[    0.326725] ACPI: SBS HC: EC = 0xffff88026542d6c0, offset = 0x20, query_bit = 0x10
[    0.326779] acpi_sbs_init
[    0.326811] acpi_sbs_add
[    0.326823] disable_sbs_manager
[    0.326835] /disable_sbs_manager
[    0.326848] acpi_charger_add
[    0.326860] acpi_ac_get_present
[    0.328878] /acpi_charger_add
[    0.328916] acpi_battery_add
[    0.328931] acpi_battery_read
[    0.328942] acpi_battery_get_info
[    0.350929] /acpi_battery_get_info
[    0.350963] acpi_battery_get_state
[    0.842409] random: nonblocking pool is initialized
[    1.303538] tsc: Refined TSC clocksource calibration: 2494.366 MHz
[    2.304756] Switched to clocksource tsc
[  242.763646] /acpi_battery_get_state
[  242.763684] /acpi_battery_read
[  242.763696] /acpi_sbs_add
[  242.763731] /acpi_sbs_init

The fact that this problem has appeared after the _OSI("Darwin") change is probably due to http://mjg59.dreamwidth.org/29954.html :

"Apple do something very strange with their ACPI battery interface. If you're running anything that doesn't claim to be Darwin, Apple expose an ACPI Control Method battery. Control Method interfaces abstract the complexity away into either ACPI bytecode or system management traps - the OS simply calls an ACPI method, magic happens and it gets an answer back. ***If you claim to be Darwin, Apple remove that interface and instead expose the raw ACPI Smart Battery System interface. This provides an i2c bus over which the OS must then speak the Smart Battery System protocol, allowing it to directly communicate with the battery.*** 

Linux has support for this, but it seems that this wasn't working so well and hadn't been for years. Loading the driver resulted in modprobe hanging until a timeout occurred, and most accesses to the battery would (a) take forever and (b) probably fail."

So now that Linux is claiming to be Darwin, the previously hidden ACPI SBS is exposed, resulting in bugs. I noticed the same hang on `modprobe sbs`

 [  138.131540] ACPI: SBS HC: EC = 0xffff880265bcf780, offset = 0x20, query_bit
= 0x10
[  138.131708] acpi_sbs_init
[  138.131735] acpi_sbs_add
[  138.131738] disable_sbs_manager
[  138.131739] /disable_sbs_manager
[  138.131741] acpi_charger_add
[  138.131742] acpi_ac_get_present
[  138.133375] /acpi_charger_add
[  138.133382] acpi_battery_add
[  138.133383] acpi_battery_read
[  138.133384] acpi_battery_get_info
[  138.156800] /acpi_battery_get_info
[  138.156808] acpi_battery_get_state
[  281.119010] /acpi_battery_get_state
[  281.119018] /acpi_battery_read
[  281.119032] /acpi_sbs_add
[  281.119073] /acpi_sbs_init

So it should be possible to reproduce just by loading and unloading the driver, avoiding the many reboots. Unfortunately attempting to `modprobe -r sbs` currently results in a NULL pointer dereference in acpi_ec_put_query_handler.
Comment 4 Chris Bainbridge 2015-04-17 16:58:24 UTC
This might be a dupe of bug #94651
Comment 5 Aaron Lu 2015-04-21 06:16:47 UTC
Let's track it there then.

*** This bug has been marked as a duplicate of bug 94651 ***