Bug 16244

Summary: Battery insert while system is asleep causes sometimes a crash
Product: ACPI Reporter: Maxim Levitsky (maximlevitsky)
Component: Power-BatteryAssignee: Lan Tianyu (tianyu.lan)
Status: CLOSED CODE_FIX    
Severity: normal CC: andi-bz, astarikovskiy, hmh, lenb, ming.m.lin, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35-rc3 Subsystem:
Regression: No Bisected commit-id:
Attachments: acpidemp
patch-16244-do-not-invoke-power_supply_changed-twice
patch-16244-do-not-invoke-power_supply_changed-twice
debug patch
debug patch v2
the kernel log
debug patch
another dmesg of a crash with the debug patch
battery_debug.patch

Description Maxim Levitsky 2010-06-18 00:30:55 UTC
Backtrace:

<6>[ 2763.330466] [drm] nouveau 0000:01:00.0: 0xB6F9: parsing clock script 1
<6>[ 2763.533828] ata3: SATA link down (SStatus 0 SControl 300)
<6>[ 2763.552678] ata2: SATA link down (SStatus 0 SControl 300)
<6>[ 2763.570498] ata4.00: configured for UDMA/33
<0>[ 2763.650740] ------------[ cut here ]------------
<2>[ 2763.650751] kernel BUG at /home/maxim/software/kernel/linux-2.6/kernel/workqueue.c:312!
<0>[ 2763.650762] invalid opcode: 0000 [#1] PREEMPT SMP 
<0>[ 2763.650772] last sysfs file: /sys/power/state
<4>[ 2763.650779] CPU 0 
<4>[ 2763.650784] Modules linked in: lirc_ene0100 af_packet iwl3945 iwlcore uvcvideo videodev mac80211 snd_hda_codec_realtek v4l2_compat_ioctl32 usb_storage uhci_hcd snd_hda_intel usb_libusual ehci_hcd snd_hda_codec joydev usbcore cpufreq_powersave tg3 cfg80211 cpufreq_conservative snd_hwdep r852 sdhci_pci sdhci psmouse cpufreq_userspace snd_pcm acpi_cpufreq mperf iTCO_wdt iTCO_vendor_support mmc_core libphy sm_common nand video coretemp ac nand_ids nand_ecc mtd snd_page_alloc sg battery lirc_dev serio_raw evdev nouveau ttm drm_kms_helper drm i2c_algo_bit
<4>[ 2763.650886] 
<4>[ 2763.650893] Pid: 145, comm: kacpi_notify Not tainted 2.6.35-rc3+ #73 Nettiling/Aspire 5720     
<4>[ 2763.650904] RIP: 0010:[<ffffffff81056424>]  [<ffffffff81056424>] queue_work_on+0x44/0x50
<4>[ 2763.650923] RSP: 0018:ffff88006ffa7cd0  EFLAGS: 00010213
<4>[ 2763.650931] RAX: ffff88006c273908 RBX: ffff88006fc195a0 RCX: 0000000000000000
<4>[ 2763.650941] RDX: ffff88006c273900 RSI: ffff88006fc195a0 RDI: 0000000000000000
<4>[ 2763.650950] RBP: ffff88006ffa7cd0 R08: 0000000000000000 R09: 0000000000000000
<4>[ 2763.650959] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88006c273900
<4>[ 2763.650968] R13: 0000000000000081 R14: ffffffff81211889 R15: ffff88006ffb0000
<4>[ 2763.650979] FS:  0000000000000000(0000) GS:ffff880002400000(0000) knlGS:0000000000000000
<4>[ 2763.650989] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[ 2763.650998] CR2: 00007f9073d926b0 CR3: 000000000155d000 CR4: 00000000000006f0
<4>[ 2763.651007] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 2763.651016] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 2763.651026] Process kacpi_notify (pid: 145, threadinfo ffff88006ffa6000, task ffff88006ffb0000)
<0>[ 2763.651036] Stack:
<4>[ 2763.651040]  ffff88006ffa7cf0 ffffffff81056479 ffff88006c273898 ffff88006fc3e000
<4>[ 2763.651052] <0> ffff88006ffa7d00 ffffffff810564c3 ffff88006ffa7d30 ffffffff812c6d1e
<4>[ 2763.651065] <0> ffff88006fc3e000 0000000000000081 ffffffff81211889 ffff88006c273800
<0>[ 2763.651079] Call Trace:
<4>[ 2763.651089]  [<ffffffff81056479>] queue_work+0x29/0x60
<4>[ 2763.651099]  [<ffffffff810564c3>] schedule_work+0x13/0x20
<4>[ 2763.651111]  [<ffffffff812c6d1e>] power_supply_changed+0x1e/0x70
<4>[ 2763.651124]  [<ffffffff81211889>] ? acpi_os_execute_deferred+0x0/0x31
<4>[ 2763.651140]  [<ffffffffa006b6de>] acpi_battery_notify+0x76/0x7e [battery]
<4>[ 2763.651152]  [<ffffffff812148da>] acpi_device_notify+0x14/0x16
<4>[ 2763.651163]  [<ffffffff81223244>] acpi_ev_notify_dispatch+0x62/0x7a
<4>[ 2763.651175]  [<ffffffff812118ad>] acpi_os_execute_deferred+0x24/0x31
<4>[ 2763.651186]  [<ffffffff81055eb0>] worker_thread+0x220/0x390
<4>[ 2763.651197]  [<ffffffff81055e5e>] ? worker_thread+0x1ce/0x390
<4>[ 2763.651209]  [<ffffffff8105ace0>] ? autoremove_wake_function+0x0/0x40
<4>[ 2763.651221]  [<ffffffff81055c90>] ? worker_thread+0x0/0x390
<4>[ 2763.651231]  [<ffffffff8105a85e>] kthread+0xae/0xc0
<4>[ 2763.651244]  [<ffffffff81003b94>] kernel_thread_helper+0x4/0x10
<4>[ 2763.651255]  [<ffffffff8105a7b0>] ? kthread+0x0/0xc0
<4>[ 2763.651265]  [<ffffffff81003b90>] ? kernel_thread_helper+0x0/0x10
<0>[ 2763.651273] Code: 75 29 44 8b 46 20 45 85 c0 75 24 48 8b 06 48 63 ff 48 89 d6 48 03 04 fd 20 3a 5b 81 48 89 c7 e8 03 ff ff ff b8 01 00 00 00 c9 c3 <0f> 0b eb fe 8b 3d 92 d8 55 00 eb d4 55 48 89 f2 48 89 e5 48 8b 
<1>[ 2763.651351] RIP  [<ffffffff81056424>] queue_work_on+0x44/0x50
<4>[ 2763.651362]  RSP <ffff88006ffa7cd0>
<4>[ 2763.655505] ---[ end trace 67498c368c7e7da0 ]---
<0>[ 2763.655514] Kernel panic - not syncing: Fatal exception
<4>[ 2763.655523] Pid: 145, comm: kacpi_notify Tainted: G      D     2.6.35-rc3+ #73
<4>[ 2763.655532] Call Trace:
<4>[ 2763.655543]  [<ffffffff8139fb38>] panic+0x90/0x10a
<4>[ 2763.655554]  [<ffffffff8100737c>] oops_end+0xcc/0xe0
<4>[ 2763.655563]  [<ffffffff81007586>] die+0x56/0x90
<4>[ 2763.655574]  [<ffffffff81004364>] do_trap+0xc4/0x170
<4>[ 2763.655585]  [<ffffffff81211889>] ? acpi_os_execute_deferred+0x0/0x31
<4>[ 2763.655597]  [<ffffffff81211889>] ? acpi_os_execute_deferred+0x0/0x31
<4>[ 2763.655607]  [<ffffffff81004810>] do_invalid_op+0x90/0xb0
<4>[ 2763.655618]  [<ffffffff81056424>] ? queue_work_on+0x44/0x50
<4>[ 2763.655629]  [<ffffffff813a41d3>] ? _raw_spin_unlock_irqrestore+0x33/0x60
<4>[ 2763.655640]  [<ffffffff810304be>] ? __wake_up+0x4e/0x70
<4>[ 2763.655651]  [<ffffffff81003a15>] invalid_op+0x15/0x20
<4>[ 2763.655662]  [<ffffffff81211889>] ? acpi_os_execute_deferred+0x0/0x31
<4>[ 2763.655673]  [<ffffffff81056424>] ? queue_work_on+0x44/0x50
<4>[ 2763.655684]  [<ffffffff81056479>] queue_work+0x29/0x60
<4>[ 2763.655694]  [<ffffffff810564c3>] schedule_work+0x13/0x20
<4>[ 2763.655704]  [<ffffffff812c6d1e>] power_supply_changed+0x1e/0x70
<4>[ 2763.655716]  [<ffffffff81211889>] ? acpi_os_execute_deferred+0x0/0x31
<4>[ 2763.655730]  [<ffffffffa006b6de>] acpi_battery_notify+0x76/0x7e [battery]
<4>[ 2763.655742]  [<ffffffff812148da>] acpi_device_notify+0x14/0x16
<4>[ 2763.655752]  [<ffffffff81223244>] acpi_ev_notify_dispatch+0x62/0x7a
<4>[ 2763.655764]  [<ffffffff812118ad>] acpi_os_execute_deferred+0x24/0x31
<4>[ 2763.655775]  [<ffffffff81055eb0>] worker_thread+0x220/0x390
<4>[ 2763.655785]  [<ffffffff81055e5e>] ? worker_thread+0x1ce/0x390
<4>[ 2763.655797]  [<ffffffff8105ace0>] ? autoremove_wake_function+0x0/0x40
<4>[ 2763.655808]  [<ffffffff81055c90>] ? worker_thread+0x0/0x390
<4>[ 2763.655818]  [<ffffffff8105a85e>] kthread+0xae/0xc0
<4>[ 2763.655830]  [<ffffffff81003b94>] kernel_thread_helper+0x4/0x10
<4>[ 2763.655840]  [<ffffffff8105a7b0>] ? kthread+0x0/0xc0
<4>[ 2763.655851]  [<ffffffff81003b90>] ? kernel_thread_helper+0x0/0x10


linux-acpi-next/release tree was used when this backtrace happened.
Note that I just recently fixed suspend/resume on my notebook, therefore I don't know if this is a regression.

Already happened twice.
Comment 1 Maxim Levitsky 2010-06-18 00:31:49 UTC
Created attachment 26833 [details]
acpidemp
Comment 2 Zhang Rui 2010-06-30 08:05:14 UTC
Created attachment 26978 [details]
patch-16244-do-not-invoke-power_supply_changed-twice

please try this patch and see if it helps.
Comment 3 Zhang Rui 2010-06-30 08:06:46 UTC
please re-open the bug report if the patch doesn't help.
Comment 4 Maxim Levitsky 2010-06-30 23:17:25 UTC
I can't test this patch now. really sorry.. but I will test as soon as I can.

Big thanks for this.
Comment 5 Alexey Starikovskiy 2010-07-06 08:45:32 UTC
Rui,
Please move first half of the patch two lines down, or you might derefence null pointer.
Comment 6 Zhang Rui 2010-07-06 08:51:20 UTC
Created attachment 27034 [details]
patch-16244-do-not-invoke-power_supply_changed-twice

how about this one?
Comment 7 Len Brown 2010-07-07 02:24:49 UTC
marking bug as RESOLVED as there is a proposed fix in comment #6
Comment 8 Maxim Levitsky 2010-07-26 08:45:45 UTC
I finally have time to continue my work in linux.

I applied that patch (+ WARN_ON to see if this condition triggers).

I see if this problem still exits, and hope that this  patch fixes it.

Patch isn't upstream yet?
Comment 9 Len Brown 2010-07-26 23:56:40 UTC
patch is upstream in 2.6.35-rc6 post git1

commit 153e500f516329f439856f52ccbf61d1fd1a946a
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Wed Jul 7 09:11:57 2010 +0800

    ACPI battery: don't invoke power_supply_changed twice when battery is hot-added

closed.
Comment 10 Maxim Levitsky 2010-07-31 16:18:36 UTC
Unfortunately just got another kernel kaboom:
This is latest git.
Same thing, system was suspended, then I decided to move it to another room.
I inserted the battery, disconnected power, and turned the system on....

<2>[ 1434.270090] kernel BUG at /home/maxim/software/kernel/linux-2.6/kernel/workqueue.c:312!
<0>[ 1434.270096] invalid opcode: 0000 [#1] PREEMPT SMP 
<0>[ 1434.270101] last sysfs file: /sys/power/state
<4>[ 1434.270105] CPU 0 
<4>[ 1434.270107] Modules linked in: ir_lirc_codec lirc_dev ir_sony_decoder ir_jvc_decoder ir_rc6_decoder ene_ir ir_rc5_decoder ir_nec_decoder ir_core nvidia(P) af_packet nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc iwl3945 iwlcore uvcvideo mac80211 videodev usb_storage v4l2_compat_ioctl32 snd_hda_codec_realtek usb_libusual cpufreq_powersave cpufreq_conservative uhci_hcd snd_hda_intel sdhci_pci cpufreq_userspace ehci_hcd acpi_cpufreq mperf sdhci usbcore snd_hda_codec iTCO_wdt cfg80211 snd_hwdep mmc_core r852 tg3 joydev snd_pcm iTCO_vendor_support sm_common nand coretemp nand_ids nand_ecc psmouse battery ac video libphy mtd snd_page_alloc serio_raw evdev sg [last unloaded: ir_core]
<4>[ 1434.270169] 
<4>[ 1434.270173] Pid: 145, comm: kacpi_notify Tainted: P            2.6.35-rc6+ #98 Nettiling/Aspire 5720     
<4>[ 1434.270179] RIP: 0010:[<ffffffff81056364>]  [<ffffffff81056364>] queue_work_on+0x44/0x50
<4>[ 1434.270191] RSP: 0018:ffff88007f19bcd0  EFLAGS: 00010213
<4>[ 1434.270195] RAX: ffff88007e374908 RBX: ffff88007f8195a0 RCX: 0000000000000000
<4>[ 1434.270200] RDX: ffff88007e374900 RSI: ffff88007f8195a0 RDI: 0000000000000000
<4>[ 1434.270204] RBP: ffff88007f19bcd0 R08: 0000000000000000 R09: 0000000000000000
<4>[ 1434.270209] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007e374900
<4>[ 1434.270214] R13: ffff88007f0a4000 R14: 0000000000000080 R15: ffff88007f1a0000
<4>[ 1434.270220] FS:  0000000000000000(0000) GS:ffff880002400000(0000) knlGS:0000000000000000
<4>[ 1434.270225] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[ 1434.270230] CR2: 00007f22b96a56b0 CR3: 0000000001559000 CR4: 00000000000006f0
<4>[ 1434.270234] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 1434.270239] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 1434.270245] Process kacpi_notify (pid: 145, threadinfo ffff88007f19a000, task ffff88007f1a0000)
<0>[ 1434.270250] Stack:
<4>[ 1434.270252]  ffff88007f19bcf0 ffffffff810563b9 ffff88007e374898 ffff88004f0b1800
<4>[ 1434.270258] <0> ffff88007f19bd00 ffffffff81056403 ffff88007f19bd30 ffffffff812c49ee
<4>[ 1434.270265] <0> ffff88004f0b1800 ffff88007f0a4000 0000000000000080 ffff88007e374800
<0>[ 1434.270272] Call Trace:
<4>[ 1434.270277]  [<ffffffff810563b9>] queue_work+0x29/0x60
<4>[ 1434.270283]  [<ffffffff81056403>] schedule_work+0x13/0x20
<4>[ 1434.270290]  [<ffffffff812c49ee>] power_supply_changed+0x1e/0x70
<4>[ 1434.270299]  [<ffffffffa007c6e8>] acpi_battery_notify+0x7d/0x86 [battery]
<4>[ 1434.270307]  [<ffffffff81211629>] ? acpi_os_execute_deferred+0x0/0x31
<4>[ 1434.270313]  [<ffffffff8121466e>] acpi_device_notify+0x14/0x16
<4>[ 1434.270320]  [<ffffffff81222ff8>] acpi_ev_notify_dispatch+0x62/0x7a
<4>[ 1434.270326]  [<ffffffff8121164d>] acpi_os_execute_deferred+0x24/0x31
<4>[ 1434.270332]  [<ffffffff81055df0>] worker_thread+0x220/0x390
<4>[ 1434.270338]  [<ffffffff81055d9e>] ? worker_thread+0x1ce/0x390
<4>[ 1434.270345]  [<ffffffff8105ac00>] ? autoremove_wake_function+0x0/0x40
<4>[ 1434.270351]  [<ffffffff81055bd0>] ? worker_thread+0x0/0x390
<4>[ 1434.270356]  [<ffffffff8105a77e>] kthread+0xae/0xc0
<4>[ 1434.270363]  [<ffffffff81003bd4>] kernel_thread_helper+0x4/0x10
<4>[ 1434.270369]  [<ffffffff8105a6d0>] ? kthread+0x0/0xc0
<4>[ 1434.270374]  [<ffffffff81003bd0>] ? kernel_thread_helper+0x0/0x10
<0>[ 1434.270378] Code: 75 29 44 8b 46 20 45 85 c0 75 24 48 8b 06 48 63 ff 48 89 d6 48 03 04 fd 60 f7 5a 81 48 89 c7 e8 03 ff ff ff b8 01 00 00 00 c9 c3 <0f> 0b eb fe 8b 3d 92 96 55 00 eb d4 55 48 89 f2 48 89 e5 48 8b 
<1>[ 1434.270416] RIP  [<ffffffff81056364>] queue_work_on+0x44/0x50
<4>[ 1434.270422]  RSP <ffff88007f19bcd0>
<4>[ 1434.270427] ---[ end trace 76a5caa882a7a1de ]---
Comment 11 Zhang Rui 2010-08-16 07:57:56 UTC
Created attachment 27460 [details]
debug patch

please apply this patch on top of the latest git tree, say 2.6.36-rc1 and attach the dmesg output after resumed.
Comment 12 Zhang Rui 2010-08-16 08:00:02 UTC
BTW: the crash should go away after applying this patch, but that's not a fix, I just need to get the dmesg output after resume. :)
Comment 13 Maxim Levitsky 2010-08-16 08:33:25 UTC
Just one question, should I do ordinary suspend/resume, or try to reproduce the bug by inserting/removing the battery.
BTW I still can't reproduce that reliably.

Thanks!
Comment 14 Zhang Rui 2010-08-16 08:47:12 UTC
Created attachment 27463 [details]
debug patch v2

then please try this patch.
and insert the battery before resume, for several times, and attach the dmesg output until you get "Rui: bug reproduced!" in dmesg.
Comment 15 Maxim Levitsky 2010-08-25 00:05:27 UTC
I tried that.
Don't see the 'Rui: bug reproduced!' at all.
I do see a message about invoking power_supply_changed twice.

I post a proper log soon. Don't have much time now
Comment 16 Maxim Levitsky 2010-09-05 09:37:42 UTC
System blew up again, now with your debug patch.
Comment 17 Maxim Levitsky 2010-09-05 09:38:20 UTC
Created attachment 29032 [details]
the kernel log
Comment 18 Zhang Rui 2010-10-08 08:43:28 UTC
Created attachment 32802 [details]
debug patch

please apply this patch on top of the latest git kernel and see if it helps.
If no, please attach the full log when the problem occurs.
Comment 19 Zhang Rui 2010-12-27 01:22:50 UTC
bug closed as there is no response from the bug reporter.
please feel free to re-open it if the problem still exists in the latest upstream kernel after applying the debug patch.
Comment 20 Henrique de Moraes Holschuh 2011-03-12 18:59:37 UTC
I am also observing hangs on 2.6.35.11 when battery is either plugged or unplugged (i.e. "plugged" state changes) during S3 sleep.  My box (a ThinkPad T43) is known good, has stable firmware, and works perfectly with 2.6.32.y and earlier kernels.

I will reopen the bug after I check things a little more.
Comment 21 Henrique de Moraes Holschuh 2011-03-15 02:00:54 UTC
I do not have enough access rights to reopen this bug.  Please reopen it.

The "debug" patch cures the problem, after a typo is fixed so that it compiles.

The printk is never triggered here, but the system does not hang anymore, i.e.,  unsafe use of delayed work queues in the power supply class is to blame.

This really needs to be fixed for the 2.6.35-longterm, regardless of whether it has been fixed directly or indirectly in mainline.  This bug can cause data loss, it managed to hang the box here with unflushed buffers (maybe ext4 in 2.6.35 does not flush delayed allocation buffers on PM sync, or the hang happened after stuff started untawing and syslog tried to write data).

Cc'ing 2.6.35-longterm maintainer, to make sure it does not fall through the cracks.
Comment 22 Maxim Levitsky 2011-04-01 04:48:33 UTC
Created attachment 52892 [details]
another dmesg of a crash with the debug patch

Well that still happens, really
Comment 23 Lan Tianyu 2011-07-25 07:27:01 UTC
Created attachment 66522 [details]
battery_debug.patch

Please try this patch.
Comment 24 Henrique de Moraes Holschuh 2011-07-25 10:35:57 UTC
Lan:

I did track it down to: "unsafe use of delayed work queues in the power supply class is to blame" (see comment #21).

THAT _also_ needs to be fixed, otherwise you will still get a crash if AC state is different on suspend/resume.
Comment 25 Zhang Rui 2012-01-18 02:12:51 UTC
Tianyu,

what's the current status of this bug?
Comment 26 Lan Tianyu 2012-01-18 02:24:42 UTC
Wait for feedback of the debug path in the comment #23.

hi Maxim:
     Since the patch has been accepted by upstream. Now you can test whether this bug still exits in the newest kernel or not.
Comment 27 Maxim Levitsky 2012-01-18 23:11:49 UTC
I don't get it anymore, but it always was very rare. Thats why I had no chance to find out what triggers it.
But yes, I guess it can be closed, at least until I get it again.
Comment 28 Zhang Rui 2012-01-19 00:46:59 UTC
(In reply to comment #27)
> I don't get it anymore, but it always was very rare. Thats why I had no
> chance
> to find out what triggers it.
> But yes, I guess it can be closed, at least until I get it again.

Good to know. Bug closed