Bug 10265

Summary: oops after resume from acpi S3
Product: ACPI Reporter: Matthias Bläsing (matthias.blaesing)
Component: ACPICA-CoreAssignee: Zhang Rui (rui.zhang)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: acpi-bugzilla, bunk
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24.3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: output of acpi dump
Debug Log around the ooops
syslog around the oops
cat /proc/acpi/event
gziped (decompressed) 11 MB syslog around oops
debug patch
oops with the patch
flush kacpi_notify_wq before removing notify handler

Description Matthias Bläsing 2008-03-16 13:15:41 UTC
Latest working kernel version: 2.6.23
Earliest failing kernel version: 2.6.24
Distribution: Debian SID
Hardware Environment: Asus M6700NE
Software Environment: ?!

Steps to reproduce:

I tried a kernel with out the /proc/acpi/event entry - this works as expected after the resume (at least on a short test). But enableing the entry results in this, after I resume from S3:

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip: 00000000 *pde = 35e1e067 *pte = 00000000 
Oops: 0000 [#1] 
Modules linked in: ac thermal fan button ipw2200 ieee80211 ieee80211_crypt tg3 binfmt_misc radeon drm rfcomm l2cap bluetooth ppdev lp ipv6 acpi_cpufreq cpufreq_stats cpufreq_powersave cpufreq_conservative cpufreq_userspace fuse dm_crypt dm_mod md_mod bay dock uinput sbp2 loop usbhid pcmcia nsc_ircc irda snd_intel8x0 parport_pc parport crc_ccitt snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_intel8x0m snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss pcspkr rtc snd_seq_midi_event snd_seq snd_pcm ohci1394 snd_seq_device snd_timer firmware_class ieee1394 yenta_socket rsrc_nonstatic pcmcia_core snd soundcore snd_page_alloc i2c_i801 ehci_hcd rng_core uhci_hcd usbcore video output battery asus_acpi processor intel_agp agpgart joydev evdev unix

Pid: 47, comm: kacpi_notify Not tainted (2.6.24.3 #4)
EIP: 0060:[<00000000>] EFLAGS: 00010282 CPU: 0
EIP is at _stext+0x3feff000/0x14
EAX: f7c1cc5c EBX: f7cc0c04 ECX: 00000000 EDX: 00000080
ESI: f5c5aa30 EDI: c02167e0 EBP: 00000000 ESP: f7cbff7c
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process kacpi_notify (pid: 47, ti=f7cbe000 task=f7c72550 task.ti=f7cbe000)
Stack: c021bd97 f75d9b40 f75d9b48 c0216802 f76db488 f7c374c0 c0123277 f7c72aa0 
       f7c72550 c02d89df f7c374c0 f7c374c0 c012386a f7c374c8 c012391e 00000000 
       f7c72550 c0125d3a f7cbffc4 f7cbffc4 f7c374c0 c012386a 00000000 c0125bda 
Call Trace:
 [<c021bd97>] acpi_ev_notify_dispatch+0x4c/0x55
 [<c0216802>] acpi_os_execute_notify+0x22/0x2b
 [<c0123277>] run_workqueue+0x6b/0xdf
 [<c02d89df>] schedule+0x1f0/0x20a
 [<c012386a>] worker_thread+0x0/0xc2
 [<c012391e>] worker_thread+0xb4/0xc2
 [<c0125d3a>] autoremove_wake_function+0x0/0x33
 [<c012386a>] worker_thread+0x0/0xc2
 [<c0125bda>] kthread+0x36/0x5c
 [<c0125ba4>] kthread+0x0/0x5c
 [<c010476f>] kernel_thread_helper+0x7/0x10
 =======================
Code:  Bad EIP value.
EIP: [<00000000>] _stext+0x3feff000/0x14 SS:ESP 0068:f7cbff7c
---[ end trace a8374706f1380296 ]---

Before 2.6.24 it worked.

Thanks in advance

Matthias
Comment 1 Zhang Rui 2008-03-17 00:06:58 UTC
First, please try the lastest kernel and see if it still hangs.
Second, as this is a regression, it would be great if you can do a git bisect to see which commit causes this oops.

BTW: please attach the output of acpidump using the latest pmtools here:
http://www.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/
Comment 2 Matthias Bläsing 2008-03-17 08:51:10 UTC
Created attachment 15315 [details]
output of acpi dump
Comment 3 Matthias Bläsing 2008-03-17 09:20:42 UTC
I tested with head of git and got the same results:

BUG: unable to handle kernel paging request at f885e127
IP: [<f885e127>]
*pde = 37c18067 *pte = 00000000 
Oops: 0000 [#1] 
Modules linked in: thermal fan button ipw2200 ieee80211 ieee80211_crypt tg3 binfmt_misc radeon drm rfcomm l2cap bluetooth ppdev lp ipv6 acpi_cpufreq cpufreq_stats cpufreq_powersave cpufreq_conservative cpufreq_userspace fuse dm_crypt crypto_blkcipher dm_mod md_mod bay dock video output uinput sbp2 loop snd_intel8x0m snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm usbhid snd_seq_dummy pcmcia snd_seq_oss snd_seq_midi nsc_ircc snd_rawmidi snd_seq_midi_event snd_seq irda parport_pc parport crc_ccitt snd_timer snd_seq_device snd pcspkr rtc ohci1394 soundcore snd_page_alloc firmware_class ieee1394 yenta_socket rsrc_nonstatic pcmcia_core i2c_i801 ehci_hcd uhci_hcd rng_core usbcore battery asus_acpi processor intel_agp agpgart joydev evdev unix [last unloaded: ac]

Pid: 45, comm: kacpi_notify Not tainted (2.6.25-rc6 #2)
EIP: 0060:[<f885e127>] EFLAGS: 00010286 CPU: 0
EIP is at 0xf885e127
EAX: f7c1cc5c EBX: f756f59c ECX: f7dcecc0 EDX: 00000080
ESI: f4cee770 EDI: c020a217 EBP: 00000000 ESP: f7ccdf90
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process kacpi_notify (pid: 45, ti=f7ccc000 task=f7cbe230 task.ti=f7ccc000)
Stack: c020f099 f76f8540 f76f8548 c020a234 f7c3ba40 c0121fc7 f7c3ba40 c01224fa 
       f7c3ba48 c01225a6 00000000 f7cbe230 c012454d f7ccdfc4 f7ccdfc4 f7c3ba40 
       c01224fa 00000000 c01243f3 c01243bd 00000000 c0105207 f7c41efc 00000000 
Call Trace:
 [<c020f099>] acpi_ev_notify_dispatch+0x4c/0x55
 [<c020a234>] acpi_os_execute_deferred+0x1d/0x25
 [<c0121fc7>] run_workqueue+0x63/0xca
 [<c01224fa>] worker_thread+0x0/0xb7
 [<c01225a6>] worker_thread+0xac/0xb7
 [<c012454d>] autoremove_wake_function+0x0/0x2b
 [<c01224fa>] worker_thread+0x0/0xb7
 [<c01243f3>] kthread+0x36/0x5b
 [<c01243bd>] kthread+0x0/0x5b
 [<c0105207>] kernel_thread_helper+0x7/0x10
 =======================
Code:  Bad EIP value.
EIP: [<f885e127>] 0xf885e127 SS:ESP 0068:f7ccdf90
---[ end trace 89ea4c9e06c68987 ]---

I tried a bisect, but was faced with errors, that make it difficult to decide/test whether the problem is the acpi part or some other subsystem. I have to look into this some more.
Comment 4 Matthias Bläsing 2008-03-19 09:32:17 UTC
Ok - the bisection is just frustrating - the testing of the acpi functionality with a crashing kernel or without the networking subsystem (the acpid tells me, that it can't open the event socket). So I'm a little stuck. Not sure where to go on.
Comment 5 Zhang Rui 2008-03-19 20:31:10 UTC
Does this happen occasionally or it happens every time back from S3?
please set CONFIG_ACPI_DEBUG,
"echo 0x88000011f > /sys/modules/acpi/parameters/debug_level"
"echo 0x44 > /sys/modules/acpi/parameters/debug_layer"
and attach the dmesg output after this oops.
Comment 6 Matthias Bläsing 2008-03-20 03:25:21 UTC
Created attachment 15353 [details]
Debug Log  around the ooops

When I tried to reproduce the ooops to get the debug output, I noticed, that it happend approximately every 2-3 times I put the notebook into S3. I had to repeat the suspend-resume cyle three time, to get the log.

The attached log gives some context to the oops. The oops is logged in line 2978. After that the battery level and multimedia keys aren't reported anymore.
Comment 7 Zhang Rui 2008-03-21 00:17:15 UTC
hah, too many debug messages are popped up.
could you please re-do the test, but
"echo 0x44 > /sys/modules/acpi/parameters/debug_layer"
BEFORE "echo 0x88000011f > /sys/modules/acpi/parameters/debug_level"? :)
Comment 8 Matthias Bläsing 2008-03-21 01:27:34 UTC
Created attachment 15365 [details]
syslog around the oops

You won't like this, but at least at a first glance I don't see any change in the  verbosity. On "good" thing though, the oops happend directly on the first try. But I can't see any usable pattern of circumstances, when it happens and when not.
Comment 9 Matthias Bläsing 2008-03-30 04:02:33 UTC
It seams, that it only happens, when /proc/acpi/events is activated. But I have to have it running, as the asus-laptop/asus-acpi module generates acpi events for the multimedia keys. These seem not to be put into the input layer (I can for example get the sleep button key without using the acpid -- they are reported by hal). If this would be the case I might be able to get rid of acpid and by that get rid of the bug.
Comment 10 Zhang Rui 2008-03-30 20:36:00 UTC
matthias, thanks for debugging.

(In reply to comment #7)
> "echo 0x44 > /sys/modules/acpi/parameters/debug_layer"
> BEFORE "echo 0x88000011f > /sys/modules/acpi/parameters/debug_level"? :)
                       ~~~~
                        typo

Please first kill acpid, and attach the result of "cat /proc/acpi/events" after s3 resume.
And then please "echo 0x04 > /sys/modules/acpi/parameters/debug_layer"
"echo 0x8800001f > /sys/modules/acpi/parameters/debug_level"
and attach the dmesg after oops.
IMO, this may be a battery driver bug that we should fix it in kernel rather than stop using acpid. :)
Comment 11 Matthias Bläsing 2008-03-31 10:13:36 UTC
hey,

some evidence for the battery hypothesis is, that I tried to remove the battery module before sending the notebook into S3 and at least it came up again three times.

I will attach the "cat /proc/acpi/events" output and the syslog of the debug run.
Comment 12 Matthias Bläsing 2008-03-31 10:15:47 UTC
Created attachment 15535 [details]
cat /proc/acpi/event
Comment 13 Matthias Bläsing 2008-03-31 10:16:46 UTC
Created attachment 15536 [details]
gziped (decompressed) 11 MB syslog around oops
Comment 14 Zhang Rui 2008-04-01 00:43:55 UTC
could you please attach the content of /etc/acpi/sleep.sh?
I suspect this script pokes the battery sysfs I/F and causes the oops when resuming from S3.
Please clear CONFIG_ACPI_SYSFS_POWER and see if it helps.
Comment 15 Zhang Rui 2008-04-01 00:47:37 UTC
Created attachment 15548 [details]
debug patch

Please apply this debug patch and set CONFIG_ACPI_SYSFS_POWER
and attach the dmesg output after the oops, you don't need to set the debug parameter this time. :)
Comment 16 Matthias Bläsing 2008-04-01 09:22:29 UTC
Created attachment 15552 [details]
oops with the patch

This is the result with the patch - the other test has to wait.
Comment 17 Matthias Bläsing 2008-04-01 13:54:50 UTC
I had a look at the debian acpi script - and wow! - there is a ton of things that are done at resume. I had a closer look and noticed these lines in a file called 72-acpi-pain.sh (yep - it causing the pain ;-):

modprobe -r battery
modprobe battery

I commented these two lines out and voila - five suspend-resume cylcles worked as expected. This is no fix for the core of the bug, but at my system works as expected.
Comment 18 Zhang Rui 2008-04-01 17:53:56 UTC
Created attachment 15557 [details]
flush kacpi_notify_wq before removing notify handler

Please try this patch and see if it fixes the problem. :)
Comment 19 Matthias Bläsing 2008-04-03 13:19:37 UTC
Tried - and it seams to work :-)

I think I see other problems (usb seems to fail at resume after some cycles, but this does not need to be correlated). So currently I can put my notebook to sleep (actually I used it the whole day this way). I'll have a closer look tomorrow, but I think this looks good.

This feels just great - thanks!
Comment 20 Matthias Bläsing 2008-04-04 14:42:17 UTC
Not sure whether this is related to this bug, but I played with kernel 2.6.25-rc8 , applied the above patch and - well - I got no ooops, but otherwise I have a feeling (not sure who to put it into a good bugreport), that the acpi subsystem got some problems when moving from 2.6.23 to 2.6.24 (see this bug) and got even more on the way to 2.6.25. The battery reporting is flaky after a resume or after removing the battery module and reinsrting it (hal reports the battery absent and after a second present and this goes on for some seconds). The multimedia keys are reported with a delay of several seconds (maybe caused by switching away from the deprecated asus-acpi module to the asus-laptop module did this ...)

With respect to the battery problem - running the acpi utility in direct runs without pause longer than needed to recall the command:

mblaesing@prometheus:~$ acpi -a
     Battery 1: charged, 100%, rate information unavailable.
  AC Adapter 1: on-line
mblaesing@prometheus:~$ acpi -a
     Battery 1: charged, 100%, rate information unavailable.
     Battery 2: charged, 100%, rate information unavailable.
  AC Adapter 1: on-line
mblaesing@prometheus:~$ acpi -a
     Battery 1: charged, 100%, rate information unavailable.
  AC Adapter 1: on-line
mblaesing@prometheus:~$ acpi -a
     Battery 1: charged, 100%, rate information unavailable.
  AC Adapter 1: on-line
mblaesing@prometheus:~$ acpi -a
     Battery 1: charged, 100%, rate information unavailable.
     Battery 2: charged, 100%, rate information unavailable.
  AC Adapter 1: on-line
mblaesing@prometheus:~$ 

I thing Battery 1 has some defect lithium cells, but that never lead to this flaky behaviour. I'll run again with 2.6.23 tomorrow.
Comment 21 Zhang Rui 2008-04-04 19:29:01 UTC
seems this bug is fixed by the patch in comment #18.
And the problem you reported is alway exists with or without this patch, right?So I think we can mark this bug as PATCH_AVAILABLE now.
Please open another bug for the problem you described in comment#20. :)
As this is a regression, it would be great if you can use git-bisect to narrow down the problem to specific version/commit.

The reason of this oops is that ACPI gets several battery notification after resume, and queues them for deffered execution, which only keeps a pointer to the notify handler in kacpi_notify_wq. User space script is invoked at this time, which removes battery module/notify handler and causes the oops.

Patch in comment#18 flush kacpi_notify_wq before removing the notify handler. 
It is one of the patch series in bug #9772.
Comment 22 Len Brown 2008-10-16 22:45:39 UTC
patch in comment #18 shipped in 2.6.26 -- closed:

Author: Zhang Rui <rui.zhang@intel.com>  2008-04-29 02:34:42
Committer: Len Brown <len.brown@intel.com>  2008-04-29 02:34:42
Parent: f1241c87a16c4fe9f4f51d6ed3589f031c505e8d (Add down_timeout and change ACPI to use it)
Branches: many (28)
Follows: v2.6.25
Precedes: v2.6.26-rc1

    flush kacpi_notify_wq before removing notify handler