Bug 10265
Summary: | oops after resume from acpi S3 | ||
---|---|---|---|
Product: | ACPI | Reporter: | Matthias Bläsing (matthias.blaesing) |
Component: | ACPICA-Core | Assignee: | Zhang Rui (rui.zhang) |
Status: | CLOSED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | acpi-bugzilla, bunk |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.24.3 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216 | ||
Attachments: |
output of acpi dump
Debug Log around the ooops syslog around the oops cat /proc/acpi/event gziped (decompressed) 11 MB syslog around oops debug patch oops with the patch flush kacpi_notify_wq before removing notify handler |
Description
Matthias Bläsing
2008-03-16 13:15:41 UTC
First, please try the lastest kernel and see if it still hangs. Second, as this is a regression, it would be great if you can do a git bisect to see which commit causes this oops. BTW: please attach the output of acpidump using the latest pmtools here: http://www.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/ Created attachment 15315 [details]
output of acpi dump
I tested with head of git and got the same results: BUG: unable to handle kernel paging request at f885e127 IP: [<f885e127>] *pde = 37c18067 *pte = 00000000 Oops: 0000 [#1] Modules linked in: thermal fan button ipw2200 ieee80211 ieee80211_crypt tg3 binfmt_misc radeon drm rfcomm l2cap bluetooth ppdev lp ipv6 acpi_cpufreq cpufreq_stats cpufreq_powersave cpufreq_conservative cpufreq_userspace fuse dm_crypt crypto_blkcipher dm_mod md_mod bay dock video output uinput sbp2 loop snd_intel8x0m snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm usbhid snd_seq_dummy pcmcia snd_seq_oss snd_seq_midi nsc_ircc snd_rawmidi snd_seq_midi_event snd_seq irda parport_pc parport crc_ccitt snd_timer snd_seq_device snd pcspkr rtc ohci1394 soundcore snd_page_alloc firmware_class ieee1394 yenta_socket rsrc_nonstatic pcmcia_core i2c_i801 ehci_hcd uhci_hcd rng_core usbcore battery asus_acpi processor intel_agp agpgart joydev evdev unix [last unloaded: ac] Pid: 45, comm: kacpi_notify Not tainted (2.6.25-rc6 #2) EIP: 0060:[<f885e127>] EFLAGS: 00010286 CPU: 0 EIP is at 0xf885e127 EAX: f7c1cc5c EBX: f756f59c ECX: f7dcecc0 EDX: 00000080 ESI: f4cee770 EDI: c020a217 EBP: 00000000 ESP: f7ccdf90 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 Process kacpi_notify (pid: 45, ti=f7ccc000 task=f7cbe230 task.ti=f7ccc000) Stack: c020f099 f76f8540 f76f8548 c020a234 f7c3ba40 c0121fc7 f7c3ba40 c01224fa f7c3ba48 c01225a6 00000000 f7cbe230 c012454d f7ccdfc4 f7ccdfc4 f7c3ba40 c01224fa 00000000 c01243f3 c01243bd 00000000 c0105207 f7c41efc 00000000 Call Trace: [<c020f099>] acpi_ev_notify_dispatch+0x4c/0x55 [<c020a234>] acpi_os_execute_deferred+0x1d/0x25 [<c0121fc7>] run_workqueue+0x63/0xca [<c01224fa>] worker_thread+0x0/0xb7 [<c01225a6>] worker_thread+0xac/0xb7 [<c012454d>] autoremove_wake_function+0x0/0x2b [<c01224fa>] worker_thread+0x0/0xb7 [<c01243f3>] kthread+0x36/0x5b [<c01243bd>] kthread+0x0/0x5b [<c0105207>] kernel_thread_helper+0x7/0x10 ======================= Code: Bad EIP value. EIP: [<f885e127>] 0xf885e127 SS:ESP 0068:f7ccdf90 ---[ end trace 89ea4c9e06c68987 ]--- I tried a bisect, but was faced with errors, that make it difficult to decide/test whether the problem is the acpi part or some other subsystem. I have to look into this some more. Ok - the bisection is just frustrating - the testing of the acpi functionality with a crashing kernel or without the networking subsystem (the acpid tells me, that it can't open the event socket). So I'm a little stuck. Not sure where to go on. Does this happen occasionally or it happens every time back from S3? please set CONFIG_ACPI_DEBUG, "echo 0x88000011f > /sys/modules/acpi/parameters/debug_level" "echo 0x44 > /sys/modules/acpi/parameters/debug_layer" and attach the dmesg output after this oops. Created attachment 15353 [details]
Debug Log around the ooops
When I tried to reproduce the ooops to get the debug output, I noticed, that it happend approximately every 2-3 times I put the notebook into S3. I had to repeat the suspend-resume cyle three time, to get the log.
The attached log gives some context to the oops. The oops is logged in line 2978. After that the battery level and multimedia keys aren't reported anymore.
hah, too many debug messages are popped up. could you please re-do the test, but "echo 0x44 > /sys/modules/acpi/parameters/debug_layer" BEFORE "echo 0x88000011f > /sys/modules/acpi/parameters/debug_level"? :) Created attachment 15365 [details]
syslog around the oops
You won't like this, but at least at a first glance I don't see any change in the verbosity. On "good" thing though, the oops happend directly on the first try. But I can't see any usable pattern of circumstances, when it happens and when not.
It seams, that it only happens, when /proc/acpi/events is activated. But I have to have it running, as the asus-laptop/asus-acpi module generates acpi events for the multimedia keys. These seem not to be put into the input layer (I can for example get the sleep button key without using the acpid -- they are reported by hal). If this would be the case I might be able to get rid of acpid and by that get rid of the bug. matthias, thanks for debugging. (In reply to comment #7) > "echo 0x44 > /sys/modules/acpi/parameters/debug_layer" > BEFORE "echo 0x88000011f > /sys/modules/acpi/parameters/debug_level"? :) ~~~~ typo Please first kill acpid, and attach the result of "cat /proc/acpi/events" after s3 resume. And then please "echo 0x04 > /sys/modules/acpi/parameters/debug_layer" "echo 0x8800001f > /sys/modules/acpi/parameters/debug_level" and attach the dmesg after oops. IMO, this may be a battery driver bug that we should fix it in kernel rather than stop using acpid. :) hey, some evidence for the battery hypothesis is, that I tried to remove the battery module before sending the notebook into S3 and at least it came up again three times. I will attach the "cat /proc/acpi/events" output and the syslog of the debug run. Created attachment 15535 [details]
cat /proc/acpi/event
Created attachment 15536 [details]
gziped (decompressed) 11 MB syslog around oops
could you please attach the content of /etc/acpi/sleep.sh? I suspect this script pokes the battery sysfs I/F and causes the oops when resuming from S3. Please clear CONFIG_ACPI_SYSFS_POWER and see if it helps. Created attachment 15548 [details]
debug patch
Please apply this debug patch and set CONFIG_ACPI_SYSFS_POWER
and attach the dmesg output after the oops, you don't need to set the debug parameter this time. :)
Created attachment 15552 [details]
oops with the patch
This is the result with the patch - the other test has to wait.
I had a look at the debian acpi script - and wow! - there is a ton of things that are done at resume. I had a closer look and noticed these lines in a file called 72-acpi-pain.sh (yep - it causing the pain ;-): modprobe -r battery modprobe battery I commented these two lines out and voila - five suspend-resume cylcles worked as expected. This is no fix for the core of the bug, but at my system works as expected. Created attachment 15557 [details]
flush kacpi_notify_wq before removing notify handler
Please try this patch and see if it fixes the problem. :)
Tried - and it seams to work :-) I think I see other problems (usb seems to fail at resume after some cycles, but this does not need to be correlated). So currently I can put my notebook to sleep (actually I used it the whole day this way). I'll have a closer look tomorrow, but I think this looks good. This feels just great - thanks! Not sure whether this is related to this bug, but I played with kernel 2.6.25-rc8 , applied the above patch and - well - I got no ooops, but otherwise I have a feeling (not sure who to put it into a good bugreport), that the acpi subsystem got some problems when moving from 2.6.23 to 2.6.24 (see this bug) and got even more on the way to 2.6.25. The battery reporting is flaky after a resume or after removing the battery module and reinsrting it (hal reports the battery absent and after a second present and this goes on for some seconds). The multimedia keys are reported with a delay of several seconds (maybe caused by switching away from the deprecated asus-acpi module to the asus-laptop module did this ...) With respect to the battery problem - running the acpi utility in direct runs without pause longer than needed to recall the command: mblaesing@prometheus:~$ acpi -a Battery 1: charged, 100%, rate information unavailable. AC Adapter 1: on-line mblaesing@prometheus:~$ acpi -a Battery 1: charged, 100%, rate information unavailable. Battery 2: charged, 100%, rate information unavailable. AC Adapter 1: on-line mblaesing@prometheus:~$ acpi -a Battery 1: charged, 100%, rate information unavailable. AC Adapter 1: on-line mblaesing@prometheus:~$ acpi -a Battery 1: charged, 100%, rate information unavailable. AC Adapter 1: on-line mblaesing@prometheus:~$ acpi -a Battery 1: charged, 100%, rate information unavailable. Battery 2: charged, 100%, rate information unavailable. AC Adapter 1: on-line mblaesing@prometheus:~$ I thing Battery 1 has some defect lithium cells, but that never lead to this flaky behaviour. I'll run again with 2.6.23 tomorrow. seems this bug is fixed by the patch in comment #18. And the problem you reported is alway exists with or without this patch, right?So I think we can mark this bug as PATCH_AVAILABLE now. Please open another bug for the problem you described in comment#20. :) As this is a regression, it would be great if you can use git-bisect to narrow down the problem to specific version/commit. The reason of this oops is that ACPI gets several battery notification after resume, and queues them for deffered execution, which only keeps a pointer to the notify handler in kacpi_notify_wq. User space script is invoked at this time, which removes battery module/notify handler and causes the oops. Patch in comment#18 flush kacpi_notify_wq before removing the notify handler. It is one of the patch series in bug #9772. patch in comment #18 shipped in 2.6.26 -- closed: Author: Zhang Rui <rui.zhang@intel.com> 2008-04-29 02:34:42 Committer: Len Brown <len.brown@intel.com> 2008-04-29 02:34:42 Parent: f1241c87a16c4fe9f4f51d6ed3589f031c505e8d (Add down_timeout and change ACPI to use it) Branches: many (28) Follows: v2.6.25 Precedes: v2.6.26-rc1 flush kacpi_notify_wq before removing notify handler |