Bug 10483 - lockup in sbshc.c in wait_transaction_complete - Acer TravelMate 4502WLMi - 2.6.25 regression
Summary: lockup in sbshc.c in wait_transaction_complete - Acer TravelMate 4502WLMi - 2...
Status: CLOSED DUPLICATE of bug 9998
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Battery (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Alexey Starikovskiy
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-04-19 06:07 UTC by Oldřich Jedlička
Modified: 2008-10-16 22:34 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.25
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Patch for drivers/acpi (sbs.c, sbshc.c) to report method entry/return and locking (13.97 KB, patch)
2008-04-19 06:09 UTC, Oldřich Jedlička
Details | Diff
Configuration used for 2.6.25 (52.43 KB, text/plain)
2008-04-19 06:10 UTC, Oldřich Jedlička
Details
lspci -v (7.41 KB, text/plain)
2008-04-19 06:10 UTC, Oldřich Jedlička
Details
/proc/cpuinfo (452 bytes, text/plain)
2008-04-19 06:11 UTC, Oldřich Jedlička
Details
Problem is here: /var/log/messages (340.74 KB, text/plain)
2008-04-19 06:16 UTC, Oldřich Jedlička
Details
2.6.26-rc1: /var/log/messages (118.26 KB, text/plain)
2008-05-05 14:38 UTC, Oldřich Jedlička
Details
2.6.26-rc6: /var/log/messages (143.61 KB, application/x-tgz)
2008-06-14 14:14 UTC, Oldřich Jedlička
Details
2.6.26-rc6: config (52.90 KB, text/plain)
2008-06-14 14:18 UTC, Oldřich Jedlička
Details
Serialize SBSHC alarms (2.24 KB, patch)
2008-06-27 02:28 UTC, Alexey Starikovskiy
Details | Diff
try the debug patch (1.92 KB, patch)
2008-08-10 06:54 UTC, ykzhao
Details | Diff
2.6.26: /var/log/messages (46.72 KB, application/x-tgz)
2008-08-10 13:24 UTC, Oldřich Jedlička
Details
try the debug patch (545 bytes, patch)
2008-08-14 02:12 UTC, ykzhao
Details | Diff
Patch for wait_transaction_complete (694 bytes, patch)
2008-08-16 06:31 UTC, Oldřich Jedlička
Details | Diff
[UPDATE] Patch for drivers/acpi (sbs.c, sbshc.c) to report method entry/return and locking (14.29 KB, patch)
2008-08-16 06:45 UTC, Oldřich Jedlička
Details | Diff
2.6.26: /var/log/messages with workaround (31.02 KB, application/x-tgz)
2008-08-16 07:15 UTC, Oldřich Jedlička
Details

Description Oldřich Jedlička 2008-04-19 06:07:52 UTC
Latest working kernel version: 2.6.23
Earliest failing kernel version: 2.6.24
Distribution: Gentoo
Hardware Environment: notebook Acer TravelMate 4502WLMi, Pentium M 1.6GHz, ATI
Mobility Radeon 9700, two monitors (panel+LCD)
Software Environment: N/A
Problem Description:

Reading of /proc/acpi/battery/BAT0/state ends in an endless loop. The first symptom was that the hald didn't want to finish startup and I figured out that the problem is in reading the battery status.

I patched (attached) sbs.c and sbshc.c to report method entry/return and also locking and I found that the execution ends in wait_transaction_complete (sbshc.c) in an endless wait_event_timeout - really strange.
Comment 1 Oldřich Jedlička 2008-04-19 06:09:25 UTC
Created attachment 15809 [details]
Patch for drivers/acpi (sbs.c, sbshc.c) to report method entry/return and locking
Comment 2 Oldřich Jedlička 2008-04-19 06:10:15 UTC
Created attachment 15810 [details]
Configuration used for 2.6.25
Comment 3 Oldřich Jedlička 2008-04-19 06:10:49 UTC
Created attachment 15811 [details]
lspci -v
Comment 4 Oldřich Jedlička 2008-04-19 06:11:30 UTC
Created attachment 15812 [details]
/proc/cpuinfo
Comment 5 Oldřich Jedlička 2008-04-19 06:16:53 UTC
Created attachment 15813 [details]
Problem is here: /var/log/messages
Comment 6 Oldřich Jedlička 2008-04-24 04:02:17 UTC
Note: I again tried 2.6.24 (once) and it seems it boots-up well, so the regression is probably only in 2.6.25. I do not remember completely why I didn't use 2.6.24, but it was maybe because the battery status didn't get updated correctly (and also because of bug #9910). I will have a look at the latest 2.6.24 and provide some details at the end of the week. Thanks.
Comment 7 Oldřich Jedlička 2008-04-28 10:36:28 UTC
So the problem is in 2.6.24 already. I saw a timeout in reading from wait_transaction_complete after 4 minutes and 55 seconds. There are two questions:

1. Why the operation timed-out?
Some problem in EC code? I can provide more logs, but which kernel parameter enables EC logging?

2. Why it took so long (295 seconds instead of 1 second)?
Maybe some milliseconds->jiffies translation problem?
Comment 8 ykzhao 2008-05-05 00:32:42 UTC
Hi
   From the log in comment #5 it seems that OSPM will always call the function of smb_check_done after the lockup. But there is no wakeup event for the SBS HC controller. It looks inconsistent.
   >SBS HC: smb_check_done, entry
   >Apr 19 14:31:08 oldium ACPI: SBS HC: smb_check_done, return
   Will you please try the latest kernel and see whether the problem can be reproduced? Please attach the output of dmesg.
   Note: please apply the patch in comment #1.
   thanks.
   
Comment 9 Oldřich Jedlička 2008-05-05 00:47:36 UTC
I will try it with latest rc (2.6.26-rc1) in the evening. Logs from 2.6.25 doesn't contain the wake-up, because I didn't wait this long to get it (it might come). Anyway, I will wait for wake-up with 2.6.26-rc1.
Comment 10 Oldřich Jedlička 2008-05-05 14:38:07 UTC
Created attachment 16030 [details]
2.6.26-rc1: /var/log/messages

I started `watch -n 5 "cat /proc/acpi/battery/BAT0/state"` to see some output. I also unplugged the power cable (smbus_alarm at 23:17:44) and plugged it back (something around 23:20:09 or 23:20:24, I forgot to check the clock).

Just ask for more input, I will help as I can.
Comment 11 Oldřich Jedlička 2008-05-05 14:48:33 UTC
There is no long wait for timeout as can be seen in previous kernels. It looks like the battery status reading works. Anyway, there are still some "timed-out" results.

I will check tomorrow if the battery status changes in time (battery discharging), because I've seen the same "remaining capacity" value over this test - but the test was not long and there was no CPU load. I remember some problems with value stalling in 2.6.24, so I will check this tomorrow.
Comment 12 Oldřich Jedlička 2008-05-18 14:56:08 UTC
I've tried 2.6.26-rc2 (without my patch): no updates of /proc/acpi/battery/BAT0/state for more than 1 hour running on battery (the file had the same content). Do you need logs from 2.6.26-rc2? I think the problem should be visible (hopefully) in 2.6.26-rc1 logs already.

Please say if you need more logs from me, possibly with more logging options.
Comment 13 Oldřich Jedlička 2008-06-14 14:13:21 UTC
I've tried 2.6.26-rc6 now and the timeout doesn't expire again. See attached logs.
Comment 14 Oldřich Jedlička 2008-06-14 14:14:18 UTC
Created attachment 16476 [details]
2.6.26-rc6: /var/log/messages
Comment 15 Oldřich Jedlička 2008-06-14 14:18:11 UTC
Created attachment 16477 [details]
2.6.26-rc6: config
Comment 16 Alexey Starikovskiy 2008-06-15 14:35:41 UTC
thanks for report
Comment 17 Oldřich Jedlička 2008-06-26 14:03:25 UTC
I tried 2.6.26-rc8, but no change. Is the patch you mention in PATCH_ALREADY_AVAILABLE resolution in the kernel already? May I try the patch myself?
Comment 18 Alexey Starikovskiy 2008-06-27 01:23:29 UTC
Oldrich,
I misunderstood your comment #14. I read it that you don't have problem with -rc6...
Comment 19 Alexey Starikovskiy 2008-06-27 02:28:02 UTC
Created attachment 16644 [details]
Serialize SBSHC alarms

Please check if this patch helps.
Comment 20 Oldřich Jedlička 2008-07-03 14:07:49 UTC
Sorry for the late response, but I'm a little bit bussy. The patch makes no difference, `cat /proc/acpi/battery/BAT0/state` doesn't return (I waited two minutes).
Comment 21 ykzhao 2008-08-10 06:54:13 UTC
Created attachment 17168 [details]
try the debug patch

Will you please try the debug patch and see whether the problem still exists?
Comment 22 Oldřich Jedlička 2008-08-10 13:24:00 UTC
Created attachment 17171 [details]
2.6.26: /var/log/messages

No success. The phenomenon is that the wait_transaction_complete (sbshc.c) uses wait_event_timeout with timeout 1000 [ms] that never expires. I don't know if there could be some conflict with something that smb_check_done does (call to smb_hc_read -> ec_read).

The attached log shows the problem in 2.6.26 (your patch plus my patch). The reading succeded once (`cat /proc/acpi/battery/BAT0/state`), but the second try failed - timeout didn't expire. Just to note: my former run succeded four times.
Comment 23 ykzhao 2008-08-14 02:12:47 UTC
Created attachment 17237 [details]
try the debug patch

Will you please try the debug patch?
In this debug patch even the timeout expires, OS will try to check the SMbus status again to avoid the bogus timeout.

Of course the patch in comment #21 and your patch are still required.

After the test, please attach the output of dmesg.
Thanks.
Comment 24 Oldřich Jedlička 2008-08-16 06:23:13 UTC
(In reply to comment #23)
> In this debug patch even the timeout expires, OS will try to check the SMbus
> status again to avoid the bogus timeout.

The problem is that the timeout NEVER expires (at least it looks like it never expires, maybe it will in an hour).

I think the core of the problem is hidden in the wait_event_timeout macro (include/linux/wait.h). It tries to execute a check function and if it fails, it tries to sleep for a "timeout" time. It can be woken up, in this case the check function is called again and if it fails, the sleep _continues_ where it stopped. The smb_check_done takes some time to execute, but this time is not counted in this "timeout". And it looks like the sleep is interrupted quite fast, so the rest of the timeout gets lower only slightly. I didn't verify/measure this hypothesis, but I rewrote the wait_transaction_complete function to really take at most "timeout" time and it works.

Question is why the sleep is interrupted so often, I didn't investigate this.

I will attach patch and (my) debug output.
Comment 25 Oldřich Jedlička 2008-08-16 06:31:38 UTC
Created attachment 17275 [details]
Patch for wait_transaction_complete

This patch workarounds the problem with "endless" wait_transaction_complete execution in sbshc.c.
Comment 26 Oldřich Jedlička 2008-08-16 06:45:56 UTC
Created attachment 17276 [details]
[UPDATE] Patch for drivers/acpi (sbs.c, sbshc.c) to report method entry/return and locking
Comment 27 Oldřich Jedlička 2008-08-16 07:15:40 UTC
Created attachment 17277 [details]
2.6.26: /var/log/messages with workaround

The /var/log/messages file still contains some "timed-out" lines, but it is not a big problem now, because the possible lockup doesn't take more than 1 second (plus minus).
Comment 28 ykzhao 2008-08-17 23:33:12 UTC
Will you please confirm whether the lockup issue disappears after applying the patch in comment #25? If the problem disappears, please send your patch to ACPI mail list and more people are involved with the discussion.
The patch in comment #25 is also reasonable, in which the polling mode is used and bogus smbus timeout is also avoided. 

Maybe what you said in comment #24 is right. But in fact the weired is that smb_check_done is called so many times. When the function of wait_event_timeout is called, there are two possible wakeup events. One is timeout and the other is smbus alarm event. Only after the process is waked will the function of smb_check_done be called. 

Thanks.
Comment 29 Oldřich Jedlička 2008-08-21 14:16:55 UTC
Thanks, I will write to the list after my vacation.
Comment 30 Adrian Bunk 2008-09-06 01:01:17 UTC
fixed by commit 266feefeb9ea2d846ac82eb6db1a54b230364ba4
Comment 31 Oldřich Jedlička 2008-09-11 11:46:13 UTC
I'm sorry, but I have to reopen the bug. The commited patch doesn't fix anything.

My problem is `wait_event_timeout` method itself. Moving discussion to ACPI mailing list.
Comment 32 Oldřich Jedlička 2008-09-15 11:39:43 UTC
How to mark this bug that the code fix (patch) is available in bug 9998? The patch from there is sufficient for me.
Comment 33 Alexey Starikovskiy 2008-09-15 11:42:28 UTC
Easy, mark as duplicate of that bug.
Thanks for your report and testing.

*** This bug has been marked as a duplicate of bug 9998 ***

Note You need to log in before you can comment on or make changes to this bug.