Bug 13109

Summary: High latency on /sys/class/thermal
Product: ACPI Reporter: Rafael J. Wysocki (rjw)
Component: ECAssignee: Alexey Starikovskiy (astarikovskiy)
Status: CLOSED WILL_NOT_FIX    
Severity: normal CC: acpi-bugzilla, lenb, rui.zhang, tiagosbatista
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30-rc1-1002-gac91f91 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13070    
Attachments: The result of booting with the new dsdt
The the output with loglevel increased
patch: enable EC debug
dmesg with ec debug patch
patch that appears to have solved the issue (based on #12011)
dmesg of vanilla + patches from attachments 21061 and 21057
debug sbs notify
try to query EC until it clears query bit
dmesg with all patches applied as requested
try to query EC until it clears query bit
dmesg with 21371, 21061, 21057
Run SBS callback under mutex
make the callback save the previous state
Run SBS callback under mutex #2
dsdt with the new sbs patch
Run SBS callback under mutex #3
dmesg with 21711
debug sbs mutexes
dmesg with previous patch

Description Rafael J. Wysocki 2009-04-16 21:41:31 UTC
Subject    : High latency on /sys/class/thermal
Submitter  : "Tiago Simões Batista" <tiagosbatista@gmail.com>
Date       : 2009-04-11 14:56
References : http://marc.info/?l=linux-kernel&m=123946182301248&w=4
Handled-By : Zhang Rui <rui.zhang@intel.com>

This entry is being used for tracking a regression from 2.6.29.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Zhang Rui 2009-04-17 01:01:51 UTC
Tiago, please do the test I mentioned at
http://marc.info/?l=linux-kernel&m=123967517427048&w=4
and attach the test result here.
Comment 2 Tiago 2009-04-17 18:17:26 UTC
Created attachment 21033 [details]
The result of booting with the new dsdt

Sorry for the delay, my workload spiked abnormaly this week...

Attatched is the dmesg with the dsdt you supplied, using a vanilla kernel checked out today. The lines that show the latency I originally described are:

[  223.173279] ACPI: Execute Method [\_TZ_.THRM._TMP] (Node f7016378)
[  225.374267] ACPI: Execute Method [\_SB_.PCI0.LPC0.EC0_._Q80] (Node f70143d8)
[  225.374467] ACPI: Execute Method [\_TZ_.THRM._TMP] (Node f7016378)
[  231.139302]  thermal-0263 [00] thermal_get_temperatur: Temperature is 3222 dK
[  231.150163]  thermal-0263 [00] thermal_get_temperatur: Temperature is 3222 dK
[  231.150178] ACPI: Execute Method [\_TZ_.PFN0._STA] (Node f7016168)
[  231.150228] ACPI: Execute Method [\_TZ_.PFN0._STA] (Node f7016168)

The last time I tried, I could not recover from s3 sleep, I will try this again today using this dsdt as soon as I finish writing this, and I will report the result.
Comment 3 Tiago 2009-04-17 18:31:55 UTC
Still no luck on the s3 front, I never had much luck wit standby on this laptop... I am compiling a new tuxonice kernel as I write, if this time I succeed in resumming from suspend, I'll post back.
Comment 4 Tiago 2009-04-17 19:11:14 UTC
Created attachment 21034 [details]
The the output with loglevel increased

Ok, I was finding it strange I did not get the output from the dsdt as I expected.

I increased the acpi loglevel to:
Description              	Hex        SET
ACPI_LV_INIT             	0x00000001 [*]
ACPI_LV_DEBUG_OBJECT     	0x00000002 [*]
ACPI_LV_INFO             	0x00000004 [*]
ACPI_LV_INIT_NAMES       	0x00000020 [*]
ACPI_LV_PARSE            	0x00000040 [*]
ACPI_LV_LOAD             	0x00000080 [*]
ACPI_LV_DISPATCH         	0x00000100 [*]
ACPI_LV_EXEC             	0x00000200 [ ]
ACPI_LV_NAMES            	0x00000400 [*]
ACPI_LV_OPREGION         	0x00000800 [*]
ACPI_LV_BFIELD           	0x00001000 [*]
ACPI_LV_TABLES           	0x00002000 [*]
ACPI_LV_VALUES           	0x00004000 [*]
ACPI_LV_OBJECTS          	0x00008000 [*]
ACPI_LV_RESOURCES        	0x00010000 [*]
ACPI_LV_USER_REQUESTS    	0x00020000 [*]
ACPI_LV_PACKAGE          	0x00040000 [*]
ACPI_LV_ALLOCATIONS      	0x00100000 [ ]
ACPI_LV_FUNCTIONS        	0x00200000 [ ]
ACPI_LV_OPTIMIZATIONS    	0x00400000 [ ]
ACPI_LV_MUTEX            	0x01000000 [ ]
ACPI_LV_THREADS          	0x02000000 [ ]
ACPI_LV_IO               	0x04000000 [ ]
ACPI_LV_INTERRUPTS       	0x08000000 [ ]

and the debug layer to
Description              	Hex        SET
ACPI_LV_INIT             	0x00000001 [*]
ACPI_LV_DEBUG_OBJECT     	0x00000002 [*]
ACPI_LV_INFO             	0x00000004 [*]
ACPI_LV_INIT_NAMES       	0x00000020 [*]
ACPI_LV_PARSE            	0x00000040 [*]
ACPI_LV_LOAD             	0x00000080 [*]
ACPI_LV_DISPATCH         	0x00000100 [*]
ACPI_LV_EXEC             	0x00000200 [ ]
ACPI_LV_NAMES            	0x00000400 [*]
ACPI_LV_OPREGION         	0x00000800 [*]
ACPI_LV_BFIELD           	0x00001000 [*]
ACPI_LV_TABLES           	0x00002000 [*]
ACPI_LV_VALUES           	0x00004000 [*]
ACPI_LV_OBJECTS          	0x00008000 [*]
ACPI_LV_RESOURCES        	0x00010000 [*]
ACPI_LV_USER_REQUESTS    	0x00020000 [*]
ACPI_LV_PACKAGE          	0x00040000 [*]
ACPI_LV_ALLOCATIONS      	0x00100000 [ ]
ACPI_LV_FUNCTIONS        	0x00200000 [ ]
ACPI_LV_OPTIMIZATIONS    	0x00400000 [ ]
ACPI_LV_MUTEX            	0x01000000 [ ]
ACPI_LV_THREADS          	0x02000000 [ ]
ACPI_LV_IO               	0x04000000 [ ]
ACPI_LV_INTERRUPTS       	0x08000000 [ ]

I know, this is probably too much information... anyway, the latency is shown on line 114, there the timestamp jumps from 1544 to 1566. I hope this helps you!

If you need any more information, just ask.
Comment 5 Tiago 2009-04-17 19:12:52 UTC
Sorry about the noise... The debug layer should read:
Description              	Hex        SET
ACPI_UTILITIES           	0x00000001 [*]
ACPI_HARDWARE            	0x00000002 [*]
ACPI_EVENTS              	0x00000004 [*]
ACPI_TABLES              	0x00000008 [*]
ACPI_NAMESPACE           	0x00000010 [*]
ACPI_PARSER              	0x00000020 [*]
ACPI_DISPATCHER          	0x00000040 [*]
ACPI_EXECUTER            	0x00000080 [*]
ACPI_RESOURCES           	0x00000100 [*]
ACPI_CA_DEBUGGER         	0x00000200 [*]
ACPI_OS_SERVICES         	0x00000400 [*]
ACPI_CA_DISASSEMBLER     	0x00000800 [*]
ACPI_COMPILER            	0x00001000 [*]
ACPI_TOOLS               	0x00002000 [*]
ACPI_BUS_COMPONENT       	0x00010000 [*]
ACPI_AC_COMPONENT        	0x00020000 [*]
ACPI_BATTERY_COMPONENT   	0x00040000 [*]
ACPI_BUTTON_COMPONENT    	0x00080000 [*]
ACPI_SBS_COMPONENT       	0x00100000 [*]
ACPI_FAN_COMPONENT       	0x00200000 [*]
ACPI_PCI_COMPONENT       	0x00400000 [*]
ACPI_POWER_COMPONENT     	0x00800000 [*]
ACPI_CONTAINER_COMPONENT 	0x01000000 [*]
ACPI_SYSTEM_COMPONENT    	0x02000000 [*]
Comment 6 Zhang Rui 2009-04-20 03:53:49 UTC
>[ 1544.436138] evregion-0401 [00] ev_address_space_dispa: Handler f710ffa0
>(@c0343441) Address 0000000000000058 [EmbeddedControl]
>[ 1566.328052]  exfldio-0535 [00] ex_field_datum_io     : Value Read
>0000000000000031, Width 1

this confirms my conjecture at the beginning.

this may be a regression caused by the recent EC MSI changes.
Alexey, can you look at this issue please?
Comment 7 Zhang Rui 2009-04-20 08:47:41 UTC
Created attachment 21057 [details]
patch: enable EC debug

the patch is for another bug report, but you can use it to enable the EC debug option.
please rebuild your kernel with this patch applied, and redo the test in
http://marc.info/?l=linux-kernel&m=123959178026836&w=4
but without "acpi.debug_level=0x07 acpi.debug_layer=0x04000018" this time.
Comment 8 Alexey Starikovskiy 2009-04-20 12:14:15 UTC
please check if last patch in #12011 makes any difference.
Comment 9 Tiago 2009-04-20 20:26:29 UTC
Created attachment 21060 [details]
dmesg with ec debug patch

Attached is what remains of my dmesg with the ec debug patch. I tried the cat command, but I am not even shure that the command shows here, there is just too much output for me to make anything of it.

I will try to apply the other mentioned patch. I tried it earlier but it conflicted with the current kernel tree. I will now update my tree, try to solve the conflicts, and I will post back the new patch to make shure the conflicts are well resolved. Oh, and any more results I get!

I had to host the attachment on mediafire, I hope thats ok...
Comment 10 Tiago 2009-04-20 22:13:59 UTC
Created attachment 21061 [details]
patch that appears to have solved the issue (based on #12011)

This is a patch based on the last one from #12011, I merged it as I belived it was planned by the author. Please review it!

I belive you have solved the problem, the latency is gone on my vanilla kernel with this patch. I will now recompile without it to make shure this is really the one that solves it. As far as I have seen, nothing is broken by this. 

A big thank you for every one that had the patience to put up with me :)

 time for ((i=0; i < 100; i++)) do cat /sys/class/thermal/thermal_zone0/temp >/dev/null ;done

real	0m0.471s
user	0m0.015s
sys	0m0.087s


Thank you all!
Comment 11 Tiago 2009-04-21 12:26:43 UTC
Ok, after a few reboots, I am experiencing a problem.

Most of the times when I boot, I jave a sloggish mouse, ans the latency on sys/thermal is huge. The other times, everything works flawlessly...

I will try to compile a current kernel without the patch I attached (I am at work now) to try to figure out if the problem presists on the mainline kernel.

Here is a part of the log when the mouse is sluggish 

Apr 21 13:15:21 xato [   84.738315] psmouse.c: TouchPad at isa0060/serio4/input0 lost synchronization, throwing 1 bytes away.
Apr 21 13:15:28 xato [   92.056939] psmouse.c: TouchPad at isa0060/serio4/input0 lost synchronization, throwing 2 bytes away.
Apr 21 13:15:29 xato [   92.567151] psmouse.c: resync failed, issuing reconnect request
Apr 21 13:15:32 xato [   96.006025] Unable to query Synaptics hardware.
Apr 21 13:15:35 xato [   98.522918] input: PS/2 Synaptics TouchPad as /class/input/input8
Comment 12 Tiago 2009-04-22 16:20:14 UTC
I can now confirm that the problem with the psmouse is present only if the last patch is present.
Comment 13 Tiago 2009-04-22 17:10:59 UTC
Ok, I started again, now with the knowledge that the culprit is the ec driver. The bisect was fast and fruitful!

I ended up reverting commit
34ff4dbccccce54c83b1234d39b7ad9e548a75dd

and this seems to have solved the problem (only 1 boot on the new kernel).

Now, the patch I reverted was intended to minimize the delay on the ec poll, right? So, how does increasing poll time reduces the latency I was experiencing?
Comment 14 Rafael J. Wysocki 2009-04-25 21:24:47 UTC
Handled-By : Alexey Starikovskiy <astarikovskiy@suse.de>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=21061
Notify-Also : Zhang Rui <rui.zhang@intel.com>
Comment 15 Alexey Starikovskiy 2009-05-07 22:50:26 UTC
Could you please attach the dmesg output from run with patch from comment #10 and comment #7?

patch 34ff4dbccccce54c83b1234d39b7ad9e548a75dd changes spin-delay of 500microsec with a sleep of no-less-than 1ms. So, it should increase ec poll time.

also, could you check how fast you get new ACPI interrupts (cat /proc/interrupts)?
Comment 16 Tiago 2009-05-08 16:04:28 UTC
Created attachment 21278 [details]
dmesg of vanilla + patches from attachments 21061 and 21057

Atached part of the dmesg after booting an updated vanilla kernel with the requested patches. The whole file was 15M and still, it did not get the initial kernel startup, if that is what you are interested in, I can try to catch it...

The number of acpi interrupts per second, sampled over a one minute timespan, with gnome power manager disabled:

vanilla: 1.9
vanilla + patches requested above: 231.1 (sluggish mouse)
vanilla - 34ff4dbccccce54c83b1234d39b7ad9e548a75dd + tuxonice (compiled on apr 27): 53.6
Comment 17 Alexey Starikovskiy 2009-05-08 20:24:58 UTC
Created attachment 21281 [details]
debug sbs notify

it looks like your machine tries to send too many 0x20 queries, which are alarms for SBS (battery and AC charger in your case).
Your DSDT has _Q20 method which tries to clear alarm bit if notification comes from SBS Manager, but this method should not be executed if SBS drivers are loaded -- sbs.ko. Problem is that Query bit in status register is not cleared after we issue single query command, so we either should try to implement this _Q20 in sbshc.c (by ignoring alarms coming from manager), or cycle query command until query bit is cleared.
Comment 18 Alexey Starikovskiy 2009-05-08 20:47:04 UTC
Created attachment 21282 [details]
try to query EC until it clears query bit

please check both patches on top of previous two and attach dmesg here.
Comment 19 Tiago 2009-05-09 19:03:46 UTC
Created attachment 21288 [details]
dmesg with all patches applied as requested

This is looking good, the dmesg from the first boot with all patches is attached, and I am now seeing 6.06 acpi interrupts per second. No more sluggish mouse!

I would say this solved it, but not yet, I still want to reboot a few times before I can make that claim!
Comment 20 Tiago 2009-05-11 10:27:35 UTC
I have now used the same kernel for some time, and an updated version with the same set of pathes. I can now say with confidence that all is well.

As far as I am concerned, this bug is solved if all the patches make it to mainline. 

Thank you all

Tiago
Comment 21 Rafael J. Wysocki 2009-05-13 10:36:03 UTC
Patch : http://bugzilla.kernel.org/attachment.cgi?id=21282
Comment 22 Len Brown 2009-05-16 02:49:43 UTC
alexey, do you have a patch here that is ready to submit to the list?
Comment 23 Alexey Starikovskiy 2009-05-16 06:47:51 UTC
Created attachment 21371 [details]
try to query EC until it clears query bit

Please check if this patch alone on top of 21061 and 21057 still helps.
Comment 24 Tiago 2009-05-17 02:45:13 UTC
Created attachment 21379 [details]
dmesg with 21371, 21061, 21057

This does not solve the problem!

I still see a lot of interrupts, I probably made a mistake collecting results because I am seeing over 1000 interrupts per second. Yet the mouse is unusable and the keyboard is hardly usable, I had to press each key for about half a second in order to register a keystroke!

I will try to apply the sbs path on top of this, I have tried something like this before, and it seems to have solved the problem...
Comment 25 Tiago 2009-05-17 03:04:57 UTC
I applied 21281 on top of the patches mentioned on my last comment. As I expected, the mouse became usable again, and the interrupt count came down to 16.6 per second.

This just crossed my mind... Could this be caused by a faulty battery? Should I test without the battery?
Comment 26 Alexey Starikovskiy 2009-05-17 06:32:50 UTC
Tiago,
Could you please remove the #24 patch now? Please also check that you still have notifications about AC adapter insertion/removal.
Comment 27 Tiago 2009-05-17 13:38:42 UTC
I assume that you meant to remove patches 21371, 21061, 21057, and leave only the sbs patch.

As you expected, I lost the AC notifications. I did not realize this before, because I seldom unplug my laptop as my battery is near its end of life.

But the ec patches do have a positive effect if applied on top of the sbs patch. with all patches, the time to perform 100 reads of /sys/class/thermal/thermal_zone0/temp is about 0.6 seconds, but with just the sbs patch, the time is around 1.2 seconds, so patches 21061 and 21371 do have a positive effect if the system is not receiving a lot of interrupts!
Comment 28 Alexey Starikovskiy 2009-05-28 23:23:01 UTC
Created attachment 21613 [details]
Run SBS callback under mutex

Here is one more patch to try.
Comment 29 Tiago 2009-05-29 18:58:43 UTC
Created attachment 21627 [details]
make the callback save the previous state

No go!

I applied your patch on a vanilla kernel, and recompiled with each of the supplied patches in order, I tried all but the patch sbshc patch.

After all the patches are applied, I still can not use the touchpad, and I get no notification on ac removal.

I did however try a little tweak on top of your patch that did give me the notifications back, but not the touchpad. The patch I attached applies on top of yours.

If I read the  suggested patch correctly, the mutex_trylock() may miss some events, and given that I am seeing a lot of events per second, I tried to convert saved* to static, and initialize it with random values. this may cause extra events on bootup, but at that time, I do not think any userland program will act upon them.

I am planning to further test the patch I attached, but I welcome your opinion, as I do not like the way I initialized the variables, and I am not sure how static variables are seen by kernel programmers...
Comment 30 Alexey Starikovskiy 2009-05-29 21:07:10 UTC
Created attachment 21629 [details]
Run SBS callback under mutex #2

Then probably we should not clear the alarm bit until the notification is served.
Please attach dmesg output with DEBUG in ec.c enabled if this patch does not help, as it is hard to guess what is going on... :(
Comment 31 Tiago 2009-05-30 08:59:54 UTC
Created attachment 21630 [details]
dsdt with the new sbs patch

This one does give me the ac adapter notifications back, but at the cost of my touchpad. If I have to choose, I take the touchpad.

One new thing I never noticed before, but is the same with this patch and the previous, and I expect it to be the same even with no sbs patch. After some time, some of the touchpad features come back, I can actually move the pointer and tapp, but every advanced feature as scrooling or corner tapping never recovers.

I got the attached dmesg right after I noticed that the touchpad had returned (the cpu went down to idle). These two lines keep appearing in the syslog if I try to move the pointer before things "settle down":

May 30 09:37:06 xato [  224.105345] psmouse.c: TouchPad at isa0060/serio4/input0 lost synchronization, throwing 2 bytes away.
May 30 09:37:06 xato [  224.618568] psmouse.c: resync failed, issuing reconnect request


Hope this helps you!
Comment 32 Alexey Starikovskiy 2009-06-02 14:17:42 UTC
Created attachment 21711 [details]
Run SBS callback under mutex #3

Please check if this added throttling helps.
Please also attach files directly to bug report, it helps later.
Comment 33 Tiago 2009-06-02 23:32:00 UTC
Created attachment 21718 [details]
dmesg with 21711

Ok, this patch seems to solve all but the charger notification. Low latency on the /sys interface, and touchapad working as well as expected. Please note that this is the first boot, I need further testing before making a final assessment as I found some discrepancies between reboots earlier.


The dmesg of this boot is attached. Earlier I did not attach in line because the files, even compressed were over 1M. I was trying to supply as much information as possilbe, much of it were repeated lines, but hidden among them I was hopping for a clue to the solution. If that happens again, I will truncate or split the file to avoid loosing data on external (crappy) sites!



I believe the problem is that throttled events may never be noticed.

on smbus_alarm() you have:

        if (alarm_is_pending ||
            time_before(jiffies, alarm_time + msecs_to_jiffies(1000)))
                goto unlock;

witch means we may skip some interrupts. But on the callback you have:

        u8 saved_charger_state = sbs->charger_present;
        u8 saved_battery_state;
        /* check if some parallel SBS access is going on */
        mutex_lock(&sbs->lock);
        acpi_ac_get_present(sbs);
        if (sbs->charger_present != saved_charger_state) {

You save the charger_state on the beginning of the _current event_ handler, read the current state, and compare it with the saved state.

Can the sbs->charger_present be updated by anything other than the acpi_ac_get_present(sbs) call? if so, when you get the mutex, the state of sbs->charger_present may already have been updated.
Comment 34 Tiago 2009-06-03 09:54:29 UTC
Created attachment 21725 [details]
debug sbs mutexes

Ok, forget my previous post! After sleeping on it, I applied the attached patch. I will attach the dmesg next.

The result is rather strange, I was expecting a lot of "_________ sbshc about to lock mutex" in my log messages, but no such thing. Only once... Can there be a deadlock somewhere?

I hope this helps you... It just puzzled me even more!
Comment 35 Tiago 2009-06-03 09:56:24 UTC
Created attachment 21726 [details]
dmesg with previous patch
Comment 36 Alexey Starikovskiy 2009-06-03 18:19:16 UTC
Actually, mutex should be locked only if alarm bit is set. And it should be set only when either AC charger or battery have changed state (have something to report).
Thus, you should see attempt to lock mutex in response to AC connect/disconnect.
You may also check if something has deadlocked by reading either /proc/acpi/battery/*/* or sysfs battery files. In case of a deadlock this read will never return.
Comment 37 Rafael J. Wysocki 2009-06-07 12:36:18 UTC
Ignore-Patch : http://bugzilla.kernel.org/attachment.cgi?id=21282
Comment 38 Rafael J. Wysocki 2009-06-07 12:41:38 UTC
Ignore-Patch : http://bugzilla.kernel.org/attachment.cgi?id=21061
Comment 39 Zhang Rui 2009-08-10 01:02:28 UTC
hi, Alexey,
any updates on this bug?
Comment 40 Alexey Starikovskiy 2009-08-12 22:53:29 UTC
Zhang, we may close this bug as "will not fix", as sufficient workaround (set HZ to 1000 exists). As I understand, reporter is not interested in any further investigation of the issue.
Comment 41 Zhang Rui 2009-08-13 00:50:16 UTC
done. thanks. :)