Subject : High latency on /sys/class/thermal Submitter : "Tiago Simões Batista" <tiagosbatista@gmail.com> Date : 2009-04-11 14:56 References : http://marc.info/?l=linux-kernel&m=123946182301248&w=4 Handled-By : Zhang Rui <rui.zhang@intel.com> This entry is being used for tracking a regression from 2.6.29. Please don't close it until the problem is fixed in the mainline.
Tiago, please do the test I mentioned at http://marc.info/?l=linux-kernel&m=123967517427048&w=4 and attach the test result here.
Created attachment 21033 [details] The result of booting with the new dsdt Sorry for the delay, my workload spiked abnormaly this week... Attatched is the dmesg with the dsdt you supplied, using a vanilla kernel checked out today. The lines that show the latency I originally described are: [ 223.173279] ACPI: Execute Method [\_TZ_.THRM._TMP] (Node f7016378) [ 225.374267] ACPI: Execute Method [\_SB_.PCI0.LPC0.EC0_._Q80] (Node f70143d8) [ 225.374467] ACPI: Execute Method [\_TZ_.THRM._TMP] (Node f7016378) [ 231.139302] thermal-0263 [00] thermal_get_temperatur: Temperature is 3222 dK [ 231.150163] thermal-0263 [00] thermal_get_temperatur: Temperature is 3222 dK [ 231.150178] ACPI: Execute Method [\_TZ_.PFN0._STA] (Node f7016168) [ 231.150228] ACPI: Execute Method [\_TZ_.PFN0._STA] (Node f7016168) The last time I tried, I could not recover from s3 sleep, I will try this again today using this dsdt as soon as I finish writing this, and I will report the result.
Still no luck on the s3 front, I never had much luck wit standby on this laptop... I am compiling a new tuxonice kernel as I write, if this time I succeed in resumming from suspend, I'll post back.
Created attachment 21034 [details] The the output with loglevel increased Ok, I was finding it strange I did not get the output from the dsdt as I expected. I increased the acpi loglevel to: Description Hex SET ACPI_LV_INIT 0x00000001 [*] ACPI_LV_DEBUG_OBJECT 0x00000002 [*] ACPI_LV_INFO 0x00000004 [*] ACPI_LV_INIT_NAMES 0x00000020 [*] ACPI_LV_PARSE 0x00000040 [*] ACPI_LV_LOAD 0x00000080 [*] ACPI_LV_DISPATCH 0x00000100 [*] ACPI_LV_EXEC 0x00000200 [ ] ACPI_LV_NAMES 0x00000400 [*] ACPI_LV_OPREGION 0x00000800 [*] ACPI_LV_BFIELD 0x00001000 [*] ACPI_LV_TABLES 0x00002000 [*] ACPI_LV_VALUES 0x00004000 [*] ACPI_LV_OBJECTS 0x00008000 [*] ACPI_LV_RESOURCES 0x00010000 [*] ACPI_LV_USER_REQUESTS 0x00020000 [*] ACPI_LV_PACKAGE 0x00040000 [*] ACPI_LV_ALLOCATIONS 0x00100000 [ ] ACPI_LV_FUNCTIONS 0x00200000 [ ] ACPI_LV_OPTIMIZATIONS 0x00400000 [ ] ACPI_LV_MUTEX 0x01000000 [ ] ACPI_LV_THREADS 0x02000000 [ ] ACPI_LV_IO 0x04000000 [ ] ACPI_LV_INTERRUPTS 0x08000000 [ ] and the debug layer to Description Hex SET ACPI_LV_INIT 0x00000001 [*] ACPI_LV_DEBUG_OBJECT 0x00000002 [*] ACPI_LV_INFO 0x00000004 [*] ACPI_LV_INIT_NAMES 0x00000020 [*] ACPI_LV_PARSE 0x00000040 [*] ACPI_LV_LOAD 0x00000080 [*] ACPI_LV_DISPATCH 0x00000100 [*] ACPI_LV_EXEC 0x00000200 [ ] ACPI_LV_NAMES 0x00000400 [*] ACPI_LV_OPREGION 0x00000800 [*] ACPI_LV_BFIELD 0x00001000 [*] ACPI_LV_TABLES 0x00002000 [*] ACPI_LV_VALUES 0x00004000 [*] ACPI_LV_OBJECTS 0x00008000 [*] ACPI_LV_RESOURCES 0x00010000 [*] ACPI_LV_USER_REQUESTS 0x00020000 [*] ACPI_LV_PACKAGE 0x00040000 [*] ACPI_LV_ALLOCATIONS 0x00100000 [ ] ACPI_LV_FUNCTIONS 0x00200000 [ ] ACPI_LV_OPTIMIZATIONS 0x00400000 [ ] ACPI_LV_MUTEX 0x01000000 [ ] ACPI_LV_THREADS 0x02000000 [ ] ACPI_LV_IO 0x04000000 [ ] ACPI_LV_INTERRUPTS 0x08000000 [ ] I know, this is probably too much information... anyway, the latency is shown on line 114, there the timestamp jumps from 1544 to 1566. I hope this helps you! If you need any more information, just ask.
Sorry about the noise... The debug layer should read: Description Hex SET ACPI_UTILITIES 0x00000001 [*] ACPI_HARDWARE 0x00000002 [*] ACPI_EVENTS 0x00000004 [*] ACPI_TABLES 0x00000008 [*] ACPI_NAMESPACE 0x00000010 [*] ACPI_PARSER 0x00000020 [*] ACPI_DISPATCHER 0x00000040 [*] ACPI_EXECUTER 0x00000080 [*] ACPI_RESOURCES 0x00000100 [*] ACPI_CA_DEBUGGER 0x00000200 [*] ACPI_OS_SERVICES 0x00000400 [*] ACPI_CA_DISASSEMBLER 0x00000800 [*] ACPI_COMPILER 0x00001000 [*] ACPI_TOOLS 0x00002000 [*] ACPI_BUS_COMPONENT 0x00010000 [*] ACPI_AC_COMPONENT 0x00020000 [*] ACPI_BATTERY_COMPONENT 0x00040000 [*] ACPI_BUTTON_COMPONENT 0x00080000 [*] ACPI_SBS_COMPONENT 0x00100000 [*] ACPI_FAN_COMPONENT 0x00200000 [*] ACPI_PCI_COMPONENT 0x00400000 [*] ACPI_POWER_COMPONENT 0x00800000 [*] ACPI_CONTAINER_COMPONENT 0x01000000 [*] ACPI_SYSTEM_COMPONENT 0x02000000 [*]
>[ 1544.436138] evregion-0401 [00] ev_address_space_dispa: Handler f710ffa0 >(@c0343441) Address 0000000000000058 [EmbeddedControl] >[ 1566.328052] exfldio-0535 [00] ex_field_datum_io : Value Read >0000000000000031, Width 1 this confirms my conjecture at the beginning. this may be a regression caused by the recent EC MSI changes. Alexey, can you look at this issue please?
Created attachment 21057 [details] patch: enable EC debug the patch is for another bug report, but you can use it to enable the EC debug option. please rebuild your kernel with this patch applied, and redo the test in http://marc.info/?l=linux-kernel&m=123959178026836&w=4 but without "acpi.debug_level=0x07 acpi.debug_layer=0x04000018" this time.
please check if last patch in #12011 makes any difference.
Created attachment 21060 [details] dmesg with ec debug patch Attached is what remains of my dmesg with the ec debug patch. I tried the cat command, but I am not even shure that the command shows here, there is just too much output for me to make anything of it. I will try to apply the other mentioned patch. I tried it earlier but it conflicted with the current kernel tree. I will now update my tree, try to solve the conflicts, and I will post back the new patch to make shure the conflicts are well resolved. Oh, and any more results I get! I had to host the attachment on mediafire, I hope thats ok...
Created attachment 21061 [details] patch that appears to have solved the issue (based on #12011) This is a patch based on the last one from #12011, I merged it as I belived it was planned by the author. Please review it! I belive you have solved the problem, the latency is gone on my vanilla kernel with this patch. I will now recompile without it to make shure this is really the one that solves it. As far as I have seen, nothing is broken by this. A big thank you for every one that had the patience to put up with me :) time for ((i=0; i < 100; i++)) do cat /sys/class/thermal/thermal_zone0/temp >/dev/null ;done real 0m0.471s user 0m0.015s sys 0m0.087s Thank you all!
Ok, after a few reboots, I am experiencing a problem. Most of the times when I boot, I jave a sloggish mouse, ans the latency on sys/thermal is huge. The other times, everything works flawlessly... I will try to compile a current kernel without the patch I attached (I am at work now) to try to figure out if the problem presists on the mainline kernel. Here is a part of the log when the mouse is sluggish Apr 21 13:15:21 xato [ 84.738315] psmouse.c: TouchPad at isa0060/serio4/input0 lost synchronization, throwing 1 bytes away. Apr 21 13:15:28 xato [ 92.056939] psmouse.c: TouchPad at isa0060/serio4/input0 lost synchronization, throwing 2 bytes away. Apr 21 13:15:29 xato [ 92.567151] psmouse.c: resync failed, issuing reconnect request Apr 21 13:15:32 xato [ 96.006025] Unable to query Synaptics hardware. Apr 21 13:15:35 xato [ 98.522918] input: PS/2 Synaptics TouchPad as /class/input/input8
I can now confirm that the problem with the psmouse is present only if the last patch is present.
Ok, I started again, now with the knowledge that the culprit is the ec driver. The bisect was fast and fruitful! I ended up reverting commit 34ff4dbccccce54c83b1234d39b7ad9e548a75dd and this seems to have solved the problem (only 1 boot on the new kernel). Now, the patch I reverted was intended to minimize the delay on the ec poll, right? So, how does increasing poll time reduces the latency I was experiencing?
Handled-By : Alexey Starikovskiy <astarikovskiy@suse.de> Patch : http://bugzilla.kernel.org/attachment.cgi?id=21061 Notify-Also : Zhang Rui <rui.zhang@intel.com>
Could you please attach the dmesg output from run with patch from comment #10 and comment #7? patch 34ff4dbccccce54c83b1234d39b7ad9e548a75dd changes spin-delay of 500microsec with a sleep of no-less-than 1ms. So, it should increase ec poll time. also, could you check how fast you get new ACPI interrupts (cat /proc/interrupts)?
Created attachment 21278 [details] dmesg of vanilla + patches from attachments 21061 and 21057 Atached part of the dmesg after booting an updated vanilla kernel with the requested patches. The whole file was 15M and still, it did not get the initial kernel startup, if that is what you are interested in, I can try to catch it... The number of acpi interrupts per second, sampled over a one minute timespan, with gnome power manager disabled: vanilla: 1.9 vanilla + patches requested above: 231.1 (sluggish mouse) vanilla - 34ff4dbccccce54c83b1234d39b7ad9e548a75dd + tuxonice (compiled on apr 27): 53.6
Created attachment 21281 [details] debug sbs notify it looks like your machine tries to send too many 0x20 queries, which are alarms for SBS (battery and AC charger in your case). Your DSDT has _Q20 method which tries to clear alarm bit if notification comes from SBS Manager, but this method should not be executed if SBS drivers are loaded -- sbs.ko. Problem is that Query bit in status register is not cleared after we issue single query command, so we either should try to implement this _Q20 in sbshc.c (by ignoring alarms coming from manager), or cycle query command until query bit is cleared.
Created attachment 21282 [details] try to query EC until it clears query bit please check both patches on top of previous two and attach dmesg here.
Created attachment 21288 [details] dmesg with all patches applied as requested This is looking good, the dmesg from the first boot with all patches is attached, and I am now seeing 6.06 acpi interrupts per second. No more sluggish mouse! I would say this solved it, but not yet, I still want to reboot a few times before I can make that claim!
I have now used the same kernel for some time, and an updated version with the same set of pathes. I can now say with confidence that all is well. As far as I am concerned, this bug is solved if all the patches make it to mainline. Thank you all Tiago
Patch : http://bugzilla.kernel.org/attachment.cgi?id=21282
alexey, do you have a patch here that is ready to submit to the list?
Created attachment 21371 [details] try to query EC until it clears query bit Please check if this patch alone on top of 21061 and 21057 still helps.
Created attachment 21379 [details] dmesg with 21371, 21061, 21057 This does not solve the problem! I still see a lot of interrupts, I probably made a mistake collecting results because I am seeing over 1000 interrupts per second. Yet the mouse is unusable and the keyboard is hardly usable, I had to press each key for about half a second in order to register a keystroke! I will try to apply the sbs path on top of this, I have tried something like this before, and it seems to have solved the problem...
I applied 21281 on top of the patches mentioned on my last comment. As I expected, the mouse became usable again, and the interrupt count came down to 16.6 per second. This just crossed my mind... Could this be caused by a faulty battery? Should I test without the battery?
Tiago, Could you please remove the #24 patch now? Please also check that you still have notifications about AC adapter insertion/removal.
I assume that you meant to remove patches 21371, 21061, 21057, and leave only the sbs patch. As you expected, I lost the AC notifications. I did not realize this before, because I seldom unplug my laptop as my battery is near its end of life. But the ec patches do have a positive effect if applied on top of the sbs patch. with all patches, the time to perform 100 reads of /sys/class/thermal/thermal_zone0/temp is about 0.6 seconds, but with just the sbs patch, the time is around 1.2 seconds, so patches 21061 and 21371 do have a positive effect if the system is not receiving a lot of interrupts!
Created attachment 21613 [details] Run SBS callback under mutex Here is one more patch to try.
Created attachment 21627 [details] make the callback save the previous state No go! I applied your patch on a vanilla kernel, and recompiled with each of the supplied patches in order, I tried all but the patch sbshc patch. After all the patches are applied, I still can not use the touchpad, and I get no notification on ac removal. I did however try a little tweak on top of your patch that did give me the notifications back, but not the touchpad. The patch I attached applies on top of yours. If I read the suggested patch correctly, the mutex_trylock() may miss some events, and given that I am seeing a lot of events per second, I tried to convert saved* to static, and initialize it with random values. this may cause extra events on bootup, but at that time, I do not think any userland program will act upon them. I am planning to further test the patch I attached, but I welcome your opinion, as I do not like the way I initialized the variables, and I am not sure how static variables are seen by kernel programmers...
Created attachment 21629 [details] Run SBS callback under mutex #2 Then probably we should not clear the alarm bit until the notification is served. Please attach dmesg output with DEBUG in ec.c enabled if this patch does not help, as it is hard to guess what is going on... :(
Created attachment 21630 [details] dsdt with the new sbs patch This one does give me the ac adapter notifications back, but at the cost of my touchpad. If I have to choose, I take the touchpad. One new thing I never noticed before, but is the same with this patch and the previous, and I expect it to be the same even with no sbs patch. After some time, some of the touchpad features come back, I can actually move the pointer and tapp, but every advanced feature as scrooling or corner tapping never recovers. I got the attached dmesg right after I noticed that the touchpad had returned (the cpu went down to idle). These two lines keep appearing in the syslog if I try to move the pointer before things "settle down": May 30 09:37:06 xato [ 224.105345] psmouse.c: TouchPad at isa0060/serio4/input0 lost synchronization, throwing 2 bytes away. May 30 09:37:06 xato [ 224.618568] psmouse.c: resync failed, issuing reconnect request Hope this helps you!
Created attachment 21711 [details] Run SBS callback under mutex #3 Please check if this added throttling helps. Please also attach files directly to bug report, it helps later.
Created attachment 21718 [details] dmesg with 21711 Ok, this patch seems to solve all but the charger notification. Low latency on the /sys interface, and touchapad working as well as expected. Please note that this is the first boot, I need further testing before making a final assessment as I found some discrepancies between reboots earlier. The dmesg of this boot is attached. Earlier I did not attach in line because the files, even compressed were over 1M. I was trying to supply as much information as possilbe, much of it were repeated lines, but hidden among them I was hopping for a clue to the solution. If that happens again, I will truncate or split the file to avoid loosing data on external (crappy) sites! I believe the problem is that throttled events may never be noticed. on smbus_alarm() you have: if (alarm_is_pending || time_before(jiffies, alarm_time + msecs_to_jiffies(1000))) goto unlock; witch means we may skip some interrupts. But on the callback you have: u8 saved_charger_state = sbs->charger_present; u8 saved_battery_state; /* check if some parallel SBS access is going on */ mutex_lock(&sbs->lock); acpi_ac_get_present(sbs); if (sbs->charger_present != saved_charger_state) { You save the charger_state on the beginning of the _current event_ handler, read the current state, and compare it with the saved state. Can the sbs->charger_present be updated by anything other than the acpi_ac_get_present(sbs) call? if so, when you get the mutex, the state of sbs->charger_present may already have been updated.
Created attachment 21725 [details] debug sbs mutexes Ok, forget my previous post! After sleeping on it, I applied the attached patch. I will attach the dmesg next. The result is rather strange, I was expecting a lot of "_________ sbshc about to lock mutex" in my log messages, but no such thing. Only once... Can there be a deadlock somewhere? I hope this helps you... It just puzzled me even more!
Created attachment 21726 [details] dmesg with previous patch
Actually, mutex should be locked only if alarm bit is set. And it should be set only when either AC charger or battery have changed state (have something to report). Thus, you should see attempt to lock mutex in response to AC connect/disconnect. You may also check if something has deadlocked by reading either /proc/acpi/battery/*/* or sysfs battery files. In case of a deadlock this read will never return.
Ignore-Patch : http://bugzilla.kernel.org/attachment.cgi?id=21282
Ignore-Patch : http://bugzilla.kernel.org/attachment.cgi?id=21061
hi, Alexey, any updates on this bug?
Zhang, we may close this bug as "will not fix", as sufficient workaround (set HZ to 1000 exists). As I understand, reporter is not interested in any further investigation of the issue.
done. thanks. :)