This bug has been previously reported on the Debian bugtracker, please have a look at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614256 as it contains much more information. Using the ondemand governor, the time taken for the CPU to rise to max frequency is noticeable by the user and impact global performance. Setting sampling_rate to sampling_rate_min makes the the CPU perform much faster transitions. On this processor : AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Down freq: 1000MHz / Up freq: 2200MHz I have: /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate:109000 /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate_min:10900 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency:109000 This simple command solves the problem: cat sampling_rate_min >| sampling_rate So, I suppose the default values are not optimal. In contrast, I have an Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz, and the three values mentioned above are all equal to 10000 on this processor. I did some dirty benchmarking using the following command: dd if=/dev/zero of=/dev/null bs=300k count=1000 Here are my results in GB/s (several averages) for the AMD CPU: -------------- performance : 7.61 / 7.61 / 7.61 / 7.56 / 7.37 ondemand (no-tweaking) : 4.83 / 4.68 / 4.73 / 5.14 / 5.51 / 5.37 ondemand (sampling_rate = rampling_rate_min, i.e. default/10) : 7.00 / 7.07 / 7.03 / 7.06 / 7.02 / 7.01 / 7.04 -------------- Please see the original Debian bug report for more information. Thanks
dd if=/dev/zero of=/dev/null bs=300k count=1000 should fully utilize a core, but the duration of the process is a bit short: 307200000 bytes (307 MB) copied, 0.052651 seconds, 5.8 GB/s real 0m0.056s user 0m0.000s sys 0m0.056s If frequency is checked every 100ms, but the process only taks 50ms, it can happen that it's not switched up at all. I remember the polling interval with old userspace governors was much higher. This is a very specific micro benchmark not telling much about reality workloads. Users won't recon whether the process ends in 50 or 70 ms, if there are more of them in parallel and the core gets utilized for longer time, the frequency will get switched up permanently. And you want to save power, therefore it makes sense to not switch frequency up on this tiny peak. On latest machines there are deep sleep states, there you want to finish up processes as quickly as possible. > In contrast, I have an Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz The main difference is MSR based vs IO based frequency switching. The latter takes longer -> longer latency and this is calculated into the sampling_rate value. New AMDs also use MSR based switching, export lower latency values (some even 0) and you get the same (min) sampling_rate. You may want to try to find a "real-world" workload which takes a minute or so and prove a performance loss of >%2, that should be hard. Especially with latest improvements (count IO as load). For theoretical worst case performance losses for your HW you can also use cpufreq-bench from the cpufrequtils package.
(In reply to comment #1) > Users won't recon whether the process ends in 50 or 70 ms > And you want to save power, therefore it makes > sense to not switch frequency up on this tiny peak. I don't agree. We don't care if frequency switches up here, it's a drop in the ocean. What power saving really is IMHO, is this: cpufreq stats: 2.20 GHz:2.50%, [...], 1000 MHz:97.34% > You may want to try to find a "real-world" workload which takes a minute or > so > and prove a performance loss of >%2, that should be hard. Especially with > latest improvements (count IO as load). On my AMD cpu, the dd command takes more than 120ms to execute, and about 65ms with the workaround. Nearly x2. Of course the benchmark is dirty, but I'm sure we can find practical issues. What about shell scripts spawning many small IO processes one after another? So I tested the following command within a shell script: for i in {000..999} ; do dd if=/dev/zero of=file$i bs=1M count=1 ; done With the following results (zsh time cmd): performance: 0,27s user 4,17s system 43% cpu 10,229 total 0,28s user 4,17s system 41% cpu 10,740 total 0,31s user 4,12s system 41% cpu 10,564 total ondemand: 0,72s user 9,76s system 70% cpu 14,944 total 0,70s user 9,74s system 64% cpu 16,256 total 0,63s user 8,64s system 61% cpu 15,037 total ondemand with workaround: 0,46s user 5,49s system 49% cpu 12,095 total 0,43s user 5,58s system 48% cpu 12,281 total 0,43s user 5,52s system 48% cpu 12,358 total And on a larger scale (doing the loop 6 times within the script): performa: 1,87s user 24,97s system 40% cpu 1:06,06 total ondemand: 4,49s user 58,64s system 70% cpu 1:30,02 total workarnd: 2,46s user 32,89s system 48% cpu 1:12,83 total The issue is clearly visible with >>2% overhead. Seems like the gap between each process creation is long enough for ondemand to switch freq down, but afterward the governor is too slow to be back up again soon enough, resulting in an overall performance cost. We know we can sample faster by setting sampling rate to min, so finally the only question is: what is the real cost of sampling faster and does it outweigh the performance benefit? I'm not a specialist, so I may be wrong... > For theoretical worst case performance losses for your HW you can also use > cpufreq-bench from the cpufrequtils package. Don't have it on Debian (version 007). Will try it if you or someone else think it's necessary to complete the results above.
> For theoretical worst case performance losses for your HW you can also use > cpufreq-bench from the cpufrequtils package. Did it. Using the provided config file (example.cfg, but with high prio): sleep = 50000 load = 50000 cpu = 0 priority = high output = /var/log/cpufreq-bench sleep_step = 50000 load_step = 50000 cycles = 20 rounds = 40 verbose = 0 governor = ondemand I got the following results: #round load sleep performance powersave percentage 0 50000 50000 51231 72195 70.962 1 100000 100000 99372 131896 75.341 2 150000 150000 150344 207348 72.508 3 200000 200000 201254 252881 79.585 4 250000 250000 252787 314064 80.489 5 300000 300000 289421 384662 75.241 6 350000 350000 349900 444644 78.692 7 400000 400000 398016 500908 79.459 8 450000 450000 441566 559675 78.897 9 500000 500000 475505 596761 79.681 10 550000 550000 543846 679182 80.074 11 600000 600000 588735 736504 79.936 12 650000 650000 649588 805318 80.662 13 700000 700000 687249 828287 82.972 14 750000 750000 744053 910069 81.758 15 800000 800000 784194 949034 82.631 16 850000 850000 837451 1000410 83.711 17 900000 900000 895482 1060572 84.434 18 950000 950000 942109 1120819 84.055 19 1000000 1000000 988791 1177261 83.991 20 1050000 1050000 1035069 1219008 84.911 21 1100000 1100000 1090936 1290393 84.543 22 1150000 1150000 1116174 1322882 84.374 23 1200000 1200000 1182898 1384354 85.448 24 1250000 1250000 1245290 1461707 85.194 25 1300000 1300000 1279248 1504184 85.046 26 1350000 1350000 1334856 1568016 85.130 27 1400000 1400000 1359270 1619033 83.956 28 1450000 1450000 1427805 1696784 84.148 29 1500000 1500000 1476888 1743625 84.702 30 1550000 1550000 1527646 1798479 84.941 31 1600000 1600000 1571870 1851467 84.899 32 1650000 1650000 1622833 1900892 85.372 33 1700000 1700000 1677034 1956607 85.711 34 1750000 1750000 1723148 2010195 85.720 35 1800000 1800000 1774814 2064853 85.954 36 1850000 1850000 1823337 2137830 85.289 37 1900000 1900000 1873975 2184400 85.789 38 1950000 1950000 1937433 2296903 84.350 39 2000000 2000000 1965560 2282347 86.120 Not so good, right? Even in the end with the 2 seconds workload...
Great, thanks! I know this runs for a while..., but could you let it run (over night?) with different sampling_rate values (this was default, 109ms?). Best: min, default and one or 2 in between (above was I agree that it would make sense to hardcode latency values in powernow-k8 at least for some families. Even latency is wrong then, it should get set in a way that ondemand takes best sampling rate values later.
(In reply to comment #4) > Great, thanks! > I know this runs for a while..., but could you let it run (over night?) with > different sampling_rate values (this was default, 109ms?). > Best: min, default and one or 2 in between (above was Yes sir! Strangely, I got better results this time (for sampling_rate = 100900). I changed kernel in the meanwhile from 2.6.37.x to 2.6.38. Don't know if that's the reason. Also, I must say that the machine is not perfectly and absolutely quiet. For example, every 5 minutes, about 45 rrdtool PNGs are being generated... Than I had to modify the benchmark source code a little because the kernel does not keep the sampling_rate value in memory when the governor is changed. i.e. line 157 @ benchmark.c: /* set the powersave governor which activates P-State switching * again */ if (set_cpufreq_governor(config->governor, config->cpu) != 0) return; int slen = strlen(config->sampling_rate); if ( sysfs_write_file(0, "ondemand/sampling_rate", config->sampling_rate, slen) != slen ) return; I hope this modification is effective immediately, because if the kernel waits for the previous sampling cycle to finish before using the new value, that might be a problem... But let's see the bench! sampling rate -> 10900 #round load sleep performance powersave percentage 0 50000 50000 50379 57822 87.128 1 100000 100000 96123 105471 91.136 2 150000 150000 157853 165049 95.640 3 200000 200000 195221 208529 93.618 4 250000 250000 240717 259260 92.848 5 300000 300000 295529 304502 97.053 6 350000 350000 351005 355713 98.676 7 400000 400000 404862 407786 99.283 8 450000 450000 455725 458553 99.383 9 500000 500000 506461 509685 99.367 10 550000 550000 552422 558971 98.828 11 600000 600000 599962 609144 98.493 12 650000 650000 653394 650651 100.422 13 700000 700000 742302 759985 97.673 14 750000 750000 743909 749063 99.312 15 800000 800000 839556 848876 98.902 16 850000 850000 836299 852711 98.075 17 900000 900000 878341 906728 96.869 18 950000 950000 958657 955908 100.288 19 1000000 1000000 996672 1008523 98.825 20 1050000 1050000 1115971 1128368 98.901 21 1100000 1100000 1066765 1096360 97.301 22 1150000 1150000 1208905 1248820 96.804 23 1200000 1200000 1201914 1207109 99.570 24 1250000 1250000 1252482 1291425 96.984 25 1300000 1300000 1311825 1310604 100.093 26 1350000 1350000 1356576 1346045 100.782 27 1400000 1400000 1396160 1414234 98.722 28 1450000 1450000 1460034 1456336 100.254 29 1500000 1500000 1473712 1505785 97.870 30 1550000 1550000 1541369 1544064 99.825 31 1600000 1600000 1595234 1563091 102.056 32 1650000 1650000 1737228 1768357 98.240 33 1700000 1700000 1688818 1674172 100.875 34 1750000 1750000 1728056 1730448 99.862 35 1800000 1800000 1749468 1788727 97.805 36 1850000 1850000 1830809 1821587 100.506 37 1900000 1900000 1968621 2001197 98.372 38 1950000 1950000 1956487 1959974 99.822 39 2000000 2000000 1989418 1991125 99.914 sampling rate -> 25000 #round load sleep performance powersave percentage 0 50000 50000 48866 63812 76.578 1 100000 100000 95406 103630 92.064 2 150000 150000 159273 170228 93.564 3 200000 200000 184444 209933 87.858 4 250000 250000 246598 254528 96.885 5 300000 300000 286551 291895 98.169 6 350000 350000 359588 378732 94.945 7 400000 400000 392532 402556 97.510 8 450000 450000 445735 449587 99.143 9 500000 500000 531572 535932 99.186 10 550000 550000 549339 597997 91.863 11 600000 600000 593279 604690 98.113 12 650000 650000 641693 635087 101.040 13 700000 700000 692703 673405 102.866 14 750000 750000 720045 719564 100.067 15 800000 800000 840078 798726 105.177 16 850000 850000 833332 922140 90.369 17 900000 900000 846550 858108 98.653 18 950000 950000 938212 1009257 92.961 19 1000000 1000000 965626 943793 102.313 20 1050000 1050000 992188 1043123 95.117 21 1100000 1100000 1079752 1053940 102.449 22 1150000 1150000 1112577 1066276 104.342 23 1200000 1200000 1178584 1186980 99.293 24 1250000 1250000 1219567 1209520 100.831 25 1300000 1300000 1275331 1295114 98.472 26 1350000 1350000 1290260 1271751 101.455 27 1400000 1400000 1317887 1355029 97.259 28 1450000 1450000 1392389 1417351 98.239 29 1500000 1500000 1450419 1470584 98.629 30 1550000 1550000 1494379 1527781 97.814 31 1600000 1600000 1578152 1574044 100.261 32 1650000 1650000 1564408 1613680 96.947 33 1700000 1700000 1641712 1658964 98.960 34 1750000 1750000 1704641 1703238 100.082 35 1800000 1800000 1854979 1908299 97.206 36 1850000 1850000 1930030 1965809 98.180 37 1900000 1900000 2006221 2002679 100.177 38 1950000 1950000 1991565 2066377 96.380 39 2000000 2000000 1937709 1963756 98.674 sampling rate -> 50000 #round load sleep performance powersave percentage 0 50000 50000 50461 57215 88.195 1 100000 100000 101266 108726 93.139 2 150000 150000 150922 161746 93.308 3 200000 200000 201366 220502 91.322 4 250000 250000 251854 270302 93.175 5 300000 300000 302107 324696 93.043 6 350000 350000 352833 377245 93.529 7 400000 400000 402723 430434 93.562 8 450000 450000 453385 475389 95.371 9 500000 500000 516310 523196 98.684 10 550000 550000 527734 541273 97.499 11 600000 600000 603932 623068 96.929 12 650000 650000 654171 672086 97.334 13 700000 700000 722544 721662 100.122 14 750000 750000 755333 780769 96.742 15 800000 800000 784639 771908 101.649 16 850000 850000 857070 866353 98.929 17 900000 900000 896301 921364 97.280 18 950000 950000 953981 977734 97.571 19 1000000 1000000 986557 945493 104.343 20 1050000 1050000 1066578 1082070 98.568 21 1100000 1100000 1118159 1137267 98.320 22 1150000 1150000 1171215 1181797 99.105 23 1200000 1200000 1248940 1218559 102.493 24 1250000 1250000 1274655 1293596 98.536 25 1300000 1300000 1354501 1327079 102.066 26 1350000 1350000 1350517 1379971 97.866 27 1400000 1400000 1403327 1425986 98.411 28 1450000 1450000 1490654 1501879 99.253 29 1500000 1500000 1516141 1532396 98.939 30 1550000 1550000 1603324 1576549 101.698 31 1600000 1600000 1624517 1720622 94.415 32 1650000 1650000 1700588 1712018 99.332 33 1700000 1700000 1709934 1732689 98.687 34 1750000 1750000 1765423 1822591 96.863 35 1800000 1800000 1814231 1839975 98.601 36 1850000 1850000 1888431 1920897 98.310 37 1900000 1900000 1918348 1942091 98.777 38 1950000 1950000 1990487 2016755 98.698 39 2000000 2000000 2029399 2055636 98.724 sampling rate -> 75000 #round load sleep performance powersave percentage 0 50000 50000 50667 84951 59.643 1 100000 100000 108964 131021 83.165 2 150000 150000 152263 171764 88.647 3 200000 200000 202738 231244 87.673 4 250000 250000 253296 281282 90.050 5 300000 300000 323773 349455 92.651 6 350000 350000 380587 409458 92.949 7 400000 400000 434598 459378 94.606 8 450000 450000 451185 462619 97.529 9 500000 500000 544721 565812 96.272 10 550000 550000 557344 584730 95.316 11 600000 600000 600266 629588 95.343 12 650000 650000 710363 735434 96.591 13 700000 700000 706661 740431 95.439 14 750000 750000 755650 772399 97.832 15 800000 800000 802048 842788 95.166 16 850000 850000 853532 882385 96.730 17 900000 900000 902382 923021 97.764 18 950000 950000 1025231 1058781 96.831 19 1000000 1000000 1084996 1114325 97.368 20 1050000 1050000 1054255 1085418 97.129 21 1100000 1100000 1147697 1168613 98.210 22 1150000 1150000 1237486 1255271 98.583 23 1200000 1200000 1189043 1246567 95.385 24 1250000 1250000 1248880 1282136 97.406 25 1300000 1300000 1303318 1323350 98.486 26 1350000 1350000 1360049 1375888 98.849 27 1400000 1400000 1394339 1424795 97.862 28 1450000 1450000 1423447 1499351 94.938 29 1500000 1500000 1626024 1664063 97.714 30 1550000 1550000 1568529 1585058 98.957 31 1600000 1600000 1613514 1651092 97.724 32 1650000 1650000 1665502 1684689 98.861 33 1700000 1700000 1745951 1789933 97.543 34 1750000 1750000 1755412 1809875 96.991 35 1800000 1800000 1818859 1853292 98.142 36 1850000 1850000 1962896 1996294 98.327 37 1900000 1900000 2015061 2053166 98.144 38 1950000 1950000 1958055 1991440 98.324 39 2000000 2000000 2059918 2077839 99.138 sampling rate -> 100900 #round load sleep performance powersave percentage 0 50000 50000 49871 84547 58.987 1 100000 100000 96912 116817 82.961 2 150000 150000 146427 170140 86.063 3 200000 200000 201665 222451 90.656 4 250000 250000 252238 279322 90.304 5 300000 300000 302541 329129 91.922 6 350000 350000 353094 385873 91.505 7 400000 400000 399234 437468 91.260 8 450000 450000 453888 491621 92.325 9 500000 500000 504262 539304 93.502 10 550000 550000 547418 593890 92.175 11 600000 600000 589834 628469 93.853 12 650000 650000 648893 698749 92.865 13 700000 700000 704968 756190 93.226 14 750000 750000 750982 792786 94.727 15 800000 800000 798755 842458 94.812 16 850000 850000 842453 872535 96.552 17 900000 900000 897170 934559 95.999 18 950000 950000 956476 980354 97.564 19 1000000 1000000 1001540 1026438 97.574 20 1050000 1050000 1037185 1068218 97.095 21 1100000 1100000 1108554 1134449 97.717 22 1150000 1150000 1157416 1185222 97.654 23 1200000 1200000 1181416 1208771 97.737 24 1250000 1250000 1255804 1281740 97.977 25 1300000 1300000 1307917 1329915 98.346 26 1350000 1350000 1313132 1365145 96.190 27 1400000 1400000 1409645 1422965 99.064 28 1450000 1450000 1460223 1456055 100.286 29 1500000 1500000 1510886 1522465 99.239 30 1550000 1550000 1554302 1566502 99.221 31 1600000 1600000 1577039 1618269 97.452 32 1650000 1650000 1644825 1684849 97.625 33 1700000 1700000 1668999 1729190 96.519 34 1750000 1750000 1747671 1793127 97.465 35 1800000 1800000 1785283 1828590 97.632 36 1850000 1850000 1863788 1885203 98.864 37 1900000 1900000 1887764 1940127 97.301 38 1950000 1950000 1955864 1980122 98.775 39 2000000 2000000 1971118 2033591 96.928 This last one is surprising because very different from the previous one. Strange enough so I ran the test I did last time again: for i in {000..999} ; do dd if=/dev/zero of=file$i bs=1M count=1 ; done To make it short, I got these average results (total exec time) : performance : 4.6s ondemand (sr = 109000) : 8.2s ondemand (sr = 10900) : 5.8s Decreasing sampling_rate is still good for performance, but the total time is much shorter in every cases. Because of the new "RCU pathname lookup" from 2.6.38 maybe? For information, this has been done on a software RAI5 + ext4. > I agree that it would make sense to hardcode latency values in powernow-k8 at > least for some families. Even latency is wrong then, it should get set in a > way > that ondemand takes best sampling rate values later. Kind of an auto-adaptive sampling rate? What I did for Debian is tweek the init.d script coming from the cpufrequtils package like this: if the sampling_rate is > 100000, set it to sampling_rate_min (see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614256). Hope this helps! Fabien C.
(In reply to comment #4) (In reply to comment #5) > sampling rate -> 100900 Oops, I did a mistake, this should have been 109000 and not 100900... I ran the last part again. Using the *original* source code benchmark : #round load sleep performance powersave percentage 0 50000 50000 53543 73043 73.304 1 100000 100000 98023 126766 77.326 2 150000 150000 159160 210128 75.744 3 200000 200000 196206 249704 78.575 4 250000 250000 243273 306993 79.244 5 300000 300000 292181 361470 80.831 6 350000 350000 343698 394660 87.087 7 400000 400000 416815 484761 85.984 8 450000 450000 462757 545664 84.806 9 500000 500000 520991 598003 87.122 10 550000 550000 535377 596937 89.687 11 600000 600000 583483 648846 89.926 12 650000 650000 628524 698827 89.940 13 700000 700000 695536 757017 91.879 14 750000 750000 728124 804866 90.465 15 800000 800000 780291 857666 90.978 16 850000 850000 877801 967767 90.704 17 900000 900000 857772 949362 90.353 18 950000 950000 975308 1083946 89.978 19 1000000 1000000 1057874 1146442 92.275 20 1050000 1050000 1004248 1092633 91.911 21 1100000 1100000 1137662 1235654 92.070 22 1150000 1150000 1177812 1299840 90.612 23 1200000 1200000 1237668 1349017 91.746 24 1250000 1250000 1277581 1398822 91.333 25 1300000 1300000 1346782 1470433 91.591 26 1350000 1350000 1372685 1514791 90.619 27 1400000 1400000 1425458 1555385 91.647 28 1450000 1450000 1488907 1614644 92.213 29 1500000 1500000 1534220 1660851 92.376 30 1550000 1550000 1589945 1705595 93.219 31 1600000 1600000 1623821 1774659 91.500 32 1650000 1650000 1682260 1819625 92.451 33 1700000 1700000 1718704 1855909 92.607 34 1750000 1750000 1800496 1916997 93.923 35 1800000 1800000 1834767 1969702 93.149 36 1850000 1850000 1836771 1991248 92.242 37 1900000 1900000 1934180 2091324 92.486 38 1950000 1950000 1946207 2129832 91.378 39 2000000 2000000 2021666 2189178 92.348 Using the *modified* source code benchmark : sampling rate -> 109000 #round load sleep performance powersave percentage 0 50000 50000 51888 67155 77.267 1 100000 100000 100547 134729 74.629 2 150000 150000 150290 189240 79.417 3 200000 200000 202431 245879 82.330 4 250000 250000 250859 316895 79.161 5 300000 300000 303353 355627 85.301 6 350000 350000 342444 403122 84.948 7 400000 400000 398541 455542 87.487 8 450000 450000 445787 505192 88.241 9 500000 500000 494594 558546 88.550 10 550000 550000 550286 624562 88.107 11 600000 600000 599732 648617 92.463 12 650000 650000 658792 709540 92.848 13 700000 700000 692467 761048 90.989 14 750000 750000 734578 815476 90.080 15 800000 800000 795283 874258 90.967 16 850000 850000 853177 932069 91.536 17 900000 900000 908659 984574 92.290 18 950000 950000 944224 1020172 92.555 19 1000000 1000000 991434 1057693 93.736 20 1050000 1050000 1030782 1105324 93.256 21 1100000 1100000 1082488 1163559 93.033 22 1150000 1150000 1157796 1242472 93.185 23 1200000 1200000 1195277 1278660 93.479 24 1250000 1250000 1242131 1352067 91.869 25 1300000 1300000 1284598 1391095 92.344 26 1350000 1350000 1342151 1428773 93.937 27 1400000 1400000 1384506 1490919 92.863 28 1450000 1450000 1444801 1566643 92.223 29 1500000 1500000 1489048 1602071 92.945 30 1550000 1550000 1547435 1661750 93.121 31 1600000 1600000 1600977 1699581 94.198 32 1650000 1650000 1646659 1744241 94.405 33 1700000 1700000 1706900 1808868 94.363 34 1750000 1750000 1744235 1864876 93.531 35 1800000 1800000 1782070 1899208 93.832 36 1850000 1850000 1833140 1956952 93.673 37 1900000 1900000 1913469 2021041 94.677 38 1950000 1950000 1938900 2078972 93.262 39 2000000 2000000 1988934 2111682 94.187
> Decreasing sampling_rate is still good for performance, but the total time is > much shorter in every cases Thanks, so there may be a fix/improvement in 2.6.38? It will still take some time, as this is rather time intensive. Also prio is not that high as there is no real regression. But I am going to revisit the issue in some weeks and may come up with some fine tuning.
(In reply to comment #7) > > Decreasing sampling_rate is still good for performance, but the total time > is > > much shorter in every cases > Thanks, so there may be a fix/improvement in 2.6.38? Well, the "for" loop is much faster, but that should not be related to cpufreq *only*. Yet, the benchmark is better, i.e. the 2 seconds workload: 2.6.37.x: 86.120% efficiency 2.6.38: 92.348% efficiency > But I am going to revisit the issue in some weeks and may come up with some > fine tuning. Great! Thanks. Don't hesitate to ask if you need some more data.
It's great that the kernel bugzilla is back. Thomas, what's the current status of this bug? Justincase, Can you please verify if the problem still exists in the latest upstream kernel?
bug closed as there is no response from the bug reporter. please feel free to reopen it if the problem still exists in the latest upstream kernel.
I had another quick look at it. The transition latency could be set statically via CPU family. But that would mean tuning per family and even there might be huge differences between CPUs (depending on how big the steps are, old AMDs switch in 100MHz steps internally iirc). It would also mean per CPU family code maintenance/tuning of old HW. As it can be tuned manually (see description), it can be set documented...
@Thomas: I agree. Setting it statically is not ideal, but I think this is not necessary. I see the very same problem described here on a laptop with AMD Turion 64 X2. Smooth video playback that would only use ~50% CPU is impossible without the described workaround "cat sampling_rate_min >| sampling_rate". Another laptop with an Intel Pentium M (single-core) gets "sampling_rate = sampling_rate_min" as default. So I guess there must be already some kind of CPU family related exception for older AMD dual-cores causing this bug! I can imagine that fixing this bug could cause further regressions. But I still think that it's worth it (although the Turion laptop is not mine :-). It would be more fair to fix systems that work best with sampling_rate_min and potentially hurt systems that cannot handle this, since the latter just have wrong values for sampling_rate_min! The solution for Debian (end of comment #5) sounds reasonable to me, since >100ms is way too slow for video frame rendering each 33...40ms. If you agree then please reopen this bug. Otherwise "CLOSED DOCUMENTED" seems ok to me too. Then distros must care themselves about optimal defaults, via CPU family, blacklists or similar. And affected people could help by (re)opening downstream bugs like https://bugs.launchpad.net/bugs/326149
Apart from the "AMD Turion 64 X2" mentioned above now I have a desktop system with an "AMD Athlon 64 X2" that also has sampling_rate = 109000, which is 109ms and way too high! @kernel-devs: please set the default for sampling_rate to X*sampling_rate_min with X=2 or 3 (e.g. 2*10900 = 21800 = 21.8ms). This will allow smooth video playback and far better desktop responsiveness while still saving power.