Bug 30712 - Slow transitioning AMD ondemand CPU because of wrong sampling_rate
Summary: Slow transitioning AMD ondemand CPU because of wrong sampling_rate
Status: CLOSED DOCUMENTED
Alias: None
Product: Power Management
Classification: Unclassified
Component: cpufreq (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Thomas Renninger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-07 21:54 UTC by Yill Din
Modified: 2012-09-23 01:11 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.37
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Yill Din 2011-03-07 21:54:02 UTC
This bug has been previously reported on the Debian bugtracker, please have a look at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614256 as it contains much more information. 

Using the ondemand governor, the time taken for the CPU to rise to max frequency is noticeable by the user and impact global performance. Setting sampling_rate to sampling_rate_min makes the the CPU perform much faster transitions. 

On this processor : 
 AMD Athlon(tm) 64 X2 Dual Core Processor 4200+
 Down freq: 1000MHz / Up freq: 2200MHz

I have: 
 /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate:109000
 /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate_min:10900
 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency:109000

This simple command solves the problem: 
 cat sampling_rate_min >| sampling_rate

So, I suppose the default values are not optimal. In contrast, I have an Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz, and the three values mentioned above are all equal to 10000 on this processor. 

I did some dirty benchmarking using the following command: 
 dd if=/dev/zero of=/dev/null bs=300k count=1000

Here are my results in GB/s (several averages) for the AMD CPU:
--------------
performance :
7.61 / 7.61 / 7.61 / 7.56 / 7.37

ondemand (no-tweaking) :
4.83 / 4.68 / 4.73 / 5.14 / 5.51 / 5.37

ondemand (sampling_rate = rampling_rate_min, i.e. default/10) :
7.00 / 7.07 / 7.03 / 7.06 / 7.02 / 7.01 / 7.04
--------------

Please see the original Debian bug report for more information. 

Thanks
Comment 1 Thomas Renninger 2011-03-08 09:17:41 UTC
dd if=/dev/zero of=/dev/null bs=300k count=1000
should fully utilize a core, but the duration of the process is a bit short:
307200000 bytes (307 MB) copied, 0.052651 seconds, 5.8 GB/s
real    0m0.056s
user    0m0.000s
sys     0m0.056s

If frequency is checked every 100ms, but the process only taks 50ms, it can happen that it's not switched up at all.
I remember the polling interval with old userspace governors was much higher.

This is a very specific micro benchmark not telling much about reality workloads.
Users won't recon whether the process ends in 50 or 70 ms, if there are more of them in parallel and the core gets utilized for longer time, the frequency will get switched up permanently. And you want to save power, therefore it makes sense to not switch frequency up on this tiny peak.
On latest machines there are deep sleep states, there you want to finish up processes as quickly as possible.

> In contrast, I have an Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz
The main difference is MSR based vs IO based frequency switching.
The latter takes longer -> longer latency and this is calculated into the sampling_rate value. New AMDs also use MSR based switching, export lower latency values (some even 0) and you get the same (min) sampling_rate.

You may want to try to find a "real-world" workload which takes a minute or so and prove a performance loss of >%2, that should be hard. Especially with latest improvements (count IO as load).
For theoretical worst case performance losses for your HW you can also use cpufreq-bench from the cpufrequtils package.
Comment 2 Yill Din 2011-03-15 04:59:11 UTC
(In reply to comment #1)

> Users won't recon whether the process ends in 50 or 70 ms

> And you want to save power, therefore it makes
> sense to not switch frequency up on this tiny peak.

I don't agree. We don't care if frequency switches up here, it's a drop in the ocean. What power saving really is IMHO, is this: 
 cpufreq stats: 2.20 GHz:2.50%, [...], 1000 MHz:97.34%

> You may want to try to find a "real-world" workload which takes a minute or
> so
> and prove a performance loss of >%2, that should be hard. Especially with
> latest improvements (count IO as load).

On my AMD cpu, the dd command takes more than 120ms to execute, and about 65ms with the workaround. Nearly x2. Of course the benchmark is dirty, but I'm sure we can find practical issues. What about shell scripts spawning many small IO processes one after another? 

So I tested the following command within a shell script: 
 for i in {000..999} ; do dd if=/dev/zero of=file$i bs=1M count=1 ; done

With the following results (zsh time cmd): 
 performance: 
 0,27s user 4,17s system 43% cpu 10,229 total
 0,28s user 4,17s system 41% cpu 10,740 total
 0,31s user 4,12s system 41% cpu 10,564 total

 ondemand: 
 0,72s user 9,76s system 70% cpu 14,944 total
 0,70s user 9,74s system 64% cpu 16,256 total
 0,63s user 8,64s system 61% cpu 15,037 total

 ondemand with workaround: 
 0,46s user 5,49s system 49% cpu 12,095 total
 0,43s user 5,58s system 48% cpu 12,281 total
 0,43s user 5,52s system 48% cpu 12,358 total

And on a larger scale (doing the loop 6 times within the script): 
 performa: 1,87s user 24,97s system 40% cpu 1:06,06 total
 ondemand: 4,49s user 58,64s system 70% cpu 1:30,02 total
 workarnd: 2,46s user 32,89s system 48% cpu 1:12,83 total


The issue is clearly visible with >>2% overhead. Seems like the gap between each process creation is long enough for ondemand to switch freq down, but afterward the governor is too slow to be back up again soon enough, resulting in an overall performance cost. 

We know we can sample faster by setting sampling rate to min, so finally the only question is: what is the real cost of sampling faster and does it outweigh the performance benefit? 

I'm not a specialist, so I may be wrong... 

> For theoretical worst case performance losses for your HW you can also use
> cpufreq-bench from the cpufrequtils package.

Don't have it on Debian (version 007). Will try it if you or someone else think it's necessary to complete the results above.
Comment 3 Yill Din 2011-03-24 17:30:10 UTC
> For theoretical worst case performance losses for your HW you can also use
> cpufreq-bench from the cpufrequtils package.

Did it. 

Using the provided config file (example.cfg, but with high prio): 
sleep = 50000
load = 50000
cpu = 0
priority = high
output = /var/log/cpufreq-bench
sleep_step = 50000
load_step = 50000
cycles = 20
rounds = 40
verbose = 0
governor = ondemand

I got the following results: 

#round load sleep performance powersave percentage
0 50000 50000 51231 72195 70.962
1 100000 100000 99372 131896 75.341
2 150000 150000 150344 207348 72.508
3 200000 200000 201254 252881 79.585
4 250000 250000 252787 314064 80.489
5 300000 300000 289421 384662 75.241
6 350000 350000 349900 444644 78.692
7 400000 400000 398016 500908 79.459
8 450000 450000 441566 559675 78.897
9 500000 500000 475505 596761 79.681
10 550000 550000 543846 679182 80.074
11 600000 600000 588735 736504 79.936
12 650000 650000 649588 805318 80.662
13 700000 700000 687249 828287 82.972
14 750000 750000 744053 910069 81.758
15 800000 800000 784194 949034 82.631
16 850000 850000 837451 1000410 83.711
17 900000 900000 895482 1060572 84.434
18 950000 950000 942109 1120819 84.055
19 1000000 1000000 988791 1177261 83.991
20 1050000 1050000 1035069 1219008 84.911
21 1100000 1100000 1090936 1290393 84.543
22 1150000 1150000 1116174 1322882 84.374
23 1200000 1200000 1182898 1384354 85.448
24 1250000 1250000 1245290 1461707 85.194
25 1300000 1300000 1279248 1504184 85.046
26 1350000 1350000 1334856 1568016 85.130
27 1400000 1400000 1359270 1619033 83.956
28 1450000 1450000 1427805 1696784 84.148
29 1500000 1500000 1476888 1743625 84.702
30 1550000 1550000 1527646 1798479 84.941
31 1600000 1600000 1571870 1851467 84.899
32 1650000 1650000 1622833 1900892 85.372
33 1700000 1700000 1677034 1956607 85.711
34 1750000 1750000 1723148 2010195 85.720
35 1800000 1800000 1774814 2064853 85.954
36 1850000 1850000 1823337 2137830 85.289
37 1900000 1900000 1873975 2184400 85.789
38 1950000 1950000 1937433 2296903 84.350
39 2000000 2000000 1965560 2282347 86.120

Not so good, right? Even in the end with the 2 seconds workload...
Comment 4 Thomas Renninger 2011-03-24 21:58:24 UTC
Great, thanks!
I know this runs for a while..., but could you let it run (over night?) with different sampling_rate values (this was default, 109ms?).
Best: min, default and one or 2 in between (above was 

I agree that it would make sense to hardcode latency values in powernow-k8 at least for some families. Even latency is wrong then, it should get set in a way that ondemand takes best sampling rate values later.
Comment 5 Yill Din 2011-03-28 01:47:43 UTC
(In reply to comment #4)
> Great, thanks!
> I know this runs for a while..., but could you let it run (over night?) with
> different sampling_rate values (this was default, 109ms?).
> Best: min, default and one or 2 in between (above was 

Yes sir!

Strangely, I got better results this time (for sampling_rate = 100900). I changed kernel in the meanwhile from 2.6.37.x to 2.6.38. Don't know if that's the reason. 

Also, I must say that the machine is not perfectly and absolutely quiet. For example, every 5 minutes, about 45 rrdtool PNGs are being generated... 

Than I had to modify the benchmark source code a little because the kernel does not keep the sampling_rate value in memory when the governor is changed. 

i.e. line 157 @ benchmark.c: 
 /* set the powersave governor which activates P-State switching
  * again */
 if (set_cpufreq_governor(config->governor, config->cpu) != 0)
 	return;
 
 int slen = strlen(config->sampling_rate);
 if ( sysfs_write_file(0, "ondemand/sampling_rate", config->sampling_rate, slen) != slen )
 	return;

I hope this modification is effective immediately, because if the kernel waits for the previous sampling cycle to finish before using the new value, that might be a problem... 

But let's see the bench! 


sampling rate -> 10900

#round load sleep performance powersave percentage
0 50000 50000 50379 57822 87.128
1 100000 100000 96123 105471 91.136
2 150000 150000 157853 165049 95.640
3 200000 200000 195221 208529 93.618
4 250000 250000 240717 259260 92.848
5 300000 300000 295529 304502 97.053
6 350000 350000 351005 355713 98.676
7 400000 400000 404862 407786 99.283
8 450000 450000 455725 458553 99.383
9 500000 500000 506461 509685 99.367
10 550000 550000 552422 558971 98.828
11 600000 600000 599962 609144 98.493
12 650000 650000 653394 650651 100.422
13 700000 700000 742302 759985 97.673
14 750000 750000 743909 749063 99.312
15 800000 800000 839556 848876 98.902
16 850000 850000 836299 852711 98.075
17 900000 900000 878341 906728 96.869
18 950000 950000 958657 955908 100.288
19 1000000 1000000 996672 1008523 98.825
20 1050000 1050000 1115971 1128368 98.901
21 1100000 1100000 1066765 1096360 97.301
22 1150000 1150000 1208905 1248820 96.804
23 1200000 1200000 1201914 1207109 99.570
24 1250000 1250000 1252482 1291425 96.984
25 1300000 1300000 1311825 1310604 100.093
26 1350000 1350000 1356576 1346045 100.782
27 1400000 1400000 1396160 1414234 98.722
28 1450000 1450000 1460034 1456336 100.254
29 1500000 1500000 1473712 1505785 97.870
30 1550000 1550000 1541369 1544064 99.825
31 1600000 1600000 1595234 1563091 102.056
32 1650000 1650000 1737228 1768357 98.240
33 1700000 1700000 1688818 1674172 100.875
34 1750000 1750000 1728056 1730448 99.862
35 1800000 1800000 1749468 1788727 97.805
36 1850000 1850000 1830809 1821587 100.506
37 1900000 1900000 1968621 2001197 98.372
38 1950000 1950000 1956487 1959974 99.822
39 2000000 2000000 1989418 1991125 99.914


sampling rate -> 25000

#round load sleep performance powersave percentage
0 50000 50000 48866 63812 76.578
1 100000 100000 95406 103630 92.064
2 150000 150000 159273 170228 93.564
3 200000 200000 184444 209933 87.858
4 250000 250000 246598 254528 96.885
5 300000 300000 286551 291895 98.169
6 350000 350000 359588 378732 94.945
7 400000 400000 392532 402556 97.510
8 450000 450000 445735 449587 99.143
9 500000 500000 531572 535932 99.186
10 550000 550000 549339 597997 91.863
11 600000 600000 593279 604690 98.113
12 650000 650000 641693 635087 101.040
13 700000 700000 692703 673405 102.866
14 750000 750000 720045 719564 100.067
15 800000 800000 840078 798726 105.177
16 850000 850000 833332 922140 90.369
17 900000 900000 846550 858108 98.653
18 950000 950000 938212 1009257 92.961
19 1000000 1000000 965626 943793 102.313
20 1050000 1050000 992188 1043123 95.117
21 1100000 1100000 1079752 1053940 102.449
22 1150000 1150000 1112577 1066276 104.342
23 1200000 1200000 1178584 1186980 99.293
24 1250000 1250000 1219567 1209520 100.831
25 1300000 1300000 1275331 1295114 98.472
26 1350000 1350000 1290260 1271751 101.455
27 1400000 1400000 1317887 1355029 97.259
28 1450000 1450000 1392389 1417351 98.239
29 1500000 1500000 1450419 1470584 98.629
30 1550000 1550000 1494379 1527781 97.814
31 1600000 1600000 1578152 1574044 100.261
32 1650000 1650000 1564408 1613680 96.947
33 1700000 1700000 1641712 1658964 98.960
34 1750000 1750000 1704641 1703238 100.082
35 1800000 1800000 1854979 1908299 97.206
36 1850000 1850000 1930030 1965809 98.180
37 1900000 1900000 2006221 2002679 100.177
38 1950000 1950000 1991565 2066377 96.380
39 2000000 2000000 1937709 1963756 98.674


sampling rate -> 50000

#round load sleep performance powersave percentage
0 50000 50000 50461 57215 88.195
1 100000 100000 101266 108726 93.139
2 150000 150000 150922 161746 93.308
3 200000 200000 201366 220502 91.322
4 250000 250000 251854 270302 93.175
5 300000 300000 302107 324696 93.043
6 350000 350000 352833 377245 93.529
7 400000 400000 402723 430434 93.562
8 450000 450000 453385 475389 95.371
9 500000 500000 516310 523196 98.684
10 550000 550000 527734 541273 97.499
11 600000 600000 603932 623068 96.929
12 650000 650000 654171 672086 97.334
13 700000 700000 722544 721662 100.122
14 750000 750000 755333 780769 96.742
15 800000 800000 784639 771908 101.649
16 850000 850000 857070 866353 98.929
17 900000 900000 896301 921364 97.280
18 950000 950000 953981 977734 97.571
19 1000000 1000000 986557 945493 104.343
20 1050000 1050000 1066578 1082070 98.568
21 1100000 1100000 1118159 1137267 98.320
22 1150000 1150000 1171215 1181797 99.105
23 1200000 1200000 1248940 1218559 102.493
24 1250000 1250000 1274655 1293596 98.536
25 1300000 1300000 1354501 1327079 102.066
26 1350000 1350000 1350517 1379971 97.866
27 1400000 1400000 1403327 1425986 98.411
28 1450000 1450000 1490654 1501879 99.253
29 1500000 1500000 1516141 1532396 98.939
30 1550000 1550000 1603324 1576549 101.698
31 1600000 1600000 1624517 1720622 94.415
32 1650000 1650000 1700588 1712018 99.332
33 1700000 1700000 1709934 1732689 98.687
34 1750000 1750000 1765423 1822591 96.863
35 1800000 1800000 1814231 1839975 98.601
36 1850000 1850000 1888431 1920897 98.310
37 1900000 1900000 1918348 1942091 98.777
38 1950000 1950000 1990487 2016755 98.698
39 2000000 2000000 2029399 2055636 98.724


sampling rate -> 75000

#round load sleep performance powersave percentage
0 50000 50000 50667 84951 59.643
1 100000 100000 108964 131021 83.165
2 150000 150000 152263 171764 88.647
3 200000 200000 202738 231244 87.673
4 250000 250000 253296 281282 90.050
5 300000 300000 323773 349455 92.651
6 350000 350000 380587 409458 92.949
7 400000 400000 434598 459378 94.606
8 450000 450000 451185 462619 97.529
9 500000 500000 544721 565812 96.272
10 550000 550000 557344 584730 95.316
11 600000 600000 600266 629588 95.343
12 650000 650000 710363 735434 96.591
13 700000 700000 706661 740431 95.439
14 750000 750000 755650 772399 97.832
15 800000 800000 802048 842788 95.166
16 850000 850000 853532 882385 96.730
17 900000 900000 902382 923021 97.764
18 950000 950000 1025231 1058781 96.831
19 1000000 1000000 1084996 1114325 97.368
20 1050000 1050000 1054255 1085418 97.129
21 1100000 1100000 1147697 1168613 98.210
22 1150000 1150000 1237486 1255271 98.583
23 1200000 1200000 1189043 1246567 95.385
24 1250000 1250000 1248880 1282136 97.406
25 1300000 1300000 1303318 1323350 98.486
26 1350000 1350000 1360049 1375888 98.849
27 1400000 1400000 1394339 1424795 97.862
28 1450000 1450000 1423447 1499351 94.938
29 1500000 1500000 1626024 1664063 97.714
30 1550000 1550000 1568529 1585058 98.957
31 1600000 1600000 1613514 1651092 97.724
32 1650000 1650000 1665502 1684689 98.861
33 1700000 1700000 1745951 1789933 97.543
34 1750000 1750000 1755412 1809875 96.991
35 1800000 1800000 1818859 1853292 98.142
36 1850000 1850000 1962896 1996294 98.327
37 1900000 1900000 2015061 2053166 98.144
38 1950000 1950000 1958055 1991440 98.324
39 2000000 2000000 2059918 2077839 99.138


sampling rate -> 100900

#round load sleep performance powersave percentage
0 50000 50000 49871 84547 58.987
1 100000 100000 96912 116817 82.961
2 150000 150000 146427 170140 86.063
3 200000 200000 201665 222451 90.656
4 250000 250000 252238 279322 90.304
5 300000 300000 302541 329129 91.922
6 350000 350000 353094 385873 91.505
7 400000 400000 399234 437468 91.260
8 450000 450000 453888 491621 92.325
9 500000 500000 504262 539304 93.502
10 550000 550000 547418 593890 92.175
11 600000 600000 589834 628469 93.853
12 650000 650000 648893 698749 92.865
13 700000 700000 704968 756190 93.226
14 750000 750000 750982 792786 94.727
15 800000 800000 798755 842458 94.812
16 850000 850000 842453 872535 96.552
17 900000 900000 897170 934559 95.999
18 950000 950000 956476 980354 97.564
19 1000000 1000000 1001540 1026438 97.574
20 1050000 1050000 1037185 1068218 97.095
21 1100000 1100000 1108554 1134449 97.717
22 1150000 1150000 1157416 1185222 97.654
23 1200000 1200000 1181416 1208771 97.737
24 1250000 1250000 1255804 1281740 97.977
25 1300000 1300000 1307917 1329915 98.346
26 1350000 1350000 1313132 1365145 96.190
27 1400000 1400000 1409645 1422965 99.064
28 1450000 1450000 1460223 1456055 100.286
29 1500000 1500000 1510886 1522465 99.239
30 1550000 1550000 1554302 1566502 99.221
31 1600000 1600000 1577039 1618269 97.452
32 1650000 1650000 1644825 1684849 97.625
33 1700000 1700000 1668999 1729190 96.519
34 1750000 1750000 1747671 1793127 97.465
35 1800000 1800000 1785283 1828590 97.632
36 1850000 1850000 1863788 1885203 98.864
37 1900000 1900000 1887764 1940127 97.301
38 1950000 1950000 1955864 1980122 98.775
39 2000000 2000000 1971118 2033591 96.928


This last one is surprising because very different from the previous one. 

Strange enough so I ran the test I did last time again: 
 for i in {000..999} ; do dd if=/dev/zero of=file$i bs=1M count=1 ; done

To make it short, I got these average results (total exec time) : 
 performance : 4.6s
 ondemand (sr = 109000) : 8.2s
 ondemand (sr = 10900) : 5.8s

Decreasing sampling_rate is still good for performance, but the total time is much shorter in every cases. Because of the new "RCU pathname lookup" from 2.6.38 maybe? For information, this has been done on a software RAI5 + ext4. 

> I agree that it would make sense to hardcode latency values in powernow-k8 at
> least for some families. Even latency is wrong then, it should get set in a
> way
> that ondemand takes best sampling rate values later.

Kind of an auto-adaptive sampling rate? 

What I did for Debian is tweek the init.d script coming from the cpufrequtils package like this: if the sampling_rate is > 100000, set it to sampling_rate_min (see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614256). 

Hope this helps! 

Fabien C.
Comment 6 Yill Din 2011-03-28 05:28:09 UTC
(In reply to comment #4)
(In reply to comment #5)

> sampling rate -> 100900

Oops, I did a mistake, this should have been 109000 and not 100900... I ran the last part again. 

Using the *original* source code benchmark : 

#round load sleep performance powersave percentage
0 50000 50000 53543 73043 73.304
1 100000 100000 98023 126766 77.326
2 150000 150000 159160 210128 75.744
3 200000 200000 196206 249704 78.575
4 250000 250000 243273 306993 79.244
5 300000 300000 292181 361470 80.831
6 350000 350000 343698 394660 87.087
7 400000 400000 416815 484761 85.984
8 450000 450000 462757 545664 84.806
9 500000 500000 520991 598003 87.122
10 550000 550000 535377 596937 89.687
11 600000 600000 583483 648846 89.926
12 650000 650000 628524 698827 89.940
13 700000 700000 695536 757017 91.879
14 750000 750000 728124 804866 90.465
15 800000 800000 780291 857666 90.978
16 850000 850000 877801 967767 90.704
17 900000 900000 857772 949362 90.353
18 950000 950000 975308 1083946 89.978
19 1000000 1000000 1057874 1146442 92.275
20 1050000 1050000 1004248 1092633 91.911
21 1100000 1100000 1137662 1235654 92.070
22 1150000 1150000 1177812 1299840 90.612
23 1200000 1200000 1237668 1349017 91.746
24 1250000 1250000 1277581 1398822 91.333
25 1300000 1300000 1346782 1470433 91.591
26 1350000 1350000 1372685 1514791 90.619
27 1400000 1400000 1425458 1555385 91.647
28 1450000 1450000 1488907 1614644 92.213
29 1500000 1500000 1534220 1660851 92.376
30 1550000 1550000 1589945 1705595 93.219
31 1600000 1600000 1623821 1774659 91.500
32 1650000 1650000 1682260 1819625 92.451
33 1700000 1700000 1718704 1855909 92.607
34 1750000 1750000 1800496 1916997 93.923
35 1800000 1800000 1834767 1969702 93.149
36 1850000 1850000 1836771 1991248 92.242
37 1900000 1900000 1934180 2091324 92.486
38 1950000 1950000 1946207 2129832 91.378
39 2000000 2000000 2021666 2189178 92.348


Using the *modified* source code benchmark : 

sampling rate -> 109000

#round load sleep performance powersave percentage
0 50000 50000 51888 67155 77.267
1 100000 100000 100547 134729 74.629
2 150000 150000 150290 189240 79.417
3 200000 200000 202431 245879 82.330
4 250000 250000 250859 316895 79.161
5 300000 300000 303353 355627 85.301
6 350000 350000 342444 403122 84.948
7 400000 400000 398541 455542 87.487
8 450000 450000 445787 505192 88.241
9 500000 500000 494594 558546 88.550
10 550000 550000 550286 624562 88.107
11 600000 600000 599732 648617 92.463
12 650000 650000 658792 709540 92.848
13 700000 700000 692467 761048 90.989
14 750000 750000 734578 815476 90.080
15 800000 800000 795283 874258 90.967
16 850000 850000 853177 932069 91.536
17 900000 900000 908659 984574 92.290
18 950000 950000 944224 1020172 92.555
19 1000000 1000000 991434 1057693 93.736
20 1050000 1050000 1030782 1105324 93.256
21 1100000 1100000 1082488 1163559 93.033
22 1150000 1150000 1157796 1242472 93.185
23 1200000 1200000 1195277 1278660 93.479
24 1250000 1250000 1242131 1352067 91.869
25 1300000 1300000 1284598 1391095 92.344
26 1350000 1350000 1342151 1428773 93.937
27 1400000 1400000 1384506 1490919 92.863
28 1450000 1450000 1444801 1566643 92.223
29 1500000 1500000 1489048 1602071 92.945
30 1550000 1550000 1547435 1661750 93.121
31 1600000 1600000 1600977 1699581 94.198
32 1650000 1650000 1646659 1744241 94.405
33 1700000 1700000 1706900 1808868 94.363
34 1750000 1750000 1744235 1864876 93.531
35 1800000 1800000 1782070 1899208 93.832
36 1850000 1850000 1833140 1956952 93.673
37 1900000 1900000 1913469 2021041 94.677
38 1950000 1950000 1938900 2078972 93.262
39 2000000 2000000 1988934 2111682 94.187
Comment 7 Thomas Renninger 2011-03-29 15:29:11 UTC
> Decreasing sampling_rate is still good for performance, but the total time is
> much shorter in every cases
Thanks, so there may be a fix/improvement in 2.6.38?

It will still take some time, as this is rather time intensive.
Also prio is not that high as there is no real regression.
But I am going to revisit the issue in some weeks and may come up with some fine tuning.
Comment 8 Yill Din 2011-03-29 16:55:11 UTC
(In reply to comment #7)
> > Decreasing sampling_rate is still good for performance, but the total time
> is
> > much shorter in every cases
> Thanks, so there may be a fix/improvement in 2.6.38?

Well, the "for" loop is much faster, but that should not be related to cpufreq *only*. Yet, the benchmark is better, i.e. the 2 seconds workload: 
2.6.37.x: 86.120% efficiency
2.6.38:   92.348% efficiency

> But I am going to revisit the issue in some weeks and may come up with some
> fine tuning.

Great! Thanks. 

Don't hesitate to ask if you need some more data.
Comment 9 Zhang Rui 2012-01-18 03:21:10 UTC
It's great that the kernel bugzilla is back.

Thomas, what's the current status of this bug?

Justincase,
Can you please verify if the problem still exists in the latest upstream kernel?
Comment 10 Zhang Rui 2012-05-24 07:54:17 UTC
bug closed as there is no response from the bug reporter.
please feel free to reopen it if the problem still exists in the latest upstream kernel.
Comment 11 Thomas Renninger 2012-05-24 08:28:28 UTC
I had another quick look at it.
The transition latency could be set statically via CPU family.
But that would mean tuning per family and even there might be huge differences between CPUs (depending on how big the steps are, old AMDs switch in 100MHz steps internally iirc).
It would also mean per CPU family code maintenance/tuning of old HW.
As it can be tuned manually (see description), it can be set documented...
Comment 12 Oliver Joos 2012-05-25 10:53:41 UTC
@Thomas: I agree. Setting it statically is not ideal, but I think this is not necessary.

I see the very same problem described here on a laptop with AMD Turion 64 X2. Smooth video playback that would only use ~50% CPU is impossible without the described workaround "cat sampling_rate_min >| sampling_rate". Another laptop with an Intel Pentium M (single-core) gets "sampling_rate = sampling_rate_min" as default. So I guess there must be already some kind of CPU family related exception for older AMD dual-cores causing this bug!

I can imagine that fixing this bug could cause further regressions. But I still think that it's worth it (although the Turion laptop is not mine :-). It would be more fair to fix systems that work best with sampling_rate_min and potentially hurt systems that cannot handle this, since the latter just have wrong values for sampling_rate_min! The solution for Debian (end of comment #5) sounds reasonable to me, since >100ms is way too slow for video frame rendering each 33...40ms.

If you agree then please reopen this bug. Otherwise "CLOSED DOCUMENTED" seems ok to me too. Then distros must care themselves about optimal defaults, via CPU family, blacklists or similar. And affected people could help by (re)opening downstream bugs like https://bugs.launchpad.net/bugs/326149
Comment 13 Oliver Joos 2012-09-23 01:11:09 UTC
Apart from the "AMD Turion 64 X2" mentioned above now I have a desktop system with an "AMD Athlon 64 X2" that also has sampling_rate = 109000, which is 109ms and way too high!

@kernel-devs: please set the default for sampling_rate to X*sampling_rate_min with X=2 or 3 (e.g. 2*10900 = 21800 = 21.8ms). This will allow smooth video playback and far better desktop responsiveness while still saving power.

Note You need to log in before you can comment on or make changes to this bug.