The Gentoo guys told me to take it elsewhere, so here I am :) (http://bugs.gentoo.org/show_bug.cgi?id=122844) Wether you have a K6, K6-II, K6-II+ or K6-III processor, trying to optimize the kernel always results in -march=k6. I tried changing the Makefile to compile with -march=k6-2 (on a K6-2) and it worked just fine. Why not allow GCC to -march properly? While I'm at it, if Pentium II is selected, why -march=686, -mcpu=pentium2 instead of just -march=pentium2?
There is no reason for -march=k6-2 in the kernel since the only difference compared to -march=k6 is the availability of 3dnow - and you can't use floating point in the kernel. In the pentium case, gcc knows how to tune best for each of the CPUs, but there's also nothing -march changes that would make a difference inside the kernel.
I'm sorry, I was mistaken. I thought the K6-2 hade a couple of extra registers to play with but they were only for storing data for 3DNOW! instructions. But I wonder if GCC's -march=k6-2 really only does -march=k6 with 3DNOW! instructions? Digging around, I found this: "they just added 16 wait states to the execution of the LOOPcc and thus caused it to slow to the speed of a Pentium. AMD didn't just do this however. They added a special case (speculation, might be coincidence) for the DEC (E)CX; Jcc combination, which is semantically equivalent with the LOOPcc instruction, but this semantic equivalency and the loop being faster on Intels caused the loop instruction to always be used. Nobody used the DEC/Jcc combo. They kept the original speed for this combo and specified in their optimization manuals that this was the preferred method over the loopcc instruction." (http://www.mega-tokyo.com/osfaq2/index.php/The%20IA32%20Architecture%20Family) So -march=k6 should make use of the LOOPcc but -march=k6-2 would hopefully go with DEC/Jcc. I'll have to try and dig around GCC to find out if that is so. But I'm really not got at this.. :) I think(?) the K6 and the K6-2 has the same amount of cache (64KiB) so there shouldn't perhaps be any difference in optimizing with that in mind..
If you look at the gcc sources, you see that gcc doesn't know about any differences between k6 and k6-2 except for 3dnow.
I stand corrected. Dug around and found this, strenghtening your claim http://gcc.gnu.org/ml/gcc/2003-02/msg01518.html. Still..the DEC/Jcc wait instructions puzzle me. But if the GCC people didn't take them into account they must have had a reason..if they did, I should be able to see it compiling some suitable code (time, perhaps?) and comparing the output when march:ing. Difference in cache size for the processors is accounted for (I guess) by using different -O options, it's nothing that march or mtune is "aware" of.