Bug 15392
Summary: | [bisected] The kernel does not start up. | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Kristóf Ralovich (kristof.ralovich) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | CLOSED INVALID | ||
Severity: | normal | CC: | akpm, florian, hjl.tools, hpa, kristof.ralovich, rjw, suresh.b.siddha |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 2.6.33 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 14885 | ||
Attachments: |
This is the configuration file I compiled the kernel with.
the .config for 2.6.33-rc1 (should be the same as for 2.6.33) System.map System.map dmesg for 2.6.32-good |
Description
Kristóf Ralovich
2010-02-25 06:52:03 UTC
Created attachment 25206 [details]
This is the configuration file I compiled the kernel with.
I'll reassign this to x86_64. The problem could of course lie elsewhere, hard to tell. Can you test 2.6.33-rc1, please? I will give it a try tonight after work. I have checked with 2.6.33-rc1, the issue is the same as with 2.6.33. I have checked again, and the last lines the kernel print are the following (for both 2.6.33 and 2.6.33-rc1): Decompressing Linux. Parsing ELF...done. Booting the kernel. At this point the output stops and the machine reboots in a few seconds. Created attachment 25304 [details]
the .config for 2.6.33-rc1 (should be the same as for 2.6.33)
The issue is still there with 2.6.33.1. On Monday 22 March 2010, RALOVICH, Kristóf wrote:
> The issue is still there with 2.6.33-rc1, 2.6.33 and 2.6.33.1 too.
I took a look: I tested 2.6.33.1 with bz2 instead of LZMA, the same issue is still there. The issue is still there with 2.6.33.2. what is your kernel commandline? Also, could you do a "git bisect" between 2.6.32 and 2.6.33, or if that is not possible try the 2.6.33 release candidates? (In reply to comment #11) > what is your kernel commandline? I'll post it tonight when I get to that machine. (In reply to comment #12) > Also, could you do a "git bisect" between 2.6.32 and 2.6.33, or if that is > not > possible try the 2.6.33 release candidates? I have tried 2.6.33-rc1 (see comment #5 and comment #6) and that fails already. I am afraid I don't have currently time to do further bisecting between 2.6.32 and 2.6.33-rc1. Summary: 2.6.32 - works fine 2.6.32.y - works fine (stable tree) 2.6.33-rc1 - fails 2.6.33.1 - fails 2.6.33.2 - fails 2.6.34-rc3 - fails My guess is that the regression occured somewhere between 2.6.32 and 2.6.33-rc1. That's a merge window... without a git bisect there isn't much additional to do on. do you have specified the panic=x parameter on your kernel commandline? (reboot on panic) if yes, leaving that out may eventually print some error... other than that an git-bisect might be the easiest to get to the culprit... On Friday 09 April 2010, RALOVICH, Kristóf wrote:
> The issue still exists!
I had bisected the failure, the regression is introduced with the following commit: 74e081797bd9d2a7d8005fe519e719df343a2ba8 x86-64: align RODATA kernel section to 2MB with CONFIG_DEBUG_RODATA did you double check that if you disable config_debug_rodata the kernel boots? (or check that reverting that commit on top of current git (if possible) boots?) also you can put a '[bisected]' ... in the bug title, if you verified that this is the culprit. i have added suresh siddha to the cc'list .. cheers, Flo p.s. sorry for not finishing the last comment... somehow i fat-fingered my browser... I compiled a 2.6.33.2 kernel with CONFIG_DEBUG_RODATA disabled, and it works for me. So this option has to do something with the kernel not starting up. Kristof, I am not able to reproduce the problem with your failing config. Perhaps some differences in linker etc hitting the bug somewhere. Can you please provide me your failing kernel's System.map and also earlyprintk=vga shows any messages on the console before it reboots? Thanks. Created attachment 26045 [details]
System.map
A failing System.map this is for the kernel (74e081797bd9d2a7d8005fe519e719df343a2ba8) introducing the regression. Don't let the filename fool you, this is for 74e081797bd9d2a7d8005fe519e719df343a2ba8 not for 2.6.32-rc4!
Created attachment 26046 [details] System.map In comment #23 I was wrong about attachment (id=26045). These are the correct System.map files, for the last good (b9af7c0d44b8bb71e3af5e94688d076414aa8c87) and first regressed (74e081797bd9d2a7d8005fe519e719df343a2ba8) revisions. Kristof, I didn't find anything wrong with the System.map-2.6.32-bad. No luck again today for me in reproducing the failure here. Also, fedora 13 kernels (using 2.6.33.x) etc have this CONFIG_DEBUG_RODATA enabled with no similar failure reports. There must be something unique on your setup that exposes this problem. Can you please help me get the below info, to evaluate my next steps. a) Kernel log of your system for the successful boot case. b) mark_rodata_ro() function gets called quite a bit later in the boot. If there is something wrong in mark_rodata_ro() code, then we should be seeing quite a few kernel messages before it crashes. Can you please remove "quiet" boot param and add something like "vga=6 earlyprintk=vga" and check if you see any kernel log for the failure case. c) For the failing case, can you please apply just this change to static_protections() in arch/x86/mm/pageattr.c and see if it changes anything. -#if defined(CONFIG_X86_64) && defined(CONFIG_DEBUG_RODATA) +#if 0 && defined(CONFIG_X86_64) && defined(CONFIG_DEBUG_RODATA) Created attachment 26074 [details]
dmesg for 2.6.32-good
(In reply to comment #25) > Kristof, I didn't find anything wrong with the System.map-2.6.32-bad. No luck > again today for me in reproducing the failure here. Also, fedora 13 kernels > (using 2.6.33.x) etc have this CONFIG_DEBUG_RODATA enabled with no similar > failure reports. There must be something unique on your setup that exposes > this > problem. Can you please help me get the below info, to evaluate my next > steps. > > a) Kernel log of your system for the successful boot case. See previous comment for dmesg on the last good revision. > > b) mark_rodata_ro() function gets called quite a bit later in the boot. If > there is something wrong in mark_rodata_ro() code, then we should be seeing > quite a few kernel messages before it crashes. Can you please remove "quiet" > boot param and add something like "vga=6 earlyprintk=vga" and check if you > see > any kernel log for the failure case. I don not have "quiet" on my kernel command line. Adding "vga=6 earlyprintk=vga" changes the console resolution, but no additional test is printed than what I had originally reported. > > c) For the failing case, can you please apply just this change to > static_protections() in arch/x86/mm/pageattr.c and see if it changes > anything. > > -#if defined(CONFIG_X86_64) && defined(CONFIG_DEBUG_RODATA) > +#if 0 && defined(CONFIG_X86_64) && defined(CONFIG_DEBUG_RODATA) I applied this on the first bad revision, but it did not help anything. Thanks Kristof. It is interesting that we don't see any kernel log messages in the failing case. Seems like we have some problem in grub loading the kernel itself. Can you please a) comment the mark_rodata_ro() call in init_post() in init/main.c and see if that changes in anything. If the problem is with the grub loading it, this should also fail. b) Can you please provide your failing vmlinux and bzImage so that I can try few things here? I am hoping to find the root cause in the next couple of days, otherwise I will post a patch which reverts that commit :( (In reply to comment #28) > Thanks Kristof. It is interesting that we don't see any kernel log messages > in > the failing case. Seems like we have some problem in grub loading the kernel > itself. Can you please > > a) comment the mark_rodata_ro() call in init_post() in init/main.c and see if > that changes in anything. If the problem is with the grub loading it, this > should also fail. I commented that line, but the failure is still there. > > b) Can you please provide your failing vmlinux and bzImage so that I can try > few things here? look here: http://people.freedesktop.org/~tade/kernel_debugging/ > > I am hoping to find the root cause in the next couple of days, otherwise I > will > post a patch which reverts that commit :( I am using standard debian grub: ii grub 0.97-47lenny2 GRand Unified Bootloader (Legacy version) ii grub-common 1.96+20080724-16 GRand Unified Bootloader, version 2 (common files) Kristof, It is the linker issue why you are seeing this issue. On my working kernel, if I do "readelf -l my-vmlinux" I see this for the first entry: Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000 0x00000000008ec000 0x00000000008ec000 R E 200000 And for your kernel, I see this: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000001000 0xffffffff81000000 0x0000000001000000 0x000000000056e000 0x000000000056e000 R E 1000 If you see the last column which is Align, mine has 2MB alignment and yours has 4k. I talked to HJ and he says when the linker uses 4K alignment, kernel might have a problem if it uses 2MB alignment. What linker (ld --version) are you using? Have you recompiled your linker or perhaps debian linker is using 4k as the max-page-size? HJ says, debian might have two linkers (ld and ld.gold) and they might be using different page-sizes. I can reproduce your problem if I add this to (arch/x86/Makefile) my kernel compilation: LDFLAGS_vmlinux += -z max-page-size=0x1000 Can you check if the kernel boots if you add this to the arch/x86/Makefile: LDFLAGS_vmlinux += -z max-page-size=0x200000 Also HJ pointed me to this ld.gold bug now http://sourceware.org/bugzilla/show_bug.cgi?id=11490 Depending on your results and my discussion with HJ, we can see if we can have a workaround in the kernel or we need to address this in the linker. Kristof, It is the linker issue why you are seeing this issue. On my working kernel, if I do "readelf -l my-vmlinux" I see this for the first entry: Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000 0x00000000008ec000 0x00000000008ec000 R E 200000 And for your kernel, I see this: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000001000 0xffffffff81000000 0x0000000001000000 0x000000000056e000 0x000000000056e000 R E 1000 If you see the last column which is Align, mine has 2MB alignment and yours has 4k. I talked to HJ and he says when the linker uses 4K alignment, kernel might have a problem if it uses 2MB alignment. What linker (ld --version) are you using? Have you recompiled your linker or perhaps debian linker is using 4k as the max-page-size? HJ says, debian might have two linkers (ld and ld.gold) and they might be using different page-sizes. I can reproduce your problem if I add this to (arch/x86/Makefile) my kernel compilation: LDFLAGS_vmlinux += -z max-page-size=0x1000 Can you check if the kernel boots if you add this to the arch/x86/Makefile: LDFLAGS_vmlinux += -z max-page-size=0x200000 Also HJ pointed me to this ld.gold bug now http://sourceware.org/bugzilla/show_bug.cgi?id=11490 Depending on your results and my discussion with HJ, we can see if we can have a workaround in the kernel or we need to address this in the linker. (In reply to comment #31) > Kristof, It is the linker issue why you are seeing this issue. > > On my working kernel, if I do "readelf -l my-vmlinux" I see this for the > first > entry: > > Program Headers: > Type Offset VirtAddr PhysAddr > FileSiz MemSiz Flags Align > LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000 > 0x00000000008ec000 0x00000000008ec000 R E 200000 > > And for your kernel, I see this: > > Type Offset VirtAddr PhysAddr > FileSiz MemSiz Flags Align > LOAD 0x0000000000001000 0xffffffff81000000 0x0000000001000000 > 0x000000000056e000 0x000000000056e000 R E 1000 > > If you see the last column which is Align, mine has 2MB alignment and yours > has > 4k. I talked to HJ and he says when the linker uses 4K alignment, kernel > might > have a problem if it uses 2MB alignment. > > What linker (ld --version) are you using? Have you recompiled your linker or > perhaps debian linker is using 4k as the max-page-size? HJ says, debian might > have two linkers (ld and ld.gold) and they might be using different > page-sizes. > > I can reproduce your problem if I add this to (arch/x86/Makefile) my kernel > compilation: > > LDFLAGS_vmlinux += -z max-page-size=0x1000 > > Can you check if the kernel boots if you add this to the arch/x86/Makefile: > > LDFLAGS_vmlinux += -z max-page-size=0x200000 > > Also HJ pointed me to this ld.gold bug now > http://sourceware.org/bugzilla/show_bug.cgi?id=11490 > > Depending on your results and my discussion with HJ, we can see if we can > have > a workaround in the kernel or we need to address this in the linker. Mea culpa - I always had this feeling in the back of my had that gold is still experimental. Right now I don't have too much time, but I purged gold, and linked the kernel with regular ld and the problem is gone for me. As soon as I got around to it, I'll try gold + LDFLAGS_vmlinux flag. I will not have any time soon to look into experimenting linking a kernel with gold. I am confirming that 74e081797bd9d2a7d8005fe519e719df343a2ba8 built with gold exhibits the problem and the same kernel and config built with ld works for me. Thanks for your input finding the source of the problem. |