Latest working kernel version: 2.6.26 (Debian Kernel) Earliest failing kernel version: 2.6.28, perhaps 2.6.27 (never tested) Distribution: Debian Lenny Hardware Environment: Mainboard: GA-MA790FX-DQ6, CPU: AMD Phenom 9350e, Memory: 4 GB DDR2-800, Harddisks: 3 * WD1600YS (RE 160GB), 3 * WD1000FYPS (RE2-GP 1TB), SATA-Controllers: AMD SB600 (onboard), JMicron 363 (onboard), Promise TX4 (PCI) Software Environment: Debian Lenny, vanilla kernel 2.6.28 (also git3, git4 and git5), built with make-kpkg (my .config will be attached), md-raid1, lvm2. # /etc/fstab: static file system information. # # <file system> <mount point> <type> <options> <dump> <pass> proc /proc proc defaults 0 0 /dev/mapper/vg00-root / ext3 data=journal,errors=remount-ro 0 1 /dev/md0 /boot ext3 defaults,data=journal 0 2 /dev/mapper/vg00-home /home ext3 defaults,data=journal 0 1 /dev/mapper/vg01-srv /srv ext3 defaults,data=journal,usrquota,grpquota 0 1 /dev/mapper/vg00-tmp /tmp ext3 defaults,data=journal 0 1 /dev/mapper/vg00-usr /usr ext3 defaults,data=journal 0 1 /dev/mapper/vg00-var /var ext3 defaults,data=journal 0 1 /dev/mapper/vg00-swap none swap sw 0 0 /dev/sr0 /media/cdrom0 udf,iso9660 user,noauto 0 0 Problem Description: I compiled a kernel for two machines. On the first machine the kernel works without problems (GA-MA790FX-DS5, 4 harddisks, Debian Lenny, md-raid1 and md-raid6, lvm2, ext4). On the second machine (described above) the kernel panics on startup. Here the 2 snippets from the panic (the whole panic is attached): ------------[ cut here ]------------ WARNING: at arch/x86/mm/ioremap.c:240 __ioremap_caller+0x173/0x2f5() Hardware name: GA-MA790FX-DQ6 Pid: 1, comm: swapper Not tainted 2.6.28-git5-1 #1 Call Trace: [<ffffffff8023cdfe>] warn_slowpath+0xd3/0x10d [<ffffffff8020c9ee>] apic_timer_interrupt+0xe/0x20 [<ffffffff8023d93c>] vprintk+0x287/0x2b3 [<ffffffff80996bca>] init_netsc520+0x0/0xfd [<ffffffff806f4bb9>] printk+0x4e/0x56 [<ffffffff80226717>] __ioremap_caller+0x173/0x2f5 [<ffffffff80996bfd>] init_netsc520+0x33/0xfd [<ffffffff80996bca>] init_netsc520+0x0/0xfd [<ffffffff80996bfd>] init_netsc520+0x33/0xfd [<ffffffff80209051>] _stext+0x51/0x120 [<ffffffff802e80c6>] create_proc_entry+0x7d/0x92 [<ffffffff80270e01>] register_irq_proc+0x94/0xac [<ffffffff8096d613>] kernel_init+0x119/0x16b [<ffffffff8020ceba>] child_rip+0xa/0x20 [<ffffffff8096d4fa>] kernel_init+0x0/0x16b [<ffffffff8020ceb0>] child_rip+0x0/0x20 ---[ end trace f77463f06c1843b4 ]--- ------------[ cut here ]------------ WARNING: at kernel/smp.c:333 smp_call_function_mask+0x3b/0x1e3() Hardware name: GA-MA790FX-DQ6 Pid: 0, comm: swapper Tainted: G M W 2.6.28-git5-1 #1 Call Trace: <#MC> [<ffffffff8023cdfe>] warn_slowpath+0xd3/0x10d [<ffffffff803efc88>] vsnprintf+0x79a/0x7e2 [<ffffffff8023d2ff>] try_acquire_console_sem+0x10/0x31 [<ffffffff8025b46f>] smp_call_function_mask+0x3b/0x1e3 [<ffffffff806f4bb9>] printk+0x4e/0x56 [<ffffffff8025d50f>] crash_kexec+0xee/0xf7 [<ffffffff8021d012>] native_smp_send_stop+0x1a/0x26 [<ffffffff806f4aca>] panic+0x95/0x136 [<ffffffff8025179d>] notifier_call_chain+0x29/0x4c [<ffffffff8023cf43>] oops_enter+0x9/0x10 [<ffffffff80217d0f>] mce_log+0x0/0x7d [<ffffffff802180ab>] do_machine_check+0x31f/0x3cf [<ffffffff806f7125>] machine_check+0x15/0x20 [<ffffffff802123a7>] default_idle+0x27/0x3b <<EOE>> [<ffffffff8021253c>] c1e_idle+0xe1/0xe5 [<ffffffff8020b064>] cpu_idle+0x4a/0x8b [<ffffffff8096dbfe>] start_kernel+0x35f/0x36b [<ffffffff8096d390>] x86_64_start_kernel+0xd9/0xdf ---[ end trace f77463f06c1843b4 ]--- Steps to reproduce: Just boot kernel 2.6.28 (here git5) compiled with the config attached on a GA-MA790FX-DQ6 with Phenom 9350e and 4 GB memory. If you think there is a problem with the CPU or the memory I can substitute them temporarily. Thanks in advance, Marcus
Created attachment 19634 [details] .config of my kernel (2.6.28-git5)
Created attachment 19635 [details] The whole kernel panic
Your hardware is dying: HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: fe0000080005001b TSC 2f219167c2 ADDR 20000554 MISC c008000001000000 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check
Hello Marcin, I have seen that message but didn't believe it (I don't want my hardware to die). So I thought there must be a kernel bug. Older kernels worked without any noticeable problem. I will try to capture the same message with a working Debian-kernel. It will take me an hour because I'm at home at the moment. That is what mcelog says: HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 BANK 0 TSC 2f219167c2 STATUS 0 MCGSTATUS 0 Now I thought, perhaps the CPU is dying. But then I found this message from the LKML: http://lkml.indiana.edu/hypermail/linux/kernel/0605.1/2085.html I will replace the memory and hope that kernel 2.6.28 is booting up properly. If this doesn't help I'll replace the CPU and then the mainboard. Anyway I'll notify you what happened. Thank you for your efforts. Marcus bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12357 > > > > > > ------- Comment #3 from marcin.slusarz@gmail.com 2009-01-05 12:31 ------- > Your hardware is dying: > > > HARDWARE ERROR > CPU 0: Machine Check Exception: 4 Bank 4: fe0000080005001b > TSC 2f219167c2 ADDR 20000554 MISC c008000001000000 > This is not a software problem! > Run through mcelog --ascii to decode and contact your hardware vendor > Kernel panic - not syncing: Machine check > >
I replaced the memory. Now the kernel panic has gone. The problems with smp.c and the "Machine Check Exception" are resolved. But there is still a problem with ioremap.c and the netsc520 (the whole bootup process is attached as minicom.2.6.28.cap.gz): ------------[ cut here ]------------ WARNING: at arch/x86/mm/ioremap.c:240 __ioremap_caller+0x173/0x2f5() Hardware name: GA-MA790FX-DQ6 Pid: 1, comm: swapper Not tainted 2.6.28-git5-1 #1 Call Trace: [<ffffffff8023cdfe>] warn_slowpath+0xd3/0x10d [<ffffffff80226cd5>] __change_page_attr_set_clr+0x16b/0x867 [<ffffffff80251648>] down_trylock+0x28/0x2e [<ffffffff80251648>] down_trylock+0x28/0x2e [<ffffffff8023d2ff>] try_acquire_console_sem+0x10/0x31 [<ffffffff80996bca>] init_netsc520+0x0/0xfd [<ffffffff806f4bb9>] printk+0x4e/0x56 [<ffffffff80226717>] __ioremap_caller+0x173/0x2f5 [<ffffffff80996bfd>] init_netsc520+0x33/0xfd [<ffffffff80996bca>] init_netsc520+0x0/0xfd [<ffffffff80996bfd>] init_netsc520+0x33/0xfd [<ffffffff80209051>] _stext+0x51/0x120 [<ffffffff802e80c6>] create_proc_entry+0x7d/0x92 [<ffffffff80270e01>] register_irq_proc+0x94/0xac [<ffffffff8096d613>] kernel_init+0x119/0x16b [<ffffffff8020ceba>] child_rip+0xa/0x20 [<ffffffff8096d4fa>] kernel_init+0x0/0x16b [<ffffffff8020ceb0>] child_rip+0x0/0x20 ---[ end trace fefc41b665ffb5f9 ]--- Booting up with kernel 2.6.26 none of the problems above appear. Even if the replaced memory is used (minicom.2.6.26.cap.gz is also attached). Best regards, Marcus
Created attachment 19669 [details] minicom.2.6.26.cap.gz
Created attachment 19670 [details] minicom.2.6.28.cap.gz
> But there is still a problem with ioremap.c and the netsc520 (the whole > bootup process is attached as minicom.2.6.28.cap.gz): Well, it's not a real problem. The warning just tells you that the resource which is accessed is in the RAM address space. BIOS-e820: 0000000000100000 - 000000007fee0000 (usable) NetSc520 flash device: 0x100000 at 0x200000 The driver is for an evaluation board and has a hard coded address for the FLASH chip. On your machine there is definitely no such device at this address and the ioremap code complains correctly that this access is wrong. Actually this driver is complete crap. The ioremap succeeds despite the warning and the driver somehow pretends that it found a flash device: Creating 4 MTD partitions on "netsc520 Flash Bank": 0x00000000-0x000c0000 : "NetSc520 boot kernel" That means it poked in the ioremapped ram. Please disable the driver for now. I look into fixing this along with some others of the same category as there is trouble waiting. @Venki: I wonder why the remap succeeds at all. Thanks, tglx