12357 – 2.6.28 kernel panic on GA-MA790FX-DQ6

Bug 12357 - 2.6.28 kernel panic on GA-MA790FX-DQ6

Summary: 2.6.28 kernel panic on GA-MA790FX-DQ6

Status:	CLOSED INVALID

Alias:	None

Product:	Other
Classification:	Unclassified
Component:	Modules (show other bugs)
Hardware:	All Linux

Importance:	P1 high
Assignee:	platform_x86_64@kernel-bugs.osdl.org

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-01-04 03:07 UTC by Marcus Husar
Modified:	2010-01-25 14:28 UTC (History)
CC List:	4 users (show)

See Also:
Kernel Version:	2.6.28 vanilla kernel
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
.config of my kernel (2.6.28-git5) (13.94 KB, application/x-gzip) 2009-01-04 03:11 UTC, Marcus Husar	Details
The whole kernel panic (8.97 KB, application/x-gzip) 2009-01-04 03:12 UTC, Marcus Husar	Details
minicom.2.6.26.cap.gz (9.26 KB, application/x-gzip) 2009-01-05 16:37 UTC, Marcus Husar	Details
minicom.2.6.28.cap.gz (12.30 KB, application/x-gzip) 2009-01-05 16:38 UTC, Marcus Husar	Details
Add an attachment (proposed patch, testcase, etc.)

Description Marcus Husar 2009-01-04 03:07:27 UTC

Latest working kernel version: 2.6.26 (Debian Kernel)
Earliest failing kernel version: 2.6.28, perhaps 2.6.27 (never tested)
Distribution: Debian Lenny

Hardware Environment:
Mainboard: GA-MA790FX-DQ6, CPU: AMD Phenom 9350e, Memory: 4 GB DDR2-800, Harddisks: 3 * WD1600YS (RE 160GB), 3 * WD1000FYPS (RE2-GP 1TB), SATA-Controllers: AMD SB600 (onboard), JMicron 363 (onboard), Promise TX4 (PCI)

Software Environment:
Debian Lenny, vanilla kernel 2.6.28 (also git3, git4 and git5), built with make-kpkg (my .config will be attached), md-raid1, lvm2.

# /etc/fstab: static file system information.
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    defaults        0       0
/dev/mapper/vg00-root /         ext3    data=journal,errors=remount-ro 0       1
/dev/md0        /boot           ext3    defaults,data=journal 0       2
/dev/mapper/vg00-home /home     ext3    defaults,data=journal 0       1
/dev/mapper/vg01-srv  /srv      ext3    defaults,data=journal,usrquota,grpquota 0       1
/dev/mapper/vg00-tmp  /tmp      ext3    defaults,data=journal 0       1
/dev/mapper/vg00-usr  /usr      ext3    defaults,data=journal 0       1
/dev/mapper/vg00-var  /var      ext3    defaults,data=journal 0       1
/dev/mapper/vg00-swap none      swap    sw              0       0
/dev/sr0        /media/cdrom0   udf,iso9660 user,noauto 0       0

Problem Description:
I compiled a kernel for two machines. On the first machine the kernel works without problems (GA-MA790FX-DS5, 4 harddisks, Debian Lenny, md-raid1 and md-raid6, lvm2, ext4). On the second machine (described above) the kernel panics on startup. Here the 2 snippets from the panic (the whole panic is attached):

------------[ cut here ]------------
WARNING: at arch/x86/mm/ioremap.c:240 __ioremap_caller+0x173/0x2f5()
Hardware name: GA-MA790FX-DQ6
Pid: 1, comm: swapper Not tainted 2.6.28-git5-1 #1
Call Trace:
 [<ffffffff8023cdfe>] warn_slowpath+0xd3/0x10d
 [<ffffffff8020c9ee>] apic_timer_interrupt+0xe/0x20
 [<ffffffff8023d93c>] vprintk+0x287/0x2b3
 [<ffffffff80996bca>] init_netsc520+0x0/0xfd
 [<ffffffff806f4bb9>] printk+0x4e/0x56
 [<ffffffff80226717>] __ioremap_caller+0x173/0x2f5
 [<ffffffff80996bfd>] init_netsc520+0x33/0xfd
 [<ffffffff80996bca>] init_netsc520+0x0/0xfd
 [<ffffffff80996bfd>] init_netsc520+0x33/0xfd
 [<ffffffff80209051>] _stext+0x51/0x120
 [<ffffffff802e80c6>] create_proc_entry+0x7d/0x92
 [<ffffffff80270e01>] register_irq_proc+0x94/0xac
 [<ffffffff8096d613>] kernel_init+0x119/0x16b
 [<ffffffff8020ceba>] child_rip+0xa/0x20
 [<ffffffff8096d4fa>] kernel_init+0x0/0x16b
 [<ffffffff8020ceb0>] child_rip+0x0/0x20
---[ end trace f77463f06c1843b4 ]---

------------[ cut here ]------------
WARNING: at kernel/smp.c:333 smp_call_function_mask+0x3b/0x1e3()
Hardware name: GA-MA790FX-DQ6
Pid: 0, comm: swapper Tainted: G   M    W  2.6.28-git5-1 #1
Call Trace:
 <#MC>  [<ffffffff8023cdfe>] warn_slowpath+0xd3/0x10d
 [<ffffffff803efc88>] vsnprintf+0x79a/0x7e2
 [<ffffffff8023d2ff>] try_acquire_console_sem+0x10/0x31
 [<ffffffff8025b46f>] smp_call_function_mask+0x3b/0x1e3
 [<ffffffff806f4bb9>] printk+0x4e/0x56
 [<ffffffff8025d50f>] crash_kexec+0xee/0xf7
 [<ffffffff8021d012>] native_smp_send_stop+0x1a/0x26
 [<ffffffff806f4aca>] panic+0x95/0x136
 [<ffffffff8025179d>] notifier_call_chain+0x29/0x4c
 [<ffffffff8023cf43>] oops_enter+0x9/0x10
 [<ffffffff80217d0f>] mce_log+0x0/0x7d
 [<ffffffff802180ab>] do_machine_check+0x31f/0x3cf
 [<ffffffff806f7125>] machine_check+0x15/0x20
 [<ffffffff802123a7>] default_idle+0x27/0x3b
 <<EOE>>  [<ffffffff8021253c>] c1e_idle+0xe1/0xe5
 [<ffffffff8020b064>] cpu_idle+0x4a/0x8b
 [<ffffffff8096dbfe>] start_kernel+0x35f/0x36b
 [<ffffffff8096d390>] x86_64_start_kernel+0xd9/0xdf
---[ end trace f77463f06c1843b4 ]---

Steps to reproduce: Just boot kernel 2.6.28 (here git5) compiled with the config attached on a GA-MA790FX-DQ6 with Phenom 9350e and 4 GB memory.

If you think there is a problem with the CPU or the memory I can substitute them temporarily.

Thanks in advance,
Marcus

Comment 1 Marcus Husar 2009-01-04 03:11:56 UTC

Created attachment 19634 [details]
.config of my kernel (2.6.28-git5)

Comment 2 Marcus Husar 2009-01-04 03:12:37 UTC

Created attachment 19635 [details]
The whole kernel panic

Comment 3 Marcin Slusarz 2009-01-05 12:31:15 UTC

Your hardware is dying:


HARDWARE ERROR
CPU 0: Machine Check Exception:                4 Bank 4: fe0000080005001b
TSC 2f219167c2 ADDR 20000554 MISC c008000001000000
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check

Comment 4 Marcus Husar 2009-01-05 13:24:19 UTC

Hello Marcin,

I have seen that message but didn't believe it (I don't want my hardware
to die). So I thought there must be a kernel bug. Older kernels worked
without any noticeable problem.

I will try to capture the same message with a working Debian-kernel. It
will take me an hour because I'm at home at the moment.

That is what mcelog says:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 0 TSC 2f219167c2
STATUS 0 MCGSTATUS 0

Now I thought, perhaps the CPU is dying. But then I found this message
from the LKML:
http://lkml.indiana.edu/hypermail/linux/kernel/0605.1/2085.html

I will replace the memory and hope that kernel 2.6.28 is booting up
properly. If this doesn't help I'll replace the CPU and then the mainboard.

Anyway I'll notify you what happened. Thank you for your efforts.

Marcus

bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12357
> 
> 
> 
> 
> 
> ------- Comment #3 from marcin.slusarz@gmail.com  2009-01-05 12:31 -------
> Your hardware is dying:
> 
> 
> HARDWARE ERROR
> CPU 0: Machine Check Exception:                4 Bank 4: fe0000080005001b
> TSC 2f219167c2 ADDR 20000554 MISC c008000001000000
> This is not a software problem!
> Run through mcelog --ascii to decode and contact your hardware vendor
> Kernel panic - not syncing: Machine check
> 
>

Comment 5 Marcus Husar 2009-01-05 16:35:29 UTC

I replaced the memory. Now the kernel panic has gone. The problems with
smp.c and the "Machine Check Exception" are resolved.

But there is still a problem with ioremap.c and the netsc520 (the whole
bootup process is attached as minicom.2.6.28.cap.gz):

------------[ cut here ]------------
WARNING: at arch/x86/mm/ioremap.c:240 __ioremap_caller+0x173/0x2f5()
Hardware name: GA-MA790FX-DQ6
Pid: 1, comm: swapper Not tainted 2.6.28-git5-1 #1
Call Trace:
 [<ffffffff8023cdfe>] warn_slowpath+0xd3/0x10d
 [<ffffffff80226cd5>] __change_page_attr_set_clr+0x16b/0x867
 [<ffffffff80251648>] down_trylock+0x28/0x2e
 [<ffffffff80251648>] down_trylock+0x28/0x2e
 [<ffffffff8023d2ff>] try_acquire_console_sem+0x10/0x31
 [<ffffffff80996bca>] init_netsc520+0x0/0xfd
 [<ffffffff806f4bb9>] printk+0x4e/0x56
 [<ffffffff80226717>] __ioremap_caller+0x173/0x2f5
 [<ffffffff80996bfd>] init_netsc520+0x33/0xfd
 [<ffffffff80996bca>] init_netsc520+0x0/0xfd
 [<ffffffff80996bfd>] init_netsc520+0x33/0xfd
 [<ffffffff80209051>] _stext+0x51/0x120
 [<ffffffff802e80c6>] create_proc_entry+0x7d/0x92
 [<ffffffff80270e01>] register_irq_proc+0x94/0xac
 [<ffffffff8096d613>] kernel_init+0x119/0x16b
 [<ffffffff8020ceba>] child_rip+0xa/0x20
 [<ffffffff8096d4fa>] kernel_init+0x0/0x16b
 [<ffffffff8020ceb0>] child_rip+0x0/0x20
---[ end trace fefc41b665ffb5f9 ]---

Booting up with kernel 2.6.26 none of the problems above appear. Even if
the replaced memory is used (minicom.2.6.26.cap.gz is also attached).

Best regards,
Marcus

Comment 6 Marcus Husar 2009-01-05 16:37:55 UTC

Created attachment 19669 [details]
minicom.2.6.26.cap.gz

Comment 7 Marcus Husar 2009-01-05 16:38:18 UTC

Created attachment 19670 [details]
minicom.2.6.28.cap.gz

Comment 8 Thomas Gleixner 2009-01-14 02:59:22 UTC

> But there is still a problem with ioremap.c and the netsc520 (the whole
> bootup process is attached as minicom.2.6.28.cap.gz):

Well, it's not a real problem. The warning just tells you that the
resource which is accessed is in the RAM address space.

BIOS-e820: 0000000000100000 - 000000007fee0000 (usable)
NetSc520 flash device: 0x100000 at 0x200000

The driver is for an evaluation board and has a hard coded address for
the FLASH chip. On your machine there is definitely no such device at
this address and the ioremap code complains correctly that this access
is wrong.

Actually this driver is complete crap. The ioremap succeeds despite
the warning and the driver somehow pretends that it found a flash
device:

Creating 4 MTD partitions on "netsc520 Flash Bank":
0x00000000-0x000c0000 : "NetSc520 boot kernel"

That means it poked in the ioremapped ram.

Please disable the driver for now. I look into fixing this along with
some others of the same category as there is trouble waiting.

@Venki: I wonder why the remap succeeds at all.

Thanks,

	tglx

Note You need to log in before you can comment on or make changes to this bug.