Bug 8890

Summary: Kernel oops in ext3_get_inode_block
Product: File System Reporter: Sebastien Koechlin (seb.kernel)
Component: ext3Assignee: Andrew Morton (akpm)
Status: REJECTED INVALID    
Severity: high CC: seb.kernel
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22.2 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Kernel Oops in syslog
New oops unable to handle kernel NULL pointer dereference at virtual address 00000040
New oops unable to handle kernel NULL pointer dereference at virtual address 00000040
.config
New Oops: invalid opcode: 0000
New Oops: kernel BUG at mm/slab.c:2980! invalid opcode: 0000

Description Sebastien Koechlin 2007-08-15 06:01:48 UTC
Distribution: Debian sarge

Hardware Environment: 
- Celeron 2.4GHz 2GB, 
- 2x Promise Technology, Inc. PDC20268
- VIA VT6420 SATA RAID Controller

Software Environment:
- Software raid and LVM, ext3

Problem Description:
Following kernel Oops when commiting changes on mailbox in mutt. Mailbox is 815MB. /tmp may have been full when oops occured (don't know).

Steps to reproduce:
Did not try
Comment 1 Sebastien Koechlin 2007-08-15 06:06:37 UTC
Created attachment 12391 [details]
Kernel Oops in syslog
Comment 2 Sebastien Koechlin 2007-08-16 04:36:00 UTC
Same Oops occurred this morning. Same stack trace; while unzipping a ~200MB zip file on another filesystem.

None of the filesystem are full.
Comment 3 Andrew Morton 2007-08-16 09:56:25 UTC
Can we see the second oops trace please?

The trace you've attached here is for the second oops:

Aug 15 12:37:40 barberine kernel: [135275.066924] EFLAGS: 00010216   (2.6.22.2.skc2 #2)

See that "#2" there?

It's important that we see the first oops which occurs after the
machine boots.

Also, something seems to have fed yur oops trace through ksymoops,
which is no longer needed.  Maybe syslog did it, maybe you did it manually?
Comment 4 Sebastien Koechlin 2007-08-16 10:55:31 UTC
Sorry, I have nothing logged on disk this time, but I think you have both oops in my attached file:
[135275.066789] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000050
[135275.066898] Oops: 0000 [#1]
and
[135275.071823] kernel BUG at fs/jbd/transaction.c:272!
[135275.071905] invalid opcode: 0000 [#2]

I did not run ksymoops, syslog probably did it. Do you want me to try to stop it?

System stop responding quickly (less than a minute) after this error (Network NAT also stop working). Yesterday I have time to do SysRq-Sync, tErm, kIll and Unmount on the console. But today I was connected remotely with ssh and can not go on the console. Syslog send me Oops on my ssh-term, first lines were lost (too small buffer) but I checked the attachment and the last part was the same:
For #1:
 EIP: [ext3_get_inode_block+36/254] ext3_get_inode_block+0x24/0xfe SS:ESP 0068:ce437bec
For #2:
 Assertion failure in journal_start() at fs/jbd/transaction.c:272

What can help you ? .config ? dmesg at startup ?
Comment 5 Sebastien Koechlin 2007-08-29 03:53:52 UTC
New oops this night, don't know if it's related or not.
Bad hardware? Wrong gcc version? (Debian sarge 3.3.5-3)
Comment 6 Sebastien Koechlin 2007-08-29 03:55:40 UTC
Created attachment 12609 [details]
New oops unable to handle kernel NULL pointer dereference at virtual address 00000040
Comment 7 Sebastien Koechlin 2007-08-29 04:30:24 UTC
Created attachment 12610 [details]
New oops unable to handle kernel NULL pointer dereference at virtual address 00000040

Here is the full dmesg from startup with oops.

I do not have any binary named ksymoops on the computer. Just klogd and syslogd but klogd complain at startup:
Aug 25 16:09:09 barberine kernel: klogd 1.4.1#17, log source = /proc/kmsg started.
Aug 25 16:09:09 barberine kernel: Cannot find map file.
Aug 25 16:09:09 barberine kernel: No module symbols loaded - kernel modules not enabled.

Do you want me to put the right /boot/System.map?
Comment 8 Sebastien Koechlin 2007-08-29 04:32:30 UTC
Created attachment 12611 [details]
.config

Linux barberine 2.6.22.3.skc4 #4 Thu Aug 16 10:00:51 CEST 2007 i686 GNU/Linux

Gnu C                  3.3.5
Gnu make               3.80
binutils               2.15
util-linux             2.12p
mount                  2.12p
module-init-tools      3.2-pre1
e2fsprogs              1.37
nfs-utils              1.0.6
Linux C Library        2.3.2
Dynamic linker (ldd)   2.3.2
Procps                 3.2.1
Net-tools              1.60
Console-tools          0.2.3
Sh-utils               5.2.1
udev                   056
seb barberine:/usr/src/linux [1093]% zgrep CONFIG_KALLSYMS /proc/config.gz
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
Comment 9 Sebastien Koechlin 2007-09-06 02:29:07 UTC
Created attachment 12723 [details]
New Oops: invalid opcode: 0000

Upgraded kernel to 2.6.22.5
Upgraded GCC to 4.1.1
Comment 10 Andrew Morton 2007-09-06 02:49:03 UTC
These oopses are all different and point at memory corruption in various
places (page allocator, slab).

I'd be suspecting faulty hardware, or some bug in some piece
of code (probably a driver) which few other people use.
Comment 11 Sebastien Koechlin 2007-09-06 02:52:14 UTC
Created attachment 12725 [details]
New Oops: kernel BUG at mm/slab.c:2980! invalid opcode: 0000

(Last file was generated using messages on console, here is oops trace from kern.log)

Upgraded kernel to 2.6.22.5
Upgraded GCC to 4.1

Linux barberine 2.6.22.5.skc6 #6 Thu Aug 30 19:57:36 CEST 2007 i686 GNU/Linux

Gnu C                  4.1.2 (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21))
Gnu make               3.81
binutils               2.17
util-linux             2.12r
mount                  2.12r
module-init-tools      3.3-pre2
e2fsprogs              1.40-WIP
Linux C Library        2.3.6
Dynamic linker (ldd)   2.3.6
Procps                 3.2.7
Net-tools              1.60
Console-tools          0.2.3
Sh-utils               5.97
udev                   105
Comment 12 Sebastien Koechlin 2007-09-07 04:23:36 UTC
There is something strange:
Each time, I have to do a hard reboot with the reboot switch. partitions are not unmounted.
At boot, ext3 journal is used to recover, but RAID Array are all detected as synchronized:

[  156.411425] md: considering sdb2 ...
[  156.411505] md:  adding sdb2 ...
[  156.411574] md: sda2 has different UUID to sdb2
[  156.411646] md: hdi3 has different UUID to sdb2
[  156.411721] md:  adding hde2 ...
[  156.411788] md: created md0
[  156.411852] md: bind<hde2>
[  156.411941] md: bind<sdb2>
[  156.412013] md: running: <sdb2><hde2>
[  156.412308] raid1: raid set md0 active with 2 out of 2 mirrors
(same for md1)
[  156.413494] md: ... autorun DONE.
[  156.464550] EXT3-fs: INFO: recovery required on readonly filesystem.
[  156.464625] EXT3-fs: write access will be enabled during recovery.
[  157.525557] kjournald starting.  Commit interval 5 seconds
[  157.525652] EXT3-fs: md0: orphan cleanup on readonly fs
[  157.574160] EXT3-fs: md0: 9 orphan inodes deleted
[  157.575451] EXT3-fs: recovery complete.
[  157.600352] EXT3-fs: mounted filesystem with ordered data mode.
[  157.600449] VFS: Mounted root (ext3 filesystem) readonly.

They really are:

[176180.076542] md: data-check of RAID array md0
[176180.076560] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[176180.076572] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
...
[176309.260073] md: md0: data-check done.
Comment 13 Sebastien Koechlin 2007-09-08 02:50:17 UTC
You are right; this night memtest found an error.

Sorry for disturbing.