Bug 10484 - Boot Oops+hang in 2.6.25-rc and 2.6.25-final kernels
Summary: Boot Oops+hang in 2.6.25-rc and 2.6.25-final kernels
Status: CLOSED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: MD (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: io_md
URL:
Keywords:
Depends on:
Blocks: 9832
  Show dependency tree
 
Reported: 2008-04-19 06:51 UTC by Nicolas Mailhot
Modified: 2008-09-23 04:18 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.25
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
oops screen capture (127.50 KB, image/jpeg)
2008-04-19 06:52 UTC, Nicolas Mailhot
Details
dmesg on the same kernel after a non-oopsing boot (30.43 KB, text/plain)
2008-04-19 06:53 UTC, Nicolas Mailhot
Details
system lspci (32.14 KB, text/plain)
2008-04-19 06:56 UTC, Nicolas Mailhot
Details
source code and disassembly of failing function (6.06 KB, text/plain)
2008-04-22 08:18 UTC, Chuck Ebbert
Details
Another screen capture, this time on a vanilla kernel (134.10 KB, image/jpeg)
2008-04-24 14:45 UTC, Nicolas Mailhot
Details
patch to fix oops. (765 bytes, patch)
2008-04-27 21:59 UTC, Neil Brown
Details | Diff

Description Nicolas Mailhot 2008-04-19 06:51:26 UTC
Latest working kernel version: 2.6.24-rc5.mm1 has been confirmed to work fine
Earliest failing kernel version: N/A unfortunately. I don't reboot often enough to notice it
Distribution: Fedora Devel
Hardware Environment: CK804 + AMD X2
Software Environment: early udev boot
Problem Description:

See attached picture. As the kernel scrolls very fast at this point it took me weeks to get a correct screen capture

Steps to reproduce:

Boot. Will almost always result in hang. shift+page-up repeatedly at boot time reduces hang probability

See also https://bugzilla.redhat.com/show_bug.cgi?id=441765
Comment 1 Nicolas Mailhot 2008-04-19 06:52:37 UTC
Created attachment 15814 [details]
oops screen capture
Comment 2 Nicolas Mailhot 2008-04-19 06:53:42 UTC
Created attachment 15815 [details]
dmesg on the same kernel after a non-oopsing boot
Comment 3 Nicolas Mailhot 2008-04-19 06:56:27 UTC
Created attachment 15816 [details]
system lspci
Comment 4 Roland Kletzing 2008-04-20 15:50:17 UTC
so - if this happens with fedora kernel, which is a distro specific kernel which may contain several patches - does the same happen with vanilla kernel?
can you try, if 2.6.25 vanilla makes a difference?
Comment 5 Nicolas Mailhot 2008-04-21 02:29:15 UTC
I probably won't have access to this particular system before the end of the week, sorry. Already spent an awful lot of time just to get a good Oops capture
Comment 6 Roland Kletzing 2008-04-21 16:12:55 UTC
thank you for your time. 
nevertheless, additional input would be very appreciated.
not sure if you can get a "vanilla kernel rpm" for fedora, so you could save compile effort/time - for suse there is such.

does this happen with 32 or 64 bit ?
Comment 7 Nicolas Mailhot 2008-04-21 23:52:56 UTC
64bit kernel
Comment 8 Chuck Ebbert 2008-04-22 08:18:45 UTC
Created attachment 15841 [details]
source code and disassembly of failing function

mddev is used once after being stored here:

2087         mddev_t *mddev = rdev->mddev;

Later on rdev->mddev is used but it is no longer equal to mddev -- something has changed it. We then try to unlock using a bad address.
Comment 9 Nicolas Mailhot 2008-04-24 14:45:35 UTC
Created attachment 15902 [details]
Another screen capture, this time on a vanilla kernel

It seems a vanilla kernel such as
http://koji.fedoraproject.org/koji/taskinfo?taskID=581601
fails the same way
Comment 10 Chuck Ebbert 2008-04-26 17:50:29 UTC
The oops is on line 2099 in drivers/md/md.c:

2099                 mddev_unlock(rdev->mddev);

rdev is NULL but it was a valid address upon entry to the function.
Comment 11 Roland Kletzing 2008-04-27 14:08:25 UTC
so this oops is in md/raid code ?

nicolas, are you using software raid or lvm volumes ?
Comment 12 Nicolas Mailhot 2008-04-27 14:23:44 UTC
I'm using lvm over md

# cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md0 : active raid1 sda1[0] sdb1[1]
      2096384 blocks [2/2] [UU]
      
md1 : active raid1 sda3[0] sdb3[1]
      288856640 blocks [2/2] [UU]
      
unused devices: <none>

# /sbin/pvdisplay 
  --- Physical volume ---
  PV Name               /dev/md1
  VG Name               VolGroup00
  PV Size               275,48 GB / not usable 38,56 MB
  Allocatable           yes 
  PE Size (KByte)       65536
  Total PE              4407
  Free PE               2211
  Allocated PE          2196
  PV UUID               5vhc8L-w0Jt-cTIo-Hswk-NtbK-eWuP-d3q6J1
Comment 13 Neil Brown 2008-04-27 21:59:54 UTC
Created attachment 15938 [details]
patch to fix oops.

This patch will probably fix the problem.

I'll submit it for the -stable series.
Comment 14 Roland Kletzing 2008-04-28 13:35:42 UTC
if you like you may try the patch and report result here....

if not, you may wait that it appears upstream and you can try the vanilla kernel from fedora project then.
Comment 15 Alan 2008-09-23 04:18:11 UTC
Verified applied

Note You need to log in before you can comment on or make changes to this bug.