Bug 6561

Summary: 2.6.16 kernels unstable with Adaptec 2100 SCSI RAID
Product: Drivers Reporter: Dave R (meherenow)
Component: I2OAssignee: Alan (alan)
Status: CLOSED CODE_FIX    
Severity: high CC: akpm, bunk, kernel, Markus.Lidel
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.16 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Bugfix for 2.6.16
Using bonnie++ on a system running the newly patched kernel

Description Dave R 2006-05-15 14:32:44 UTC
Most recent kernel where this bug did not occur: 2.6.15
Distribution: Fedora Core 4 and 5 (also appears on gentoo with 2.6.16)
Hardware Environment: i686, Adaptec 2100S SCSI RAID card
Software Environment: Fedora Core 4 with 2.6.16 errata kernel, or FC5 default
Problem Description:

The 2.6.16 kernel appears to be unstable when used with a RAID array supported
by the i2o driver such as the Adaptec 2100S.

Steps to reproduce:

1) Install an OS using the 2.6.15 kernel or below.
2) Upgrade to 2.6.16 and watch the kernel oops

OR

1) Attempt to install an OS using the 2.6.15 kernel and watch the kernel oops

I have thoroughly bugzilla'd this at below but after further investigation (and
help from others) I believe it to be an upstream (i.e. here) bug so I've raised
it here also. 

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=189570

There is considerable debug and traceback info available at bugzilla.redhat.com.
Comment 1 Andrew Morton 2006-05-15 15:35:21 UTC
OK, I see from the RH report that Markus is working on getting the appropriate
hardware.

But 2.6.15 used to work.  Unfortunately we put a lot of changes into
that driver between 2.6.15 and 2.6.16.

I was unable to locate an oops trace in that RH report.  Maybe I missed
it.  Do we have one?
Comment 2 Dave R 2006-05-16 00:10:23 UTC
Thanks for the feedback.

You're quite right that 2.6.15 used to (and indeed still does) work, the problem
definately surfaced in 2.6.16.

Unfortunately I don't think I have a full kernel oops, I think the only way I
can get one is by using serial port logging and I don't have another machine
that has a serial port.

I think I've taken a few pictures of the screen at the critical point, but I
won't even be able to get those uploaded until the weekend.

If there's another way to log that kind of information I'm all ears.
Comment 3 Andrew Morton 2006-05-16 00:27:56 UTC
Digital photos work well.  You can email it to me if you like
and I'll attach it to the bugzilla report.

Comment 4 Dave R 2006-05-16 01:22:11 UTC
That's fine. I'm working off-site at the moment, but as soon as I get back at
the weekend I'll upload the images I took last time.
Comment 5 Dave R 2006-05-18 16:33:10 UTC
Kernel oops, and some very useful testing now present on:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=189570

Hope that helps.
Comment 6 Markus Lidel 2006-06-07 08:43:31 UTC
Created attachment 8271 [details]
Bugfix for 2.6.16

Changes:
- Fixes memory corrupt caused from access memory after free
- Fixed locking of struct i2o_exec_wait in Executive-OSM
- Removed LCT Notify in i2o_exec_probe() which caused freeing memory during
first enumeration
- Added missing locking in i2o_exec_lct_notify()
- removed put_device() of I2O controller in i2o_iop_remove() which caused the
controller structure get freed to early
- Fixed size of mempool in i2o_iop_alloc()
- Fixed access to memory after free in i2o_msg_get()
Comment 7 Dan Carpenter 2006-06-07 15:31:43 UTC
+	list_add(&wait->list, &i2o_exec_wait_list);

I'm a newbie, so please forgive me if this is obvious.  Shouldn't that be under
a spin_lock_irqsave()?  Say if you add something to the list and something else
is deleting something from the list at the same time couldn't that trigger the
BUG() in list_del().

Comment 8 Markus Lidel 2006-06-08 05:13:56 UTC
Hello,

which BUG() do you mean?

Hmmm, at the moment i don't see a problem, but probably i have overseen something.

Best regards,


Markus Lidel
Comment 9 Ivan Karpukhin 2006-06-08 11:36:33 UTC
Hardware: Adaptec SCSI RAID 2015S
Kenel: 2.6.16 without new patches from Markus Lidel  

Result: fully filesystem crash
Comment 10 Dave R 2006-06-09 02:27:20 UTC
[root@luggage ~]# uname -a
Linux luggage.darkglobe.int 2.6.16.20-withi2opatchi2opatch #1 Wed Jun 7 22:56:06
BST 2006 i686 athlon i386 GNU/Linux
[root@luggage ~]#

Works for me!

I'm stress testing it now, not expecting any problems but fingers crossed.
Comment 11 Dave R 2006-06-09 06:41:05 UTC
Created attachment 8280 [details]
Using bonnie++ on a system running the newly patched kernel

[dave@luggage ~]$ uname -a
Linux luggage.darkglobe.int 2.6.16.20-withi2opatchi2opatch #1 Wed Jun 7
22:56:06 BST 2006 i686 athlon i386 GNU/Linux
[dave@luggage ~]$ /usr/sbin/bonnie++ -s 4096 -r 1024 -n 5 -x 10 | tee
bonnierun.log

Creates the following log...

P.S. The short answer is that everything appears to be stable.

Please submit into the mainstream kernel asap, many thanks to all involved!
Comment 12 Daniel Drake 2006-06-09 15:41:12 UTC
Another downstream bug at http://bugs.gentoo.org/show_bug.cgi?id=136088
(nothing interesting to add at this time)
Comment 13 Adrian Bunk 2006-06-10 11:25:51 UTC
The patch from this bug was included into Linus' tree (and will therefore be in
2.6.17).