Bug 9901

Summary: kernel panic in stex modules (?)
Product: IO/Storage Reporter: Nikolay S. Rybaloff (dairinin)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: alan, bunk, Sergey.Belyashov
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: lspci, cpuinfo, .config, slabinfo

Description Nikolay S. Rybaloff 2008-02-06 09:40:13 UTC
Latest working kernel version: 2.6.23-r6
Earliest failing kernel version: 2.6.24
Distribution: Gentoo
Hardware Environment: Core2D E6600, Asus p5B Dlx, 2G DDR2 667, Promise ST EX4350
Software Environment: GCC 4.2.3/4.1.2, CFLAGS="-O2"

Problem Description:
The problem is frequent kernel panics within the same module. Can't say what it is, but looks like it is related to dma and promise driver.
The first culprit, the memory, is ok, 8 hours of memtest passed without errors.
Before, kernel 2.6.23-gentoo-r6, compiled with GCC 4.1.2 worked just fine, then after upgrade to 4.2.2 th bug appeared. Upgrade to 2.6.24 didn't solve the problem. Switching back to GCC 4.1.2 made things better for a moment, crashes became less frequent and I thought compiler was the cause. But today system crashed again with same symptoms.
Sorry, but I can't save crash log, so I'll provide screen "shot":
http://img238.imageshack.us/my.php?image=p2030030ki1.jpg

Steps to reproduce:
Boot, start FTP-server, load RAID with heavy input, in some hours it will crash. With pure reads system can run several days, heavy write load kills it much too easier.
Comment 1 Nikolay S. Rybaloff 2008-02-06 09:42:33 UTC
Forgot to mention, system is amd64
Comment 2 Nikolay S. Rybaloff 2008-02-06 09:47:54 UTC
Created attachment 14724 [details]
lspci, cpuinfo, .config, slabinfo
Comment 3 Anonymous Emailer 2008-02-06 10:15:20 UTC
Reply-To: akpm@linux-foundation.org

On Wed,  6 Feb 2008 09:40:15 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9901
> 
>            Summary: kernel panic in stex modules (?)
>            Product: IO/Storage
>            Version: 2.5
>      KernelVersion: 2.6.24
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Serial ATA
>         AssignedTo: jgarzik@pobox.com
>         ReportedBy: dairinin@gmail.com
> 
> 
> Latest working kernel version: 2.6.23-r6
> Earliest failing kernel version: 2.6.24
> Distribution: Gentoo
> Hardware Environment: Core2D E6600, Asus p5B Dlx, 2G DDR2 667, Promise ST
> EX4350
> Software Environment: GCC 4.2.3/4.1.2, CFLAGS="-O2"
> 
> Problem Description:
> The problem is frequent kernel panics within the same module. Can't say what
> it
> is, but looks like it is related to dma and promise driver.
> The first culprit, the memory, is ok, 8 hours of memtest passed without
> errors.
> Before, kernel 2.6.23-gentoo-r6, compiled with GCC 4.1.2 worked just fine,
> then
> after upgrade to 4.2.2 th bug appeared. Upgrade to 2.6.24 didn't solve the
> problem. Switching back to GCC 4.1.2 made things better for a moment, crashes
> became less frequent and I thought compiler was the cause. But today system
> crashed again with same symptoms.
> Sorry, but I can't save crash log, so I'll provide screen "shot":
> http://img238.imageshack.us/my.php?image=p2030030ki1.jpg
> 
> Steps to reproduce:
> Boot, start FTP-server, load RAID with heavy input, in some hours it will
> crash. With pure reads system can run several days, heavy write load kills it
> much too easier.
> 

The supertrak driver has regressed in 2.6.24.  And

commit 9cb83c7529d929c00f37d821daed1942a1b20602
Author: FUJITA Tomonori <tomof@acm.org>
Date:   Tue Oct 16 11:24:32 2007 +0200

    [SCSI] add use_sg_chaining option to scsi_host_template
    
looks a likely candidate.

And this:

commit d3f46f39b7092594b498abc12f0c73b0b9913bde
Author: James Bottomley <James.Bottomley@HansenPartnership.com>
Date:   Tue Jan 15 11:11:46 2008 -0600

    [SCSI] remove use_sg_chaining

from 2.6.25 looks to be a likely fix for it.  Should it be backported?
Comment 4 Anonymous Emailer 2008-02-06 10:27:00 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Wed, 2008-02-06 at 10:15 -0800, Andrew Morton wrote:
> On Wed,  6 Feb 2008 09:40:15 -0800 (PST) bugme-daemon@bugzilla.kernel.org
> wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=9901
> > 
> >            Summary: kernel panic in stex modules (?)
> >            Product: IO/Storage
> >            Version: 2.5
> >      KernelVersion: 2.6.24
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Serial ATA
> >         AssignedTo: jgarzik@pobox.com
> >         ReportedBy: dairinin@gmail.com
> > 
> > 
> > Latest working kernel version: 2.6.23-r6
> > Earliest failing kernel version: 2.6.24
> > Distribution: Gentoo
> > Hardware Environment: Core2D E6600, Asus p5B Dlx, 2G DDR2 667, Promise ST
> > EX4350
> > Software Environment: GCC 4.2.3/4.1.2, CFLAGS="-O2"
> > 
> > Problem Description:
> > The problem is frequent kernel panics within the same module. Can't say
> what it
> > is, but looks like it is related to dma and promise driver.
> > The first culprit, the memory, is ok, 8 hours of memtest passed without
> errors.
> > Before, kernel 2.6.23-gentoo-r6, compiled with GCC 4.1.2 worked just fine,
> then
> > after upgrade to 4.2.2 th bug appeared. Upgrade to 2.6.24 didn't solve the
> > problem. Switching back to GCC 4.1.2 made things better for a moment,
> crashes
> > became less frequent and I thought compiler was the cause. But today system
> > crashed again with same symptoms.
> > Sorry, but I can't save crash log, so I'll provide screen "shot":
> > http://img238.imageshack.us/my.php?image=p2030030ki1.jpg
> > 
> > Steps to reproduce:
> > Boot, start FTP-server, load RAID with heavy input, in some hours it will
> > crash. With pure reads system can run several days, heavy write load kills
> it
> > much too easier.
> > 
> 
> The supertrak driver has regressed in 2.6.24.  And
> 
> commit 9cb83c7529d929c00f37d821daed1942a1b20602
> Author: FUJITA Tomonori <tomof@acm.org>
> Date:   Tue Oct 16 11:24:32 2007 +0200
> 
>     [SCSI] add use_sg_chaining option to scsi_host_template
>     
> looks a likely candidate.
> 
> And this:
> 
> commit d3f46f39b7092594b498abc12f0c73b0b9913bde
> Author: James Bottomley <James.Bottomley@HansenPartnership.com>
> Date:   Tue Jan 15 11:11:46 2008 -0600
> 
>     [SCSI] remove use_sg_chaining
> 
> from 2.6.25 looks to be a likely fix for it.  Should it be backported?

If the patch you identify is the culprit, mine can't be the fix ... and
it should also be present in git head.

The BUG_ON is here: isn't it?

static inline void
dma_unmap_sg(struct device *hwdev, struct scatterlist *sg, int nents,
	     int direction)
{
	BUG_ON(!valid_dma_direction(direction));
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	dma_ops->unmap_sg(hwdev, sg, nents, direction);
}

stex only does scsi_dma_unmap(), so something looks to have tampered
with the cmnd->sc_data_direction somehow ... and I can't see how.

James
Comment 5 Anonymous Emailer 2008-02-06 17:02:25 UTC
Reply-To: fujita.tomonori@lab.ntt.co.jp

On Wed, 06 Feb 2008 12:26:39 -0600
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Wed, 2008-02-06 at 10:15 -0800, Andrew Morton wrote:
> > On Wed,  6 Feb 2008 09:40:15 -0800 (PST) bugme-daemon@bugzilla.kernel.org
> wrote:
> > 
> > > http://bugzilla.kernel.org/show_bug.cgi?id=9901
> > > 
> > >            Summary: kernel panic in stex modules (?)
> > >            Product: IO/Storage
> > >            Version: 2.5
> > >      KernelVersion: 2.6.24
> > >           Platform: All
> > >         OS/Version: Linux
> > >               Tree: Mainline
> > >             Status: NEW
> > >           Severity: normal
> > >           Priority: P1
> > >          Component: Serial ATA
> > >         AssignedTo: jgarzik@pobox.com
> > >         ReportedBy: dairinin@gmail.com
> > > 
> > > 
> > > Latest working kernel version: 2.6.23-r6
> > > Earliest failing kernel version: 2.6.24
> > > Distribution: Gentoo
> > > Hardware Environment: Core2D E6600, Asus p5B Dlx, 2G DDR2 667, Promise ST
> > > EX4350
> > > Software Environment: GCC 4.2.3/4.1.2, CFLAGS="-O2"
> > > 
> > > Problem Description:
> > > The problem is frequent kernel panics within the same module. Can't say
> what it
> > > is, but looks like it is related to dma and promise driver.
> > > The first culprit, the memory, is ok, 8 hours of memtest passed without
> errors.
> > > Before, kernel 2.6.23-gentoo-r6, compiled with GCC 4.1.2 worked just
> fine, then
> > > after upgrade to 4.2.2 th bug appeared. Upgrade to 2.6.24 didn't solve
> the
> > > problem. Switching back to GCC 4.1.2 made things better for a moment,
> crashes
> > > became less frequent and I thought compiler was the cause. But today
> system
> > > crashed again with same symptoms.
> > > Sorry, but I can't save crash log, so I'll provide screen "shot":
> > > http://img238.imageshack.us/my.php?image=p2030030ki1.jpg
> > > 
> > > Steps to reproduce:
> > > Boot, start FTP-server, load RAID with heavy input, in some hours it will
> > > crash. With pure reads system can run several days, heavy write load
> kills it
> > > much too easier.
> > > 
> > 
> > The supertrak driver has regressed in 2.6.24.  And
> > 
> > commit 9cb83c7529d929c00f37d821daed1942a1b20602
> > Author: FUJITA Tomonori <tomof@acm.org>
> > Date:   Tue Oct 16 11:24:32 2007 +0200
> > 
> >     [SCSI] add use_sg_chaining option to scsi_host_template
> >     
> > looks a likely candidate.
> > 
> > And this:
> > 
> > commit d3f46f39b7092594b498abc12f0c73b0b9913bde
> > Author: James Bottomley <James.Bottomley@HansenPartnership.com>
> > Date:   Tue Jan 15 11:11:46 2008 -0600
> > 
> >     [SCSI] remove use_sg_chaining
> > 
> > from 2.6.25 looks to be a likely fix for it.  Should it be backported?
> 
> If the patch you identify is the culprit, mine can't be the fix ... and
> it should also be present in git head.
> 
> The BUG_ON is here: isn't it?
> 
> static inline void
> dma_unmap_sg(struct device *hwdev, struct scatterlist *sg, int nents,
>            int direction)
> {
>       BUG_ON(!valid_dma_direction(direction));
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>       dma_ops->unmap_sg(hwdev, sg, nents, direction);
> }
> 
> stex only does scsi_dma_unmap(), so something looks to have tampered
> with the cmnd->sc_data_direction somehow ... and I can't see how.

Surely, someone changes the cmnd->sc_data_direction, or else we should
be hit by dma_map_sg before dma_unmap_sg:

static inline int
dma_map_sg(struct device *hwdev, struct scatterlist *sg, int nents, int direction)
{
	BUG_ON(!valid_dma_direction(direction));
	return dma_ops->map_sg(hwdev, sg, nents, direction);
}
Comment 6 Sergey Belyashov 2008-04-17 02:49:38 UTC
I have same bug in 2.6.24.
http://foto.mail.ru/mail/belyashov_sa/Misc/2.html

It causes when I try to write >10GB.

04:0e.0 RAID bus controller: Promise Technology, Inc. 80333 [SuperTrak EX8350/EX16350], 80331 [SuperTrak EX8300/EX16300]
platform is x86_64
Comment 7 Alan 2009-03-24 04:34:03 UTC
Is this still occurring as of 2.6.28 or now cured ?