Bug 14563 - SCSI tape driver: Spurious EIO and kernel BUG
Summary: SCSI tape driver: Spurious EIO and kernel BUG
Status: CLOSED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-11-09 08:29 UTC by Joe Breuer
Modified: 2012-06-14 16:51 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.29.4, 2.6.30.3, 2.6.31.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
BUG console capture from 2.6.30.3 while writing to tape (16.78 KB, text/plain)
2009-11-09 08:29 UTC, Joe Breuer
Details
BUG console capture from 2.6.30.3 while writing to tape (2.34 KB, text/plain)
2009-11-09 08:29 UTC, Joe Breuer
Details
BUG console capture from 2.6.30-gentoo-r5 while writing to tape (2.74 KB, text/plain)
2009-11-09 08:29 UTC, Joe Breuer
Details
Hardware overview created using lshw under 2.6.27.10 (28.15 KB, text/plain)
2009-11-09 08:30 UTC, Joe Breuer
Details
dmesg output from affected kernel 2.6.29.4 (immediately after boot) (30.23 KB, text/plain)
2009-11-09 08:30 UTC, Joe Breuer
Details
dmesg output from non-affected kernel 2.6.27.10 (after a successful backup/test-restore run) (30.18 KB, text/plain)
2009-11-09 08:31 UTC, Joe Breuer
Details
Console output with st DEBUG enabled, 2.6.30.3, variable tape block size (5.75 KB, text/plain)
2009-11-17 10:57 UTC, Joe Breuer
Details
Console output with st DEBUG enabled, 2.6.30.3, fixed tape block size 512k (8.57 KB, text/plain)
2009-11-17 10:57 UTC, Joe Breuer
Details
Console output with st DEBUG enabled, 2.6.27.10, fixed tape block size 512k (2.94 KB, text/plain)
2009-11-18 15:38 UTC, Joe Breuer
Details
Console output with st DEBUG enabled, 2.6.30.3, fixed tape block size 512k, DIO turned off (see comment) (3.41 KB, text/plain)
2009-11-18 16:54 UTC, Joe Breuer
Details
Console output, 2.6.30.3, fixed tape block size 512k, patch by FUJITA Tomonori 2009-11-19 (4.82 KB, text/plain)
2009-11-23 12:10 UTC, Joe Breuer
Details
Console output, 2.6.30.3, fixed tape block size 512k, patch by FUJITA Tomonori 2009-11-19, really this time (12.67 KB, text/plain)
2009-11-25 09:47 UTC, Joe Breuer
Details

Description Joe Breuer 2009-11-09 08:29:01 UTC
Created attachment 23702 [details]
BUG console capture from 2.6.30.3 while writing to tape

Overview:

Performing tape backups using amanda 2.6.0-p2 [using 512kB block size] leads to spurious 'EIO' errors when writing to the tape; most of the time a kernel BUG is hit shortly thereafter.


Steps to reproduce:

Using amanda 2.6.0-p2 to perform tape backups to an HP C5683A DDS-4 tape drive; connected to a Adaptec 29160 Ultra160 SCSI adapter / aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs.


Actual results:

Amanda ALWAYS (on affected kernels, see below) complains about an EIO while writing to the tape a few (< 1000) MB into the tape; most of the time a kernel BUG is displayed and the system becomes unresponsive.

See attachments BUG-*.txt for a selection. Unfortunately, I could not save readable BUG output from 2.6.31.5 due to a management console problem; if there is great demand I can certainly reproduce the BUG output with working console capture.


Expected results:

Amanda should correctly write to tape during backup as it does using kernels <= 2.6.27.10.


Affected versions:

The oldest kernel I tried which DOES exhibit the problem is 2.6.29.4; other kernels exhibiting this problem are 2.6.30.3 and 2.6.31.5.


Non-affected versions:

A distribution-patched kernel based on version 2.6.23 [2.6.23-gentoo-r8] used to work correctly since February 2008; the newest kernel I tried which did not exhibit this problem is 2.6.27.10.

I did not try any 2.6.28 kernel.


Additional information:

See lshw-2.6.27.10.txt for a complete hardware overview; and dmesg-2.6.27.10.txt/dmesg-2.6.29.4.txt for dmesg output on a working/non-working kernel, respectively.
Comment 1 Joe Breuer 2009-11-09 08:29:23 UTC
Created attachment 23703 [details]
BUG console capture from 2.6.30.3 while writing to tape
Comment 2 Joe Breuer 2009-11-09 08:29:45 UTC
Created attachment 23704 [details]
BUG console capture from 2.6.30-gentoo-r5 while writing to tape
Comment 3 Joe Breuer 2009-11-09 08:30:20 UTC
Created attachment 23705 [details]
Hardware overview created using lshw under 2.6.27.10
Comment 4 Joe Breuer 2009-11-09 08:30:55 UTC
Created attachment 23706 [details]
dmesg output from affected kernel 2.6.29.4 (immediately after boot)
Comment 5 Joe Breuer 2009-11-09 08:31:33 UTC
Created attachment 23707 [details]
dmesg output from non-affected kernel 2.6.27.10 (after a successful backup/test-restore run)
Comment 6 Kai Mäkisara 2009-11-16 19:19:33 UTC
On Mon, 9 Nov 2009, bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14563
,,,
> Overview:
> 
> Performing tape backups using amanda 2.6.0-p2 [using 512kB block size] leads
> to
> spurious 'EIO' errors when writing to the tape; most of the time a kernel BUG
> is hit shortly thereafter.
> 

Can you try this with debugging enabled in st? (Edit st.c to
'#define DEBUG 1' and recompile the module.) This might give more 
information about the source of the EIO and the code path it uses.

...
> Amanda should correctly write to tape during backup as it does using kernels
> <=
> 2.6.27.10.
> 
> 
> Affected versions:
> 
> The oldest kernel I tried which DOES exhibit the problem is 2.6.29.4; other
> kernels exhibiting this problem are 2.6.30.3 and 2.6.31.5.
> 

There were major changes to the code performing the SCSI commands at 
2.6.29. The changes were mostly done by Fujita Tomonori and I added him to 
the cc list.

Kai
Comment 7 Anonymous Emailer 2009-11-17 05:15:02 UTC
Reply-To: fujita.tomonori@lab.ntt.co.jp

> > Performing tape backups using amanda 2.6.0-p2 [using 512kB block size]
> leads to
> > spurious 'EIO' errors when writing to the tape; most of the time a kernel
> BUG
> > is hit shortly thereafter.
> > 
> 
> Can you try this with debugging enabled in st? (Edit st.c to
> '#define DEBUG 1' and recompile the module.) This might give more 
> information about the source of the EIO and the code path it uses.

Sorry about that bug.

Looks like that st drivers messes up memory management but the log
doesn't give enough information.

As Kai suggested, please enable the debug option and send the new log.
Comment 8 Joe Breuer 2009-11-17 10:56:33 UTC
I've recreated the problem in 2.6.30.3 with '#define DEBUG 1' in st.c.

Attached is the console log with a few annotations interspersed: st-debug-2.6.30.3-mtblk0.txt.

The buffer sizes of 4096 and 516096 look most suspicious to me, as amanda uses 512kB (== 524288 bytes) blocks. I tried again with 'mt setblk 524288', but the same result: EIO followed by BUG. The corresponding console log is attached as st-debug-2.6.30.3-mtblk512k.txt.
Comment 9 Joe Breuer 2009-11-17 10:57:30 UTC
Created attachment 23814 [details]
Console output with st DEBUG enabled, 2.6.30.3, variable tape block size
Comment 10 Joe Breuer 2009-11-17 10:57:52 UTC
Created attachment 23815 [details]
Console output with st DEBUG enabled, 2.6.30.3, fixed tape block size 512k
Comment 11 Anonymous Emailer 2009-11-18 09:18:05 UTC
Reply-To: fujita.tomonori@lab.ntt.co.jp

On Tue, 17 Nov 2009 10:56:36 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14563
> 
> 
> 
> 
> 
> --- Comment #8 from Joachim Breuer <linux-kernel@jmbreuer.net>  2009-11-17
> 10:56:33 ---
> I've recreated the problem in 2.6.30.3 with '#define DEBUG 1' in st.c.
> 
> Attached is the console log with a few annotations interspersed:
> st-debug-2.6.30.3-mtblk0.txt.

Thanks for the logs.

> The buffer sizes of 4096 and 516096 look most suspicious to me, as amanda
> uses
> 512kB (== 524288 bytes) blocks.

Can you get the log with a working kernel (2.6.27?) and '#define DEBUG
1' in st.c? Then we could see what are different with the same
commands.


> I tried again with 'mt setblk 524288', but the
> same result: EIO followed by BUG. The corresponding console log is attached
> as
> st-debug-2.6.30.3-mtblk512k.txt.

Looks like that the kernel crashes in both logs right after dio
transfer? What happens with the kernel module parameter 'try_rdio = 0
try_wdio=0'?

Thanks,
Comment 12 Joe Breuer 2009-11-18 15:37:50 UTC
I'll attach the console log of a working dump to tape (roughly 11 GB in several (tape) files) below.

After that, I'll try with 'try_rdi0=0 try_wdio=0'.
Comment 13 Joe Breuer 2009-11-18 15:38:25 UTC
Created attachment 23824 [details]
 Console output with st DEBUG enabled, 2.6.27.10, fixed tape block size 512k
Comment 14 Joe Breuer 2009-11-18 15:39:44 UTC
Ah sorry typo, I'll of course try with 'try_rdio=0 try_wdio=0'.
Comment 15 Joe Breuer 2009-11-18 16:53:53 UTC
Allright! You hit something on the head!

With 'try_rdio=0 try_wdio=0' as parameters to the st module, writing to tape from amanda works correctly in 2.6.30.3.

I'll attach the console log.
Comment 16 Joe Breuer 2009-11-18 16:54:44 UTC
Created attachment 23826 [details]
 Console output with st DEBUG enabled, 2.6.30.3, fixed tape block size 512k, DIO turned off (see comment)
Comment 17 Anonymous Emailer 2009-11-19 07:51:05 UTC
Reply-To: fujita.tomonori@lab.ntt.co.jp

On Wed, 18 Nov 2009 16:53:53 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14563
> 
> 
> 
> 
> 
> --- Comment #15 from Joachim Breuer <linux-kernel@jmbreuer.net>  2009-11-18
> 16:53:53 ---
> Allright! You hit something on the head!
> 
> With 'try_rdio=0 try_wdio=0' as parameters to the st module, writing to tape
> from amanda works correctly in 2.6.30.3.

Thanks a lot for the logs. They are very helpful.

I think that I found one bug about dio. There might be other bugs that
cause this issue though.

Can you try the patch that I've just sent.

Thanks,
Comment 18 Joe Breuer 2009-11-23 12:10:04 UTC
Created attachment 23877 [details]
Console output, 2.6.30.3, fixed tape block size 512k, patch by FUJITA Tomonori 2009-11-19

The patch you sent changes the behavior: I still get the EIO after (in this case) 239 I/O requests while writing to the tape; but the machine no longer hard-crashes with a BUG.

[I've removed the try_rdio=0/try_wdio=0 module parameters for this test.]

Attached is the console log as usual with my comments interspersed; if I read this correctly "st0: Number of r/w requests 239, dio used in 1, pages 128." direct I/O is used at least once before the problem occurs.
Comment 19 Kai Mäkisara 2009-11-23 20:12:38 UTC
On Mon, 23 Nov 2009, bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14563
> 
> 
> 
> 
> 
> --- Comment #18 from Joachim Breuer <linux-kernel@jmbreuer.net>  2009-11-23
> 12:10:04 ---
> Created an attachment (id=23877)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=23877)
> Console output, 2.6.30.3, fixed tape block size 512k, patch by FUJITA
> Tomonori
> 2009-11-19
> 
> The patch you sent changes the behavior: I still get the EIO after (in this
> case) 239 I/O requests while writing to the tape; but the machine no longer
> hard-crashes with a BUG.
> 
> [I've removed the try_rdio=0/try_wdio=0 module parameters for this test.]
> 
> Attached is the console log as usual with my comments interspersed; if I read
> this correctly "st0: Number of r/w requests 239, dio used in 1, pages 128."
> direct I/O is used at least once before the problem occurs.
> 
Are you sure that you have used the patched st? The log does not show the 
the debugging output from Fujita's patch. It also shows the problem 
occurring before the patch:

st0: Block size: 524288, buffer size: 516096 (0 blocks).

Kai
Comment 20 Anonymous Emailer 2009-11-25 00:21:50 UTC
Reply-To: fujita.tomonori@lab.ntt.co.jp

On Mon, 23 Nov 2009 12:10:06 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14563
> 
> 
> 
> 
> 
> --- Comment #18 from Joachim Breuer <linux-kernel@jmbreuer.net>  2009-11-23
> 12:10:04 ---
> Created an attachment (id=23877)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=23877)
> Console output, 2.6.30.3, fixed tape block size 512k, patch by FUJITA
> Tomonori
> 2009-11-19
> 
> The patch you sent changes the behavior: I still get the EIO after (in this
> case) 239 I/O requests while writing to the tape; but the machine no longer
> hard-crashes with a BUG.
> 
> [I've removed the try_rdio=0/try_wdio=0 module parameters for this test.]
> 
> Attached is the console log as usual with my comments interspersed; if I read
> this correctly "st0: Number of r/w requests 239, dio used in 1, pages 128."
> direct I/O is used at least once before the problem occurs.

As Kai said, it doesn't look like a kernel log with the patch
applied. With the patch, the kernel becomes more verbose.

Can you check again?

Thanks,
Comment 21 Joe Breuer 2009-11-25 09:47:00 UTC
Created attachment 23929 [details]
Console output, 2.6.30.3, fixed tape block size 512k, patch by FUJITA Tomonori 2009-11-19, really this time

Sorry about that. Insufficient sleep, insufficient coffee :-/

Attached is the log when really, actually, using FUJITA Tomonori's patch from 2009-11-19.

The two-fold good news:

a) amflush worked correctly, no EIO, no BUG
I'll still read this tape back in for verification and post a quick comment about the result.

b) DIO apparently was used: "st0: Number of r/w requests 38867, dio used in 138, pages 17664."
Comment 22 Joe Breuer 2009-11-25 16:51:21 UTC
OK, reading back the tape from #21 yields no verify errors; it seems writing with FUJITA Tomonori's patch works correctly for me.
Comment 23 Anonymous Emailer 2009-11-26 00:27:03 UTC
Reply-To: fujita.tomonori@lab.ntt.co.jp

On Wed, 25 Nov 2009 16:51:25 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14563
> 
> 
> 
> 
> 
> --- Comment #22 from Joachim Breuer <linux-kernel@jmbreuer.net>  2009-11-25
> 16:51:21 ---
> OK, reading back the tape from #21 yields no verify errors; it seems writing
> with FUJITA Tomonori's patch works correctly for me.

Thanks a lot!

And really sorry about the bug.

Note You need to log in before you can comment on or make changes to this bug.