Bug 14563
Created attachment 23703 [details]
BUG console capture from 2.6.30.3 while writing to tape
Created attachment 23704 [details]
BUG console capture from 2.6.30-gentoo-r5 while writing to tape
Created attachment 23705 [details]
Hardware overview created using lshw under 2.6.27.10
Created attachment 23706 [details]
dmesg output from affected kernel 2.6.29.4 (immediately after boot)
Created attachment 23707 [details]
dmesg output from non-affected kernel 2.6.27.10 (after a successful backup/test-restore run)
On Mon, 9 Nov 2009, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14563 ,,, > Overview: > > Performing tape backups using amanda 2.6.0-p2 [using 512kB block size] leads > to > spurious 'EIO' errors when writing to the tape; most of the time a kernel BUG > is hit shortly thereafter. > Can you try this with debugging enabled in st? (Edit st.c to '#define DEBUG 1' and recompile the module.) This might give more information about the source of the EIO and the code path it uses. ... > Amanda should correctly write to tape during backup as it does using kernels > <= > 2.6.27.10. > > > Affected versions: > > The oldest kernel I tried which DOES exhibit the problem is 2.6.29.4; other > kernels exhibiting this problem are 2.6.30.3 and 2.6.31.5. > There were major changes to the code performing the SCSI commands at 2.6.29. The changes were mostly done by Fujita Tomonori and I added him to the cc list. Kai Reply-To: fujita.tomonori@lab.ntt.co.jp > > Performing tape backups using amanda 2.6.0-p2 [using 512kB block size] > leads to > > spurious 'EIO' errors when writing to the tape; most of the time a kernel > BUG > > is hit shortly thereafter. > > > > Can you try this with debugging enabled in st? (Edit st.c to > '#define DEBUG 1' and recompile the module.) This might give more > information about the source of the EIO and the code path it uses. Sorry about that bug. Looks like that st drivers messes up memory management but the log doesn't give enough information. As Kai suggested, please enable the debug option and send the new log. I've recreated the problem in 2.6.30.3 with '#define DEBUG 1' in st.c. Attached is the console log with a few annotations interspersed: st-debug-2.6.30.3-mtblk0.txt. The buffer sizes of 4096 and 516096 look most suspicious to me, as amanda uses 512kB (== 524288 bytes) blocks. I tried again with 'mt setblk 524288', but the same result: EIO followed by BUG. The corresponding console log is attached as st-debug-2.6.30.3-mtblk512k.txt. Created attachment 23814 [details]
Console output with st DEBUG enabled, 2.6.30.3, variable tape block size
Created attachment 23815 [details]
Console output with st DEBUG enabled, 2.6.30.3, fixed tape block size 512k
Reply-To: fujita.tomonori@lab.ntt.co.jp On Tue, 17 Nov 2009 10:56:36 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14563 > > > > > > --- Comment #8 from Joachim Breuer <linux-kernel@jmbreuer.net> 2009-11-17 > 10:56:33 --- > I've recreated the problem in 2.6.30.3 with '#define DEBUG 1' in st.c. > > Attached is the console log with a few annotations interspersed: > st-debug-2.6.30.3-mtblk0.txt. Thanks for the logs. > The buffer sizes of 4096 and 516096 look most suspicious to me, as amanda > uses > 512kB (== 524288 bytes) blocks. Can you get the log with a working kernel (2.6.27?) and '#define DEBUG 1' in st.c? Then we could see what are different with the same commands. > I tried again with 'mt setblk 524288', but the > same result: EIO followed by BUG. The corresponding console log is attached > as > st-debug-2.6.30.3-mtblk512k.txt. Looks like that the kernel crashes in both logs right after dio transfer? What happens with the kernel module parameter 'try_rdio = 0 try_wdio=0'? Thanks, I'll attach the console log of a working dump to tape (roughly 11 GB in several (tape) files) below. After that, I'll try with 'try_rdi0=0 try_wdio=0'. Created attachment 23824 [details]
Console output with st DEBUG enabled, 2.6.27.10, fixed tape block size 512k
Ah sorry typo, I'll of course try with 'try_rdio=0 try_wdio=0'. Allright! You hit something on the head! With 'try_rdio=0 try_wdio=0' as parameters to the st module, writing to tape from amanda works correctly in 2.6.30.3. I'll attach the console log. Created attachment 23826 [details]
Console output with st DEBUG enabled, 2.6.30.3, fixed tape block size 512k, DIO turned off (see comment)
Reply-To: fujita.tomonori@lab.ntt.co.jp On Wed, 18 Nov 2009 16:53:53 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14563 > > > > > > --- Comment #15 from Joachim Breuer <linux-kernel@jmbreuer.net> 2009-11-18 > 16:53:53 --- > Allright! You hit something on the head! > > With 'try_rdio=0 try_wdio=0' as parameters to the st module, writing to tape > from amanda works correctly in 2.6.30.3. Thanks a lot for the logs. They are very helpful. I think that I found one bug about dio. There might be other bugs that cause this issue though. Can you try the patch that I've just sent. Thanks, Created attachment 23877 [details]
Console output, 2.6.30.3, fixed tape block size 512k, patch by FUJITA Tomonori 2009-11-19
The patch you sent changes the behavior: I still get the EIO after (in this case) 239 I/O requests while writing to the tape; but the machine no longer hard-crashes with a BUG.
[I've removed the try_rdio=0/try_wdio=0 module parameters for this test.]
Attached is the console log as usual with my comments interspersed; if I read this correctly "st0: Number of r/w requests 239, dio used in 1, pages 128." direct I/O is used at least once before the problem occurs.
On Mon, 23 Nov 2009, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14563 > > > > > > --- Comment #18 from Joachim Breuer <linux-kernel@jmbreuer.net> 2009-11-23 > 12:10:04 --- > Created an attachment (id=23877) > --> (http://bugzilla.kernel.org/attachment.cgi?id=23877) > Console output, 2.6.30.3, fixed tape block size 512k, patch by FUJITA > Tomonori > 2009-11-19 > > The patch you sent changes the behavior: I still get the EIO after (in this > case) 239 I/O requests while writing to the tape; but the machine no longer > hard-crashes with a BUG. > > [I've removed the try_rdio=0/try_wdio=0 module parameters for this test.] > > Attached is the console log as usual with my comments interspersed; if I read > this correctly "st0: Number of r/w requests 239, dio used in 1, pages 128." > direct I/O is used at least once before the problem occurs. > Are you sure that you have used the patched st? The log does not show the the debugging output from Fujita's patch. It also shows the problem occurring before the patch: st0: Block size: 524288, buffer size: 516096 (0 blocks). Kai Reply-To: fujita.tomonori@lab.ntt.co.jp On Mon, 23 Nov 2009 12:10:06 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14563 > > > > > > --- Comment #18 from Joachim Breuer <linux-kernel@jmbreuer.net> 2009-11-23 > 12:10:04 --- > Created an attachment (id=23877) > --> (http://bugzilla.kernel.org/attachment.cgi?id=23877) > Console output, 2.6.30.3, fixed tape block size 512k, patch by FUJITA > Tomonori > 2009-11-19 > > The patch you sent changes the behavior: I still get the EIO after (in this > case) 239 I/O requests while writing to the tape; but the machine no longer > hard-crashes with a BUG. > > [I've removed the try_rdio=0/try_wdio=0 module parameters for this test.] > > Attached is the console log as usual with my comments interspersed; if I read > this correctly "st0: Number of r/w requests 239, dio used in 1, pages 128." > direct I/O is used at least once before the problem occurs. As Kai said, it doesn't look like a kernel log with the patch applied. With the patch, the kernel becomes more verbose. Can you check again? Thanks, Created attachment 23929 [details]
Console output, 2.6.30.3, fixed tape block size 512k, patch by FUJITA Tomonori 2009-11-19, really this time
Sorry about that. Insufficient sleep, insufficient coffee :-/
Attached is the log when really, actually, using FUJITA Tomonori's patch from 2009-11-19.
The two-fold good news:
a) amflush worked correctly, no EIO, no BUG
I'll still read this tape back in for verification and post a quick comment about the result.
b) DIO apparently was used: "st0: Number of r/w requests 38867, dio used in 138, pages 17664."
OK, reading back the tape from #21 yields no verify errors; it seems writing with FUJITA Tomonori's patch works correctly for me. Reply-To: fujita.tomonori@lab.ntt.co.jp On Wed, 25 Nov 2009 16:51:25 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14563 > > > > > > --- Comment #22 from Joachim Breuer <linux-kernel@jmbreuer.net> 2009-11-25 > 16:51:21 --- > OK, reading back the tape from #21 yields no verify errors; it seems writing > with FUJITA Tomonori's patch works correctly for me. Thanks a lot! And really sorry about the bug. |
Created attachment 23702 [details] BUG console capture from 2.6.30.3 while writing to tape Overview: Performing tape backups using amanda 2.6.0-p2 [using 512kB block size] leads to spurious 'EIO' errors when writing to the tape; most of the time a kernel BUG is hit shortly thereafter. Steps to reproduce: Using amanda 2.6.0-p2 to perform tape backups to an HP C5683A DDS-4 tape drive; connected to a Adaptec 29160 Ultra160 SCSI adapter / aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs. Actual results: Amanda ALWAYS (on affected kernels, see below) complains about an EIO while writing to the tape a few (< 1000) MB into the tape; most of the time a kernel BUG is displayed and the system becomes unresponsive. See attachments BUG-*.txt for a selection. Unfortunately, I could not save readable BUG output from 2.6.31.5 due to a management console problem; if there is great demand I can certainly reproduce the BUG output with working console capture. Expected results: Amanda should correctly write to tape during backup as it does using kernels <= 2.6.27.10. Affected versions: The oldest kernel I tried which DOES exhibit the problem is 2.6.29.4; other kernels exhibiting this problem are 2.6.30.3 and 2.6.31.5. Non-affected versions: A distribution-patched kernel based on version 2.6.23 [2.6.23-gentoo-r8] used to work correctly since February 2008; the newest kernel I tried which did not exhibit this problem is 2.6.27.10. I did not try any 2.6.28 kernel. Additional information: See lshw-2.6.27.10.txt for a complete hardware overview; and dmesg-2.6.27.10.txt/dmesg-2.6.29.4.txt for dmesg output on a working/non-working kernel, respectively.