Bug 5117 - Panic when accessing scsi-tapedrives with 4G-remap
Summary: Panic when accessing scsi-tapedrives with 4G-remap
Status: REJECTED UNREPRODUCIBLE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: i386 Linux
: P2 blocking
Assignee: Mike Anderson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-23 12:53 UTC by Johan van Baarlen
Modified: 2008-03-10 20:42 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.12.5
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Johan van Baarlen 2005-08-23 12:53:34 UTC
Distro: FC3 x86_64
Kernel: mainstream 2.6.12.2, 2.6.12.3, 2.6.12.5
Hardware: Asus A8N-E (bios 1006), Athlon64-X2 4400+, 4Gbyte PC3200 ram
Video: PCI-Express NVidia GF6200. Module nvidia.ko taints kernel, but its 
presence has no influence on this bug (verified by using a clean kernel without 
nvidia-module)
Scsi-controllers: 53c875, 53c895 (one or two installed makes no difference)
Scsi-devices: DLT7000 (on 895-internal), DLT4000 (on 875-internal), DLT4000 (on 
875-external)

As long as I do not enable the pci-address space remapping in bios, everything 
is fine: tapedrives run at full speed (well, the old DLTs aren't really quick, 
but they do run at >95% of their rated raw speeds), survives 
writing/rewinding/reading the same tape for days without interruption (total 
transfer >1.5Tbyte of data)
But I only have 3Gigs of memory available, which is a bit of a waste, and a lot 
of apps are touching swapspace because of the missing 1G.

By enabling the bios-options to remap memory (software and hardware remap), 
linux can see and use (almost) 4Gbyte: 
Memory: 4038248k/5242880k available (2341k kernel code, 154876k reserved, 1166k 
data, 212k init). Apparently the memoryhole for PCI- and PCI-express cards on 
this board is a full gigabyte, and this 'missing' 1G is added on top of 4G. 
Works like a charm, stable as a rock, until I try to use the tapedrives: 
sometimes the driver just locks up on first access (requiring a reboot to 
revive it), sometimes it works fine for a few minutes, but I can (almost 
always) get the kernel to break down immediately with a very simple command:

To reproduce: dd if=/dev/zero bs=256k count=1000 > /dev/st0, system doesn't 
even have time to log the errors.
'star' gives similar errors, only later - output below is from star from disk 
to DLT7000/53c895 with 256k blocksize. Happens on DLT4000 with 53c875 as well, 
also with external DLT4000 on 53c895. 

Results from syslog, repeated a lot of times, eventually resulting in a kernel 
panic or hard lockup always within several minutes. Didn't capture the panics, 
but the 'normal' repetive error goes like this:
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039706e8)
flags:0x2000000c mapping:ffff810132201570 mapcount:2 count:0
Call Trace:
<ffffffff80131360>{default_wake_function+0} 
<ffffffff80160a00>{bad_page+112}
<ffffffff801611e5>{free_hot_cold_page+133} 
<ffffffff8801a9af>{:st:sgl_unmap_user_pages+79}
<ffffffff8801a9db>{:st:release_buffering+27} 
<ffffffff8801f97e>{:st:st_write+2718}
<ffffffff8017ce65>{vfs_write+229} 
<ffffffff8017cfc3>{sys_write+83}
<ffffffff8010eb36>{system_call+126} 
Trying to fix it up, but a reboot is needed

Repeating a lot of times until the system locks (often) or panics (sometimes) 
or recovers from panic by rebooting (happened twice).

The 'page' keeps increasing, but 'mapping' stays the same. After a reboot, same 
errors are logged, but different 'mapping' (constant) and 'page' (increasing). 
Not sure if this is helpful, but anyway:
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4000)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4038)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4070)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e40a8)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e40e0)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4118)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4150)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4188)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e41c0)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e41f8)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4230)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4268)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e42a0)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e42d8)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4310)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4348)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4380)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e43b8)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e43f0)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4428)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4460)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4498)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e44d0)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4508)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4540)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4578)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e45b0)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e45e8)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4620)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4658)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0
Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4690)
flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0




Have tried a lot of combinations already, but no diff - one controller 
installed, both controllers installed, different PCI-slots, etc.
Blocksize used seems to make a difference - dd'ing 256kbyte blocks crashes 
instantly, 64k blocks seem to make it through.
sym53c8xx has different options for DMA-adressing (32bit, 40/48bit, 64bit) - 
tried them all, no difference, back to default 40/48bit.

From the above syslog-dump, the problem seems to be in the scsi-
tapedriver 'st', but since this is so widely used it is very hard to imagine a 
bug in that module. 
It only occurs with the 4G-remapping enabled, so I am getting the feeling that 
this is going to be a nasty combination to find. Looks like something is 
fouling the release of a (series of) buffers, but this requires a big dive into 
a section of the code for which I really don't have enough experience.


Please help, this one is driving me crazy - the machine needs to go online in 
about a week, and I'd really like to find and fix the source of these problems 
so that it can be operational with the full 4G of ram available, and all 
tapedrives singing.
Comment 1 Andrew Morton 2005-08-25 17:25:50 UTC

Begin forwarded message:

Date: Tue, 23 Aug 2005 12:53:38 -0700
From: bugme-daemon@kernel-bugs.osdl.org
To: bugme-new@lists.osdl.org
Subject: [Bugme-new] [Bug 5117] New: Panic when accessing scsi-tapedrives with 4G-remap


 http://bugzilla.kernel.org/show_bug.cgi?id=5117

            Summary: Panic when accessing scsi-tapedrives with 4G-remap
     Kernel Version: 2.6.12.5


Could this purely be a highmem problem?   Is the zero-copy DMA feature of
st.c known to work OK with x86 highmem?

Comment 2 Anonymous Emailer 2005-08-25 19:22:36 UTC
Reply-To: osst@riede.org

On 08/25/2005 08:24:17 PM, Andrew Morton wrote:
> 
> Subject: [Bugme-new] [Bug 5117] New: Panic when accessing scsi-tapedrives  
> with 4G-remap
> 
> 
>  http://bugzilla.kernel.org/show_bug.cgi?id=5117
> 
>             Summary: Panic when accessing scsi-tapedrives with 4G-remap
>      Kernel Version: 2.6.12.5
> 
> 
> Could this purely be a highmem problem?   Is the zero-copy DMA feature of
> st.c known to work OK with x86 highmem?

I don't know about Kai, but I don't have the hardware to test that...

Sorry, Willem Riede.


Comment 3 Anonymous Emailer 2005-08-26 09:33:30 UTC
Reply-To: Kai.Makisara@kolumbus.fi

On Thu, 25 Aug 2005, Andrew Morton wrote:

> 
> 
> Begin forwarded message:
> 
> Date: Tue, 23 Aug 2005 12:53:38 -0700
> From: bugme-daemon@kernel-bugs.osdl.org
> To: bugme-new@lists.osdl.org
> Subject: [Bugme-new] [Bug 5117] New: Panic when accessing scsi-tapedrives with 4G-remap
> 
> 
>  http://bugzilla.kernel.org/show_bug.cgi?id=5117
> 
>             Summary: Panic when accessing scsi-tapedrives with 4G-remap
>      Kernel Version: 2.6.12.5
> 
> 
> Could this purely be a highmem problem?   Is the zero-copy DMA feature of
> st.c known to work OK with x86 highmem?

It is _not_ known _not_ to work ;-) I.e., I have received neither any 
success nor any failure reports. I have not tested it because I don't have 
any machine with enough memory (and I have not hacked a kernel to use 
highmem with 512 MB of memory). I hope someone seeing this thread and 
using highmem with tape can comment on this subject.

Comment 4 Anonymous Emailer 2005-08-28 03:35:27 UTC
Reply-To: Kai.Makisara@kolumbus.fi

On Fri, 26 Aug 2005, Kai Makisara wrote:

> On Thu, 25 Aug 2005, Andrew Morton wrote:
> 
> > 
> > 
> > Begin forwarded message:
> > 
> > Date: Tue, 23 Aug 2005 12:53:38 -0700
> > From: bugme-daemon@kernel-bugs.osdl.org
> > To: bugme-new@lists.osdl.org
> > Subject: [Bugme-new] [Bug 5117] New: Panic when accessing scsi-tapedrives with 4G-remap
> > 
> > 
> >  http://bugzilla.kernel.org/show_bug.cgi?id=5117
> > 
> >             Summary: Panic when accessing scsi-tapedrives with 4G-remap
> >      Kernel Version: 2.6.12.5
> > 
> > 
> > Could this purely be a highmem problem?   Is the zero-copy DMA feature of
> > st.c known to work OK with x86 highmem?
> 
> It is _not_ known _not_ to work ;-) I.e., I have received neither any 
> success nor any failure reports. I have not tested it because I don't have 
> any machine with enough memory (and I have not hacked a kernel to use 
> highmem with 512 MB of memory). I hope someone seeing this thread and 
> using highmem with tape can comment on this subject.
> 
OK. I booted my test i386 machine with highmem=384m and did some tests. I 
also added a counter to st.c to count the highmem pages used for zero-copy 
DMA. I could not get dd to use highmem but with tar that succeeded. No 
extra messages were found in syslog during these tests.

BUT, at the same time I remembered that the system in the Bugzilla report 
was Athlon64 running FC3 x86_64. The x86_64 kernel does not have highmem.

Comment 5 Anonymous Emailer 2005-08-28 05:23:35 UTC
Reply-To: osst@riede.org

On 08/28/2005 06:40:04 AM, Kai Makisara wrote:
> On Fri, 26 Aug 2005, Kai Makisara wrote:
> 
> > On Thu, 25 Aug 2005, Andrew Morton wrote:
> >
> > > Could this purely be a highmem problem? Is the zero-copy DMA feature of
> > > st.c known to work OK with x86 highmem?
> >
> > It is _not_ known _not_ to work ;-) I.e., I have received neither any
> > success nor any failure reports. I have not tested it because I don't have
> 
> > any machine with enough memory (and I have not hacked a kernel to use
> > highmem with 512 MB of memory). I hope someone seeing this thread and
> > using highmem with tape can comment on this subject.
> >
> OK. I booted my test i386 machine with highmem=384m and did some tests. I
> also added a counter to st.c to count the highmem pages used for zero-copy
> DMA. I could not get dd to use highmem but with tar that succeeded. No
> extra messages were found in syslog during these tests.
> 
> BUT, at the same time I remembered that the system in the Bugzilla report
> was Athlon64 running FC3 x86_64. The x86_64 kernel does not have highmem.

True. I do have a machine running FC4 x86_64, but it doesn't have 4G of  
memory, so it doesn't do the 4G-remap that the bug report refers to.

Don't know if there is any way to make it do so (other than buy more memory).

Regards, Willem Riede.


Comment 6 Anonymous Emailer 2005-08-28 06:21:11 UTC
Reply-To: arjan@infradead.org


> > OK. I booted my test i386 machine with highmem=384m and did some tests. I
> > also added a counter to st.c to count the highmem pages used for zero-copy
> > DMA. I could not get dd to use highmem but with tar that succeeded. No
> > extra messages were found in syslog during these tests.
> > 
> > BUT, at the same time I remembered that the system in the Bugzilla report
> > was Athlon64 running FC3 x86_64. The x86_64 kernel does not have highmem.
> 
> True. I do have a machine running FC4 x86_64, but it doesn't have 4G of  
> memory, so it doesn't do the 4G-remap that the bug report refers to.
> 
> Don't know if there is any way to make it do so (other than buy more memory).


note that some (tyan) amd64 motherboards have a really broken bios wrt
remap, and the only way to get those systems stable is to disable the
memory remap in that bios. If you don't all kinds of funky stuff can and
does happen.


Comment 7 Johan van Baarlen 2005-08-29 00:10:34 UTC
Please don't get me started on broken bioses... I had a lot of fun with 4G on a 
Asus A8V (Via K8T800), without remap, 3.5G are available and everything works - 
with remap, all PCI-cards and onboard devices stop functioning. Nice.
The A8N is much better, at least all onboard devices work without a hitch :-) 

But back to the topic at hand: I've just flashed it to 1008-bios, and you can 
probably guess it did make a difference - now I can't access anything on the 
second scsi-controller anymore. cat /proc/scsi/scsi shows all devices, but as 
soon as I try to do mt -f /dev/st1 status (or /dev/st2), the driver seems to 
lock up and only a full reboot gets it going again. 

Without remap, everything still works fine. Later today I'm gonna try some 
other controllers and initially only test one at a time, to see if I can 
pinpoint the troublemaker a bit more precise.
Comment 8 Johan van Baarlen 2005-10-01 03:14:07 UTC
Been a while, but we're up to 2.6.13.2 by now... Didn't see a panic on this 
kernel, but something is definitely not right. The driver locks up completely 
way too often - doing a simple 'mt -f /dev/st0 status' right after a warm or 
cold reboot is enough to send it into non-responsive hell, no way to break, no 
errormessages, no nothing. Rest of the system is running, I can still logon, 
access network, disks, etcetera, just no responses on any of the controllers.
Happens without the 3G-remap too, so probably a slightly different issue, but 
still a very bad one. Does anybody have a clue how to start debugging this? 
This renders the system unusable, which is a very bad thing.
If more info required, PLEASE tell me how to produce it so there is at least a 
glimpse of resolving this matter.
Comment 9 Johan van Baarlen 2005-10-13 06:33:21 UTC
2.6.13.3, still happening... This is getting critical, I need that machine 
stable! Anybody got a clue?
Comment 10 Matthew Wilcox 2005-10-13 08:36:53 UTC
As far as I am aware this really isn't a sym2 problem.
Reassigning to tape
Comment 11 Natalie Protasevich 2007-10-11 00:31:51 UTC
What is the current state of this problem, any update please.
Thanks.
Comment 12 Natalie Protasevich 2008-03-10 20:42:56 UTC
Closing the bug, no recent activity. Johan if you still have the problem with recent kernels please reopen.

Note You need to log in before you can comment on or make changes to this bug.