Distro: FC3 x86_64 Kernel: mainstream 2.6.12.2, 2.6.12.3, 2.6.12.5 Hardware: Asus A8N-E (bios 1006), Athlon64-X2 4400+, 4Gbyte PC3200 ram Video: PCI-Express NVidia GF6200. Module nvidia.ko taints kernel, but its presence has no influence on this bug (verified by using a clean kernel without nvidia-module) Scsi-controllers: 53c875, 53c895 (one or two installed makes no difference) Scsi-devices: DLT7000 (on 895-internal), DLT4000 (on 875-internal), DLT4000 (on 875-external) As long as I do not enable the pci-address space remapping in bios, everything is fine: tapedrives run at full speed (well, the old DLTs aren't really quick, but they do run at >95% of their rated raw speeds), survives writing/rewinding/reading the same tape for days without interruption (total transfer >1.5Tbyte of data) But I only have 3Gigs of memory available, which is a bit of a waste, and a lot of apps are touching swapspace because of the missing 1G. By enabling the bios-options to remap memory (software and hardware remap), linux can see and use (almost) 4Gbyte: Memory: 4038248k/5242880k available (2341k kernel code, 154876k reserved, 1166k data, 212k init). Apparently the memoryhole for PCI- and PCI-express cards on this board is a full gigabyte, and this 'missing' 1G is added on top of 4G. Works like a charm, stable as a rock, until I try to use the tapedrives: sometimes the driver just locks up on first access (requiring a reboot to revive it), sometimes it works fine for a few minutes, but I can (almost always) get the kernel to break down immediately with a very simple command: To reproduce: dd if=/dev/zero bs=256k count=1000 > /dev/st0, system doesn't even have time to log the errors. 'star' gives similar errors, only later - output below is from star from disk to DLT7000/53c895 with 256k blocksize. Happens on DLT4000 with 53c875 as well, also with external DLT4000 on 53c895. Results from syslog, repeated a lot of times, eventually resulting in a kernel panic or hard lockup always within several minutes. Didn't capture the panics, but the 'normal' repetive error goes like this: Bad page state at free_hot_cold_page (in process 'star', page ffff8100039706e8) flags:0x2000000c mapping:ffff810132201570 mapcount:2 count:0 Call Trace: <ffffffff80131360>{default_wake_function+0} <ffffffff80160a00>{bad_page+112} <ffffffff801611e5>{free_hot_cold_page+133} <ffffffff8801a9af>{:st:sgl_unmap_user_pages+79} <ffffffff8801a9db>{:st:release_buffering+27} <ffffffff8801f97e>{:st:st_write+2718} <ffffffff8017ce65>{vfs_write+229} <ffffffff8017cfc3>{sys_write+83} <ffffffff8010eb36>{system_call+126} Trying to fix it up, but a reboot is needed Repeating a lot of times until the system locks (often) or panics (sometimes) or recovers from panic by rebooting (happened twice). The 'page' keeps increasing, but 'mapping' stays the same. After a reboot, same errors are logged, but different 'mapping' (constant) and 'page' (increasing). Not sure if this is helpful, but anyway: Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4000) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4038) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4070) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e40a8) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e40e0) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4118) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4150) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4188) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e41c0) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e41f8) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4230) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4268) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e42a0) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e42d8) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4310) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4348) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4380) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e43b8) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e43f0) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4428) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4460) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4498) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e44d0) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4508) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4540) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4578) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e45b0) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e45e8) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4620) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4658) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Bad page state at free_hot_cold_page (in process 'star', page ffff8100039e4690) flags:0x2000000c mapping:ffff8101383c5570 mapcount:2 count:0 Have tried a lot of combinations already, but no diff - one controller installed, both controllers installed, different PCI-slots, etc. Blocksize used seems to make a difference - dd'ing 256kbyte blocks crashes instantly, 64k blocks seem to make it through. sym53c8xx has different options for DMA-adressing (32bit, 40/48bit, 64bit) - tried them all, no difference, back to default 40/48bit. From the above syslog-dump, the problem seems to be in the scsi- tapedriver 'st', but since this is so widely used it is very hard to imagine a bug in that module. It only occurs with the 4G-remapping enabled, so I am getting the feeling that this is going to be a nasty combination to find. Looks like something is fouling the release of a (series of) buffers, but this requires a big dive into a section of the code for which I really don't have enough experience. Please help, this one is driving me crazy - the machine needs to go online in about a week, and I'd really like to find and fix the source of these problems so that it can be operational with the full 4G of ram available, and all tapedrives singing.
Begin forwarded message: Date: Tue, 23 Aug 2005 12:53:38 -0700 From: bugme-daemon@kernel-bugs.osdl.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 5117] New: Panic when accessing scsi-tapedrives with 4G-remap http://bugzilla.kernel.org/show_bug.cgi?id=5117 Summary: Panic when accessing scsi-tapedrives with 4G-remap Kernel Version: 2.6.12.5 Could this purely be a highmem problem? Is the zero-copy DMA feature of st.c known to work OK with x86 highmem?
Reply-To: osst@riede.org On 08/25/2005 08:24:17 PM, Andrew Morton wrote: > > Subject: [Bugme-new] [Bug 5117] New: Panic when accessing scsi-tapedrives > with 4G-remap > > > http://bugzilla.kernel.org/show_bug.cgi?id=5117 > > Summary: Panic when accessing scsi-tapedrives with 4G-remap > Kernel Version: 2.6.12.5 > > > Could this purely be a highmem problem? Is the zero-copy DMA feature of > st.c known to work OK with x86 highmem? I don't know about Kai, but I don't have the hardware to test that... Sorry, Willem Riede.
Reply-To: Kai.Makisara@kolumbus.fi On Thu, 25 Aug 2005, Andrew Morton wrote: > > > Begin forwarded message: > > Date: Tue, 23 Aug 2005 12:53:38 -0700 > From: bugme-daemon@kernel-bugs.osdl.org > To: bugme-new@lists.osdl.org > Subject: [Bugme-new] [Bug 5117] New: Panic when accessing scsi-tapedrives with 4G-remap > > > http://bugzilla.kernel.org/show_bug.cgi?id=5117 > > Summary: Panic when accessing scsi-tapedrives with 4G-remap > Kernel Version: 2.6.12.5 > > > Could this purely be a highmem problem? Is the zero-copy DMA feature of > st.c known to work OK with x86 highmem? It is _not_ known _not_ to work ;-) I.e., I have received neither any success nor any failure reports. I have not tested it because I don't have any machine with enough memory (and I have not hacked a kernel to use highmem with 512 MB of memory). I hope someone seeing this thread and using highmem with tape can comment on this subject.
Reply-To: Kai.Makisara@kolumbus.fi On Fri, 26 Aug 2005, Kai Makisara wrote: > On Thu, 25 Aug 2005, Andrew Morton wrote: > > > > > > > Begin forwarded message: > > > > Date: Tue, 23 Aug 2005 12:53:38 -0700 > > From: bugme-daemon@kernel-bugs.osdl.org > > To: bugme-new@lists.osdl.org > > Subject: [Bugme-new] [Bug 5117] New: Panic when accessing scsi-tapedrives with 4G-remap > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=5117 > > > > Summary: Panic when accessing scsi-tapedrives with 4G-remap > > Kernel Version: 2.6.12.5 > > > > > > Could this purely be a highmem problem? Is the zero-copy DMA feature of > > st.c known to work OK with x86 highmem? > > It is _not_ known _not_ to work ;-) I.e., I have received neither any > success nor any failure reports. I have not tested it because I don't have > any machine with enough memory (and I have not hacked a kernel to use > highmem with 512 MB of memory). I hope someone seeing this thread and > using highmem with tape can comment on this subject. > OK. I booted my test i386 machine with highmem=384m and did some tests. I also added a counter to st.c to count the highmem pages used for zero-copy DMA. I could not get dd to use highmem but with tar that succeeded. No extra messages were found in syslog during these tests. BUT, at the same time I remembered that the system in the Bugzilla report was Athlon64 running FC3 x86_64. The x86_64 kernel does not have highmem.
Reply-To: osst@riede.org On 08/28/2005 06:40:04 AM, Kai Makisara wrote: > On Fri, 26 Aug 2005, Kai Makisara wrote: > > > On Thu, 25 Aug 2005, Andrew Morton wrote: > > > > > Could this purely be a highmem problem? Is the zero-copy DMA feature of > > > st.c known to work OK with x86 highmem? > > > > It is _not_ known _not_ to work ;-) I.e., I have received neither any > > success nor any failure reports. I have not tested it because I don't have > > > any machine with enough memory (and I have not hacked a kernel to use > > highmem with 512 MB of memory). I hope someone seeing this thread and > > using highmem with tape can comment on this subject. > > > OK. I booted my test i386 machine with highmem=384m and did some tests. I > also added a counter to st.c to count the highmem pages used for zero-copy > DMA. I could not get dd to use highmem but with tar that succeeded. No > extra messages were found in syslog during these tests. > > BUT, at the same time I remembered that the system in the Bugzilla report > was Athlon64 running FC3 x86_64. The x86_64 kernel does not have highmem. True. I do have a machine running FC4 x86_64, but it doesn't have 4G of memory, so it doesn't do the 4G-remap that the bug report refers to. Don't know if there is any way to make it do so (other than buy more memory). Regards, Willem Riede.
Reply-To: arjan@infradead.org > > OK. I booted my test i386 machine with highmem=384m and did some tests. I > > also added a counter to st.c to count the highmem pages used for zero-copy > > DMA. I could not get dd to use highmem but with tar that succeeded. No > > extra messages were found in syslog during these tests. > > > > BUT, at the same time I remembered that the system in the Bugzilla report > > was Athlon64 running FC3 x86_64. The x86_64 kernel does not have highmem. > > True. I do have a machine running FC4 x86_64, but it doesn't have 4G of > memory, so it doesn't do the 4G-remap that the bug report refers to. > > Don't know if there is any way to make it do so (other than buy more memory). note that some (tyan) amd64 motherboards have a really broken bios wrt remap, and the only way to get those systems stable is to disable the memory remap in that bios. If you don't all kinds of funky stuff can and does happen.
Please don't get me started on broken bioses... I had a lot of fun with 4G on a Asus A8V (Via K8T800), without remap, 3.5G are available and everything works - with remap, all PCI-cards and onboard devices stop functioning. Nice. The A8N is much better, at least all onboard devices work without a hitch :-) But back to the topic at hand: I've just flashed it to 1008-bios, and you can probably guess it did make a difference - now I can't access anything on the second scsi-controller anymore. cat /proc/scsi/scsi shows all devices, but as soon as I try to do mt -f /dev/st1 status (or /dev/st2), the driver seems to lock up and only a full reboot gets it going again. Without remap, everything still works fine. Later today I'm gonna try some other controllers and initially only test one at a time, to see if I can pinpoint the troublemaker a bit more precise.
Been a while, but we're up to 2.6.13.2 by now... Didn't see a panic on this kernel, but something is definitely not right. The driver locks up completely way too often - doing a simple 'mt -f /dev/st0 status' right after a warm or cold reboot is enough to send it into non-responsive hell, no way to break, no errormessages, no nothing. Rest of the system is running, I can still logon, access network, disks, etcetera, just no responses on any of the controllers. Happens without the 3G-remap too, so probably a slightly different issue, but still a very bad one. Does anybody have a clue how to start debugging this? This renders the system unusable, which is a very bad thing. If more info required, PLEASE tell me how to produce it so there is at least a glimpse of resolving this matter.
2.6.13.3, still happening... This is getting critical, I need that machine stable! Anybody got a clue?
As far as I am aware this really isn't a sym2 problem. Reassigning to tape
What is the current state of this problem, any update please. Thanks.
Closing the bug, no recent activity. Johan if you still have the problem with recent kernels please reopen.