Bug 5880
Summary: | Transcend CF pcmcia card no longer works | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Pavel Machek (pavel) |
Component: | IDE | Assignee: | Bartlomiej Zolnierkiewicz (bzolnier) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | akpm, alan, axboe, linux, protasnb |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.13-rc7-2.6.15-git | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Pavel Machek
2006-01-13 04:09:41 UTC
...after removing the card (while mount is running) I got: hdf: irq timeout: status=0xff { Busy } ide: failed opcode was: unknown then machine froze for ~2 seconds. ide2: reset timed-out, status=0xff hde: status timeout: status=0xff { Busy } ide: failed opcode was: unknown hde: drive not ready for command ipw2200: Firmware error detected. Restarting. ipw2200: Sysfs 'error' log captured. ...and freezes keep repeating. I got even nice oops from block layer in the end. System is trashed at this point, with disk accesses blocking. end_request: I/O error, dev hde, sector 0 Buffer I/O error on device hde, logical block 0 end_request: I/O error, dev hdf, sector 0 Buffer I/O error on device hdf, logical block 0 end_request: I/O error, dev hdf, sector 0 Buffer I/O error on device hdf, logical block 0 ldm_validate_partition_table(): Disk read failed. Unable to handle kernel NULL pointer dereference at virtual address 00000844 printing eip: c0262321 *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: ipw2200 CPU: 0 EIP: 0060:[<c0262321>] Not tainted VLI EFLAGS: 00010046 (2.6.15-git_dirty) EIP is at as_find_arq_hash+0x21/0xc0 eax: 00000844 ebx: c0263c20 ecx: 00000000 edx: 00000000 esi: f7c5e0ac edi: 00000008 ebp: 00000844 esp: f45c9ae4 ds: 007b es: 007b ss: 0068 Process mount (pid: 15819, threadinfo=f45c8000 task=f4c78a50) Stack: 00000000 00000000 c0263c20 f7c5e0ac 00000008 00000000 c0263c63 f7c06cc8 f44902c8 00000000 00000000 f7e49840 f45c9b84 f44902c0 00000001 c0263c20 f7c5e0ac f7e49840 f45862c0 c0259ff5 f7cc18c0 f45c9b80 f45c9b84 00000000 Call Trace: [<c0263c20>] as_merge+0x0/0x1f0 [<c0263c63>] as_merge+0x43/0x1f0 [<c0263c20>] as_merge+0x0/0x1f0 [<c0259ff5>] elv_merge+0x45/0x120 [<c025e15f>] __make_request+0x13f/0x3b0 [<c025e551>] generic_make_request+0xc1/0x170 [<c011ffb6>] try_to_wake_up+0x76/0x3e0 [<c0122019>] __wake_up_common+0x39/0x70 [<c025e64f>] submit_bio+0x4f/0xf0 [<c05c0150>] schedule+0x3d0/0x7b0 [<c016fc60>] bio_alloc_bioset+0x120/0x1d0 [<c016c240>] end_buffer_async_read+0x0/0x1e0 [<c016f3e9>] submit_bh+0xe9/0x150 [<c016e26e>] block_read_full_page+0x2ee/0x340 [<c0172630>] blkdev_get_block+0x0/0x90 [<c014b1a0>] add_to_page_cache+0x70/0xe0 [<c0153428>] lru_cache_add+0x48/0x80 [<c014cbb3>] read_cache_page+0x93/0x140 [<c0172800>] blkdev_readpage+0x0/0x10 [<c01a3d8a>] read_dev_sector+0x3a/0xc0 [<c01a6a79>] msdos_partition+0x49/0x370 [<c01a4fde>] ldm_validate_partition_table+0x7e/0xc0 [<c01a6586>] ldm_partition+0x36/0x200 [<c01a3541>] disk_name+0xd1/0xf0 [<c01a365f>] check_partition+0x9f/0x110 [<c01a3c5d>] rescan_partitions+0x9d/0x190 [<c03ae530>] task_no_data_intr+0x0/0xa0 [<c017328e>] do_open+0x2de/0x3a0 [<c01733bc>] blkdev_get+0x6c/0x90 [<c0173106>] do_open+0x156/0x3a0 [<c0173405>] blkdev_open+0x25/0x70 [<c0169612>] __dentry_open+0xe2/0x290 [<c01698d5>] nameidata_to_filp+0x35/0x40 [<c01733e0>] blkdev_open+0x0/0x70 [<c01697fb>] filp_open+0x3b/0x50 [<c05c1c0d>] _spin_unlock+0xd/0x30 [<c0169a14>] get_unused_fd+0xb4/0xe0 [<c0169b61>] do_sys_open+0x41/0xd0 [<c0103341>] syscall_call+0x7/0xb Code: 00 00 00 8d bc 27 00 00 00 00 55 57 56 53 83 ec 08 89 14 24 0f ac ca 03 89 4c 24 04 69 d2 01 00 37 9e 8b 40 34 c1 ea 1a 8d 2c d0 <8b> 7d 00 8d b6 00 00 00 00 8d bf 00 00 00 00 39 fd 74 4c 8d 5f <6>note: mount[15819] exited with preempt_count 1 ipw2200: Firmware error detected. Restarting. ipw2200: Sysfs 'error' log already exists. 2.6. kernels use pcmciamtd instead of memory_cs.c and do not require special MTD handling any more. ...I hope system is in good enough shape to submit report... Jens, any ideas? Is there way to add Dominik into Cc? [Or who does PCMCIA these days? MAINTAINERS says "PCMCIA team" :-(] > hde: Transcend, CFA DISK drive
> hdf: probing with STATUS(0x50) instead of ALTSTATUS(0x0a), ATA DISK drive
We handle undecoded slave in ide-probe.c:ide_undecoded_slave() but it
seems like a slightly different version of the problem (slave returns
different identify data?).
I'm adding Alan to cc: since he contributed original code and knows the most
about the issue. I'm also adding Dominik. I hope that they don't mind it . :)
Do you use pcmcia-cs or pcmciautils? Could you grab latest pcmciautils from git://git.kernel.org/pub/scm/linux/pcmcia/pcmciautils.git , please, build and install it and post the output of "lspcmica -vv"? The dup detector looks for drives where the slave isn't decoded. You appear to have random noise as the slave instead. Probably needs some specific logic for the transcend pcmcia ident to be single channel. > Card definitely worked with 2.4, I think it worked with older 2.6's,
Could you verify this before we start adding Transcend specific hacks?
(Adding bugzilla back to cc) Tejun Heo <htejun@gmail.com> wrote: > > Nick Piggin wrote: > > Tejun Heo wrote: > > > >> Andrew Morton wrote: > >>> Code: 00 00 00 8d bc 27 00 00 00 00 55 57 56 53 83 ec 08 89 14 24 0f > >>> ac ca 03 89 > >>> 4c 24 04 69 d2 01 00 37 9e 8b 40 34 c1 ea 1a 8d 2c d0 <8b> 7d 00 8d > >>> b6 00 00 00 > >>> 00 8d bf 00 00 00 00 39 fd 74 4c 8d 5f > >> > >> > >> > >> Hello, > >> > >> Disassemblies of the faulting and releated previous instruction are > >> > >> %eax: address of as_data (the first argument @ad) > >> %edx: ad->hash index calculated from @offset > >> > >> 8b4034 mov (eax,0x34), eax : %eax = ad->hash > >> ... > >> 8d2cd0 lea (%eax,%edx,8), ebp : %ebp = &ad->hash[%edx] > >> 8b7d00 mov (ebp), edi : *%ebp, boom! > >> > >> So, the problem is that ad->hash contains invalid address. It seems > > > > > > Well, ad itself is wrong, as I suspected. > > > >> that the queue got destroyed after ldm validation failure and somehow > >> a request got submitted to the queue after it. > >> > > > > Somehow because the partition checking is moving on test for msdos > > partition, > > which is what caused the oops... The disk is probably not ldm, but even > > if it > > were, all the reads are failing anyway so that point is irrelevant. > > > >> I'll look further into queue deallocation path. > >> > > > > I'm not sure that it has been deallocated, rather something has overwritten > > it, no? > > The ldm validation or failure seems irrelevant. From the original bug > report. > > ===== > > ------- Additional Comment #1 From Pavel Machek 2006-01-13 04:13 ------ > > ...after *removing* the card (while mount is running) I got: > > hdf: irq timeout: status=0xff { Busy } > ide: failed opcode was: unknown > > then machine froze for ~2 seconds. > > ide2: reset timed-out, status=0xff > hde: status timeout: status=0xff { Busy } > ide: failed opcode was: unknown > hde: drive not ready for command > ipw2200: Firmware error detected. Restarting. > ipw2200: Sysfs 'error' log captured. > > ...and freezes keep repeating. > > > ------- Additional Comment #2 From Pavel Machek 2006-01-13 04:17 ------- > > I got even nice oops from block layer in the end. System is trashed at this > point, with disk accesses blocking. > > ===== > > So, it seems we're looking at premature queue deallocation bug during > IDE unplugging. > > -- > tejun Reply-To: nickpiggin@yahoo.com.au Andrew Morton wrote: > (Adding bugzilla back to cc) > > Tejun Heo <htejun@gmail.com> wrote: > >>>I'm not sure that it has been deallocated, rather something has overwritten >>>it, no? >> >>The ldm validation or failure seems irrelevant. From the original bug >>report. >> Looks like you're right. It was just odd that the ldm check worked and the msdos one oopsed; both attempt to read the same block in the same manner. Perhaps it may be that the queue was freed long before then, and only just gets reused for something else and overwritten at this point. slab debugging might help? Strange. I turned on CONFIG_DEBUG_SLAB, and it does not oops any more. Also... I probably found out why it seemed to work for me in past. This is x32, it has PCMCIA slot and CF slot. In CF slot, card behaves as I described. In PCMCIA slot, it works better: pcmcia: registering new device pcmcia0.0 Probing IDE interface ide2... hde: Transcend, CFA DISK drive hdf: probing with STATUS(0x50) instead of ALTSTATUS(0x0a) hdf: probing with STATUS(0x00) instead of ALTSTATUS(0x0a) ...hdf is not created and no problems. It is strange, because controllers for PCMCIA and CF should be identical... 0000:02:00.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev aa) 0000:02:00.1 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev aa) Reply-To: htejun@gmail.com Nick Piggin wrote: > Andrew Morton wrote: > >> (Adding bugzilla back to cc) >> >> Tejun Heo <htejun@gmail.com> wrote: >> > >>>> I'm not sure that it has been deallocated, rather something has >>>> overwritten >>>> it, no? >>> >>> >>> The ldm validation or failure seems irrelevant. From the original >>> bug report. >>> > > Looks like you're right. > > It was just odd that the ldm check worked and the msdos one > oopsed; both attempt to read the same block in the same manner. > > Perhaps it may be that the queue was freed long before then, and > only just gets reused for something else and overwritten at this > point. > > slab debugging might help? > [CC'ing Bartlomiej and linux-ide] Hello, Bartlomiej. This thread started by Andrew forwarding bugzilla #5880 to as-iosched related people as the machine oopsed in anticipatory scheduler. After diagnosis, it seems to be premature device release problem of IDE driver. The original bug entry is http://bugzilla.kernel.org/show_bug.cgi?id=5880 The original bug is about CF device unuseable. The oops occurs after yanking out the CF device during partition check is in progress (all are failing due to IO error). I've glanced through the IDE code and the problem seems to be in ide_disk_get(). This function doesn't get the IDE generic device (ide_drive_t->gendev) and so if a device is yanked out while it's still open, the gendev doesn't wait for the open fd's to be closed. This results in dangling open block device and thus above oops. I'm not very sure about this analysis and currently don't have access to hot-unpluggable IDE devices. Can you please confirm this? Thanks. On 1/16/06, Tejun Heo <htejun@gmail.com> wrote: > Nick Piggin wrote: > > Andrew Morton wrote: > > > >> (Adding bugzilla back to cc) > >> > >> Tejun Heo <htejun@gmail.com> wrote: > >> > > > >>>> I'm not sure that it has been deallocated, rather something has > >>>> overwritten > >>>> it, no? > >>> > >>> > >>> The ldm validation or failure seems irrelevant. From the original > >>> bug report. > >>> > > > > Looks like you're right. > > > > It was just odd that the ldm check worked and the msdos one > > oopsed; both attempt to read the same block in the same manner. > > > > Perhaps it may be that the queue was freed long before then, and > > only just gets reused for something else and overwritten at this > > point. > > > > slab debugging might help? > > > > [CC'ing Bartlomiej and linux-ide] > > Hello, Bartlomiej. > > This thread started by Andrew forwarding bugzilla #5880 to as-iosched > related people as the machine oopsed in anticipatory scheduler. After > diagnosis, it seems to be premature device release problem of IDE > driver. The original bug entry is > > http://bugzilla.kernel.org/show_bug.cgi?id=5880 > > The original bug is about CF device unuseable. The oops occurs after > yanking out the CF device during partition check is in progress (all are > failing due to IO error). > > I've glanced through the IDE code and the problem seems to be in > ide_disk_get(). This function doesn't get the IDE generic device > (ide_drive_t->gendev) and so if a device is yanked out while it's still > open, the gendev doesn't wait for the open fd's to be closed. This > results in dangling open block device and thus above oops. I'm not very > sure about this analysis and currently don't have access to > hot-unpluggable IDE devices. > > Can you please confirm this? Hi Tejun, Confirmed, we need to get/put generic device in ide_disk_{get,put}() similarly like drivers/scsi/sd.c driver does it (the same also applies to ide-{cd,floppy,tape,scsi} drivers). I would happily apply a patch fixing it (*hint*, *hint*). Now back to the original problem ("ghost" slave device): Pavel, is there any way to recognize this PCMCIA IDE controller in ide-cs driver so we could tell IDE subsystem not to probe slave devices? Thanks, Bartlomiej Reply-To: pavel@ucw.cz > ------- Additional Comments From bzolnier@gmail.com 2006-01-16 01:44 ------- > > Pavel, is there any way to recognize this PCMCIA IDE controller > in ide-cs driver so we could tell IDE subsystem not to probe slave > devices? Well, anything inserted into PCMCIA is *very* unlikely to have slave devices. Plus PCMCIA devices have names; we could use that. Unlikely yes, but they exist and most hardware works properly anyway. In fact you are the first case I've seen since the undecoded slave fix killed the other cases. In actual fact you can get up to 4 disk PCMCIA ATA controllers (2 master/2 slave) ! I'm trying to run hdparm -i /dev/hde, but it hung. I can try again in another pcmcia slot. Would that info be useful? root@amd:/home/pavel# hdparm -i /dev/hde /dev/hde: Model=Transcend, FwRev=2N3-0925, SerialNo=TS256MFLASHCP 00000 Config={ HardSect NotMFM Removeable DTR>10Mbs nonMagnetic } RawCHS=984/16/32, TrkSize=2048, SectSize=512, ECCbytes=4 BuffType=1Sect, BuffSize=0kB, MaxMultSect=1, MultSect=off CurCHS=984/16/32, CurSects=503808, LBA=yes, LBAsects=503808 IORDY=no PIO modes: pio0 pio1 pio2 AdvancedPM=no * signifies the current active mode ...any other info that would be useful? Any updates on this problem? Thanks. Closing for now (please re-open if it still happens with recent kernels). Reply-To: pavel@ucw.cz Hi! > http://bugzilla.kernel.org/show_bug.cgi?id=5880 > > > bzolnier@gmail.com changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|NEW |REJECTED > Resolution| |INSUFFICIENT_DATA > > > > > ------- Comment #18 from bzolnier@gmail.com 2008-02-13 09:07 ------- > Closing for now (please re-open if it still happens with recent kernels). It seems to work ok with 2.6.25-rc2. Thanks! Feb 18 14:52:19 amd kernel: PM: Adding info for pcmcia:0.0 Feb 18 14:52:19 amd kernel: Probing IDE interface ide0... Feb 18 14:52:20 amd kernel: hda: Transcend, CFA DISK drive Feb 18 14:52:20 amd kernel: PM: Adding info for No Bus:ide0 Feb 18 14:52:20 amd kernel: ide0 at 0x9100-0x9107,0x910e on irq 16 Feb 18 14:52:20 amd kernel: PM: Adding info for ide:0.0 Feb 18 14:52:20 amd kernel: hda: max request size: 128KiB Feb 18 14:52:20 amd kernel: hda: 503808 sectors (257 MB) w/0KiB Cache, CHS=984/16/32 Feb 18 14:52:20 amd kernel: PM: Adding info for No Bus:hda Feb 18 14:52:20 amd kernel: hda: hda1 Feb 18 14:52:20 amd kernel: PM: Adding info for No Bus:hda1 Feb 18 14:52:20 amd cardmgr[1375]: socket 0: ATA/IDE Fixed Disk Feb 18 14:52:20 amd kernel: ide-cs: hda: Vpp = 0.0 Pavel Thanks Pavel. |