Bug 5880

Summary: Transcend CF pcmcia card no longer works
Product: IO/Storage Reporter: Pavel Machek (pavel)
Component: IDEAssignee: Bartlomiej Zolnierkiewicz (bzolnier)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, alan, axboe, linux, protasnb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.13-rc7-2.6.15-git Subsystem:
Regression: --- Bisected commit-id:

Description Pavel Machek 2006-01-13 04:09:41 UTC
Most recent kernel where this bug did not occur: It already occured in
2.6.13-rc7. Card definitely worked with 2.4, I think it worked with older 2.6's,
too.
Distribution: debian
Hardware Environment: i386
Software Environment:
Problem Description: Insert CF card. Two cards are detected:
pccard: PCMCIA card inserted into slot 1
cs: memory probe 0xe8000000-0xefffffff: excluding 0xe8000000-0xefffffff
cs: memory probe 0xc0200000-0xcfffffff: excluding 0xc0200000-0xc11fffff
0xc1a00000-0xc61fffff 0xc6a00000-0xc71fffff 0xc7a00000-0xc81fffff
0xc8a00000-0xc91fffff 0xc9a00000-0xca1fffff 0xcaa00000-0xcb1fffff
0xcba00000-0xcc1fffff 0xcca00000-0xcd1fffff 0xcda00000-0xce1fffff
0xcea00000-0xcf1fffff 0xcfa00000-0xd01fffff
pcmcia: registering new device pcmcia1.0
Probing IDE interface ide2...
hde: Transcend, CFA DISK drive
hdf: probing with STATUS(0x50) instead of ALTSTATUS(0x0a)
, ATA DISK drive
ide2 at 0x4100-0x4107,0x410e on irq 7
hde: max request size: 128KiB
hde: 503808 sectors (257 MB) w/0KiB Cache, CHS=984/16/32
 hde: hde1
hdf: max request size: 128KiB
hdf: 131584 sectors (67 MB) w/1028KiB Cache, CHS=2056/8/8
 hdf:<6> hde:<4>hdf: lost interrupt
hdf: lost interrupt
hdf: lost interrupt
hdf: lost interrupt
hdf: lost interrupt
pavel@amd:~$

...but only hde is real. Mounting /dev/hde1 hangs. I think it will oops when I
yank the card. If so, I'll attach it as a comment.
Comment 1 Pavel Machek 2006-01-13 04:13:47 UTC
...after removing the card (while mount is running) I got:

hdf: irq timeout: status=0xff { Busy }
ide: failed opcode was: unknown

then machine froze for ~2 seconds.

ide2: reset timed-out, status=0xff
hde: status timeout: status=0xff { Busy }
ide: failed opcode was: unknown
hde: drive not ready for command
ipw2200: Firmware error detected.  Restarting.
ipw2200: Sysfs 'error' log captured.

...and freezes keep repeating.



Comment 2 Pavel Machek 2006-01-13 04:17:53 UTC
I got even nice oops from block layer in the end. System is trashed at this
point, with disk accesses blocking.

end_request: I/O error, dev hde, sector 0
Buffer I/O error on device hde, logical block 0
end_request: I/O error, dev hdf, sector 0
Buffer I/O error on device hdf, logical block 0
end_request: I/O error, dev hdf, sector 0
Buffer I/O error on device hdf, logical block 0
ldm_validate_partition_table(): Disk read failed.
Unable to handle kernel NULL pointer dereference at virtual address 00000844
 printing eip:
c0262321
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in: ipw2200
CPU:    0
EIP:    0060:[<c0262321>]    Not tainted VLI
EFLAGS: 00010046   (2.6.15-git_dirty)
EIP is at as_find_arq_hash+0x21/0xc0
eax: 00000844   ebx: c0263c20   ecx: 00000000   edx: 00000000
esi: f7c5e0ac   edi: 00000008   ebp: 00000844   esp: f45c9ae4
ds: 007b   es: 007b   ss: 0068
Process mount (pid: 15819, threadinfo=f45c8000 task=f4c78a50)
Stack: 00000000 00000000 c0263c20 f7c5e0ac 00000008 00000000 c0263c63 f7c06cc8
       f44902c8 00000000 00000000 f7e49840 f45c9b84 f44902c0 00000001 c0263c20
       f7c5e0ac f7e49840 f45862c0 c0259ff5 f7cc18c0 f45c9b80 f45c9b84 00000000
Call Trace:
 [<c0263c20>] as_merge+0x0/0x1f0
 [<c0263c63>] as_merge+0x43/0x1f0
 [<c0263c20>] as_merge+0x0/0x1f0
 [<c0259ff5>] elv_merge+0x45/0x120
 [<c025e15f>] __make_request+0x13f/0x3b0
 [<c025e551>] generic_make_request+0xc1/0x170
 [<c011ffb6>] try_to_wake_up+0x76/0x3e0
 [<c0122019>] __wake_up_common+0x39/0x70
 [<c025e64f>] submit_bio+0x4f/0xf0
 [<c05c0150>] schedule+0x3d0/0x7b0
 [<c016fc60>] bio_alloc_bioset+0x120/0x1d0
 [<c016c240>] end_buffer_async_read+0x0/0x1e0
 [<c016f3e9>] submit_bh+0xe9/0x150
 [<c016e26e>] block_read_full_page+0x2ee/0x340
 [<c0172630>] blkdev_get_block+0x0/0x90
 [<c014b1a0>] add_to_page_cache+0x70/0xe0
 [<c0153428>] lru_cache_add+0x48/0x80
 [<c014cbb3>] read_cache_page+0x93/0x140
 [<c0172800>] blkdev_readpage+0x0/0x10
 [<c01a3d8a>] read_dev_sector+0x3a/0xc0
 [<c01a6a79>] msdos_partition+0x49/0x370
 [<c01a4fde>] ldm_validate_partition_table+0x7e/0xc0
 [<c01a6586>] ldm_partition+0x36/0x200
 [<c01a3541>] disk_name+0xd1/0xf0
 [<c01a365f>] check_partition+0x9f/0x110
 [<c01a3c5d>] rescan_partitions+0x9d/0x190
 [<c03ae530>] task_no_data_intr+0x0/0xa0
 [<c017328e>] do_open+0x2de/0x3a0
 [<c01733bc>] blkdev_get+0x6c/0x90
 [<c0173106>] do_open+0x156/0x3a0
 [<c0173405>] blkdev_open+0x25/0x70
 [<c0169612>] __dentry_open+0xe2/0x290
 [<c01698d5>] nameidata_to_filp+0x35/0x40
 [<c01733e0>] blkdev_open+0x0/0x70
 [<c01697fb>] filp_open+0x3b/0x50
 [<c05c1c0d>] _spin_unlock+0xd/0x30
 [<c0169a14>] get_unused_fd+0xb4/0xe0
 [<c0169b61>] do_sys_open+0x41/0xd0
 [<c0103341>] syscall_call+0x7/0xb
Code: 00 00 00 8d bc 27 00 00 00 00 55 57 56 53 83 ec 08 89 14 24 0f ac ca 03 89
4c 24 04 69 d2 01 00 37 9e 8b 40 34 c1 ea 1a 8d 2c d0 <8b> 7d 00 8d b6 00 00 00
00 8d bf 00 00 00 00 39 fd 74 4c 8d 5f
 <6>note: mount[15819] exited with preempt_count 1
ipw2200: Firmware error detected.  Restarting.
ipw2200: Sysfs 'error' log already exists.
2.6. kernels use pcmciamtd instead of memory_cs.c and do not require special
MTD handling any more.

...I hope system is in good enough shape to submit report...
Comment 3 Pavel Machek 2006-01-13 04:31:59 UTC
Jens, any ideas? Is there way to add Dominik into Cc? [Or who does PCMCIA these
days? MAINTAINERS says "PCMCIA team" :-(]
Comment 4 Bartlomiej Zolnierkiewicz 2006-01-13 05:01:43 UTC
> hde: Transcend, CFA DISK drive
> hdf: probing with STATUS(0x50) instead of ALTSTATUS(0x0a), ATA DISK drive

We handle undecoded slave in ide-probe.c:ide_undecoded_slave() but it 
seems like a slightly different version of the problem (slave returns 
different identify data?).

I'm adding Alan to cc: since he contributed original code and knows the most 
about the issue.  I'm also adding Dominik.  I hope that they don't mind it . :)
Comment 5 Dominik Brodowski 2006-01-13 05:15:29 UTC
Do you use pcmcia-cs or pcmciautils? Could you grab latest pcmciautils from
git://git.kernel.org/pub/scm/linux/pcmcia/pcmciautils.git , please, build and
install it and post the output of "lspcmica -vv"? 
Comment 6 Alan 2006-01-13 06:04:43 UTC
The dup detector looks for drives where the slave isn't decoded. You appear to
have random noise as the slave instead. Probably needs some specific logic for
the transcend pcmcia ident to be single channel.
Comment 7 Bartlomiej Zolnierkiewicz 2006-01-13 06:10:21 UTC
> Card definitely worked with 2.4, I think it worked with older 2.6's,

Could you verify this before we start adding Transcend specific hacks?
Comment 8 Andrew Morton 2006-01-13 21:41:24 UTC
(Adding bugzilla back to cc)

Tejun Heo <htejun@gmail.com> wrote:
>
> Nick Piggin wrote:
> > Tejun Heo wrote:
> > 
> >> Andrew Morton wrote:
> >>> Code: 00 00 00 8d bc 27 00 00 00 00 55 57 56 53 83 ec 08 89 14 24 0f 
> >>> ac ca 03 89
> >>> 4c 24 04 69 d2 01 00 37 9e 8b 40 34 c1 ea 1a 8d 2c d0 <8b> 7d 00 8d 
> >>> b6 00 00 00
> >>> 00 8d bf 00 00 00 00 39 fd 74 4c 8d 5f
> >>
> >>
> >>
> >> Hello,
> >>
> >> Disassemblies of the faulting and releated previous instruction are
> >>
> >> %eax: address of as_data (the first argument @ad)
> >> %edx: ad->hash index calculated from @offset
> >>
> >> 8b4034    mov (eax,0x34), eax    : %eax = ad->hash
> >> ...
> >> 8d2cd0    lea (%eax,%edx,8), ebp    : %ebp = &ad->hash[%edx]
> >> 8b7d00    mov (ebp), edi        : *%ebp, boom!
> >>
> >> So, the problem is that ad->hash contains invalid address.  It seems 
> > 
> > 
> > Well, ad itself is wrong, as I suspected.
> > 
> >> that the queue got destroyed after ldm validation failure and somehow 
> >> a request got submitted to the queue after it.
> >>
> > 
> > Somehow because the partition checking is moving on test for msdos 
> > partition,
> > which is what caused the oops... The disk is probably not ldm, but even 
> > if it
> > were, all the reads are failing anyway so that point is irrelevant.
> > 
> >> I'll look further into queue deallocation path.
> >>
> > 
> > I'm not sure that it has been deallocated, rather something has overwritten
> > it, no?
> 
> The ldm validation or failure seems irrelevant.  From the original bug 
> report.
> 
> =====
> 
> ------- Additional Comment #1 From Pavel Machek  2006-01-13 04:13 ------
> 
> ...after *removing* the card (while mount is running) I got:
> 
> hdf: irq timeout: status=0xff { Busy }
> ide: failed opcode was: unknown
> 
> then machine froze for ~2 seconds.
> 
> ide2: reset timed-out, status=0xff
> hde: status timeout: status=0xff { Busy }
> ide: failed opcode was: unknown
> hde: drive not ready for command
> ipw2200: Firmware error detected.  Restarting.
> ipw2200: Sysfs 'error' log captured.
> 
> ...and freezes keep repeating.
> 
> 
> ------- Additional Comment #2 From Pavel Machek 2006-01-13 04:17 -------
> 
> I got even nice oops from block layer in the end. System is trashed at this
> point, with disk accesses blocking.
> 
> =====
> 
> So, it seems we're looking at premature queue deallocation bug during 
> IDE unplugging.
> 
> -- 
> tejun

Comment 9 Anonymous Emailer 2006-01-13 22:04:05 UTC
Reply-To: nickpiggin@yahoo.com.au

Andrew Morton wrote:
> (Adding bugzilla back to cc)
> 
> Tejun Heo <htejun@gmail.com> wrote:
> 

>>>I'm not sure that it has been deallocated, rather something has overwritten
>>>it, no?
>>
>>The ldm validation or failure seems irrelevant.  From the original bug 
>>report.
>>

Looks like you're right.

It was just odd that the ldm check worked and the msdos one
oopsed; both attempt to read the same block in the same manner.

Perhaps it may be that the queue was freed long before then, and
only just gets reused for something else and overwritten at this
point.

slab debugging might help?

Comment 10 Pavel Machek 2006-01-14 11:18:19 UTC
Strange. I turned on CONFIG_DEBUG_SLAB, and it does not oops any more.

Also... I probably found out why it seemed to work for me in past. This is x32,
it has PCMCIA slot and CF slot. In CF slot, card behaves as I described. In
PCMCIA slot, it works better:

pcmcia: registering new device pcmcia0.0
Probing IDE interface ide2...
hde: Transcend, CFA DISK drive
hdf: probing with STATUS(0x50) instead of ALTSTATUS(0x0a)
hdf: probing with STATUS(0x00) instead of ALTSTATUS(0x0a)

...hdf is not created and no problems.

It is strange, because controllers for PCMCIA and CF should be identical...

0000:02:00.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev aa)
0000:02:00.1 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev aa)
Comment 11 Anonymous Emailer 2006-01-16 00:34:42 UTC
Reply-To: htejun@gmail.com

Nick Piggin wrote:
> Andrew Morton wrote:
> 
>> (Adding bugzilla back to cc)
>>
>> Tejun Heo <htejun@gmail.com> wrote:
>>
> 
>>>> I'm not sure that it has been deallocated, rather something has 
>>>> overwritten
>>>> it, no?
>>>
>>>
>>> The ldm validation or failure seems irrelevant.  From the original 
>>> bug report.
>>>
> 
> Looks like you're right.
> 
> It was just odd that the ldm check worked and the msdos one
> oopsed; both attempt to read the same block in the same manner.
> 
> Perhaps it may be that the queue was freed long before then, and
> only just gets reused for something else and overwritten at this
> point.
> 
> slab debugging might help?
> 

[CC'ing Bartlomiej and linux-ide]

Hello, Bartlomiej.

This thread started by Andrew forwarding bugzilla #5880 to as-iosched 
related people as the machine oopsed in anticipatory scheduler.  After 
diagnosis, it seems to be premature device release problem of IDE 
driver.  The original bug entry is

http://bugzilla.kernel.org/show_bug.cgi?id=5880

The original bug is about CF device unuseable.  The oops occurs after 
yanking out the CF device during partition check is in progress (all are 
failing due to IO error).

I've glanced through the IDE code and the problem seems to be in 
ide_disk_get().  This function doesn't get the IDE generic device 
(ide_drive_t->gendev) and so if a device is yanked out while it's still 
open, the gendev doesn't wait for the open fd's to be closed.  This 
results in dangling open block device and thus above oops.  I'm not very 
sure about this analysis and currently don't have access to 
hot-unpluggable IDE devices.

Can you please confirm this?

Thanks.

Comment 12 Bartlomiej Zolnierkiewicz 2006-01-16 01:44:23 UTC
On 1/16/06, Tejun Heo <htejun@gmail.com> wrote:
> Nick Piggin wrote:
> > Andrew Morton wrote:
> >
> >> (Adding bugzilla back to cc)
> >>
> >> Tejun Heo <htejun@gmail.com> wrote:
> >>
> >
> >>>> I'm not sure that it has been deallocated, rather something has
> >>>> overwritten
> >>>> it, no?
> >>>
> >>>
> >>> The ldm validation or failure seems irrelevant.  From the original
> >>> bug report.
> >>>
> >
> > Looks like you're right.
> >
> > It was just odd that the ldm check worked and the msdos one
> > oopsed; both attempt to read the same block in the same manner.
> >
> > Perhaps it may be that the queue was freed long before then, and
> > only just gets reused for something else and overwritten at this
> > point.
> >
> > slab debugging might help?
> >
>
> [CC'ing Bartlomiej and linux-ide]
>
> Hello, Bartlomiej.
>
> This thread started by Andrew forwarding bugzilla #5880 to as-iosched
> related people as the machine oopsed in anticipatory scheduler.  After
> diagnosis, it seems to be premature device release problem of IDE
> driver.  The original bug entry is
>
> http://bugzilla.kernel.org/show_bug.cgi?id=5880
>
> The original bug is about CF device unuseable.  The oops occurs after
> yanking out the CF device during partition check is in progress (all are
> failing due to IO error).
>
> I've glanced through the IDE code and the problem seems to be in
> ide_disk_get().  This function doesn't get the IDE generic device
> (ide_drive_t->gendev) and so if a device is yanked out while it's still
> open, the gendev doesn't wait for the open fd's to be closed.  This
> results in dangling open block device and thus above oops.  I'm not very
> sure about this analysis and currently don't have access to
> hot-unpluggable IDE devices.
>
> Can you please confirm this?

Hi Tejun,

Confirmed, we need to get/put generic device in ide_disk_{get,put}()
similarly like drivers/scsi/sd.c driver does it (the same also applies to
ide-{cd,floppy,tape,scsi} drivers).

I would happily apply a patch fixing it (*hint*, *hint*).


Now back to the original problem ("ghost" slave device):

Pavel, is there any way to recognize this PCMCIA IDE controller
in ide-cs driver so we could tell IDE subsystem not to probe slave
devices?

Thanks,
Bartlomiej

Comment 13 Anonymous Emailer 2006-01-18 05:21:21 UTC
Reply-To: pavel@ucw.cz

> ------- Additional Comments From bzolnier@gmail.com  2006-01-16 01:44 -------
> 
> Pavel, is there any way to recognize this PCMCIA IDE controller
> in ide-cs driver so we could tell IDE subsystem not to probe slave
> devices?

Well, anything inserted into PCMCIA is *very* unlikely to have slave
devices. Plus PCMCIA devices have names; we could use that.

Comment 14 Alan 2006-01-18 06:05:43 UTC
Unlikely yes, but they exist and most hardware works properly anyway. In fact
you are the first case I've seen since the undecoded slave fix killed the other
cases.

In actual fact you can get up to 4 disk PCMCIA ATA controllers (2 master/2 slave) !
Comment 15 Pavel Machek 2006-02-04 13:53:27 UTC
I'm trying to run hdparm -i /dev/hde, but it hung. I can try again in another
pcmcia slot. Would that info be useful?
Comment 16 Pavel Machek 2006-02-04 14:24:36 UTC
root@amd:/home/pavel# hdparm -i /dev/hde

/dev/hde:

 Model=Transcend, FwRev=2N3-0925, SerialNo=TS256MFLASHCP 00000
 Config={ HardSect NotMFM Removeable DTR>10Mbs nonMagnetic }
 RawCHS=984/16/32, TrkSize=2048, SectSize=512, ECCbytes=4
 BuffType=1Sect, BuffSize=0kB, MaxMultSect=1, MultSect=off
 CurCHS=984/16/32, CurSects=503808, LBA=yes, LBAsects=503808
 IORDY=no
 PIO modes:  pio0 pio1 pio2
 AdvancedPM=no

 * signifies the current active mode


...any other info that would be useful?
Comment 17 Natalie Protasevich 2007-07-08 12:04:33 UTC
Any updates on this problem?
Thanks.
Comment 18 Bartlomiej Zolnierkiewicz 2008-02-13 09:07:51 UTC
Closing for now (please re-open if it still happens with recent kernels).
Comment 19 Anonymous Emailer 2008-02-18 05:53:14 UTC
Reply-To: pavel@ucw.cz

Hi!

> http://bugzilla.kernel.org/show_bug.cgi?id=5880
> 
> 
> bzolnier@gmail.com changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |REJECTED
>          Resolution|                            |INSUFFICIENT_DATA
> 
> 
> 
> 
> ------- Comment #18 from bzolnier@gmail.com  2008-02-13 09:07 -------
> Closing for now (please re-open if it still happens with recent kernels).

It seems to work ok with 2.6.25-rc2. Thanks!

Feb 18 14:52:19 amd kernel: PM: Adding info for pcmcia:0.0
Feb 18 14:52:19 amd kernel: Probing IDE interface ide0...
Feb 18 14:52:20 amd kernel: hda: Transcend, CFA DISK drive
Feb 18 14:52:20 amd kernel: PM: Adding info for No Bus:ide0
Feb 18 14:52:20 amd kernel: ide0 at 0x9100-0x9107,0x910e on irq 16
Feb 18 14:52:20 amd kernel: PM: Adding info for ide:0.0
Feb 18 14:52:20 amd kernel: hda: max request size: 128KiB
Feb 18 14:52:20 amd kernel: hda: 503808 sectors (257 MB) w/0KiB Cache,
CHS=984/16/32
Feb 18 14:52:20 amd kernel: PM: Adding info for No Bus:hda
Feb 18 14:52:20 amd kernel:  hda: hda1
Feb 18 14:52:20 amd kernel: PM: Adding info for No Bus:hda1
Feb 18 14:52:20 amd cardmgr[1375]: socket 0: ATA/IDE Fixed Disk
Feb 18 14:52:20 amd kernel: ide-cs: hda: Vpp = 0.0
							Pavel
Comment 20 Bartlomiej Zolnierkiewicz 2008-02-18 08:02:01 UTC
Thanks Pavel.