Bug 13029 - media revalidation bug on USB card reader
Summary: media revalidation bug on USB card reader
Status: CLOSED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Block Layer (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jens Axboe
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-04-06 23:04 UTC by David Zeuthen
Modified: 2011-05-30 07:59 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.38
Subsystem:
Regression: No
Bisected commit-id:


Attachments
cdrom_open-always-check-events.patch (2.32 KB, patch)
2011-04-06 10:02 UTC, Tejun Heo
Details | Diff
blkdev-get-rescan.patch (2.85 KB, patch)
2011-04-06 11:48 UTC, Tejun Heo
Details | Diff

Description David Zeuthen 2009-04-06 23:04:03 UTC
Hi,

I have a Belkin card reader [1] where media revalidation appears to be broken. The kernel will happily return ENOMEDIUM when opening the /dev/sdb with O_RDONLY but only when running 'touch /dev/sdb' the partition goes away.

I would think that if the kernel knows enough to return ENOMEDIUM for /dev/sdb it should be able to also infer that the partitions from /dev/sdb1 are no more.

It is also very weird because I only see this behavior on the Belkin device, not on other USB card readers.

This is on 2.6.29.1-46.fc11.x86_64 from Fedora Rawhide.

Here is the detailed log. First we kill all polling

[root@x61 1-3]# killall hald-addon-storage

Then we listen on uevents

[root@x61 1-3]# udevadm monitor
monitor will print the received events for:
UDEV the event which udev sends out after rule processing
UEVENT the kernel uevent

# insert CF card and run 'touch /dev/sdb' in another shell

UEVENT[1239057973.774859] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb (block)
UEVENT[1239057973.774932] add      /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb/sdb1 (block)
UEVENT[1239057973.778183] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb (block)
UEVENT[1239057973.814998] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0 (scsi)
UDEV  [1239057973.962835] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb (block)
UDEV  [1239057974.094378] add      /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb/sdb1 (block)
UDEV  [1239057974.265656] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb (block)
UEVENT[1239057974.269085] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb (block)
UDEV  [1239057974.269574] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0 (scsi)
UDEV  [1239057974.437891] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb (block)

# Now we remove CF card - note that since user space don't poll for media changes nothing happens

# [...]

# Now we run 'blockdev --rereadpt /dev/sdb'. It fails with
#
# /dev/sdb: No medium found
#
# The only interaction with the device that happens is that it
# does open("/dev/sdb", O_RDONLY) and which fails with ENOMEDIUM
#
# However the following uevents occur

UEVENT[1239058028.675551] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0 (scsi)
UEVENT[1239058028.686676] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb (block)
UDEV  [1239058028.686744] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0 (scsi)
UDEV  [1239058028.765815] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb (block)

# Note that we didn't get a remove event for /dev/sdb1. It is still around
# both in sysfs and /dev
#

# run 'touch /dev/sdb' to get the remove event.
#

UEVENT[1239058055.909464] remove   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb/sdb1 (block)
UDEV  [1239058055.921046] remove   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb/sdb1 (block)
UEVENT[1239058055.947384] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb (block)
UDEV  [1239058056.029241] change   /devices/pci0000:00/0000:00:1a.7/usb1/1-3/1-3:1.0/host19/target19:0:0/19:0:0:0/block/sdb (block)

# Essentially touch just does
#
# open("/dev/sdb", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3
# dup2(3, 0)                              = 0
# close(3)                                = 0
# utimensat(0, NULL, NULL, 0)             = 0
# close(0)                                = 0

[1] :

# pwd
/sys/devices/pci0000:00/0000:00:1a.7/usb1/1-3

# cat idProduct idVendor manufacturer product 
0248
050d
BELKIN
USB 2 Flash Media Device
Comment 1 Alan 2009-04-08 09:33:48 UTC
Touch intentionally avoids blocking or testing things like media state (note its use of O_NONBLOCK)
Comment 2 David Zeuthen 2009-04-08 20:24:42 UTC
Hi Alan,

(In reply to comment #1)
> Touch intentionally avoids blocking or testing things like media state (note
> its use of O_NONBLOCK)

I'm not sure what you are saying here as comment 0 shows that it is the touch call (with O_NONBLOCK) that actually makes the kernel remove sdb1 while earlier calls open() calls (without O_NONBLOCK) does not trigger this (but makes the kernel return ENOMEDIUM).

What baffles me is that in order for the kernel to return ENOMEDIUM it must know there is no medium. Yet it refuses to remove the partitions on the medium until O_NONBLOCK is used. How do you explain that not being buggy behavior?

Of course we can easily work around this in user space by just essentially running 'touch /dev/sdb' when we get ENOMEDIUM but at the end of the day I still think that is a work around.
Comment 3 Alan 2009-04-08 20:37:02 UTC
ah Ok I misunderstood the original report

With O_NDELAY we go direct to the device level status
Without it we go via open_for_data() which checks the drive_status and does other stuff as well

so its definitely not sane unless something else is holding the partition open and this is a timing funny.

Can you rebuild drivers/cdrom/cdrom.c with the ERRLOGMASK define near the top set to log everything as a first step, then see what that shows up.
Comment 4 David Zeuthen 2009-04-08 20:51:50 UTC
(In reply to comment #3)
> Can you rebuild drivers/cdrom/cdrom.c with the ERRLOGMASK define near the top
> set to log everything as a first step, then see what that shows up.

Sure, but the bug only happens with a USB 4in1 card reader driven by usb-storage so maybe I need to tweak another define than in cdrom.c?
Comment 5 Alan 2009-04-08 21:20:00 UTC
The sd case is a bit trickier but if you stick a few printk() calls into sd.c : sd_open/sd_close and sd_revalidate_disk so you can see the events occuring maybe that will produce a clue.
Comment 6 Martin Pitt 2011-04-05 15:48:27 UTC
I just stumbled over this as well. We currently try to remove the polling in udisks to make use of the new in-kernel media polling facility (https://lkml.org/lkml/2010/12/8/299).

My SD card reader (which is actually a Sony PRS e-book reader) behaves exactly like David's. When I eject the card from the reader, no uevents happen until I poke the drive with an open():

$ sudo head -c 1 /dev/sdd
head: cannot open `/dev/sdd' for reading: No medium found

KERNEL[1302016466.795990] change   /devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.5/1-1.5.3/1-1.5.3.3/1-1.5.3.3:1.0/host19/target19:0:0/19:0:0:2/block/sdd (block)
UDEV_LOG=3
ACTION=change
DISK_MEDIA_CHANGE=1

Note there is no removal event for /dev/sdd1, and the device is still around:

$ ls /dev/sdd*
/dev/sdd /dev/sdd1

 * The current udisks poller works around this by doing the poke again with O_NONBLOCK; when doing this, sdd1 actually goes away:

$ sudo python -c 'import os; os.open("/dev/sdd", os.O_RDONLY|os.O_NONBLOCK)'

KERNEL[1302017728.645882] remove   /devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.5/1-1.5.3/1-1.5.3.3/1-1.5.3.3:1.0/host21/target21:0:0/21:0:0:2/block/sdd/sdd1 (block)
ACTION=remove
DEVTYPE=partition

This behaves exactly the same when enabling the in-kernel polling in /sys/block/sdd/events_poll_msecs; when doing that, the "head" command (which is by and large doing open() without O_NONBLOCK here) does not need to be run, but the last python O_NONBLOCK poke is still required.

Could the in-kernel poller do the equivalent of this when a DISK_MEDIA_CHANGE uevent is generated, i. e. take into account that partitions might have been removed/added? I. e. the equivalent of what opening the block device with O_NONBLOCK does?

If this could be fixed in the in-kernel poller, we could remove the userspace workaround to open(O_NONBLOCK) the device on a DISK_MEDIA_CHANGE event.
Comment 7 Kay Sievers 2011-04-05 16:45:53 UTC
Tejun, the issue with the in-kernel polling and media re-validation seems like an old bug -- would be still nice to fix that, so userspace does not need to work around that weird behavior. Any ideas?
Comment 8 Tejun Heo 2011-04-06 10:02:16 UTC
Created attachment 53632 [details]
cdrom_open-always-check-events.patch

Can you please verify the attached patch fixes the problem?  Thanks.
Comment 9 Tejun Heo 2011-04-06 11:48:53 UTC
Created attachment 53652 [details]
blkdev-get-rescan.patch

For some reason I thought this report was about cdrom.  We actually have two layers of problems here.  For cdrom, the event check itself doesn't happen if medimum is not there.  For other devices, that part happens but partition rescan doesn't.  This patch should fix the latter.  The patch is only compile tested.  Please test whether it fixes the problem.

Thanks.
Comment 10 Kay Sievers 2011-04-06 12:10:09 UTC
On Wed, Apr 6, 2011 at 13:48,  <bugzilla-daemon@bugzilla.kernel.org> wrote:

> --- Comment #9 from Tejun Heo <tj@kernel.org>  2011-04-06 11:48:53 ---
> Created an attachment (id=53652)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=53652)
> blkdev-get-rescan.patch
>
> For some reason I thought this report was about cdrom.  We actually have two
> layers of problems here.  For cdrom, the event check itself doesn't happen if
> medimum is not there.  For other devices, that part happens but partition
> rescan doesn't.  This patch should fix the latter.  The patch is only compile
> tested.  Please test whether it fixes the problem.

That fixes the issue for me, and works without weird added additional
open(O_NONBLOCK) workarounds.

Thanks!
Comment 11 Kay Sievers 2011-04-06 12:17:37 UTC
On Wed, Apr 6, 2011 at 14:10,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> That fixes the issue for me, and works without weird added additional
> open(O_NONBLOCK) workarounds.

Feel free to add:
  Tested-By: Kay Sievers <kay.sievers@vrfy.org>

to all 3 patches. I'm running them all here with the in-development
version of udisks, that switched completely from userspace to kernel
polling.
Comment 12 Florian Mickler 2011-05-19 07:21:56 UTC
A patch referencing this bug report has been merged in v2.6.39:

commit 02e352287a40bd456eb78df705bf888bc3161d3f
Author: Tejun Heo <tj@kernel.org>
Date:   Fri Apr 29 10:15:20 2011 +0200

    block: rescan partitions on invalidated devices on -ENOMEDIA too
Comment 13 Florian Mickler 2011-05-19 07:30:44 UTC
So I close this bug as fixed assuming it fixes this also for David, the original reporter.
Comment 14 Florian Mickler 2011-05-30 07:59:03 UTC
A patch referencing this bug report has been merged in v3.0-rc1:

commit 1196f8b814f32cd04df334abf47648c2a9fd8324
Author: Tejun Heo <tj@kernel.org>
Date:   Thu Apr 21 20:54:45 2011 +0200

    block: rescan partitions on invalidated devices on -ENOMEDIA too

Note You need to log in before you can comment on or make changes to this bug.