Bug 8588

Summary: The automatic partition scan upon block device registration cause I/O trouble problems for other cluster nodes
Product: IO/Storage Reporter: Tore Anderson (tore)
Component: Block LayerAssignee: Jens Axboe (axboe)
Status: CLOSED CODE_FIX    
Severity: low CC: alan, devzero
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.20 Subsystem:
Regression: No Bisected commit-id:

Description Tore Anderson 2007-06-05 12:00:32 UTC
Most recent kernel where this bug did *NOT* occur: None
Distribution: Debian, Ubuntu, RHEL (it applies to any distro I guess)
Hardware Environment: Most kinds of low or midrange networked storage arrays
Software Environment: Clusters using dm-multipath
Problem Description:
When a cluster is using a storage array with active/passive controllers that
moves volumes when it sees I/O arriving to a passive controller, the automatic
partition table scan that happens when a block device is registered often
causes a volume transfer, which in turn makes I/O to the formerly active
controller fail for some time (probably until the volume is completely
transferred away, after which I/O wil just move it back again).  This applies
to many different hardware vendors like Engenio (OEM'ed by Sun/StorageTek, IBM,
probably others) and EMC.  In Engenio's case this "Automatic Volume Transfer"
mode is the only way to use it with Linux and dm-multipath, as it's alternative
mode where volumes are only transferred after an a specific SCSI command is
sent that explicitly requests the volume to be transferred ("RDAC mode") is one
dm-multipath does not yet have a hardware handler for.

This is problematic for clusters where several machines have access to the same
volume.  Consider the case where machine Foo and machine Bar is happily using
a volume on such an array.  They're both using controller 2 for all I/O, due
to dm-multipath's nice path grouping and prioritising features, and all is
well.  At a later time, though, Foo is rebooted, and when it is starting up it
loads the fibre channel HBA driver, which finds, say, eight paths to each of
the two controllers.  It then proceeds to register these in the kernel as
block devices, and when the block layer goes on to scan for a partition table
it generates I/O to controller 1, prompting the array to move it there and
making Bar experience I/O errors.  When the array has completed the volume
movement and Bar's dm-multipath has been able to redirect I/O there, you can be
certain that Foo is well on it's way to move it back by registering another
block device that represent a path through controller 2.  This can go on for
quite some time, depending on the complexity of the SAN and the number of
volumes and hosts.  In any case it's bad - until it settles down there has been
very poor if any service and if dm-multipath has noted all paths as failed VFS
could have seen I/O errors and remounted filesystems readonly and so on.  Pain. 

Steps to reproduce: Reload the SCSI driver on a cluster node (or boot it)

The solution seems to me to make it possible somehow to make the kernel NOT
scan newly registered block devices for partition tables, and instead delegate
this task to udev (which then can be used to selectively pick which devices to
scan).

There might be other things the block layer sends when it's being registered
too, I don't know.  But at least the INQUIRY, TEST UNIT READY, and READ
CAPACITY commands do not cause any volume transfers, so these should be safe.
At least on my EMC and Engenio gear.

Regards,
Tore
Comment 1 Roland Kletzing 2009-08-29 09:41:11 UTC
does this problem still apply with more recent kernels or dm-multipath versions?
maybe we can close this one...!?
Comment 2 Roland Kletzing 2009-09-12 10:12:04 UTC
forget about the last post. 

so, would that mean  

- in fs/block_dev.c, the call to rescan_partitions() should be optional (i.e. skipped if there is some special param in place) ? 
- need to call BLKRRPART ioctl from userspace for partition scan instead? (as that calls rescan_partitions() too)

i`m not a real kernel coder, i`m just trying to understand linux/kernel better....
Comment 3 Tore Anderson 2009-09-12 10:38:25 UTC
Hello,

about your point #1 - that is exactly the case, yes.  Actually Hannes Reinecke posted a patch that did this a while back at http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-08/msg11622.html - don't know why it never got merged.

About point #2 - yes, userspace would have to make sure to call that ioctl (if needed for the device in question of course) to make the partition devices appear.  An alternative to using the built-in kernel partition handling would be to use the kpartx tool to set up DM maps that corresponded to the partitions (in this case you only need to change the userspace tool in order to support new partition table types and so on).

Tore
Comment 4 Roland Kletzing 2009-09-12 11:37:21 UTC
good pointer, but if no_partition_scan disables partition scan globally,how should such system boot sucessfully, as it won`t recognize partitions on local raid device anymore ? boot environment would need to be adjusted for that. wouldn`t that make things difficult?
were adressing a very specific problem with disabling partition_scan due to problems in multipathing enviroment, so i`d consider disabling scan for dedicated blockdevices a better approach.

comments ?
Comment 5 Tore Anderson 2009-09-12 12:01:55 UTC
Yes, userspace would need to take care of triggering the partition scan for the devices where it is necessary.  In most distributions this would have to be done from the initramfs - which by the way is already responsible for loading the block device drivers, so it's likely not a very big change.

I'm not very sure what you mean by "dedicated blockdevices", though?

Tore
Comment 6 Roland Kletzing 2009-09-12 12:22:28 UTC
>I'm not very sure what you mean by "dedicated blockdevices", though?

pardon, i`m no native english speaker. maybe "certain blockdevices" is the better word ?

i mean, being able to either specify "do not scan for partitions on /dev/sdc, /dev/sdd..." or specify "do not scan for partitions on all devices except /dev/sda, /dev/...."
Comment 7 Tore Anderson 2009-09-12 17:31:22 UTC
I see.  The problem with that is that the device names aren't always deterministic - especially in SAN environments.  They've become less so of late too, most distributions use file system labels or UUIDs for mounting, and dm-multipath does something similar.

So even if you add a kernel param like "enablepartscanonlyon=sda,sdb" then you might risk that after a kernel upgrade, sda and sdb no longer represents the same devices as they used to.

IMHO the best approach would be to leave partition scanning to userspace in general, which fits with the current trend of leaving other similar tasks to userspace (udev, initramfs, etc).  But that is a more lengthy undertaking - better to get Hannes Reinecke's patch in, and then get support for userspace-initiated partition scanning in the various distributions' mkinitramfs scripts so it can be enabled when it's necessary.

After that support is ubiquitous it might be possible to make it default enabled...

Tore
Comment 8 Alan 2012-05-21 14:39:33 UTC
As far as I can see you can now build a distribution that does this and compile with no partition support in kernel but use the device mapper layer.