Most recent kernel where this bug did *NOT* occur: None Distribution: Debian, Ubuntu, RHEL (it applies to any distro I guess) Hardware Environment: Most kinds of low or midrange networked storage arrays Software Environment: Clusters using dm-multipath Problem Description: When a cluster is using a storage array with active/passive controllers that moves volumes when it sees I/O arriving to a passive controller, the automatic partition table scan that happens when a block device is registered often causes a volume transfer, which in turn makes I/O to the formerly active controller fail for some time (probably until the volume is completely transferred away, after which I/O wil just move it back again). This applies to many different hardware vendors like Engenio (OEM'ed by Sun/StorageTek, IBM, probably others) and EMC. In Engenio's case this "Automatic Volume Transfer" mode is the only way to use it with Linux and dm-multipath, as it's alternative mode where volumes are only transferred after an a specific SCSI command is sent that explicitly requests the volume to be transferred ("RDAC mode") is one dm-multipath does not yet have a hardware handler for. This is problematic for clusters where several machines have access to the same volume. Consider the case where machine Foo and machine Bar is happily using a volume on such an array. They're both using controller 2 for all I/O, due to dm-multipath's nice path grouping and prioritising features, and all is well. At a later time, though, Foo is rebooted, and when it is starting up it loads the fibre channel HBA driver, which finds, say, eight paths to each of the two controllers. It then proceeds to register these in the kernel as block devices, and when the block layer goes on to scan for a partition table it generates I/O to controller 1, prompting the array to move it there and making Bar experience I/O errors. When the array has completed the volume movement and Bar's dm-multipath has been able to redirect I/O there, you can be certain that Foo is well on it's way to move it back by registering another block device that represent a path through controller 2. This can go on for quite some time, depending on the complexity of the SAN and the number of volumes and hosts. In any case it's bad - until it settles down there has been very poor if any service and if dm-multipath has noted all paths as failed VFS could have seen I/O errors and remounted filesystems readonly and so on. Pain. Steps to reproduce: Reload the SCSI driver on a cluster node (or boot it) The solution seems to me to make it possible somehow to make the kernel NOT scan newly registered block devices for partition tables, and instead delegate this task to udev (which then can be used to selectively pick which devices to scan). There might be other things the block layer sends when it's being registered too, I don't know. But at least the INQUIRY, TEST UNIT READY, and READ CAPACITY commands do not cause any volume transfers, so these should be safe. At least on my EMC and Engenio gear. Regards, Tore
does this problem still apply with more recent kernels or dm-multipath versions? maybe we can close this one...!?
forget about the last post. so, would that mean - in fs/block_dev.c, the call to rescan_partitions() should be optional (i.e. skipped if there is some special param in place) ? - need to call BLKRRPART ioctl from userspace for partition scan instead? (as that calls rescan_partitions() too) i`m not a real kernel coder, i`m just trying to understand linux/kernel better....
Hello, about your point #1 - that is exactly the case, yes. Actually Hannes Reinecke posted a patch that did this a while back at http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-08/msg11622.html - don't know why it never got merged. About point #2 - yes, userspace would have to make sure to call that ioctl (if needed for the device in question of course) to make the partition devices appear. An alternative to using the built-in kernel partition handling would be to use the kpartx tool to set up DM maps that corresponded to the partitions (in this case you only need to change the userspace tool in order to support new partition table types and so on). Tore
good pointer, but if no_partition_scan disables partition scan globally,how should such system boot sucessfully, as it won`t recognize partitions on local raid device anymore ? boot environment would need to be adjusted for that. wouldn`t that make things difficult? were adressing a very specific problem with disabling partition_scan due to problems in multipathing enviroment, so i`d consider disabling scan for dedicated blockdevices a better approach. comments ?
Yes, userspace would need to take care of triggering the partition scan for the devices where it is necessary. In most distributions this would have to be done from the initramfs - which by the way is already responsible for loading the block device drivers, so it's likely not a very big change. I'm not very sure what you mean by "dedicated blockdevices", though? Tore
>I'm not very sure what you mean by "dedicated blockdevices", though? pardon, i`m no native english speaker. maybe "certain blockdevices" is the better word ? i mean, being able to either specify "do not scan for partitions on /dev/sdc, /dev/sdd..." or specify "do not scan for partitions on all devices except /dev/sda, /dev/...."
I see. The problem with that is that the device names aren't always deterministic - especially in SAN environments. They've become less so of late too, most distributions use file system labels or UUIDs for mounting, and dm-multipath does something similar. So even if you add a kernel param like "enablepartscanonlyon=sda,sdb" then you might risk that after a kernel upgrade, sda and sdb no longer represents the same devices as they used to. IMHO the best approach would be to leave partition scanning to userspace in general, which fits with the current trend of leaving other similar tasks to userspace (udev, initramfs, etc). But that is a more lengthy undertaking - better to get Hannes Reinecke's patch in, and then get support for userspace-initiated partition scanning in the various distributions' mkinitramfs scripts so it can be enabled when it's necessary. After that support is ubiquitous it might be possible to make it default enabled... Tore
As far as I can see you can now build a distribution that does this and compile with no partition support in kernel but use the device mapper layer.