Latest working kernel version: 2.6.25.18 Earliest failing kernel version: 2.6.27.7 Distribution: openSUSE Hardware Environment: x86_64 Software Environment: ? Problem Description: The behavior of network block device (nbd) on the client side appears to have changed somewhat - with 2.6.25.18 a network block device that disappeared would eventually time out on the client - that does not appear to be working with 2.6.27.7. Steps to reproduce: Starting with 2.6.25.18 (the default up-to-date kernel for opensuse 11.0), I served a local hard disk via nbd-server. Using the same kernel on another machine, I used nbd-client to make use of that hard disk: nbd-client bs=4096 timeout=30 frank 12345 /dev/nbd0 Then I was using /dev/nbd0 as part of a raid1, but that is only part of the story. If you start a 'dd' process on the client like this: strace -t -TT -f dd if=/dev/nbd0 of=/dev/null and then "reboot -fn" (or pull the plug or whatever, but *don't* reboot or shutdown or whatever in any sort of graceful fashion) the server then the client gets stuck in a read. I let the last one sit there for over 2 hours. Using nbd-client -d /dev/nbd0 does not kill the read(2). This is where it gets yucky - if the nbd device is part of a raid, all further operations on /proc/mdstat and on the raid device itself (/dev/mdNN) fail. What's worse, mdadm get stuck /in the kernel/ and cannot be interrupted: Using magic-sysrq 't' this is what mdadm is doing: Dec 22 15:45:08 turnip kernel: mdadm D 0000000000000000 0 14624 14623 Dec 22 15:45:08 turnip kernel: ffff8800022c5bd8 0000000000000086 ffff8800022c5b18 ffffffff80a59980 Dec 22 15:45:08 turnip kernel: ffffffff80a59980 ffffffff80a567f0 ffffffff80a59980 ffffffff80a59980 Dec 22 15:45:08 turnip kernel: ffffffff80a59980 ffffffff80a59980 ffffffff80a59980 ffffffff80a59980 Dec 22 15:45:08 turnip kernel: Call Trace: Dec 22 15:45:08 turnip kernel: [<ffffffff80408bb6>] md_super_wait+0xc6/0xe0 Dec 22 15:45:08 turnip kernel: [<ffffffff8040d5ac>] write_sb_page+0x1a3/0x1d4 Dec 22 15:45:08 turnip kernel: [<ffffffff8040d8d1>] write_page+0x1c/0x103 Dec 22 15:45:08 turnip kernel: [<ffffffff804096a8>] md_update_sb+0x20b/0x2fa Dec 22 15:45:08 turnip kernel: [<ffffffff8040be10>] md_allow_write+0x9f/0xe4 Dec 22 15:45:08 turnip kernel: [<ffffffff8040be68>] get_bitmap_file+0x13/0xe1 Dec 22 15:45:08 turnip kernel: [<ffffffff8040c42e>] md_ioctl+0x4f8/0x897 Dec 22 15:45:08 turnip kernel: [<ffffffff8034bac1>] blkdev_driver_ioctl+0x5d/0x72 Dec 22 15:45:08 turnip kernel: [<ffffffff8034c32e>] blkdev_ioctl+0x1f5/0x217 Dec 22 15:45:08 turnip kernel: [<ffffffff802e20e7>] block_ioctl+0x1b/0x20 Dec 22 15:45:08 turnip kernel: [<ffffffff802c70b9>] vfs_ioctl+0x21/0x6c Dec 22 15:45:08 turnip kernel: [<ffffffff802c7343>] do_vfs_ioctl+0x23f/0x255 Dec 22 15:45:08 turnip kernel: [<ffffffff802c73aa>] sys_ioctl+0x51/0x73 Dec 22 15:45:08 turnip kernel: [<ffffffff8020c37a>] system_call_fastpath+0x16/0x1b Dec 22 15:45:08 turnip kernel: [<00007f6a54b61b57>] 0x7f6a54b61b57 and the files it has open: turnip:~ # ls -la /proc/14624/fd total 0 dr-x------ 2 root root 0 Dec 22 13:13 . dr-xr-xr-x 7 root root 0 Dec 22 13:13 .. lr-x------ 1 root root 64 Dec 22 13:13 0 -> /dev/null l-wx------ 1 root root 64 Dec 22 13:13 1 -> pipe:[446282] l-wx------ 1 root root 64 Dec 22 13:13 2 -> pipe:[7246] lr-x------ 1 root root 64 Dec 22 15:47 3 -> /dev/md11 turnip:~ # Again, using 2.6.25.18 this did not appear to be a problem.
Basically, any I/O that is taking place prior to the the disconnect operation (nbd-client -d /dev/nbd0) gets hung and does not return an error as in prior versions.
The following is an strace taken from a machine running 2.6.25.18, executing the 'dd if=/dev/nbd0 of=/dev/null' command. The server was rebooted with 'reboot -fn' very shortly before this portion of the strace: 07:57:56.568319 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512 <0.000180> 07:57:56.568648 read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512 <0.000185> 07:57:56.568986 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512 <0.000144> 07:57:56.569268 read(0, ^^^ this is where it hangs. This is expected. 0x8059000, 512) = -1 EIO (Input/output error) <161.315346> ^^ it hung for approx. 161 seconds. I was expecting something more like 30 seconds (I used timeout=30 in the setup command) but 161 seconds is better than "forever" ;-) 08:00:37.884925 open("/usr/share/locale-langpack/en_US.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000044> 08:00:37.885143 open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000034> 08:00:37.885292 open("/usr/share/locale-bundle/en_US.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000034> 08:00:37.885440 open("/usr/share/locale-langpack/en_US.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000032> It should also be noted that I also file a novell bug here: https://bugzilla.novell.com/show_bug.cgi?id=462269
hello jon, shouldn`t we mark this as a regression?
I'm inclined to say "yes".
...can you try to git bisect it?
Do you see "killing hung xmit" message in dmesg?
I do not, but I can try again. I don't have logs that go back that far. I assume that I would see that in /var/log/messages, no?
Yes, it should be in messages. Trying latest -mm driver would be nice.
Pavel, Jon, check novell bugzilla -- I've made some updates. I attached a patch, which I think will fix this.