Latest working kernel version: 2.6.25.6? (2.6.23.16 was fine) Earliest failing kernel version: 2.6.25.10 Distribution: Debian Etch Hardware Environment: x86 Software Environment: Linux 2.6.25.10, NFSv3 Problem Description: This is the same NFS server and clients described in bug #8467, the NFS mount options are still the same as described there. The clients boot from USB flash drives and then pivot_root onto the readonly NFS root FS. Since upgrading the server and clients to 2.6.25.10, we occasionally see the clients suddenly lose NFS connectivity, without a loss of the network connection. I see a message like this on our syslog server: graciano kernel: nfs: server xarello not responding, still trying And then graciano locks up unable to access its root FS, and never recovers. rpcinfo doesn't show anything unusual: $ rpcinfo -p graciano program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 622 status 100024 1 tcp 625 status 100021 1 tcp 53873 nlockmgr 100021 3 tcp 53873 nlockmgr 100021 4 tcp 53873 nlockmgr Compare another client not experiencing the problem: $ rpcinfo -p jacquere program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 630 status 100024 1 tcp 633 status 100021 1 tcp 50415 nlockmgr 100021 3 tcp 50415 nlockmgr 100021 4 tcp 50415 nlockmgr And I can still reboot the machine remotely via ssh by invoking the reboot binary on the flash boot drive. I never saw this under 2.6.23.16 and earlier kernel versions. We upgraded from that version to 2.6.25.6 and then to 2.6.25.10. This problem never occurred as far as I know under 2.6.25.6, but I can't say for sure that that version is OK since we only used it for a short time. Steps to reproduce: unknown, it has happened twice so far, seemingly at random.
Again, an alt-sysrq-t trace would be helpful in order to figure out where the processes are hanging. Also, perhaps, the contents of /proc/meminfo It would also be useful if you could check if there is any NFS traffic on the wire when this occurs (use wireshark or tcpdump).
Actually, if 2.6.25.6 is working, then I'd suspect a networking driver issue or something like that. There have been no NFS changes at all in the 2.6.25.x series afaik. Have you checked the list of relevant patches since 2.6.25.6?
It turns out I made a big error with the original report. I had compiled 2.6.25.10 but failed to install it on the server, so at the time of the incidents described above the server was running 2.6.25.6 but all clients were 2.6.25.10. I was about to say I hadn't seen it happen since booting 2.6.25.10 on the server, but then it happened to our web server this morning. The symptoms were the same, it didn't lose network connectivity, but all NFS mounts from xarello hung. The web server (airen) has a local disk instead of root on NFS, so it was easier to investigate. Here is /proc/meminfo from the server: MemTotal: 8079504 kB MemFree: 43696 kB Buffers: 1038476 kB Cached: 3993784 kB SwapCached: 0 kB Active: 896912 kB Inactive: 4253960 kB SwapTotal: 3960012 kB SwapFree: 3959976 kB Dirty: 608 kB Writeback: 0 kB AnonPages: 118612 kB Mapped: 47476 kB Slab: 2848668 kB SReclaimable: 2753400 kB SUnreclaim: 95268 kB PageTables: 5740 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 7999764 kB Committed_AS: 226640 kB VmallocTotal: 34359738367 kB VmallocUsed: 3048 kB VmallocChunk: 34359735279 kB and from the client: MemTotal: 515428 kB MemFree: 34532 kB Buffers: 81180 kB Cached: 152764 kB SwapCached: 0 kB Active: 210624 kB Inactive: 162188 kB SwapTotal: 979924 kB SwapFree: 979924 kB Dirty: 172 kB Writeback: 0 kB AnonPages: 138888 kB Mapped: 13596 kB Slab: 103196 kB SReclaimable: 94876 kB SUnreclaim: 8320 kB PageTables: 2180 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 1237636 kB Committed_AS: 330024 kB VmallocTotal: 516048 kB VmallocUsed: 692 kB VmallocChunk: 515176 kB I ran tcpdump on the server, and there was no NFS/RPC traffic at all, even when I tried to ls the mounts on airen. Then I tried unmounting and remounting one mount and got the following traffic: 09:54:48.939506 IP (tos 0x0, ttl 64, id 20658, offset 0, flags [DF], proto: TCP (6), length: 60) airen.bcm.umontreal.ca.58967 > xarello.bcm.umontreal.ca.sunrpc: S, cksum 0x9b14 (correct), 2385953479:2385953479(0) win 5840 <mss 1460,sackOK,timestamp 368087669 0,nop,wscale 5> 09:54:48.939528 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 60) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.58967: S, cksum 0x536a (correct), 2436276658:2436276658(0) ack 2385953480 win 5792 <mss 1460,sackOK,timestamp 291505022 368087669,nop,wscale 7> 09:54:48.939655 IP (tos 0x0, ttl 64, id 20659, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.58967 > xarello.bcm.umontreal.ca.sunrpc: ., cksum 0x981f (correct), ack 1 win 183 <nop,nop,timestamp 368087669 291505022> 09:54:48.952535 IP (tos 0x0, ttl 64, id 20660, offset 0, flags [DF], proto: TCP (6), length: 112) airen.bcm.umontreal.ca.58967 > xarello.bcm.umontreal.ca.sunrpc: P 1:61(60) ack 1 win 183 <nop,nop,timestamp 368087672 291505022> 09:54:48.952554 IP (tos 0x0, ttl 64, id 41677, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.58967: ., cksum 0x9866 (correct), ack 61 win 46 <nop,nop,timestamp 291505025 368087672> 09:54:48.952731 IP (tos 0x0, ttl 64, id 41678, offset 0, flags [DF], proto: TCP (6), length: 84) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.58967: P 1:33(32) ack 61 win 46 <nop,nop,timestamp 291505025 368087672> 09:54:48.952841 IP (tos 0x0, ttl 64, id 20661, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.58967 > xarello.bcm.umontreal.ca.sunrpc: ., cksum 0x97bc (correct), ack 33 win 183 <nop,nop,timestamp 368087673 291505025> 09:54:48.952851 IP (tos 0x0, ttl 64, id 20662, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.58967 > xarello.bcm.umontreal.ca.sunrpc: F, cksum 0x97bb (correct), 61:61(0) ack 33 win 183 <nop,nop,timestamp 368087673 291505025> 09:54:48.952875 IP (tos 0x0, ttl 64, id 3809, offset 0, flags [DF], proto: TCP (6), length: 60) airen.bcm.umontreal.ca.665 > xarello.bcm.umontreal.ca.702: S, cksum 0xa2f2 (correct), 2381552792:2381552792(0) win 5840 <mss 1460,sackOK,timestamp 368087673 0,nop,wscale 5> 09:54:48.952882 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 60) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.665: S, cksum 0xa984 (correct), 2438288212:2438288212(0) ack 2381552793 win 5792 <mss 1460,sackOK,timestamp 291505025 368087673,nop,wscale 7> 09:54:48.952886 IP (tos 0x0, ttl 64, id 41679, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.58967: F, cksum 0x9843 (correct), 33:33(0) ack 62 win 46 <nop,nop,timestamp 291505025 368087673> 09:54:48.952937 IP (tos 0x0, ttl 64, id 3810, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.665 > xarello.bcm.umontreal.ca.702: ., cksum 0xee39 (correct), ack 1 win 183 <nop,nop,timestamp 368087673 291505025> 09:54:48.952965 IP (tos 0x0, ttl 64, id 20663, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.58967 > xarello.bcm.umontreal.ca.sunrpc: ., cksum 0x97ba (correct), ack 34 win 183 <nop,nop,timestamp 368087673 291505025> 09:54:48.952985 IP (tos 0x0, ttl 64, id 3811, offset 0, flags [DF], proto: TCP (6), length: 96) airen.bcm.umontreal.ca.665 > xarello.bcm.umontreal.ca.702: P 1:45(44) ack 1 win 183 <nop,nop,timestamp 368087673 291505025> 09:54:48.952988 IP (tos 0x0, ttl 64, id 20692, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.665: ., cksum 0xee96 (correct), ack 45 win 46 <nop,nop,timestamp 291505025 368087673> 09:54:48.953157 IP (tos 0x0, ttl 64, id 20693, offset 0, flags [DF], proto: TCP (6), length: 80) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.665: P, cksum 0xb635 (incorrect (-> 0x946c), 1:29(28) ack 45 win 46 <nop,nop,timestamp 291505025 368087673> 09:54:48.953242 IP (tos 0x0, ttl 64, id 3812, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.665 > xarello.bcm.umontreal.ca.702: ., cksum 0xedf1 (correct), ack 29 win 183 <nop,nop,timestamp 368087673 291505025> 09:54:48.953248 IP (tos 0x0, ttl 64, id 3813, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.665 > xarello.bcm.umontreal.ca.702: F, cksum 0xedf0 (correct), 45:45(0) ack 29 win 183 <nop,nop,timestamp 368087673 291505025> 09:54:48.953257 IP (tos 0x0, ttl 64, id 61994, offset 0, flags [DF], proto: TCP (6), length: 60) airen.bcm.umontreal.ca.666 > xarello.bcm.umontreal.ca.702: S, cksum 0xdf2e (correct), 2382061651:2382061651(0) win 5840 <mss 1460,sackOK,timestamp 368087673 0,nop,wscale 5> 09:54:48.953268 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 60) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.666: S, cksum 0x58ff (correct), 2442387415:2442387415(0) ack 2382061652 win 5792 <mss 1460,sackOK,timestamp 291505025 368087673,nop,wscale 7> 09:54:48.953295 IP (tos 0x0, ttl 64, id 20694, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.665: F, cksum 0xee78 (correct), 29:29(0) ack 46 win 46 <nop,nop,timestamp 291505025 368087673> 09:54:48.953337 IP (tos 0x0, ttl 64, id 61995, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.666 > xarello.bcm.umontreal.ca.702: ., cksum 0x9db4 (correct), ack 1 win 183 <nop,nop,timestamp 368087673 291505025> 09:54:48.953367 IP (tos 0x0, ttl 64, id 61996, offset 0, flags [DF], proto: TCP (6), length: 144) airen.bcm.umontreal.ca.666 > xarello.bcm.umontreal.ca.702: P 1:93(92) ack 1 win 183 <nop,nop,timestamp 368087673 291505025> 09:54:48.953373 IP (tos 0x0, ttl 64, id 56409, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.666: ., cksum 0x9de1 (correct), ack 93 win 46 <nop,nop,timestamp 291505025 368087673> 09:54:48.953380 IP (tos 0x0, ttl 64, id 3814, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.665 > xarello.bcm.umontreal.ca.702: ., cksum 0xedef (correct), ack 30 win 183 <nop,nop,timestamp 368087673 291505025> 09:54:48.964356 IP (tos 0x0, ttl 64, id 56410, offset 0, flags [DF], proto: TCP (6), length: 80) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.666: P, cksum 0xb635 (incorrect (-> 0x3610), 1:29(28) ack 93 win 46 <nop,nop,timestamp 291505028 368087673> 09:54:48.964468 IP (tos 0x0, ttl 64, id 61997, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.666 > xarello.bcm.umontreal.ca.702: ., cksum 0x9d37 (correct), ack 29 win 183 <nop,nop,timestamp 368087675 291505028> 09:54:48.964490 IP (tos 0x0, ttl 64, id 61998, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.666 > xarello.bcm.umontreal.ca.702: F, cksum 0x9d36 (correct), 93:93(0) ack 29 win 183 <nop,nop,timestamp 368087675 291505028> 09:54:48.964542 IP (tos 0x0, ttl 64, id 56411, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.666: F, cksum 0x9dbe (correct), 29:29(0) ack 94 win 46 <nop,nop,timestamp 291505028 368087675> 09:54:48.964621 IP (tos 0x0, ttl 64, id 61999, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.666 > xarello.bcm.umontreal.ca.702: ., cksum 0x9d35 (correct), ack 30 win 183 <nop,nop,timestamp 368087675 291505028> 09:54:58.400325 IP (tos 0x0, ttl 64, id 58035, offset 0, flags [DF], proto: TCP (6), length: 60) airen.bcm.umontreal.ca.41108 > xarello.bcm.umontreal.ca.sunrpc: S, cksum 0x1869 (correct), 2536405249:2536405249(0) win 5840 <mss 1460,sackOK,timestamp 368090034 0,nop,wscale 5> 09:54:58.400342 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 60) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.41108: S, cksum 0x35ee (correct), 2579770040:2579770040(0) ack 2536405250 win 5792 <mss 1460,sackOK,timestamp 291507387 368090034,nop,wscale 7> 09:54:58.400426 IP (tos 0x0, ttl 64, id 58036, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.41108 > xarello.bcm.umontreal.ca.sunrpc: ., cksum 0x7aa3 (correct), ack 1 win 183 <nop,nop,timestamp 368090034 291507387> 09:54:58.400454 IP (tos 0x0, ttl 64, id 58037, offset 0, flags [DF], proto: TCP (6), length: 112) airen.bcm.umontreal.ca.41108 > xarello.bcm.umontreal.ca.sunrpc: P 1:61(60) ack 1 win 183 <nop,nop,timestamp 368090034 291507387> 09:54:58.400459 IP (tos 0x0, ttl 64, id 9730, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.41108: ., cksum 0x7af0 (correct), ack 61 win 46 <nop,nop,timestamp 291507387 368090034> 09:54:58.400633 IP (tos 0x0, ttl 64, id 9731, offset 0, flags [DF], proto: TCP (6), length: 84) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.41108: P 1:33(32) ack 61 win 46 <nop,nop,timestamp 291507387 368090034> 09:54:58.400775 IP (tos 0x0, ttl 64, id 58038, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.41108 > xarello.bcm.umontreal.ca.sunrpc: ., cksum 0x7a47 (correct), ack 33 win 183 <nop,nop,timestamp 368090034 291507387> 09:54:58.400785 IP (tos 0x0, ttl 64, id 58039, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.41108 > xarello.bcm.umontreal.ca.sunrpc: F, cksum 0x7a46 (correct), 61:61(0) ack 33 win 183 <nop,nop,timestamp 368090034 291507387> 09:54:58.400799 IP (tos 0x0, ttl 64, id 1828, offset 0, flags [DF], proto: TCP (6), length: 60) airen.bcm.umontreal.ca.0 > xarello.bcm.umontreal.ca.nfs: 0 null 09:54:58.400806 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 60) xarello.bcm.umontreal.ca.nfs > airen.bcm.umontreal.ca.0: reply ok 0 null 09:54:58.400810 IP (tos 0x0, ttl 64, id 9732, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.41108: F, cksum 0x7ace (correct), 33:33(0) ack 62 win 46 <nop,nop,timestamp 291507387 368090034> 09:54:58.400925 IP (tos 0x0, ttl 64, id 1829, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.667 > xarello.bcm.umontreal.ca.nfs: ., cksum 0x1a92 (correct), ack 2594307158 win 183 <nop,nop,timestamp 368090034 291507387> 09:54:58.400951 IP (tos 0x0, ttl 64, id 58040, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.41108 > xarello.bcm.umontreal.ca.sunrpc: ., cksum 0x7a45 (correct), ack 34 win 183 <nop,nop,timestamp 368090034 291507387> 09:54:58.400965 IP (tos 0x0, ttl 64, id 1830, offset 0, flags [DF], proto: TCP (6), length: 96) airen.bcm.umontreal.ca.2050133211 > xarello.bcm.umontreal.ca.nfs: 44 null 09:54:58.400970 IP (tos 0x0, ttl 64, id 61677, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.nfs > airen.bcm.umontreal.ca.667: ., cksum 0x1aef (correct), ack 44 win 46 <nop,nop,timestamp 291507387 368090034> 09:54:58.401062 IP (tos 0x0, ttl 64, id 61678, offset 0, flags [DF], proto: TCP (6), length: 80) xarello.bcm.umontreal.ca.nfs > airen.bcm.umontreal.ca.2050133211: reply ok 28 null 09:54:58.401174 IP (tos 0x0, ttl 64, id 1831, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.667 > xarello.bcm.umontreal.ca.nfs: ., cksum 0x1a49 (correct), ack 29 win 183 <nop,nop,timestamp 368090035 291507387> 09:54:58.401181 IP (tos 0x0, ttl 64, id 1832, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.667 > xarello.bcm.umontreal.ca.nfs: F, cksum 0x1a48 (correct), 44:44(0) ack 29 win 183 <nop,nop,timestamp 368090035 291507387> 09:54:58.401190 IP (tos 0x0, ttl 64, id 18562, offset 0, flags [DF], proto: TCP (6), length: 60) airen.bcm.umontreal.ca.40890 > xarello.bcm.umontreal.ca.sunrpc: S, cksum 0xbec6 (correct), 2541343537:2541343537(0) win 5840 <mss 1460,sackOK,timestamp 368090035 0,nop,wscale 5> 09:54:58.401200 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 60) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.40890: S, cksum 0x4d3f (correct), 2580593080:2580593080(0) ack 2541343538 win 5792 <mss 1460,sackOK,timestamp 291507387 368090035,nop,wscale 7> 09:54:58.401218 IP (tos 0x0, ttl 64, id 61679, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.nfs > airen.bcm.umontreal.ca.667: F, cksum 0x1ad0 (correct), 29:29(0) ack 45 win 46 <nop,nop,timestamp 291507387 368090035> 09:54:58.401324 IP (tos 0x0, ttl 64, id 18563, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.40890 > xarello.bcm.umontreal.ca.sunrpc: ., cksum 0x91f4 (correct), ack 1 win 183 <nop,nop,timestamp 368090035 291507387> 09:54:58.401341 IP (tos 0x0, ttl 64, id 1833, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.667 > xarello.bcm.umontreal.ca.nfs: ., cksum 0x1a47 (correct), ack 30 win 183 <nop,nop,timestamp 368090035 291507387> 09:54:58.401351 IP (tos 0x0, ttl 64, id 18564, offset 0, flags [DF], proto: TCP (6), length: 112) airen.bcm.umontreal.ca.40890 > xarello.bcm.umontreal.ca.sunrpc: P 1:61(60) ack 1 win 183 <nop,nop,timestamp 368090035 291507387> 09:54:58.401357 IP (tos 0x0, ttl 64, id 46338, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.40890: ., cksum 0x9241 (correct), ack 61 win 46 <nop,nop,timestamp 291507387 368090035> 09:54:58.401447 IP (tos 0x0, ttl 64, id 46339, offset 0, flags [DF], proto: TCP (6), length: 84) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.40890: P 1:33(32) ack 61 win 46 <nop,nop,timestamp 291507387 368090035> 09:54:58.401573 IP (tos 0x0, ttl 64, id 18565, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.40890 > xarello.bcm.umontreal.ca.sunrpc: ., cksum 0x9198 (correct), ack 33 win 183 <nop,nop,timestamp 368090035 291507387> 09:54:58.401580 IP (tos 0x0, ttl 64, id 18566, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.40890 > xarello.bcm.umontreal.ca.sunrpc: F, cksum 0x9197 (correct), 61:61(0) ack 33 win 183 <nop,nop,timestamp 368090035 291507387> 09:54:58.401588 IP (tos 0x0, ttl 64, id 18882, offset 0, flags [DF], proto: TCP (6), length: 60) airen.bcm.umontreal.ca.668 > xarello.bcm.umontreal.ca.702: S, cksum 0x5a5a (correct), 2541081201:2541081201(0) win 5840 <mss 1460,sackOK,timestamp 368090035 0,nop,wscale 5> 09:54:58.401594 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 60) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.668: S, cksum 0xda52 (correct), 2584070147:2584070147(0) ack 2541081202 win 5792 <mss 1460,sackOK,timestamp 291507387 368090035,nop,wscale 7> 09:54:58.401599 IP (tos 0x0, ttl 64, id 46340, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.sunrpc > airen.bcm.umontreal.ca.40890: F, cksum 0x921f (correct), 33:33(0) ack 62 win 46 <nop,nop,timestamp 291507387 368090035> 09:54:58.401723 IP (tos 0x0, ttl 64, id 18883, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.668 > xarello.bcm.umontreal.ca.702: ., cksum 0x1f08 (correct), ack 1 win 183 <nop,nop,timestamp 368090035 291507387> 09:54:58.401736 IP (tos 0x0, ttl 64, id 18567, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.40890 > xarello.bcm.umontreal.ca.sunrpc: ., cksum 0x9196 (correct), ack 34 win 183 <nop,nop,timestamp 368090035 291507387> 09:54:58.401744 IP (tos 0x0, ttl 64, id 18884, offset 0, flags [DF], proto: TCP (6), length: 96) airen.bcm.umontreal.ca.668 > xarello.bcm.umontreal.ca.702: P 1:45(44) ack 1 win 183 <nop,nop,timestamp 368090035 291507387> 09:54:58.401748 IP (tos 0x0, ttl 64, id 50360, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.668: ., cksum 0x1f64 (correct), ack 45 win 46 <nop,nop,timestamp 291507388 368090035> 09:54:58.401882 IP (tos 0x0, ttl 64, id 50361, offset 0, flags [DF], proto: TCP (6), length: 80) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.668: P, cksum 0xb635 (incorrect (-> 0xcc4a), 1:29(28) ack 45 win 46 <nop,nop,timestamp 291507388 368090035> 09:54:58.401973 IP (tos 0x0, ttl 64, id 18885, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.668 > xarello.bcm.umontreal.ca.702: ., cksum 0x1ebf (correct), ack 29 win 183 <nop,nop,timestamp 368090035 291507388> 09:54:58.401981 IP (tos 0x0, ttl 64, id 18886, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.668 > xarello.bcm.umontreal.ca.702: F, cksum 0x1ebe (correct), 45:45(0) ack 29 win 183 <nop,nop,timestamp 368090035 291507388> 09:54:58.401990 IP (tos 0x0, ttl 64, id 58833, offset 0, flags [DF], proto: TCP (6), length: 60) airen.bcm.umontreal.ca.669 > xarello.bcm.umontreal.ca.702: S, cksum 0xae8d (correct), 2538634850:2538634850(0) win 5840 <mss 1460,sackOK,timestamp 368090035 0,nop,wscale 5> 09:54:58.401998 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 60) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.669: S, cksum 0x10a7 (correct), 2582570488:2582570488(0) ack 2538634851 win 5792 <mss 1460,sackOK,timestamp 291507388 368090035,nop,wscale 7> 09:54:58.402039 IP (tos 0x0, ttl 64, id 50362, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.668: F, cksum 0x1f46 (correct), 29:29(0) ack 46 win 46 <nop,nop,timestamp 291507388 368090035> 09:54:58.402123 IP (tos 0x0, ttl 64, id 58834, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.669 > xarello.bcm.umontreal.ca.702: ., cksum 0x555c (correct), ack 1 win 183 <nop,nop,timestamp 368090035 291507388> 09:54:58.402140 IP (tos 0x0, ttl 64, id 18887, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.668 > xarello.bcm.umontreal.ca.702: ., cksum 0x1ebd (correct), ack 30 win 183 <nop,nop,timestamp 368090035 291507388> 09:54:58.402149 IP (tos 0x0, ttl 64, id 58835, offset 0, flags [DF], proto: TCP (6), length: 144) airen.bcm.umontreal.ca.669 > xarello.bcm.umontreal.ca.702: P 1:93(92) ack 1 win 183 <nop,nop,timestamp 368090035 291507388> 09:54:58.402155 IP (tos 0x0, ttl 64, id 59717, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.669: ., cksum 0x5589 (correct), ack 93 win 46 <nop,nop,timestamp 291507388 368090035> 09:54:58.406217 IP (tos 0x0, ttl 64, id 59718, offset 0, flags [DF], proto: TCP (6), length: 124) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.669: P 1:73(72) ack 93 win 46 <nop,nop,timestamp 291507389 368090035> 09:54:58.406317 IP (tos 0x0, ttl 64, id 58836, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.669 > xarello.bcm.umontreal.ca.702: ., cksum 0x54b6 (correct), ack 73 win 183 <nop,nop,timestamp 368090036 291507389> 09:54:58.406329 IP (tos 0x0, ttl 64, id 58837, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.669 > xarello.bcm.umontreal.ca.702: F, cksum 0x54b5 (correct), 93:93(0) ack 73 win 183 <nop,nop,timestamp 368090036 291507389> 09:54:58.406351 IP (tos 0x0, ttl 64, id 59719, offset 0, flags [DF], proto: TCP (6), length: 52) xarello.bcm.umontreal.ca.702 > airen.bcm.umontreal.ca.669: F, cksum 0x553d (correct), 73:73(0) ack 94 win 46 <nop,nop,timestamp 291507389 368090036> 09:54:58.406424 IP (tos 0x0, ttl 64, id 58838, offset 0, flags [DF], proto: TCP (6), length: 52) airen.bcm.umontreal.ca.669 > xarello.bcm.umontreal.ca.702: ., cksum 0x54b4 (correct), ack 74 win 183 <nop,nop,timestamp 368090036 291507389> The mount did not succeed, however, it just hung. I got sysrq-t traces from both client and server, I'll attach those. Doing sysrq-t caused problems on both machines, on the client I got Clocksource tsc unstable (delta = 253128831675 ns) Hangcheck: hangcheck value past margin! That seems harmless, but on the server it killed networking: eth2: Detected Tx Unit Hang: TDH <ba> TDT <ba> next_to_use <ba> next_to_clean <6d> buffer_info[next_to_clean]: time_stamp <11163f235> next_to_watch <6d> jiffies <111644ddf> next_to_watch.status <1> bonding: bond0: link status definitely up for interface eth0. Re-initializing the interfaces didn't work, so I had to reboot the server. Interestingly the client - not rebooted - can now see its NFS mounts since the server came back up.
Created attachment 16964 [details] sysrq-t of airen (client) during the problem
Created attachment 16965 [details] sysrq-t of xarello (server) during the problem
Created attachment 16982 [details] Ryan's trace from comment #3 I reformated Ryan's trace from comment #3 for easier readability (1 line per packet, and removed tos 0x0, ttl 64, offset 0, flags [DF], proto: TCP (6), and .bcm.umontreal.ca domain information which was duplicated in every packet). I'm not able to interpret the network trace myself, however I was surprised to see that 3 separate packets from the NFS server xarello, port 702 had an identical but incorrect checksum of 0xb635. No idea what that means, though.
I see something very similar. I have a Dell laptop with NFS root which will regularly hang (from which it has only ever recovered once, after several hours). I have a similar HP laptop mount ing the same NFS root, which never seems to see the problem. The server is fine, as evidenced by the good client continuing to work. I can ping the bad laptop. The main differences are the Dell has 512MB and an Broadcom BCM4401-B0 100Base-TX (rev 02) ethernet chip (b44 driver) vs the HP with 1GB and a Broadcom BCM5705M_2 Gigabit Ethernet (rev 03) (tg3 driver).
My earlier comment re the HP laptop is a red herring. Both laptops experience the NFS hang. This problem is still there in version 2.6.27.7. It never occurs with 2.6.24.4 (which is later than the 2.6.23.16 listed in the original report. I think there is a high probability that this is the same bug as 11154, because they appeared about the same time. Bug 11154 focuses on behaviour which occurs after resume, but "not always". This problem is pretty painful as it makes kernels after 2.6.24.4 useless to me. Is there anything I can do to help? Because all file systems are on NFS, it is pretty hard to do much to investigate after the hang. I do remote logging so I have a record of the "nfs server not responding" and "nfs server OK" messages. Generally it can go for many hours before the "OK" if any. Maybe it would be possible to enable (or add) additional logging?
Ian, did you try the patches in bug 11154 comments 18-19 ? I'm currently running 2.6.27.7 with both patches applied and my problem is gone. In 2.6.28-rc6 the first patch is already applied but not the second one - and the problem is gone there too but only after applying the second patch. Not sure if these patches will help though since I don't see how they explain hangs longer than 300 seconds. There is a chance you may have something else going on. Maybe you can try running tcpdump on the server to see what packets are exchanged with the laptop ?
I hadn't but I have now! Unfortunately these patches do not solve my problem. I tcpdump (actually wireshark) and I can't see anything interesting. There is plenty of NFS traffic even after the server supposedly became not ready. The only interesting features are that the operations are all V3 GETATTR or V3 ACCESS (no read or write) and the same operations get tried on the same filehandles many times. The replies are all NFS3_OK. This is entirely consistent with the client failing to respond to the servers replies.
Further investigation shows that 2.6.24.7 (the last 2.6.24.x) works fine, and 2.5.25 doesn't.
I'm having trouble following the discussion, since it's never clear whether the version numbers given apply to the server, the client, or both. So in #11, for example, are you leaving the client kernel versions fixed, and varying only the server versions?
These are client version numbers. I note the original report had upgraded both client and server, but I have never changed the server, which is 2.6.26.3-14, since before this issue arose.
We see this same problem. client: Jan 9 07:38:04 volt kernel: nfs: server ifs not responding, timed out server: Linux 2.6.18-113.el5 #1 SMP Fri Sep 12 11:02:39 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux clients: Linux 2.6.27.9-159.fc10.i686.PAE #1 SMP Tue Dec 16 14:59:37 EST 2008 i686 i686 i386 GNU/Linux Linux 2.6.25.9-76.fc9.x86_64 #1 SMP Fri Jun 27 15:58:30 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux client: rpcinfo -p server looks good
I have nearly completed a bisection to discover exactly when this bug was introduced. It takes a while because it is unpredictable how long it takes to show up. I should know within 24 hours...
Ok. The bisection reveals: c81468a1a766921f11ae44e8a99816ac8dc7b015 is first bad commit commit c81468a1a766921f11ae44e8a99816ac8dc7b015 Author: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Fri Dec 14 14:56:05 2007 -0500 NFS: Clean up the nfs_find_client function. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> :040000 040000 38d3ae161b4ea03cc801b391449ac0273397a889 4e2e506ba4ab43beebb5fa365595986916cccfd5 M fs Looking at that commit, it looks like just a re-organisation (moving stuff from a helper function etc) but one thing I notice is missing. Pre this commit, nfs_find_client() had: - if (clp != NULL && clp->cl_cons_state != NFS_CS_READY) { - nfs_put_client(clp); - clp = NULL; } The new nfs_find_client omits the call to nfs_put_client in this case. I've no idea if this is significant.
I think the original code may have been somewhat convoluted for a reason! The new code, as I said (#16) omits the call to nfs_put_client for the case where cl_cons_state is not NFS_CS_READY. __nfs_find_client used to increase the reference count for this case, and then nfs_put_client would decrease the reference count again. The new way seems much cleaner /except/, that is not all nfs_put_client does. It also unlinks the client if the reference count reaches zero. I don't think one can list_del it while in the middle of a list_for_each_entry loop, or bad things would happen. So this would explain the apparently convoluted original code. Maybe this is a case which never arises, but the only functional difference I can spot between the original working code and the new non-working code is this (case of !NFS_CS_READY) and the the change from comparing ip address and port number to memcmp'ing the whole sockaddr structure. This (using memcmp) on a socket address is also potentially problematic, because there are unused padding bytes in a socket address. Unless one can be sure that the unused bytes have always been initialized the compare could fail when it shouldn't.
We do not see the same problem when the kernel versions are the same (2.6.27, see below). We have 2 clusters, one with 12 nodes and one with 24 nodes that have lots of nfs mounts from the master to the nodes. These are all running 2.6.27, and we do not see the nfs server timeouts on these systems. [root@carbon ecrp12]# uname -a Linux carbon.corp.com 2.6.27.7-53.fc9.x86_64 #1 SMP Thu Nov 27 02:05:02 EST 2008 x86_64 x86_64 x86_64 GNU/Linux [root@carbon ecrp12]# rsh node46 uname -a Linux node46.cl.corp.com 2.6.27.5-41.fc9.x86_64 #1 SMP Thu Nov 13 20:29:07 EST 2008 x86_64 x86_64 x86_64 GNU/Linux The problems continue on the systems listed in the earlier post. We do have verbose out from a failed nfs/rpc call. Should I post this?
Created attachment 19784 [details] Don't use memcmp on sockaddr This patch works for me so far with 2.27.7. memcmp() is not suitable to compare network addresses. A sockaddr struct contains an opaque bunch of bytes which as (almost) never all used when it is aliased to a specific family address structure. Two sockaddr structs can have the same family, same ipaddr and same port number and yet not compare equal with memcmp because they differ in the unused bytes used to pad out the structure. Often those padding bytes would happen to be zero, but evidently, not always!
I see... I think that the reason we're not seeing this here is that we're probably all using the new text based mount interface. The latter allocates the sockaddr using a kzalloc(), and the helpers nfs_parse_ipv4_address() subsequently initialise the sockaddr_in->sin_addr.s_addr field directly. IOW: all the padding etc remain set to zero. I'm therefore assuming that you are using either an old version of the utils-linux mount code, or one of the early versions of nfs-utils with the binary-only mount.nfs utility. Might that be the case?
I guess so, nfs-utils-1.1.1-2.fc8, which doesn't seem *that* old. So maybe if I had upgraded to fc9 or 10 I would never have seen this problem. Instead I wanted to check the kernel first as I have been caught before upgrading to a new distribution to discover it doesn't work due to a kernel bug (usually in the driver for some particular hardware), and having no easy downgrade path. This time I have been caught out the other way! Regardless, it still seems like a bug so long as the old kernel mount interface is still advertised.
nfs-utils-1.1.1 does in fact have the text-based mount capability, but you have to explicitly enable it using the '-i' option. OK. That explanation makes sense, and is all I wanted to clarify. I agree with you that this needs to be fixed so that we can continue to support the legacy binary mount interface.
Could you please add a changelog entry, and a Signed-off-by line, check that the patch applies against 2.6.29-rc1, and then submit it for inclusion via the usual NFS mailing list (linux-nfs@vger.kernel.org)? Thanks...
Am trying to do this, (my original patch doesn't apply cleanly against 2.6.29-rc[12]) but I have a bigger nfs problem with these kernels. They won't boot. I get: Mounting root filesystem nfsmount: error mounting server:/foo/blah on /sysroot as nfs: Connection refused. But the server logs don't show any problem and in fact "showmount -a" shows the export as mounted by this client! This is all without my changes applied. Before I file a new bug report, is there anything I should be aware of? Could this be the text-based mount thing (comment 20)? For these diskless clients, the nfs mount is in fact being done by nash, not nfs-utils.