Most recent kernel where this bug did *NOT* occur: none Distribution: Gentoo Hardware Environment: x86_64 Core 2 Duo, D-Link System Inc DGE-528T Gigabit Ethernet Adapter Problem Description: http://lkml.org/lkml/2007/1/30/110 Steps to reproduce: 1. I wait for the NAS to sleep. 2. I access the NAS with the Amarok application to read a mp3 file. This action triggers the awakening of the NAS (~3 secs). 3. Before the NAS awakes, I access the NAS from another application : a file manager (Konqueror). I click on the directory that contains the mp3 (which is the same directory that is accessed by amarok - this directory contains the mounted cifs volume of the NAS). 4. The system freezes.
Well, this could be a net driver bug or it could be a CIFS bug. One way to determine that would be to try the same operation from a machine whcih has a different type of NIC. Or swap out the NIC in that machine. Is that possible?
I have tried 6-7 times to trigger the bug from my laptop since yesterday (IBM T42 with an Intel Corporation 82540EP Gigabit Ethernet Controller) with both the 2.6.19.3 and 2.6.20.1 kernel, with no success. It seems to be a net driver related bug.
You have two options: 1. (shot in the dark) try http://bugzilla.kernel.org/attachment.cgi?id=10646 against 2.6.21-rc3. Don't use netpoll. 2. start a 'tcpdump -s 0 -w /tmp/zoinx.pcap -i eth0' on your laptop and issue the same sequence that would otherwise kill the host with the r8169. #2 is higher priority. I wonder what the NAS can push in the pipe when it wakes up. -- Ueimor
I just have a network switch and not a hub, so I can't monitor the connection between the NAS and the crashing system, from my laptop. I will try to borrow a hub and do a tcpdump as soon as possible.
Created attachment 10854 [details] sequence that kill the host with the r8169 during the NAS awakening I have borrowed a Hub and so I have done a tcpdump. I join the dump file, which contains the sequence that kill the host with the r8169 during the NAS awakening. Note that the NAS has the 192.168.1.60 address and the host that freezes has the 192.168.1.10 address. (the laptop is 192.168.1.8).
Thanks for your follow-up. > 1. I wait for the NAS to sleep. > > 2. I access the NAS with the Amarok application to read a mp3 file. This > action triggers the awakening of the NAS (~3 secs). I am not a CIFS guru but it is probably: 17 00:51:55.366975 192.168.1.10 -> 192.168.1.60 SMB Trans2 Request, QUERY_PATH_INFO, Query File Unix Basic, Path: \Pop Rock Alternatif [...] 30 00:51:55.390461 192.168.1.10 -> 192.168.1.60 SMB Trans2 Request, QUERY_PATH_INFO, Query File Unix Basic, Path: \Pop Rock Alternatif\Radiohead\b-sides\The Bends\permanent daylight.mp3 31 00:51:55.390961 192.168.1.60 -> 192.168.1.10 SMB Trans2 Response, QUERY_PATH_INFO > 3. Before the NAS awakes, I access the NAS from another application : a file > manager (Konqueror). > I click on the directory that contains the mp3 (which is the same directory > that is accessed by amarok - this directory contains the mounted cifs volume > of the NAS). 36 00:51:58.182445 192.168.1.10 -> 192.168.1.60 SMB Trans2 Request, QUERY_PATH_INFO, Query File Unix Basic, Path: \.directory [lots of stuff related to 30 and 31 but not to 36 ?] 58 00:52:03.099267 192.168.1.60 -> 192.168.1.10 SMB Read AndX Response, FID: 0x2be2, 16384 bytes 59 00:52:03.099518 192.168.1.10 -> 192.168.1.60 TCP 35478 > microsoft-ds [ACK] Seq=1419 Ack=11296 Win=5065 Len=0 [...] 64 00:52:03.099772 192.168.1.10 -> 192.168.1.60 TCP 35478 > microsoft-ds [ACK] Seq=1419 Ack=17583 Win=5065 Len=0 TSV=31819739 TSER=53387879 65 00:52:03.100017 192.168.1.10 -> 192.168.1.60 TCP 35478 > microsoft-ds [FIN,ACK] Seq=1419 Ack=17583 Win=5065 Len=0 TSV=31819739 TSER=53387879 66 00:52:03.100019 192.168.1.10 -> 192.168.1.60 TCP 35478 > microsoft-ds [RST,ACK] Seq=1420 Ack=17583 Win=5065 Len=0 TSV=31819739 TSER=53387879 Connexion closed by the client. 67 00:52:03.100020 192.168.1.10 -> 192.168.1.60 TCP 44317 > microsoft-ds [SYN] Seq=0 Len=0 MSS=1460 TSV=31819739 TSER=0 WS=5 Connexion started again. > 4. The system freezes. 68 00:52:03.100267 192.168.1.60 -> 192.168.1.10 TCP microsoft-ds > 44317 [SYN,ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=53387880 TSER=31819739 WS=1 69 00:52:07.098344 192.168.1.60 -> 192.168.1.10 TCP microsoft-ds > 44317 [SYN,ACK] Seq=0 Ack=1 Win=11584 Len=0 MSS=1460 TSV=53388280 TSER=31819739 WS=1 70 00:52:13.098335 192.168.1.60 -> 192.168.1.10 TCP microsoft-ds > 44317 [SYN,ACK] Seq=0 Ack=1 Win=11584 Len=0 MSS=1460 TSV=53388880 TSER=31819739 WS=1 Client dead. Can you send the dump of the same sequence with the e1000 ? You could try to add 16 bytes to RX_BUF_SIZE in the r8169 driver but I do not expect it to make a difference. Feel free to ignore this suggestion for now if you do not have much time. -- Ueimor
I have monitor from the 192.168.1.10 machine, the communication between the NAS (192.168.1.60) and my laptop (192.168.1.8) with the e1000. I have carried out the same sequence. Also, the file is just bigger than 1Mo (while I used the same tcpdump command ??!), so I can't upload it on bugzilla.kernel.org. Thus, I send you this dump by email.
I have compared the two sequences with 1 - tethereal -l -V -t a -n -r nas-ko.pcap 'ip.addr == 192.168.1.10 or ip.addr == 192.168.1.60' 2 - tethereal -l -V -t a -n -r nas-ok.pcap 'ip.addr == 192.168.1.8 or ip.addr == 192.168.1.60' (nas-ok.pcap received by e-mail as bugzilla barfed on it) Before the host dies in the capture 1, the sequence looks like: [...] 56 00:52:03.097519 192.168.1.60 -> 192.168.1.10 TCP [TCP segment of a reassembled PDU] 57 00:52:03.098768 192.168.1.60 -> 192.168.1.10 TCP [TCP segment of a reassembled PDU] 58 00:52:03.099267 192.168.1.60 -> 192.168.1.10 SMB Read AndX Response, FID: 0x2be2, 16384 bytes (this is tethereal output without the verbose -V flag) Capture 2 seems close: [...] 57 01:48:31.185867 192.168.1.60 -> 192.168.1.8 TCP [TCP segment of a reassembled PDU] 58 01:48:31.187098 192.168.1.60 -> 192.168.1.8 TCP [TCP segment of a reassembled PDU] 59 01:48:31.188388 192.168.1.60 -> 192.168.1.8 TCP [TCP segment of a reassembled PDU] 60 01:48:31.189072 192.168.1.60 -> 192.168.1.8 SMB Read AndX Response, FID: 0x1cc7, 16384 bytes However, the detailled output of frame #58 in 1 contains: [...] Frame 58 (561 bytes on wire, 561 bytes captured) Arrival Time: Mar 20, 2007 00:52:03.099267000 [Time delta from previous packet: 0.000499000 seconds] [...] Internet Protocol, Src: 192.168.1.60 (192.168.1.60), Dst: 192.168.1.10 (192.168.1.10) Version: 4 Header length: 20 bytes Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00) 0000 00.. = Differentiated Services Codepoint: Default (0x00) .... ..0. = ECN-Capable Transport (ECT): 0 .... ...0 = ECN-CE: 0 Total Length: 547 [...] Transmission Control Protocol, Src Port: 445 (445), Dst Port: 35478 (35478), Seq: 17088, Ack: 1419, Len: 495 Source port: 445 (445) Destination port: 35478 (35478) [...] NetBIOS Session Service Message Type: Session message Length: 16443 SMB (Server Message Block Protocol) SMB Header Server Component: SMB [Response to: 32] [Time from request: 7.707806000 seconds] SMB Command: Read AndX (0x2e) NT Status: STATUS_SUCCESS (0x00000000) [...] Byte Count (BCC): 16384 File Data: 4944330300000000046954414C42000000100000004D7920... NetBIOS Session Service Message Type: Session message Length: 35 SMB (Server Message Block Protocol) SMB Header Server Component: SMB [Response to: 36] [Time from request: 4.916822000 seconds] SMB Command: Trans2 (0x32) NT Status: STATUS_OBJECT_NAME_NOT_FOUND (0xc0000034) [...] Two SMB back to back ? Assuming I understand the output of tethereal correctly, the same SMB is never piggybacked in stream 2. In stream 2, it appears in a packet of its own as: [...] 65 01:48:31.189732 192.168.1.60 -> 192.168.1.8 SMB Trans2 Response, QUERY_PATH_INFO, Error: STATUS_OBJECT_NAME_NOT_FOUND Steve, could you give a look at the attachment 10854 [details] above and tell if it is completely safe for the CIFS stack to receive this thing -- Ueimor
I've just tried to trigger the bug with the 2.6.21. It is still present :/
Any update on this bug? Is it still present in the latest kernel? (2.6.22+) Thanks.
It seems that I can't trigger it anymore on a 2.6.22-rc7-g190045d5 ;) Natalie Protasevich, do you know the "commit" that seems to have resolved the bug ?
Can you try git bisect ? There is no important change in the r8169 driver between 2.6.21.5 and current mainline. I'd really like to know if this bug is fixed or if it is simply hidden by some timing change (or such). Btw expect some changes in the r8169 driver during 2.6.23-rc1.
It is probably most likely that it's become hidden. I can only see some TCP patches, but nothing significant. (But there is a bit of hope someone fixed something that was causing the bug - directly or indirectly. git bisect would be the best) I would check if any of bits and pieces have changed in your test environment, maybe something in user space that now is not triggering the bug.
yep, I'll do a git bisect as soon as possible and I'll recheck a 2.6.21 to validate my test environment again.
I didn't finish yet and I just begin to git-bisect but it takes lot of time and I have not a lot for now. However, I checked a 2.6.21.5 and I triggered the bug. I also checked a 2.6.22-rc1, and cannot be able to trigger it. But maybe it is hidden as you said. I will give you more results next time.
I have some partial results with git-bisect. I post them and hope that it would sufficiently restrict the set of commits (git-bisect say that there are 21 revisions left to test after the new one I don't test yet). Note that for the commit 5cd471551 there is something strange: after I compiled the kernel and booted on it I see with 'uname -a': Linux 2.6.21-rc7 ??! (<- look at the 2 first tests) Anyway via git bisect-visualize it seems there is a lot of commits from CIFS (and some from KVM but... ;) However I don't take time to well understand the figure shown by gitk, so I surely missed something. ---------- good -> I set "good" when I could trigger the bug. bad -> I set "bad" when I couldn't git-bisect start # bad: [39403865d2e4590802553370a56c9ab93131e4ee] Linux 2.6.22-rc1 git-bisect bad 39403865d2e4590802553370a56c9ab93131e4ee # good: [de46c33745f5e2ad594c72f2cf5f490861b16ce1] Linux 2.6.21 git-bisect good de46c33745f5e2ad594c72f2cf5f490861b16ce1 # good: [ea62ccd00fd0b6720b033adfc9984f31130ce195] Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6 git-bisect good ea62ccd00fd0b6720b033adfc9984f31130ce195 # bad: [215d06780d13fd7de629b02b61b7b7bf88ce5039] Fix sunrpc warning noise git-bisect bad 215d06780d13fd7de629b02b61b7b7bf88ce5039 # bad: [e7d709c096487078652a1384d7a2d0e4459e18b6] is_power_of_2 in fat git-bisect bad e7d709c096487078652a1384d7a2d0e4459e18b6 # bad: [aeecf3142d82414d511135cc85f86caddfb58338] Convert h8/300 to generic timekeeping git-bisect bad aeecf3142d82414d511135cc85f86caddfb58338 # bad: [6de410c2b0cc055ae9ee640c84331f6a70878d9b] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm git-bisect bad 6de410c2b0cc055ae9ee640c84331f6a70878d9b # good: [5cd47155155a32e5b944ac9fc3f3dc578e429aa0] Merge branch 'ixp4xx' into devel git-bisect good 5cd47155155a32e5b944ac9fc3f3dc578e429aa0 # good: [5008fdf5b6a31240da060c0867d8f16f08ce2384] KVM: Use kernel-standard types git-bisect good 5008fdf5b6a31240da060c0867d8f16f08ce2384
tuxiko@free.fr 2007-07-17 16:55 : [...] > I have some partial results with git-bisect. I post them and hope that it > would > sufficiently restrict the set of commits (git-bisect say that there are 21 > revisions left to test after the new one I don't test yet). It is very frustrating. You should really git reset --hard 2ff81f70b56dc1cdd3bf2f08414608069db6ef1a and try it. It will tell if the trigger is in KVM or in CIFS. > Note that for the commit 5cd471551 there is something strange: after I > compiled > the kernel and booted on it I see with 'uname -a': Linux 2.6.21-rc7 ??! (<- > look at the 2 first tests) Sorry to be dense but why do you think it is strange ? 5cd471551 produces a 2.6.21-rc7 Makefile here.
Sorry I forget to precise in my previous message that I would give you more precise results when I'll have time to go on. I expected that my first results would give you some ideas about the problem waiting for the new tests. Now I come with more precise results (it remains 4 not tested commits of which one is clearly not relevant ;) Well, it's in KVM (I forgot that I have enabled it since the beginning so I thought I missed a test). (About the commit 5cd471551, what makes me puzzled is that 2.6.21-rc7 is before my first good-mark 2.6.21. But I suppose that it stems from the merging of a different branch? anyway I will appreciate an insight about that). ------------ git-bisect start # bad: [39403865d2e4590802553370a56c9ab93131e4ee] Linux 2.6.22-rc1 git-bisect bad 39403865d2e4590802553370a56c9ab93131e4ee # good: [de46c33745f5e2ad594c72f2cf5f490861b16ce1] Linux 2.6.21 git-bisect good de46c33745f5e2ad594c72f2cf5f490861b16ce1 # good: [ea62ccd00fd0b6720b033adfc9984f31130ce195] Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6 git-bisect good ea62ccd00fd0b6720b033adfc9984f31130ce195 # bad: [215d06780d13fd7de629b02b61b7b7bf88ce5039] Fix sunrpc warning noise git-bisect bad 215d06780d13fd7de629b02b61b7b7bf88ce5039 # bad: [e7d709c096487078652a1384d7a2d0e4459e18b6] is_power_of_2 in fat git-bisect bad e7d709c096487078652a1384d7a2d0e4459e18b6 # bad: [aeecf3142d82414d511135cc85f86caddfb58338] Convert h8/300 to generic timekeeping git-bisect bad aeecf3142d82414d511135cc85f86caddfb58338 # bad: [6de410c2b0cc055ae9ee640c84331f6a70878d9b] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm git-bisect bad 6de410c2b0cc055ae9ee640c84331f6a70878d9b # good: [5cd47155155a32e5b944ac9fc3f3dc578e429aa0] Merge branch 'ixp4xx' into devel git-bisect good 5cd47155155a32e5b944ac9fc3f3dc578e429aa0 # good: [5008fdf5b6a31240da060c0867d8f16f08ce2384] KVM: Use kernel-standard types git-bisect good 5008fdf5b6a31240da060c0867d8f16f08ce2384 # bad: [94dfbdb3894eda2f673b70e20da2743c4a8d3968] KVM: SVM: Only save/restore MSRs when needed git-bisect bad 94dfbdb3894eda2f673b70e20da2743c4a8d3968 # good: [4d56c8a787aefb2e3fc4ac4be966db96c14d1ad8] KVM: VMX: Only save/restore MSR_K6_STAR if necessary git-bisect good 4d56c8a787aefb2e3fc4ac4be966db96c14d1ad8 # good: [7807fa6ca5af2e5660a0eb3cd90276ca0c5bdfc8] KVM: Lazy FPU support for SVM git-bisect good 7807fa6ca5af2e5660a0eb3cd90276ca0c5bdfc8 ------------- I plan to finish the last tests tomorrow but if you don't need more say it.
I forgot to post the 4 remaining commits: 2807696c3791d6dd1dcf20f022eaa2dc7615bc5d 2ab455ccceb07945368709ba852e49f4c3119331 25c4c2762e31a75403eca0dd59f2cab85e3a2532 (e0e5127d06957e76da3906b7a58d5d2665e81f59)
After finishing my test, I realized I made at least an error on a test, cause the git-bisect targeted commit was this one: # bad: [94dfbdb3894eda2f673b70e20da2743c4a8d3968] KVM: SVM: Only save/restore MSRs when needed but it is really unrelated to the problem. So I rechecked this one and I was able to trigger the bug. So now it remains some commits to test but there are in a bunch of CIFS commits, what seems more suitable. However I can only guarantee that I didn't make an error for the commits I tagged good, cause I was able to trigger the bug. (- It seems the bug easily triggers itself when I do the right things, but maybe I act too quickly on the commit I was mistaken; - Remember that to made a sole test I need to wait 15min for the NAS to sleep and then I proceed to trigger the bug so I usually did a single test per commit.) I proceed now with git bisect on those CIFS commits (hoping I only do a single error). I will give you the results as soon as possible. the updated "git bisect log": ----------- git-bisect start # bad: [39403865d2e4590802553370a56c9ab93131e4ee] Linux 2.6.22-rc1 git-bisect bad 39403865d2e4590802553370a56c9ab93131e4ee # good: [de46c33745f5e2ad594c72f2cf5f490861b16ce1] Linux 2.6.21 git-bisect good de46c33745f5e2ad594c72f2cf5f490861b16ce1 # good: [ea62ccd00fd0b6720b033adfc9984f31130ce195] Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6 git-bisect good ea62ccd00fd0b6720b033adfc9984f31130ce195 # bad: [215d06780d13fd7de629b02b61b7b7bf88ce5039] Fix sunrpc warning noise git-bisect bad 215d06780d13fd7de629b02b61b7b7bf88ce5039 # bad: [e7d709c096487078652a1384d7a2d0e4459e18b6] is_power_of_2 in fat git-bisect bad e7d709c096487078652a1384d7a2d0e4459e18b6 # bad: [aeecf3142d82414d511135cc85f86caddfb58338] Convert h8/300 to generic timekeeping git-bisect bad aeecf3142d82414d511135cc85f86caddfb58338 # bad: [6de410c2b0cc055ae9ee640c84331f6a70878d9b] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm git-bisect bad 6de410c2b0cc055ae9ee640c84331f6a70878d9b # good: [5cd47155155a32e5b944ac9fc3f3dc578e429aa0] Merge branch 'ixp4xx' into devel git-bisect good 5cd47155155a32e5b944ac9fc3f3dc578e429aa0 # good: [5008fdf5b6a31240da060c0867d8f16f08ce2384] KVM: Use kernel-standard types git-bisect good 5008fdf5b6a31240da060c0867d8f16f08ce2384 # good: [94dfbdb3894eda2f673b70e20da2743c4a8d3968] KVM: SVM: Only save/restore MSRs when needed git-bisect good 94dfbdb3894eda2f673b70e20da2743c4a8d3968 -------
I think we have it! ;) It is this commit that make the bug vanish: ------------- aaf737adb6937339494d5a7111f0433cd9676db8 is first bad commit commit aaf737adb6937339494d5a7111f0433cd9676db8 Author: Igor Mammedov <qwerty0987654321@mail.ru> Date: Tue Apr 3 19:16:43 2007 +0000 [CIFS] Switch cifsd to kthread_run from kernel_thread cifsd was the only cifs thread that had not been switched to the newer kthread interface Signed-off-by: Igor Mammedov <niallain at gmail.com> Signed-off-by: Wilhelm Meier <wilhelm.meier@fh-kl.de> Signed-off-by: Steve French <sfrench@us.ibm.com> :040000 040000 b367eb72d44b719420a26661da434ed51f900297 3e64cb1424d3db57a640054e7bad21fe3a817726 M fs -------------- And here the complete git-bisect log: Right after the commit 94dfbdb3894, git-bisect give me a commit inside a bunch of CIFS commits that "oopsed" during the sequence that triggers the bug but didn't crash the system. Therefore I "git-reset --hard" to aaf737adb6 that seems a good target to test to me. I did 3 tests and was not able to trigger the bug. Then, I tested the last commit c33f8d32746 and was able to trigger the bug at the first test. Hence the outcome. --------- git-bisect start # bad: [39403865d2e4590802553370a56c9ab93131e4ee] Linux 2.6.22-rc1 git-bisect bad 39403865d2e4590802553370a56c9ab93131e4ee # good: [de46c33745f5e2ad594c72f2cf5f490861b16ce1] Linux 2.6.21 git-bisect good de46c33745f5e2ad594c72f2cf5f490861b16ce1 # good: [ea62ccd00fd0b6720b033adfc9984f31130ce195] Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6 git-bisect good ea62ccd00fd0b6720b033adfc9984f31130ce195 # bad: [215d06780d13fd7de629b02b61b7b7bf88ce5039] Fix sunrpc warning noise git-bisect bad 215d06780d13fd7de629b02b61b7b7bf88ce5039 # bad: [e7d709c096487078652a1384d7a2d0e4459e18b6] is_power_of_2 in fat git-bisect bad e7d709c096487078652a1384d7a2d0e4459e18b6 # bad: [aeecf3142d82414d511135cc85f86caddfb58338] Convert h8/300 to generic timekeeping git-bisect bad aeecf3142d82414d511135cc85f86caddfb58338 # bad: [6de410c2b0cc055ae9ee640c84331f6a70878d9b] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm git-bisect bad 6de410c2b0cc055ae9ee640c84331f6a70878d9b # good: [5cd47155155a32e5b944ac9fc3f3dc578e429aa0] Merge branch 'ixp4xx' into devel git-bisect good 5cd47155155a32e5b944ac9fc3f3dc578e429aa0 # good: [5008fdf5b6a31240da060c0867d8f16f08ce2384] KVM: Use kernel-standard types git-bisect good 5008fdf5b6a31240da060c0867d8f16f08ce2384 # good: [94dfbdb3894eda2f673b70e20da2743c4a8d3968] KVM: SVM: Only save/restore MSRs when needed git-bisect good 94dfbdb3894eda2f673b70e20da2743c4a8d3968 # bad: [aaf737adb6937339494d5a7111f0433cd9676db8] [CIFS] Switch cifsd to kthread_run from kernel_thread git-bisect bad aaf737adb6937339494d5a7111f0433cd9676db8 # good: [c33f8d32746db12ba353b0a05b25f7893a0ac344] [CIFS] Remove unnecessary checks git-bisect good c33f8d32746db12ba353b0a05b25f7893a0ac344 ----------------
How interesting, speaking of indirect fixes... :) Great job bisecting Eric.
The bug reappeared with the 2.6.25.4 (maybe before). I don't have time for the moment to do some new tests (+ I don't have access for the moment to the NAS). Nonetheless, I hope that all the information I already provided (git bisect, tcpdump, etc.) will be sufficient to pinpoint the problem. Best regards.
It seems that the bug disappeared... It didn't trigger with new kernel releases I've installed since then.