The bug appears after resuming from S3, when networking connection unaviable. Traces from pstore in attachments
Created attachment 106889 [details] Oops extracted from pstore #1
Created attachment 106890 [details] Oops #2
Is this the line you are assuming has oopsed? 0x23a4c is in cifs_call_async (fs/cifs/transport.c:528). 523 cifs_in_send_dec(server); 524 cifs_save_when_sent(mid); 525 526 if (rc < 0) 527 server->sequence_number -= 2; 528 mutex_unlock(&server->srv_mutex); 529 530 if (rc == 0) 531 return 0; 532
See the instructions on oopses at: http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting
<4> [<ffffffffa06f8b6c>] cifs_call_async+0xec/0x1b0 [cifs] Probably here: 000000000001ca80 <cifs_call_async>: ..... + 0x1b0 = 1cb6c: cifs_in_send_inc(server); rc = smb_send_rqst(server, rqst); 1cb61: 4c 89 ee mov %r13,%rsi 1cb64: 48 89 df mov %rbx,%rdi 1cb67: e8 74 f6 ff ff callq 1c1e0 <smb_send_rqst> 1cb6c: 41 89 c5 mov %eax,%r13d
(gdb) list *(cifs_call_async+0xec) 0x1cb6c is in cifs_call_async (fs/cifs/transport.c:522). 517 list_add_tail(&mid->qhead, &server->pending_mid_q); 518 spin_unlock(&GlobalMid_Lock); 519 520 521 cifs_in_send_inc(server); 522 rc = smb_send_rqst(server, rqst); 523 cifs_in_send_dec(server); 524 cifs_save_when_sent(mid); 525 526 if (rc < 0)
Is this easily reproducible?
Not really. Looks like I need SMB 2x, S3/disconnect and some fortune. It reproduced about 3-4 times a week
Created attachment 106953 [details] Oops #3
Hmm...the problem with your analysis is that the top of the oops message got chopped off. So you have the bottom part of the stack there. cifs_call_async is just in part of that call-chain, but it's not where it oopsed and so it's not terribly interesting... ------------[snip]------------------ Oops#2 Part1 <4> [<ffffffffa06f8b6c>] cifs_call_async+0xec/0x1b0 [cifs] <4> [<ffffffffa07025e0>] ? free_rsp_buf+0x40/0x40 [cifs] <4> [<ffffffffa070406e>] SMB2_echo+0x8e/0xb0 [cifs] <4> [<ffffffffa06e6789>] cifs_echo_request+0x79/0xa0 [cifs] <4> [<ffffffff810b45b3>] process_one_work+0x173/0x4a0 <4> [<ffffffff810b52a1>] worker_thread+0x121/0x3a0 <4> [<ffffffff810b5180>] ? manage_workers.isra.27+0x2b0/0x2b0 <4> [<ffffffff810bae00>] kthread+0xc0/0xd0 <4> [<ffffffff810bad40>] ? kthread_create_on_node+0x120/0x120 <4> [<ffffffff815b18dc>] ret_from_fork+0x7c/0xb0 <4> [<ffffffff810bad40>] ? kthread_create_on_node+0x120/0x120 <4>Code: 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 90 03 00 00 55 48 89 e5 5d <48> 8b 40 d8 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 <1>RIP [<ffffffff810bb451>] kthread_data+0x11/0x20 <4> RSP <ffff880052d2d950> <4>CR2: ffffffffffffffd8 <4>---[ end trace a3ce9d2d2a1d56c8 ]--- ------------[snip]------------------ The messages in those files are somewhat disjointed so it's tough to tell what happened first too. I've never used pstore, but it looks like it's not getting complete oopses here and that's going to hamper the effort to track this down. Any chance of hookup up a serial console here and collecting an oops that way? If you clear the panic_on_oops flag, does the box keep running even after it oopses? If you're able to log in after the oops then you might be able to get the whole oops message with dmesg.
grep PANIC_ON_OOPS .config CONFIG_PANIC_ON_OOPS_VALUE=0 I have no serial port here, so tracking is something difficult. Looks like I concatenate that stuff in incorrect order
Created attachment 106954 [details] Oops #3 - right order
Comment on attachment 106954 [details] Oops #3 - right order Ok, that looks better. So now you'll want to do a listing within gdb on sock_sendmsg+0x93 since that's where it oopsed. I suspect that you'll find that the server->ssocket pointer is being set to NULL here. If so, then the problem is likely the sloppy locking around the socket reconnect handling. It looks like it can get zeroed out by another thread while this one is still sending on the socket. Sadly, the lower layers of the cifs code are rife with these sorts of races...
list *(sock_sendmsg+0x93) 0xffffffff814dcc13 is in sock_sendmsg (net/socket.c:625). 620 si->sock = sock; 621 si->scm = NULL; 622 si->msg = msg; 623 si->size = size; 624 625 return sock->ops->sendmsg(iocb, sock, msg, size); 626 } 627
So yeah, that looks like the problem. Thanks for chasing it down. The fix is not exactly trivial though. Someone needs to go through and track down all of the places where server->ssocket can get set to NULL and ensure they're done under a lock of some sort. The srv_mutex seems like the obvious choice. The problem though is that it's already held in some places and not in others, so the code will need thorough auditing to ensure that deadlocks are not introduced.
Created attachment 107285 [details] patch -- ensure that srv_mutex is held when manipulating ssocket pointer Possible patch. I walked over the code and I think this is the only place where the ssocket pointer is used/updated without holding the srv_mutex. Would you be able to test this patch and let me know if it helps?
Will try to reproduce, thanks
Oleksii, any luck reproducing this?
Till now I couldn't reproduce this bug with old usecase
Ok, thanks. I went ahead and proposed the patch upstream since I think it's the correct thing to do regardless. Hopefully it'll get merged in the coming merge window and pulled back into stable kernels soon afterward.
Today I get next issue _before_ going to S3, don't know is it connected with patch or not: [68024.902378] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0): [68024.902414] zsh D ffff88021e312d00 5360 19843 2416 0x00000006 [68024.902419] ffff8801728e1bd8 0000000000000046 ffff8801f38e8000 ffff8801728e1fd8 [68024.902422] ffff8801728e1fd8 ffff8801728e1fd8 ffff8801f38e8000 ffff880212fb21b8 [68024.902425] ffff880212fb21bc 00000000ffffffff ffff880212fb21c0 ffff8801f38e8000 [68024.902428] Call Trace: [68024.902436] [<ffffffff815be983>] schedule_preempt_disabled+0x33/0x80 [68024.902439] [<ffffffff815bca1c>] __mutex_lock_slowpath+0x15c/0x210 [68024.902445] [<ffffffff811684e2>] ? truncate_inode_pages_range+0x202/0x630 [68024.902449] [<ffffffff815bbb97>] mutex_lock+0x17/0x30 [68024.902463] [<ffffffffa0888fd4>] SendReceive2+0xb4/0x380 [cifs] [68024.902470] [<ffffffffa088705a>] ? cifs_small_buf_get+0x1a/0x30 [cifs] [68024.902477] [<ffffffffa08892dc>] SendReceiveNoRsp+0x3c/0x40 [cifs] [68024.902484] [<ffffffffa0893eee>] SMB2_logoff+0x5e/0x90 [cifs] [68024.902489] [<ffffffffa0876f28>] cifs_put_smb_ses+0xb8/0xd0 [cifs] [68024.902494] [<ffffffffa0876fe6>] cifs_put_tcon+0xa6/0xb0 [cifs] [68024.902540] [<ffffffffa08790f0>] cifs_put_tlink+0x40/0x60 [cifs] [68024.902549] [<ffffffffa0879f7d>] cifs_umount+0x5d/0xb0 [cifs] [68024.902560] [<ffffffffa086c04e>] cifs_kill_sb+0x1e/0x30 [cifs] [68024.902565] [<ffffffff811af94d>] deactivate_locked_super+0x3d/0x60 [68024.902568] [<ffffffff811af9b6>] deactivate_super+0x46/0x60 [68024.902573] [<ffffffff811cb26f>] mntput_no_expire+0xef/0x150 [68024.902576] [<ffffffff811cb2f6>] mntput+0x26/0x40 [68024.902579] [<ffffffff811b6b3e>] path_put+0x1e/0x30 [68024.902584] [<ffffffff811dbe8f>] free_fs_struct+0x1f/0x40 [68024.902587] [<ffffffff811dbf28>] exit_fs+0x78/0x90 [68024.902591] [<ffffffff810985a9>] do_exit+0x2a9/0xaf0 [68024.902594] [<ffffffff81099c3f>] do_group_exit+0x3f/0xa0 [68024.902596] [<ffffffff81099cb4>] SyS_exit_group+0x14/0x20 [68024.902600] [<ffffffff815c7846>] system_call_fastpath+0x1a/0x1f At that point none of CIFS shares were mounted, but cifs module was marked as used.
Maybe. The above process is hung on while trying to lock the srv_mutex, which may be held due to a reconnect. Suspending with an active cifs mount is currently pretty iffy, unfortunately. I have some ideas about how to fix that, but they'll mean some rather significant changes. The important thing for this bug is that it didn't oops. Probably not much we can do currently about it occasionally failing to suspend.
Hello! I suppose this bug is still present in newer kernels? On a machine I'm running I can continuously reproduce this error. It's a combination of an nfs mounted share shared by samba. When the share is accessed (by a W2k8 client), the kernel crashes. If the bug is still present I suppose upgrading the kernel is not of much use, but because I can reproduce the bug very easily I might provider better dumps?
(In reply to J.M. Stoorvogel from comment #23) > Hello! I suppose this bug is still present in newer kernels? On a machine > I'm running I can continuously reproduce this error. It's a combination of > an nfs mounted share shared by samba. When the share is accessed (by a W2k8 > client), the kernel crashes. > > If the bug is still present I suppose upgrading the kernel is not of much > use, but because I can reproduce the bug very easily I might provider better > dumps? Can you post any log as well.
Please try to reproduce this bug with latest kernel image.