Bug 3328

Summary: Killing process using ncpfs makes mount unusable
Product: File System Reporter: Peter Astrand (astrand)
Component: OtherAssignee: Pierre Ossman (pierre-bugzilla)
Status: CLOSED CODE_FIX    
Severity: normal CC: jayc
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.8.1 Subsystem:
Regression: --- Bisected commit-id:

Description Peter Astrand 2004-09-01 13:27:50 UTC
Distribution: Suse 9.1 (actually all as far as I know)
Hardware Environment: i386

Problem Description:
When KILLing a process in the middle of a ncpfs operation, the mount becomes
unusable. Subsequent file operations returns EIO. 

Steps to reproduce:
One easy way to reproduce this is to run KDE 3.2 with the DCOP server authority
file on a ncpfs. The authority file can be set with the environment variable
DCOPAUTHORITY. When shutting down KDE, a process called dcopserver_shutdown will
busy-looping read(), which returns EIO. After killing dcopserver_shutdown with
SIGKILL, the NCP mount is in a non-working state. 

The ncpfs maintainer, Petr Vandrovec, has confirmed that this problems exists:

>> * When shutting down KDE, the file system startas reporting EIO 
>> (Input/output error). It's also very un-responsive; if one process runs
>> read() (returning EIO) in a tight loop, other processes hangs forever (well,
>> at least > 12 hours). 
>
>Someone aborted process in the middle of ncpfs operation with SIGKIL.  
>As ncp request/responses are handled in context of calling
>process, if you SIGKILL process after it sent request but before it
>received response, connection is invalidated.  But it should not hang 
>other processes which have nothing to do with ncpfs.  If you run in a 
>loop process which does some (failing) ncpfs operation, other processes 
>which want to use that mountpoint of course blocks: Linux semaphores 
are not fair.  If you do not want SIGKILL to abort connection, replace
>sigmask(SIGKILL) with sigmask(0) in fs/ncpfs/sock.c.  But then you may
>get unkillable process in the case TCP connected server dies, or if you'll
>set infinite timeout for UDP/IPX servers

He has also indicated that this problem might be solveable:

>Actually with 2.6.x fs/ncpfs/sock.c changing code to not call
>ncp_abort_connection() when something goes wrong should not be that
>hard. Only problem is that ncp_request_reply structure is allocated
>on caller stack, and so you must kill receiver (ncpdgram_rcv_proc/
>ncptcp_rcv_proc) before you release this structure.
Comment 1 Peter Astrand 2005-11-10 06:39:10 UTC
This is still a problem for us. Any progress?
Comment 2 Pierre Ossman 2007-04-04 00:18:12 UTC
Cendio has provided a patch which has been merged upstream (will be in 2.6.21).
Closing bug.