Bug 37782

Summary: Sometimes a partition hangs up: any process freezes if touches a file/directory on this partition
Product: File System Reporter: Kroz (kroz.nn)
Component: ReiserFSAssignee: ReiseFS developers team (reiserfs-devel)
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan, s9gf4ult
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.39.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: Archive with all logs and system information
archive with output of all common parameters
lspci output, just forgot about
nautilus and xterm strace
syslog with traces from SysRq

Description Kroz 2011-06-17 23:36:42 UTC
Created attachment 62652 [details]
Archive with all logs and system information

Here are two partitions on my hard disk:
/dev/sda1 on / type reiserfs 
/dev/sda7 on /home type reiserfs

Full output of 'mount' attached

Periodically (randomly) sda7 partition hangs up. It means that any process which tries to open a file or to list a directory will hang up. Considering that sda7 mounted to /home, user is unable to do anything.

Nevertheless, another partition - sda1 - works normally. Thus, it is possible to switch to some tty (Ctrl+Alt+F1) and log on as root (since home directory of root is on sda1). Root can do anything except touching anything on sda7; if he does, the process hangs up and all you can do is to log on to another console as root.

'ps ax' shows frozen processes as 'D' - Uninterruptible sleep (usually IO) - output of 'ps ax' attached.

In the attachment you can see strace output of commands 'ls /home' 'find /home' and 'sync'. Also, I have written a simple program (named 'touch2') that just opens a file in /home partition and closes it; the program and its strace log also attached.

Brief information:
Last two lines of 'ls /home':
fcntl64(3, F_GETFD)                     = 0x1 (flags FD_CLOEXEC)                                                                                                                                                   
getdents64(3,

Last two lines of 'find /home':
fcntl64(5, F_SETFD, FD_CLOEXEC)         = 0                                                                                                                                                                        
getdents64(4,

Last two lines of 'sync':
close(3)                                = 0                                                                                                                                                                        
sync(

Last two lines of 'touch2':
brk(0x9c60000)                          = 0x9c60000                                                                                                                                                                
open("/home/kroz/.bashrc", O_RDONLY

When I get the problem, I push "power" button and computer starts shutdown process and stops on shutting down of torrent client (it has it files on /home partition) or syslog-ng (if I have no torrent daemon running). After that I do Magic+SysRq+S several times ,then Magic+SysRq+O to turn off the computer; it works. On next reboot reiserfs starts to replay transaction log. '/var/log/messages' attached - it contains a session which ended by such a freeze and another one - next normal boot.

From /var/log/messages you can notice a massage "Unwanted recursive reiserfs lock". Please, notice that 1) it is always made by Opera process; 2) usually it does not cause the freeze; 3) another person, who has the same problem, has no "Unwanted recursive reiserfs lock" message in /var/log/messages.

The problem was noticed on several computers with different hardware and different kernel versions. Common things about these computers: Gentoo Linux, reiserfs; I suppose there are no other common factors. 'lspci' output of my hardware configuration attached.

Since problem can be observed on different hardware and considering that only a part of hard disk (actually, a partition) hangs up, I suppose that this is not a hardware problem.

Here some observations, when the problem may happen.

I my computer I have found several factors which can increase probability of (but not guaranty) the problem: 1) Opera browser is running 2) Flash plugin is running 3) torrent client is on 4) torrent client actively downloads a file. If I have all these 4 factors, in 99% (but not 100%) of cases I just need to wait for 2-5 minutes to get the problem. However sometimes only Opera browser or only torrent client may cause the problem. May be something else, it is hard to say.

Another person I communicated with has this problem when he compiles a software packages (common method of installing software in Gentoo). His problem - root partition (reiserfs as wells) hangs up. As for me - I have never had this problem during compiling a new software. Again - he has this problem not always.
Another computer, which has this problem, has only Opera browser and flash plugin running. The problem appears very rare on that computer.

I seems that the problem appears DURING or AFTER active random access to the disk. However I cannot be sure since I it very hard to simulate this problems by other ways. The problem has a really random nature.

My kernel config attached as well.

Please, help to analyze and solve the problem.

Thank you in advance!
Comment 1 Alex 2011-06-18 10:21:43 UTC
Created attachment 62672 [details]
archive with output of all common parameters
Comment 2 Alex 2011-06-18 11:22:45 UTC
Created attachment 62682 [details]
lspci output, just forgot about

I am this "another person" who talked about in post above.

This problem appears very rarely for me. All the same logs, except traces are attached above.

As stated above, this problem appears when i compile the packages for gentoo, i am using pappalels building (feature of portage), and i belive that problem arises during one package is compiling and other package is installing at the same time. I belive this problem is assigned with I/O and CPU load at the same time.

One time this problem appeared when "tracker" (gnome service for file indexing) hard worked rebuilding indexes without touching my root partition. So i can not explain why did it happen, because the root partition blocking when problem arises in my case. So when the root partition hangs up i can not logon in another console and even can not return to the X console, when i do this i just see the black screen on all consoles and can not do any thing except the SysRq reboot.

I did not "strace's" because this problem is very rare and i can not reproduce it predictable enough to be prepared and do all the necessary things (for example launch enough xterm's and log in under root in another console).

When problem appears the memory is free onough to operate. So i do not belive that this is 12309 bug. All programs which trying to run or (if they are running) trying to touch the root partition hangs up, after several time (from several seconds to 5 minutes), then they all starts to operate at the same time. So all the xterm's (which i tried to run) or other programs launching and continuing work.

Another note is that i can run some programs in launched terminal under user, for example i can run htop in running xterm when partition hangs up, i belive that this is because the htop is launching often enough to be cached in file cache, but it is fails to start new xterm (i think this is because of ZSH which reads many files when start). I also can do 'ls' on my home partition which is under EXT4 (sinse 'ls' is embeded command of ZSH), but i never tryed to do 'ls' on my root partition. I think that my root partition is hangs up because of couple of facts i described above (and the same "simptomes" described by person ebove) so it can be not a true. 

Yet another note is that my home partition was under XFS partition and i belived that this problem because of this. I even posted the bug in gentoo bugtracker here
http://bugs.gentoo.org/show_bug.cgi?id=365677
I thought that this is the bug of XFS because this problem was appear when i moved my build directory (i which portage builds the packages) on my /home with XFS. I did not think to check what is realy going on in my /home directory. I just could not to run any program and thought that this is not a temporary hanging up but that i can do anything without reboot.

Now i join the call to help to determine the cause of this problem.
Thanks a lot.
Comment 3 Alex 2011-06-24 04:11:52 UTC
Yesterday my system was hanged up again. I has tried 
ls ~
ls /
and i did ! But applications still did not run. I has tried
touch ~/1
and all applications has started immediately. I do not know was `touch` the cause of "unhanging" but this two events was very close in time. It's need to repeat this experiment several times to confirm this dependence, i think.

In next time my system will hang up i will post here the ouput of
strace xterm
because xterm can not launch when this problem occures again.

I just posted this fact here in the hope that this information will help.
Comment 4 Alex 2011-07-10 20:00:33 UTC
Created attachment 65192 [details]
nautilus and xterm strace

I has done strace for nautilus and xterm when they could not launch. When it happend, it was enough memory and enough swap to operate. HDD did nothing.

Waiting for instructions to diagnose ...
Comment 5 Alex 2011-07-16 13:55:41 UTC
Created attachment 65802 [details]
syslog with traces from SysRq

Here is part of syslog when processes was blocked again, i just used twice SysRq to show blocked processes with traces. It is interesting that blocking function is queue_log_writer called after reiserfs_sync_fs, reiserfs_dirty_inode or reiserfs_create. I hope it will give some help.
Comment 6 Alex 2011-07-16 15:11:30 UTC
Here is some links for the same problem
https://bugzilla.kernel.org/show_bug.cgi?id=4850
https://lkml.org/lkml/2010/11/18/346