Kernel Bug Tracker – Bug 9017
Strange file locking problems
Last modified: 2009-03-23 11:36:02 UTC
Most recent kernel where this bug did not occur: 126.96.36.199
Distribution: Debian Etch
Hardware Environment: Opteron, Adaptec 2130S RAID controller
Software Environment: SAMBA
Problem Description: This is very similar to bug 8978, however this happens even with RAID in status Optimal. The problem happens so far only with 188.8.131.52 kernel, not with 184.108.40.206, whereas the bug 8978 happens with both kernels.
We run fileserver on the machine. Twice a day the backup script archives all documents locally on the server using tar.gz
Under moderate load (cca 2.2) the shared documents stop working for Windows clients. They cannot open them, save them etc. It seems it is connected to the tar accessing those documents. Only restart restored the normal operation.
The filesystem is ext3, of course. We have about 200 Windows XP clients in the LAN, most of them open documents from fileserver. There is also an app that some 100 clients open from fileserver too.
Can we see the kernel logs please?
Created attachment 12825 [details]
The Sep 7 entries are from the last reboot before the failure. I don't have any logs hereafter. Next records available are from Sep 10. The problems started at Sep 10 around noon (when the backup runs), however nothing visible here. There are no more entries here before the next restart that restored the normal function.
Created attachment 12826 [details]
Syslog. As You can see, at 11:30 the backup job (and soon the problems) started. At 13:15, I rebooted the machine since other attempts failed.
I filtered out CUPS and sensord messages, since I don't think they are interesting anyhow.
Peter, is the problem still there? Have you tested with newer kernels since then?
Well, since this is production system, we must have workarounded it. You know, this problem is not too easily reproducible, however clearly based on file lockig.
Since the samba passes all locking to the kernel, and we haven't touched samba however touched kernel few times, the kernel is suspect #1. After problems, I have downgraded kernel to version 220.127.116.11, and problems have quite disappeared. However, we have met them again, althought we feel it's been less frequently than with 18.104.22.168
So I have downgraded again to 22.214.171.124, and seems that problem is completely gone. So You can assume that the bug appeared AFTER 126.96.36.199
I can send You kernel configurations, they are slightly different between the kernel versions (I need new function, so I compile it with newer kernel version). Unfortunately, I'm afraid it's not completely equal to try apply the config from 188.8.131.52 to the new version -many parameters have been added to kernel in newer versions.
I cannot promise You, how soon could I test the new kernel again. I'll try. But will it make a difference? Does anybody actually work on the problem?
You know, I'm already tired of those responses "test with newer kernel" after years from report, then I test again, then nothing for a half year, and then again "still problem with newer kernel?". For a production, it's quite a luxury to do such tests with no real chance to progress.
So it appears that this was a regression from 2.6.19 and worsened from then on. The best would be to try git bisect search, but it is quite difficult for you in production environment.
The reason for testing higher releases is to test patches when they have been developed, and for unsolved bugs is to verify if it is still triggered with all the surrounding changes, or maybe it mutates and creates different problems.
It's understandably difficult to do when you are dealing with production system.
Sometimes, it is possible to set up parallel debug system running your application and model it, but this is definitely a luxury too, depends if you have resources for that.
Well, at the end of december I have booted 184.108.40.206 kernel, and since then, at least 5 times the application freezed due the exclusively locked files. I booted back the old kernel 16.1.2008 and we'll see, what will happen.
However, seems that 220.127.116.11 really IS getting the problem worse.
Peter, since you are having freezes repeatedly, you can collect some information on what processes are wedging your system without risk to your production environment. If you run a script that periodically collects output from vmstat, /proc/meminfo, and "echo t > /proc/sysreq-trigger", then you'll hopefully have a set of data close or at time of the freeze.