Bug 9017

Summary: Strange file locking problems
Product: IO/Storage Reporter: Peter (tuharsky)
Component: SCSIAssignee: io_scsi
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: akpm, protasnb, yyyeer.bo
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22.6 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Kern.log
syslog (gzipped)

Description Peter 2007-09-14 01:02:09 UTC
Most recent kernel where this bug did not occur: 2.6.21.7
Distribution: Debian Etch
Hardware Environment: Opteron, Adaptec 2130S RAID controller
Software Environment: SAMBA
Problem Description: This is very similar to bug 8978, however this happens even with RAID in status Optimal. The problem happens so far only with 2.6.22.6 kernel, not with 2.6.21.7, whereas the bug 8978 happens with both kernels.

We run fileserver on the machine. Twice a day the backup script archives all documents locally on the server using tar.gz

Under moderate load (cca 2.2) the shared documents stop working for Windows clients. They cannot open them, save them etc. It seems it is connected to the tar accessing those documents. Only restart restored the normal operation.
Comment 1 Peter 2007-09-14 01:14:24 UTC
The filesystem is ext3, of course. We have about 200 Windows XP clients in the LAN, most of them open documents from fileserver. There is also an app that some 100 clients open from fileserver too.
Comment 2 Andrew Morton 2007-09-14 01:39:55 UTC
Can we see the kernel logs please?
Comment 3 Peter 2007-09-14 02:28:16 UTC
Created attachment 12825 [details]
Kern.log

The Sep 7 entries are from the last reboot before the failure. I don't have any logs hereafter. Next records available are from Sep 10. The problems started at Sep 10 around noon (when the backup runs), however nothing visible here. There are no more entries here before the next restart that restored the normal function.
Comment 4 Peter 2007-09-14 02:35:36 UTC
Created attachment 12826 [details]
syslog (gzipped)

Syslog. As You can see, at 11:30 the backup job (and soon the problems) started. At 13:15, I rebooted the machine since other attempts failed.
Comment 5 Peter 2007-09-14 02:36:35 UTC
I filtered out CUPS and sensord messages, since I don't think they are interesting anyhow.
Comment 6 Natalie Protasevich 2007-11-28 17:57:14 UTC
Peter, is the problem still there? Have you tested with newer kernels since then?
Thanks.
Comment 7 Peter 2007-12-18 01:31:34 UTC
Well, since this is production system, we must have workarounded it. You know, this problem is not too easily reproducible, however clearly based on file lockig.

Since the samba passes all locking to the kernel, and we haven't touched samba however touched kernel few times, the kernel is suspect #1. After problems, I have downgraded kernel to version 2.6.21.7, and problems have quite disappeared. However, we have met them again, althought we feel it's been less frequently than with 2.6.22.6

So I have downgraded again to 2.6.19.2, and seems that problem is completely gone. So You can assume that the bug appeared AFTER 2.6.19.2

I can send You kernel configurations, they are slightly different between the kernel versions (I need new function, so I compile it with newer kernel version). Unfortunately, I'm afraid it's not completely equal to try apply the config from 2.6.19.2 to the new version -many parameters have been added to kernel in newer versions.


I cannot promise You, how soon could I test the new kernel again. I'll try. But will it make a difference? Does anybody actually work on the problem?

You know, I'm already tired of those responses "test with newer kernel" after years from report, then I test again, then nothing for a half year, and then again "still problem with newer kernel?". For a production, it's quite a luxury to do such tests with no real chance to progress.
Comment 8 Natalie Protasevich 2007-12-18 02:25:34 UTC
So it appears that this was a regression from 2.6.19 and worsened from then on. The best would be to try git bisect search, but it is quite difficult for you in production environment.

The reason for testing higher releases is to test patches when they have been developed, and for unsolved bugs is to verify if it is still triggered with all the surrounding changes, or maybe it mutates and creates different problems. 
It's understandably difficult to do when you are dealing with production system.

Sometimes, it is possible to set up parallel debug system running your application and model it, but this is definitely a luxury too, depends if you have resources for that.
Comment 9 Peter 2008-01-18 02:25:04 UTC
Well, at the end of december I have booted 2.6.23.11 kernel, and since then, at least 5 times the application freezed due the exclusively locked files. I booted back the old kernel 16.1.2008 and we'll see, what will happen.

However, seems that 2.6.23.11 really IS getting the problem worse.
Comment 10 Natalie Protasevich 2008-02-12 08:26:32 UTC
Peter, since you are having freezes repeatedly, you can collect some information on what processes are wedging your system without risk to your production environment. If you run a script that periodically collects output from vmstat, /proc/meminfo, and "echo t > /proc/sysreq-trigger", then you'll hopefully have a set of data close or at time of the freeze.