Bug 8978

Summary: Strange locking problems when hw RAID in Degraded status
Product: File System Reporter: Peter (tuharsky)
Component: ext3Assignee: Andrew Morton (akpm)
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: alan, htejun, neilb, protasnb
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.21.7 Subsystem:
Regression: --- Bisected commit-id:
Attachments: syslog

Description Peter 2007-09-03 22:53:42 UTC
Most recent kernel where this bug did not occur:
Distribution: Debian Etch
Hardware Environment: PC i486, Dual Opteron
Software Environment: Samba
Problem Description: We have Dual Opteron server with Adaptec 2130S (aacraid) controller, where our 4-disk hw RAID-5 array resides. We use open kernel drivers from standard tree only, we custom compiled kernel 2.6.21.5 on top of standard Debian Etch i486. Server is used as Samba fileserver.

After two years of perfect function, the controller started to beep, because one of the disks started to have uncorrectable errors. The RAID-5 array has fallen to "Degraded" state, although continued working.

However, strange file locking errors have shown up soon. We use Visual Foxpro application, that some 200 Windows machines run from that Samba share. There are dozens of form files that the app uses. In some situations, the app spits errors that "access to the library file denied". The file is usually some form file from the Samba share.

Moreover, I am even unable to copy or tarball the application share on the server console. The process repeatedly stops at some form file, and after minutes (probably when someone in the network stops using the file) it "unlocks" and continues.

Because of that, I assume, that the RAID status has some (bad) impact on file locking. I know it shouldn't happen, but it does. When I temporarily resolved the RAID so that "Optional" status took place, the locking problems stopped. Once the RAID has fallen back to "Degraded" status, problems arise again.

Since this is production server, I just have resolved the RAID problems. However I can offer You any help I'm able to do, to help solving this odd kernel bug.


Steps to reproduce:
Comment 1 Andrew Morton 2007-09-03 23:36:04 UTC
Is there nothing of interest in the logs?
Comment 2 Peter 2007-09-03 23:42:15 UTC
Well, RAID problems are back, so debugging possible. I'll try latest kernel.

The buggy one is 2.6.21.7
Comment 3 Peter 2007-09-03 23:56:04 UTC
Kernel dosen't tell anything to kern.log when the controller starts beeping. The startup (dmesg) is here -see attachment 12690 [details]
Comment 4 Peter 2007-09-03 23:56:44 UTC
Created attachment 12691 [details]
syslog

Well, these samba oplock breaks are suspicious.
Comment 5 Peter 2007-09-04 00:02:46 UTC
However, I cannot guarantee, that exactly these oplock errors are the merit. I looked at old logs and some oplock problems were there before, however they seemed a bit different. That was Debian Sarge with older samba release, so the error codes and syntax could have changed..
Comment 6 Andrew Morton 2007-09-14 01:39:13 UTC
I don't understand what I'm seeing in your logs.  How come there's
a pile of ata errors coming out when you say the problem is with
the aacraid controller?
Comment 7 Peter 2007-09-14 02:15:52 UTC
These are subject of separate bug 8979, that is resolved aj a problem of old smartd version.
Comment 8 Peter 2007-09-14 02:17:10 UTC
The kernel dosen't show up anything interesting when RAID enters "Degraded" state.
Comment 9 Natalie Protasevich 2008-02-11 01:30:31 UTC
Peter, any updates? Have you tried other kernel levels, newer ones or falling back to the one that used to work for you? I won't be surprised if the controller itself was going bad.
Comment 10 Peter 2008-02-11 21:37:17 UTC
Well, until the bug 9017 persists, it's quite impossible to debug this problem, because the symptoms are pretty same (file locking problems). After the bug 9017 resolved, I could try removing a harddrive from raid and see what will happen with recent kernel, but it dosen't make any sense any sooner.