Bug 5403 - Zoran driver sleeps in invalid context
Summary: Zoran driver sleeps in invalid context
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: drivers_video-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-10-09 08:14 UTC by Christian Casteyde
Modified: 2005-12-26 14:12 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.14-rc3
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
dmesg output after running lavrec. (12.42 KB, text/plain)
2005-10-09 08:16 UTC, Christian Casteyde
Details
possible fix (alternative) (710 bytes, patch)
2005-10-29 08:26 UTC, Ronald Bultje
Details | Diff

Description Christian Casteyde 2005-10-09 08:14:45 UTC
Most recent kernel where this bug did not occur: Unknown 
Distribution: Slackware 10.1 
Hardware Environment: Athlon 2000+, 512Mo, Miro DC30 card 
Software Environment: mjpegtools-1.6.3-rc3 
Problem Description: 
lavrec doesn't capture any frame, since the zr36067 driver stops due to 
sleeping function calls while interrupts are disabled. 
 
Steps to reproduce: 
Issue : 
lavrec -f a -i s -d 1 -U -n 256 -b 1024 -c 2 -g 768x576 -S test.avi 
 
on a PC with a Miro DC30 card. 
Kernel preemption is enabled on my system, I'll check if no premption can help. 
Appended: dmesg output after call trace dump.
Comment 1 Christian Casteyde 2005-10-09 08:16:01 UTC
Created attachment 6263 [details]
dmesg output after running lavrec.
Comment 2 Ronald Bultje 2005-10-10 21:21:06 UTC
Hi,

this is a known issue. It was introduced by some changes in 2.4 -> 2.6. I don't
think it causes any actual issues, so it's not very high priority for me... I do
acknowledge the issue and I'm willing to fix at some point, though.
Comment 3 Christian Casteyde 2005-10-25 12:00:59 UTC
OK, I've looked a little at this one. 
 
As might_sleep says, there is a call to "down" on a semaphore while interrupts 
are disabled. I tracked down the interrupt masking call, and I've found it was 
in "zoran_jpg_queue_frame", which calls "spin_lock_irqsave" (zoran_driver.c, 
line 999) and then calls "zr36057_enable_jpg". This call ends up trying to get 
a semaphore in i2c code, which of course can sleep. 
 
Of course, it appears to not occur, otherwise the CPU (and the OS, which in my 
case is not SMP) would be locked on the semaphore with no chance to get the 
owner to "up" the semaphore (no scheduling since IRQs are off). But it could be 
dangerous (I didn't analyzed further and do not have knowledge of the kernel to 
do that). 
 
Anyway, either the semaphore is useless (I doubt it, since it is deep inside 
i2c code, which is generic code), or the IRQ should be released before calling 
zr36057_enable_jpg, or IRQs should not be disabled (why doing that on a simple 
ioctl?) or - maybe - this function (that is, zr36057_enable_jpg) should not be 
called at all there. 
Indeed, I noticed the dumps occurs each time I stop lavrec, which implies it is 
when the device is closed or when the ioctl says "no more capture", or 
something like that. That is, the jpg code is already here for ages, and I do 
not understand why trying to load it in this context (if the comment is 
correct, this is what is intended). 
 
Anyway, if this is necessary, I suggest to move the loading of jpg up, just 
before the call to spin_lock_irqsave. After all, we could load where this is 
not necessary, but I do not think this will occur often (it would be only in 
error case I think anyway). 
 
Sorry, I cannot test until this WE, as usual. I was just doing static code 
reading. 
 
CC 
Comment 4 Christian Casteyde 2005-10-29 02:35:11 UTC
Well, I didn't read the code with enough attention when I wrote my last comment 
and was not clear enough. 
 
First, contrary to what I said before, the stack dump doesn't occur on device 
release (ie when lavrec ends), but when lavrec starts, which is far more 
coherent. The problem is really on loading the jpg codec therefore. 
 
Second, it seems zoran_jpg_queue_frame tries to load on the fly the jpg codec 
if it is not loaded when a new frame is queued. To do this, it must ensures 
nobody else is using the codec (I think this is what is intended) and therefore 
tries to take the spinlock. 
 
This is really a big action : all this is occurring from a userspace call, not 
from an interrupt or something else. I therefore think it is not a spinlock 
that must be used there (except if the jpg codec can access the driver's 
structure while loading and from an interrupt, in which case there should be a 
start method on this thing), but simply a semaphore / critical section. 
Moreover, I wonder why bother loading on the fly this thing, and not loading it 
when the driver is opened / the whole module loaded ? This doesn't seem to be a 
heavy thing and I hope it is not used if no grabbing is in place. It would only 
make another dependency on the zr36067 module (I mean, a permanent dependency 
instead of a temporary one only while recording). 
 
Anyway, the check looks at the codec_mode and results of previous check (which 
can be -EBUSY). However, the codec_mode is already accessed from outside the 
spinlock just before. Moreover, the -EBUSY is set only if the second imbricated 
"if" fails in the previous test (not if the first one fails), and therefore 
doesn't seem to be a reliable information to know if anybody is already using 
the driver. An atomic client count would certainly be more efficient (no 
spinlock/semaphore) and less error prone. 
 
Well, I said I would test my hypotheses, and here are my results: I confirm the 
stack dump is due to this piece of code. I did the following modifs, and I did 
not get the warning anymore (and I do not think this I removed any usefull 
check or made the system more unstable if several clients use the driver, for 
the reasons I said previously): 
 
- suppress the following line: 
                        res = -EBUSY; 
                } 
        } 
 
        if (!res && zr->codec_mode == BUZ_MODE_IDLE) { 
                /* Ok load up the jpeg codec */ 
-              zr36057_enable_jpg(zr, mode); 
        } 
 
        if (!res) { 
                switch (zr->jpg_buffers.buffer[num].state) { 
 
and move it to the previous check, just before the spin_lock_irqsave: 
       /* what is the codec mode right now? */ 
        if (zr->codec_mode == BUZ_MODE_IDLE) { 
                zr->jpg_settings = fh->jpg_settings; 
+                zr36057_enable_jpg(zr, mode); 
        } else if (zr->codec_mode != mode) { 
                /* wrong codec mode active - invalid */ 
                dprintk(1, 
                        KERN_ERR 
                        "%s: jpg_queue_frame() - codec in wrong mode\n", 
                        ZR_DEVNAME(zr)); 
                return -EINVAL; 
        } 
 
 
Hope my reflexions will help, and you could check if this spin_lock is really 
usefull for checking if the jpg codec is already loaded or not. 
 
Bye, 
CC 
 
Comment 5 Ronald Bultje 2005-10-29 08:26:10 UTC
Created attachment 6419 [details]
possible fix (alternative)

I don't think this is correct; the problem is that we need to check for the
buffer to be valid (free) before we want to load the JPG codec, because we do
that on the fly (that's just how it's implemented; it's not really required but
that's just the way we did it). So rather than moving the JPG loading above the
spinlock, we should move the spinlock to below the JPG loading. See attached
patch, can you test that?
Comment 6 Christian Casteyde 2005-10-29 10:51:07 UTC
Tested, it works, which is normal, since we are not calling anymore 
zoran_jpg_queue_frame from inside the spinlock. 
 
Comment 7 Ronald Bultje 2005-10-29 16:50:00 UTC
Hi Christian,

the testing is just to assure I didn't do something completely stupid. Thanks
for testing, I'll try to submit this upstream sometime very soon.
Comment 8 Christian Casteyde 2005-12-01 11:26:26 UTC
Seen the zoran_driver was updated in 2.6.15-rc4  
(not in the way we tried before, but the spin lock  
is not held while calling zr36057_enable_jpg). 
 
Will try this week end and, if it works I'll close 
this bug. 
 
Regards, 
CC 
Comment 9 Christian Casteyde 2005-12-26 14:12:32 UTC
Works fine with 2.6.15-rc7.  

Note You need to log in before you can comment on or make changes to this bug.