Most recent kernel where this bug did not occur: Unknown Distribution: Slackware 10.1 Hardware Environment: Athlon 2000+, 512Mo, Miro DC30 card Software Environment: mjpegtools-1.6.3-rc3 Problem Description: lavrec doesn't capture any frame, since the zr36067 driver stops due to sleeping function calls while interrupts are disabled. Steps to reproduce: Issue : lavrec -f a -i s -d 1 -U -n 256 -b 1024 -c 2 -g 768x576 -S test.avi on a PC with a Miro DC30 card. Kernel preemption is enabled on my system, I'll check if no premption can help. Appended: dmesg output after call trace dump.
Created attachment 6263 [details] dmesg output after running lavrec.
Hi, this is a known issue. It was introduced by some changes in 2.4 -> 2.6. I don't think it causes any actual issues, so it's not very high priority for me... I do acknowledge the issue and I'm willing to fix at some point, though.
OK, I've looked a little at this one. As might_sleep says, there is a call to "down" on a semaphore while interrupts are disabled. I tracked down the interrupt masking call, and I've found it was in "zoran_jpg_queue_frame", which calls "spin_lock_irqsave" (zoran_driver.c, line 999) and then calls "zr36057_enable_jpg". This call ends up trying to get a semaphore in i2c code, which of course can sleep. Of course, it appears to not occur, otherwise the CPU (and the OS, which in my case is not SMP) would be locked on the semaphore with no chance to get the owner to "up" the semaphore (no scheduling since IRQs are off). But it could be dangerous (I didn't analyzed further and do not have knowledge of the kernel to do that). Anyway, either the semaphore is useless (I doubt it, since it is deep inside i2c code, which is generic code), or the IRQ should be released before calling zr36057_enable_jpg, or IRQs should not be disabled (why doing that on a simple ioctl?) or - maybe - this function (that is, zr36057_enable_jpg) should not be called at all there. Indeed, I noticed the dumps occurs each time I stop lavrec, which implies it is when the device is closed or when the ioctl says "no more capture", or something like that. That is, the jpg code is already here for ages, and I do not understand why trying to load it in this context (if the comment is correct, this is what is intended). Anyway, if this is necessary, I suggest to move the loading of jpg up, just before the call to spin_lock_irqsave. After all, we could load where this is not necessary, but I do not think this will occur often (it would be only in error case I think anyway). Sorry, I cannot test until this WE, as usual. I was just doing static code reading. CC
Well, I didn't read the code with enough attention when I wrote my last comment and was not clear enough. First, contrary to what I said before, the stack dump doesn't occur on device release (ie when lavrec ends), but when lavrec starts, which is far more coherent. The problem is really on loading the jpg codec therefore. Second, it seems zoran_jpg_queue_frame tries to load on the fly the jpg codec if it is not loaded when a new frame is queued. To do this, it must ensures nobody else is using the codec (I think this is what is intended) and therefore tries to take the spinlock. This is really a big action : all this is occurring from a userspace call, not from an interrupt or something else. I therefore think it is not a spinlock that must be used there (except if the jpg codec can access the driver's structure while loading and from an interrupt, in which case there should be a start method on this thing), but simply a semaphore / critical section. Moreover, I wonder why bother loading on the fly this thing, and not loading it when the driver is opened / the whole module loaded ? This doesn't seem to be a heavy thing and I hope it is not used if no grabbing is in place. It would only make another dependency on the zr36067 module (I mean, a permanent dependency instead of a temporary one only while recording). Anyway, the check looks at the codec_mode and results of previous check (which can be -EBUSY). However, the codec_mode is already accessed from outside the spinlock just before. Moreover, the -EBUSY is set only if the second imbricated "if" fails in the previous test (not if the first one fails), and therefore doesn't seem to be a reliable information to know if anybody is already using the driver. An atomic client count would certainly be more efficient (no spinlock/semaphore) and less error prone. Well, I said I would test my hypotheses, and here are my results: I confirm the stack dump is due to this piece of code. I did the following modifs, and I did not get the warning anymore (and I do not think this I removed any usefull check or made the system more unstable if several clients use the driver, for the reasons I said previously): - suppress the following line: res = -EBUSY; } } if (!res && zr->codec_mode == BUZ_MODE_IDLE) { /* Ok load up the jpeg codec */ - zr36057_enable_jpg(zr, mode); } if (!res) { switch (zr->jpg_buffers.buffer[num].state) { and move it to the previous check, just before the spin_lock_irqsave: /* what is the codec mode right now? */ if (zr->codec_mode == BUZ_MODE_IDLE) { zr->jpg_settings = fh->jpg_settings; + zr36057_enable_jpg(zr, mode); } else if (zr->codec_mode != mode) { /* wrong codec mode active - invalid */ dprintk(1, KERN_ERR "%s: jpg_queue_frame() - codec in wrong mode\n", ZR_DEVNAME(zr)); return -EINVAL; } Hope my reflexions will help, and you could check if this spin_lock is really usefull for checking if the jpg codec is already loaded or not. Bye, CC
Created attachment 6419 [details] possible fix (alternative) I don't think this is correct; the problem is that we need to check for the buffer to be valid (free) before we want to load the JPG codec, because we do that on the fly (that's just how it's implemented; it's not really required but that's just the way we did it). So rather than moving the JPG loading above the spinlock, we should move the spinlock to below the JPG loading. See attached patch, can you test that?
Tested, it works, which is normal, since we are not calling anymore zoran_jpg_queue_frame from inside the spinlock.
Hi Christian, the testing is just to assure I didn't do something completely stupid. Thanks for testing, I'll try to submit this upstream sometime very soon.
Seen the zoran_driver was updated in 2.6.15-rc4 (not in the way we tried before, but the spin lock is not held while calling zr36057_enable_jpg). Will try this week end and, if it works I'll close this bug. Regards, CC
Works fine with 2.6.15-rc7.