On a DM&P ebox2300sx, 300Mhz Vortex86 cpu, I have a vanilla 2.6.31.5 kernel with a pac207 webcam. I run capture-example from the v4l-dvb sample applications and it crashes 1 out of 5 times. Let me know if I need to collect more data or try anything. [root@X-Linux]:~ # capture-example ......................................................................BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1069 in_atomic(): 0, irqs_disabled(): 1, pid: 1178, name: capture-example 4 locks held by capture-example/1178: #0: (&gspca_dev->queue_lock){+.+.+.}, at: [<c8872eda>] vidioc_streamoff+0x3b/0xb4 [gspca_main] #1: (&gspca_dev->usb_lock){+.+.+.}, at: [<c8872eed>] vidioc_streamoff+0x4e/0xb4 [gspca_main] #2: (&ohci->lock){-.-...}, at: [<c11c8093>] ohci_endpoint_disable+0x31/0x192 #3: (&mm->mmap_sem){++++++}, at: [<c100c168>] do_page_fault+0xc1/0x1fe irq event stamp: 11656 hardirqs last enabled at (11655): [<c12e41a0>] _spin_unlock_irq+0x22/0x26 hardirqs last disabled at (11656): [<c12e41da>] _spin_lock_irqsave+0x10/0x5a softirqs last enabled at (11610): [<c101a87f>] __do_softirq+0x145/0x14d softirqs last disabled at (11605): [<c101a8b1>] do_softirq+0x2a/0x42 Pid: 1178, comm: capture-example Not tainted 2.6.31.5 #2 Call Trace: [<c101222d>] __might_sleep+0xcb/0xd0 [<c100c1ad>] do_page_fault+0x106/0x1fe [<c100c0a7>] ? do_page_fault+0x0/0x1fe [<c12e43c3>] error_code+0x63/0x70 [<c100c0a7>] ? do_page_fault+0x0/0x1fe [<c11c5cef>] ? td_free+0x23/0x75 [<c11c8175>] ohci_endpoint_disable+0x113/0x192 [<c11b4428>] usb_hcd_disable_endpoint+0x2e/0x32 [<c11b5b3f>] usb_disable_endpoint+0x6d/0x72 [<c11b5cae>] usb_disable_interface+0x30/0x3f [<c11b70ac>] usb_set_interface+0x11b/0x1a0 [<c8872e1d>] gspca_set_alt0+0x23/0x46 [gspca_main] [<c8872e75>] gspca_stream_off+0x35/0x5f [gspca_main] [<c8872ef8>] vidioc_streamoff+0x59/0xb4 [gspca_main] [<c8817244>] __video_do_ioctl+0x17af/0x3920 [videodev] [<c1032fa1>] ? __lock_acquire+0x6ef/0x755 [<c102f436>] ? lock_release_holdtime+0x81/0x86 [<c103315c>] ? lock_release_non_nested+0xab/0x1cf [<c105382f>] ? might_fault+0x3d/0x79 [<c105382f>] ? might_fault+0x3d/0x79 [<c11123d4>] ? copy_from_user+0x31/0x54 [<c88196b8>] video_ioctl2+0x303/0x3ea [videodev] [<c102f436>] ? lock_release_holdtime+0x81/0x86 [<c12e430e>] ? _spin_unlock_irqrestore+0x36/0x3c [<c103086c>] ? trace_hardirqs_on_caller+0x104/0x12b [<c103089e>] ? trace_hardirqs_on+0xb/0xd [<c88193b5>] ? video_ioctl2+0x0/0x3ea [videodev] [<c88156d8>] v4l2_unlocked_ioctl+0x2e/0x32 [videodev] [<c88156aa>] ? v4l2_unlocked_ioctl+0x0/0x32 [videodev] [<c106dd91>] vfs_ioctl+0x19/0x50 [<c106e36b>] do_vfs_ioctl+0x458/0x4a3 [<c1155a42>] ? tty_ldisc_deref+0x8/0xa [<c1150c1c>] ? tty_write+0x1b1/0x1c2 [<c1152d69>] ? n_tty_write+0x0/0x2e6 [<c1150a6b>] ? tty_write+0x0/0x1c2 [<c106431d>] ? vfs_write+0xe3/0xfa [<c1002858>] ? restore_all_notrace+0x0/0x18 [<c106e3e2>] sys_ioctl+0x2c/0x45 [<c1002825>] syscall_call+0x7/0xb BUG: unable to handle kernel paging request at a7a7a7c3 IP: [<c11c5cef>] td_free+0x23/0x75 *pde = 00000000 Oops: 0000 [#1] DEBUG_PAGEALLOC last sysfs file: Modules linked in: gspca_pac207 gspca_main videodev v4l1_compat Pid: 1178, comm: capture-example Not tainted (2.6.31.5 #2) EIP: 0060:[<c11c5cef>] EFLAGS: 00000083 CPU: 0 EIP is at td_free+0x23/0x75 EAX: a7a7a7a7 EBX: c6b35bf0 ECX: c6b35ce4 EDX: a7a7a7c3 ESI: c6b7d800 EDI: c6b35cd4 EBP: c6785cc4 ESP: c6785cb8 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 Process capture-example (pid: 1178, ti=c6784000 task=c678d338 task.ti=c6784000) Stack: c6b35bf0 000003e8 c6b35cd4 c6785cf0 c11c8175 c6ba2ea0 c6b35bf0 00000000 <0> c6b35bf0 00000292 c6b7c040 c6b35bf0 c6ba2ea0 c6b99ed8 c6785d00 c11b4428 <0> c6b32bf0 c6ba2ea0 c6785d14 c11b5b3f 01ba3df0 000000dc 00000005 c6785d30 Call Trace: [<c11c8175>] ? ohci_endpoint_disable+0x113/0x192 [<c11b4428>] ? usb_hcd_disable_endpoint+0x2e/0x32 [<c11b5b3f>] ? usb_disable_endpoint+0x6d/0x72 [<c11b5cae>] ? usb_disable_interface+0x30/0x3f [<c11b70ac>] ? usb_set_interface+0x11b/0x1a0 [<c8872e1d>] ? gspca_set_alt0+0x23/0x46 [gspca_main] [<c8872e75>] ? gspca_stream_off+0x35/0x5f [gspca_main] [<c8872ef8>] ? vidioc_streamoff+0x59/0xb4 [gspca_main] [<c8817244>] ? __video_do_ioctl+0x17af/0x3920 [videodev] [<c1032fa1>] ? __lock_acquire+0x6ef/0x755 [<c102f436>] ? lock_release_holdtime+0x81/0x86 [<c103315c>] ? lock_release_non_nested+0xab/0x1cf [<c105382f>] ? might_fault+0x3d/0x79 [<c105382f>] ? might_fault+0x3d/0x79 [<c11123d4>] ? copy_from_user+0x31/0x54 [<c88196b8>] ? video_ioctl2+0x303/0x3ea [videodev] [<c102f436>] ? lock_release_holdtime+0x81/0x86 [<c12e430e>] ? _spin_unlock_irqrestore+0x36/0x3c [<c103086c>] ? trace_hardirqs_on_caller+0x104/0x12b [<c103089e>] ? trace_hardirqs_on+0xb/0xd [<c88193b5>] ? video_ioctl2+0x0/0x3ea [videodev] [<c88156d8>] ? v4l2_unlocked_ioctl+0x2e/0x32 [videodev] [<c88156aa>] ? v4l2_unlocked_ioctl+0x0/0x32 [videodev] [<c106dd91>] ? vfs_ioctl+0x19/0x50 [<c106e36b>] ? do_vfs_ioctl+0x458/0x4a3 [<c1155a42>] ? tty_ldisc_deref+0x8/0xa [<c1150c1c>] ? tty_write+0x1b1/0x1c2 [<c1152d69>] ? n_tty_write+0x0/0x2e6 [<c1150a6b>] ? tty_write+0x0/0x1c2 [<c106431d>] ? vfs_write+0xe3/0xfa [<c1002858>] ? restore_all_notrace+0x0/0x18 [<c106e3e2>] ? sys_ioctl+0x2c/0x45 [<c1002825>] ? syscall_call+0x7/0xb Code: e5 e8 bf 7b e9 ff 5d c3 55 89 e5 57 89 c7 56 89 d6 53 8b 42 28 89 c2 c1 ea 06 31 d0 83 e0 3f 8d 94 87 cc 00 00 00 eb 03 8d 50 1c <8b> 02 85 c0 74 0b 39 EIP: [<c11c5cef>] td_free+0x23/0x75 SS:ESP 0068:c6785cb8 CR2: 00000000a7a7a7c3 ---[ end trace 6303a49aef2d5a94 ]---
This bug is related to bug 14565.
Created attachment 23709 [details] kernel config
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). (lotsa cc's added) On Mon, 9 Nov 2009 08:59:05 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14564 > > Summary: capture-example sleeping function called from invalid > context at arch/x86/mm/fault.c > Product: Memory Management > Version: 2.5 > Kernel Version: 2.6.31.5 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: Slab Allocator > AssignedTo: akpm@linux-foundation.org > ReportedBy: knife@toaster.net > Regression: No > Thhis is odd. > On a DM&P ebox2300sx, 300Mhz Vortex86 cpu, I have a vanilla 2.6.31.5 kernel > with a pac207 webcam. I run capture-example from the v4l-dvb sample > applications and it crashes 1 out of 5 times. Let me know if I need to > collect > more data or try anything. > > [root@X-Linux]:~ # capture-example > ......................................................................BUG: > sleeping function called from invalid context at arch/x86/mm/fault.c:1069 > in_atomic(): 0, irqs_disabled(): 1, pid: 1178, name: capture-example > 4 locks held by capture-example/1178: > #0: (&gspca_dev->queue_lock){+.+.+.}, at: [<c8872eda>] > vidioc_streamoff+0x3b/0xb4 [gspca_main] > #1: (&gspca_dev->usb_lock){+.+.+.}, at: [<c8872eed>] > vidioc_streamoff+0x4e/0xb4 [gspca_main] > #2: (&ohci->lock){-.-...}, at: [<c11c8093>] > ohci_endpoint_disable+0x31/0x192 > #3: (&mm->mmap_sem){++++++}, at: [<c100c168>] do_page_fault+0xc1/0x1fe > irq event stamp: 11656 > hardirqs last enabled at (11655): [<c12e41a0>] _spin_unlock_irq+0x22/0x26 > hardirqs last disabled at (11656): [<c12e41da>] _spin_lock_irqsave+0x10/0x5a > softirqs last enabled at (11610): [<c101a87f>] __do_softirq+0x145/0x14d > softirqs last disabled at (11605): [<c101a8b1>] do_softirq+0x2a/0x42 > Pid: 1178, comm: capture-example Not tainted 2.6.31.5 #2 > Call Trace: > [<c101222d>] __might_sleep+0xcb/0xd0 > [<c100c1ad>] do_page_fault+0x106/0x1fe > [<c100c0a7>] ? do_page_fault+0x0/0x1fe > [<c12e43c3>] error_code+0x63/0x70 > [<c100c0a7>] ? do_page_fault+0x0/0x1fe > [<c11c5cef>] ? td_free+0x23/0x75 > [<c11c8175>] ohci_endpoint_disable+0x113/0x192 > [<c11b4428>] usb_hcd_disable_endpoint+0x2e/0x32 > [<c11b5b3f>] usb_disable_endpoint+0x6d/0x72 > [<c11b5cae>] usb_disable_interface+0x30/0x3f > [<c11b70ac>] usb_set_interface+0x11b/0x1a0 > [<c8872e1d>] gspca_set_alt0+0x23/0x46 [gspca_main] > [<c8872e75>] gspca_stream_off+0x35/0x5f [gspca_main] > [<c8872ef8>] vidioc_streamoff+0x59/0xb4 [gspca_main] > [<c8817244>] __video_do_ioctl+0x17af/0x3920 [videodev] > [<c1032fa1>] ? __lock_acquire+0x6ef/0x755 > [<c102f436>] ? lock_release_holdtime+0x81/0x86 > [<c103315c>] ? lock_release_non_nested+0xab/0x1cf > [<c105382f>] ? might_fault+0x3d/0x79 > [<c105382f>] ? might_fault+0x3d/0x79 > [<c11123d4>] ? copy_from_user+0x31/0x54 > [<c88196b8>] video_ioctl2+0x303/0x3ea [videodev] > [<c102f436>] ? lock_release_holdtime+0x81/0x86 > [<c12e430e>] ? _spin_unlock_irqrestore+0x36/0x3c > [<c103086c>] ? trace_hardirqs_on_caller+0x104/0x12b > [<c103089e>] ? trace_hardirqs_on+0xb/0xd > [<c88193b5>] ? video_ioctl2+0x0/0x3ea [videodev] > [<c88156d8>] v4l2_unlocked_ioctl+0x2e/0x32 [videodev] > [<c88156aa>] ? v4l2_unlocked_ioctl+0x0/0x32 [videodev] > [<c106dd91>] vfs_ioctl+0x19/0x50 > [<c106e36b>] do_vfs_ioctl+0x458/0x4a3 > [<c1155a42>] ? tty_ldisc_deref+0x8/0xa > [<c1150c1c>] ? tty_write+0x1b1/0x1c2 > [<c1152d69>] ? n_tty_write+0x0/0x2e6 > [<c1150a6b>] ? tty_write+0x0/0x1c2 > [<c106431d>] ? vfs_write+0xe3/0xfa > [<c1002858>] ? restore_all_notrace+0x0/0x18 > [<c106e3e2>] sys_ioctl+0x2c/0x45 > [<c1002825>] syscall_call+0x7/0xb We oopsed in td_free() (see below). But as part of that oops processing the kernel entered do_page_fault() and emitted a might_sleep() warning because we took a pagefault with local interrupts disabled. This is undesirable behaviour from the low-level x86 fault code and I don't think it normally happens. Did we break something in x86 land, or is this oops sufficiently weird and whacky to bypass existing checks for this false positive? > BUG: unable to handle kernel paging request at a7a7a7c3 > IP: [<c11c5cef>] td_free+0x23/0x75 > *pde = 00000000 > Oops: 0000 [#1] DEBUG_PAGEALLOC > last sysfs file: > Modules linked in: gspca_pac207 gspca_main videodev v4l1_compat > > Pid: 1178, comm: capture-example Not tainted (2.6.31.5 #2) > EIP: 0060:[<c11c5cef>] EFLAGS: 00000083 CPU: 0 > EIP is at td_free+0x23/0x75 > EAX: a7a7a7a7 EBX: c6b35bf0 ECX: c6b35ce4 EDX: a7a7a7c3 > ESI: c6b7d800 EDI: c6b35cd4 EBP: c6785cc4 ESP: c6785cb8 > DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 > Process capture-example (pid: 1178, ti=c6784000 task=c678d338 > task.ti=c6784000) > Stack: > c6b35bf0 000003e8 c6b35cd4 c6785cf0 c11c8175 c6ba2ea0 c6b35bf0 00000000 > <0> c6b35bf0 00000292 c6b7c040 c6b35bf0 c6ba2ea0 c6b99ed8 c6785d00 c11b4428 > <0> c6b32bf0 c6ba2ea0 c6785d14 c11b5b3f 01ba3df0 000000dc 00000005 c6785d30 > Call Trace: > [<c11c8175>] ? ohci_endpoint_disable+0x113/0x192 > [<c11b4428>] ? usb_hcd_disable_endpoint+0x2e/0x32 > [<c11b5b3f>] ? usb_disable_endpoint+0x6d/0x72 > [<c11b5cae>] ? usb_disable_interface+0x30/0x3f > [<c11b70ac>] ? usb_set_interface+0x11b/0x1a0 > [<c8872e1d>] ? gspca_set_alt0+0x23/0x46 [gspca_main] > [<c8872e75>] ? gspca_stream_off+0x35/0x5f [gspca_main] > [<c8872ef8>] ? vidioc_streamoff+0x59/0xb4 [gspca_main] > [<c8817244>] ? __video_do_ioctl+0x17af/0x3920 [videodev] > [<c1032fa1>] ? __lock_acquire+0x6ef/0x755 > [<c102f436>] ? lock_release_holdtime+0x81/0x86 > [<c103315c>] ? lock_release_non_nested+0xab/0x1cf > [<c105382f>] ? might_fault+0x3d/0x79 > [<c105382f>] ? might_fault+0x3d/0x79 > [<c11123d4>] ? copy_from_user+0x31/0x54 > [<c88196b8>] ? video_ioctl2+0x303/0x3ea [videodev] > [<c102f436>] ? lock_release_holdtime+0x81/0x86 > [<c12e430e>] ? _spin_unlock_irqrestore+0x36/0x3c > [<c103086c>] ? trace_hardirqs_on_caller+0x104/0x12b > [<c103089e>] ? trace_hardirqs_on+0xb/0xd > [<c88193b5>] ? video_ioctl2+0x0/0x3ea [videodev] > [<c88156d8>] ? v4l2_unlocked_ioctl+0x2e/0x32 [videodev] > [<c88156aa>] ? v4l2_unlocked_ioctl+0x0/0x32 [videodev] > [<c106dd91>] ? vfs_ioctl+0x19/0x50 > [<c106e36b>] ? do_vfs_ioctl+0x458/0x4a3 > [<c1155a42>] ? tty_ldisc_deref+0x8/0xa > [<c1150c1c>] ? tty_write+0x1b1/0x1c2 > [<c1152d69>] ? n_tty_write+0x0/0x2e6 > [<c1150a6b>] ? tty_write+0x0/0x1c2 > [<c106431d>] ? vfs_write+0xe3/0xfa > [<c1002858>] ? restore_all_notrace+0x0/0x18 > [<c106e3e2>] ? sys_ioctl+0x2c/0x45 > [<c1002825>] ? syscall_call+0x7/0xb > Code: e5 e8 bf 7b e9 ff 5d c3 55 89 e5 57 89 c7 56 89 d6 53 8b 42 28 89 c2 c1 > ea 06 31 d0 83 e0 3f 8d 94 87 cc 00 00 00 eb 03 8d 50 1c <8b> 02 85 c0 74 0b > 39 > EIP: [<c11c5cef>] td_free+0x23/0x75 SS:ESP 0068:c6785cb8 > CR2: 00000000a7a7a7c3 And here's the real oops. drivers/usb/host/ohci-mem.c:td_free() dereferenced a7a7a7c3. Which looks like /********** drivers/base/dmapool.c **********/ #define POOL_POISON_FREED 0xa7 /* !inuse */ #define POOL_POISON_ALLOCATED 0xa9 /* !initted */
On Wed, 11 Nov 2009, Andrew Morton wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=14564 > > > > Summary: capture-example sleeping function called from invalid > > context at arch/x86/mm/fault.c > We oopsed in td_free() (see below). But as part of that oops > processing the kernel entered do_page_fault() and emitted a > might_sleep() warning because we took a pagefault with local interrupts > disabled. > > This is undesirable behaviour from the low-level x86 fault code and I > don't think it normally happens. > > Did we break something in x86 land, or is this oops sufficiently weird > and whacky to bypass existing checks for this false positive? No, what happened was a structure containing a linked-list entry got freed while it was still on the list. Then when the driver walked through the list, it attempted to dereference a list pointer that had been poisoned. More or less by coincidence, the poison value represented a paged-out address rather than an invalid address, so a page fault occurred. That's what caused the oops. > > BUG: unable to handle kernel paging request at a7a7a7c3 > > IP: [<c11c5cef>] td_free+0x23/0x75 > > [<c1155a42>] ? tty_ldisc_deref+0x8/0xa > > [<c1150c1c>] ? tty_write+0x1b1/0x1c2 > > [<c1152d69>] ? n_tty_write+0x0/0x2e6 > > [<c1150a6b>] ? tty_write+0x0/0x1c2 > > [<c106431d>] ? vfs_write+0xe3/0xfa > > [<c1002858>] ? restore_all_notrace+0x0/0x18 > > [<c106e3e2>] ? sys_ioctl+0x2c/0x45 > > [<c1002825>] ? syscall_call+0x7/0xb > > Code: e5 e8 bf 7b e9 ff 5d c3 55 89 e5 57 89 c7 56 89 d6 53 8b 42 28 89 c2 > c1 > > ea 06 31 d0 83 e0 3f 8d 94 87 cc 00 00 00 eb 03 8d 50 1c <8b> 02 85 c0 74 > 0b 39 > > EIP: [<c11c5cef>] td_free+0x23/0x75 SS:ESP 0068:c6785cb8 > > CR2: 00000000a7a7a7c3 > > And here's the real oops. drivers/usb/host/ohci-mem.c:td_free() > dereferenced a7a7a7c3. Which looks like > > /********** drivers/base/dmapool.c **********/ > #define POOL_POISON_FREED 0xa7 /* !inuse */ > #define POOL_POISON_ALLOCATED 0xa9 /* !initted */ If I'm reading this correctly, the bad dereference occurred in the second source line: prev = &(*prev)->td_hash; if (*prev) The original value in *prev was 0xa7a7a7a7 and the offset of td_hash is 0x1c, causing the offending address to be 0xa7a7a7c3. I have no idea why a struct td would have been freed while it was still in use. Alan Stern
Is there anything I can do to help? This is a show stopping bug for me. Thanks, Sean Lazar Alan Stern wrote: > On Wed, 11 Nov 2009, Andrew Morton wrote: > > >>> http://bugzilla.kernel.org/show_bug.cgi?id=14564 >>> >>> Summary: capture-example sleeping function called from invalid >>> context at arch/x86/mm/fault.c >>> > > >> We oopsed in td_free() (see below). But as part of that oops >> processing the kernel entered do_page_fault() and emitted a >> might_sleep() warning because we took a pagefault with local interrupts >> disabled. >> >> This is undesirable behaviour from the low-level x86 fault code and I >> don't think it normally happens. >> >> Did we break something in x86 land, or is this oops sufficiently weird >> and whacky to bypass existing checks for this false positive? >> > > No, what happened was a structure containing a linked-list entry got > freed while it was still on the list. Then when the driver walked > through the list, it attempted to dereference a list pointer that had > been poisoned. More or less by coincidence, the poison value > represented a paged-out address rather than an invalid address, so a > page fault occurred. That's what caused the oops. > > >>> BUG: unable to handle kernel paging request at a7a7a7c3 >>> IP: [<c11c5cef>] td_free+0x23/0x75 >>> > > >>> [<c1155a42>] ? tty_ldisc_deref+0x8/0xa >>> [<c1150c1c>] ? tty_write+0x1b1/0x1c2 >>> [<c1152d69>] ? n_tty_write+0x0/0x2e6 >>> [<c1150a6b>] ? tty_write+0x0/0x1c2 >>> [<c106431d>] ? vfs_write+0xe3/0xfa >>> [<c1002858>] ? restore_all_notrace+0x0/0x18 >>> [<c106e3e2>] ? sys_ioctl+0x2c/0x45 >>> [<c1002825>] ? syscall_call+0x7/0xb >>> Code: e5 e8 bf 7b e9 ff 5d c3 55 89 e5 57 89 c7 56 89 d6 53 8b 42 28 89 c2 >>> c1 >>> ea 06 31 d0 83 e0 3f 8d 94 87 cc 00 00 00 eb 03 8d 50 1c <8b> 02 85 c0 74 >>> 0b 39 >>> EIP: [<c11c5cef>] td_free+0x23/0x75 SS:ESP 0068:c6785cb8 >>> CR2: 00000000a7a7a7c3 >>> >> And here's the real oops. drivers/usb/host/ohci-mem.c:td_free() >> dereferenced a7a7a7c3. Which looks like >> >> /********** drivers/base/dmapool.c **********/ >> #define POOL_POISON_FREED 0xa7 /* !inuse */ >> #define POOL_POISON_ALLOCATED 0xa9 /* !initted */ >> > > If I'm reading this correctly, the bad dereference occurred in the > second source line: > > prev = &(*prev)->td_hash; > if (*prev) > > The original value in *prev was 0xa7a7a7a7 and the offset of td_hash is > 0x1c, causing the offending address to be 0xa7a7a7c3. > > I have no idea why a struct td would have been freed while it was still > in use. > > Alan Stern > > -- > To unsubscribe from this list: send the line "unsubscribe linux-media" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > >
On Wed, 2 Dec 2009, Sean wrote: > Is there anything I can do to help? This is a show stopping bug for me. Here's a patch you can try. It will add a _lot_ of debugging information to the system log. Maybe it will help pin down the source of the problem. Alan Stern Index: 2.6.31/drivers/usb/host/ohci-hcd.c =================================================================== --- 2.6.31.orig/drivers/usb/host/ohci-hcd.c +++ 2.6.31/drivers/usb/host/ohci-hcd.c @@ -197,7 +197,7 @@ static int ohci_urb_enqueue ( /* allocate the TDs (deferring hash chain updates) */ for (i = 0; i < size; i++) { - urb_priv->td [i] = td_alloc (ohci, mem_flags); + urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, urb->ep); if (!urb_priv->td [i]) { urb_priv->length = i; urb_free_priv (ohci, urb_priv); Index: 2.6.31/drivers/usb/host/ohci-mem.c =================================================================== --- 2.6.31.orig/drivers/usb/host/ohci-mem.c +++ 2.6.31/drivers/usb/host/ohci-mem.c @@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr /* TDs ... */ static struct td * -td_alloc (struct ohci_hcd *hc, gfp_t mem_flags) +td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev, + struct usb_host_endpoint *ep) { dma_addr_t dma; struct td *td; @@ -94,6 +95,8 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem td->hwNextTD = cpu_to_hc32 (hc, dma); td->td_dma = dma; /* hashed in td_fill */ + ohci_info(hc, "td alloc for %s ep%x: %p\n", + udev->devpath, ep->desc.bEndpointAddress, td); } return td; } @@ -103,8 +106,14 @@ td_free (struct ohci_hcd *hc, struct td { struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; - while (*prev && *prev != td) + ohci_info(hc, "td free %p\n", td); + while (*prev && *prev != td) { + if ((unsigned long) *prev == 0xa7a7a7a7) { + ohci_info(hc, "poisoned hash at %p\n", prev); + return; + } prev = &(*prev)->td_hash; + } if (*prev) *prev = td->td_hash; else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) Index: 2.6.31/drivers/usb/host/ohci-q.c =================================================================== --- 2.6.31.orig/drivers/usb/host/ohci-q.c +++ 2.6.31/drivers/usb/host/ohci-q.c @@ -403,7 +403,7 @@ static struct ed *ed_get ( } /* dummy td; end of td list for ed */ - td = td_alloc (ohci, GFP_ATOMIC); + td = td_alloc (ohci, GFP_ATOMIC, udev, ep); if (!td) { /* out of memory */ ed_free (ohci, ed);
Thanks, that patch definitely traps the bug. Unfortunately there are so many debug messages that the capture-example.c times out trying to connect to the webcam. The debug messages slow down the process enough to make that happen. But if I modify your patch and take out the extra debug messages, it works well. The modified patch is below. Reproducing the bug in four separate instances I got: ohci_hcd 0000:00:0a.0: poisoned hash at c67a285c ohci_hcd 0000:00:0a.0: poisoned hash at c67b875c ohci_hcd 0000:00:0a.0: poisoned hash at c67a179c ohci_hcd 0000:00:0a.0: poisoned hash at c679c79c Sean Lazar --- ohci-mem.c.orig 2009-12-16 22:57:49.000000000 +0000 +++ ohci-mem.c 2009-12-16 22:49:37.000000000 +0000 @@ -103,8 +103,13 @@ { struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; - while (*prev && *prev != td) + while (*prev && *prev != td) { + if ((unsigned long) *prev == 0xa7a7a7a7) { + ohci_info(hc, "poisoned hash at %p\n", prev); + return; + } prev = &(*prev)->td_hash; + } if (*prev) *prev = td->td_hash; else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) Alan Stern wrote: > On Wed, 2 Dec 2009, Sean wrote: > > >> Is there anything I can do to help? This is a show stopping bug for me. >> > > Here's a patch you can try. It will add a _lot_ of debugging > information to the system log. Maybe it will help pin down the source > of the problem. > > Alan Stern > > > > Index: 2.6.31/drivers/usb/host/ohci-hcd.c > =================================================================== > --- 2.6.31.orig/drivers/usb/host/ohci-hcd.c > +++ 2.6.31/drivers/usb/host/ohci-hcd.c > @@ -197,7 +197,7 @@ static int ohci_urb_enqueue ( > > /* allocate the TDs (deferring hash chain updates) */ > for (i = 0; i < size; i++) { > - urb_priv->td [i] = td_alloc (ohci, mem_flags); > + urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, > urb->ep); > if (!urb_priv->td [i]) { > urb_priv->length = i; > urb_free_priv (ohci, urb_priv); > Index: 2.6.31/drivers/usb/host/ohci-mem.c > =================================================================== > --- 2.6.31.orig/drivers/usb/host/ohci-mem.c > +++ 2.6.31/drivers/usb/host/ohci-mem.c > @@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr > > /* TDs ... */ > static struct td * > -td_alloc (struct ohci_hcd *hc, gfp_t mem_flags) > +td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev, > + struct usb_host_endpoint *ep) > { > dma_addr_t dma; > struct td *td; > @@ -94,6 +95,8 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem > td->hwNextTD = cpu_to_hc32 (hc, dma); > td->td_dma = dma; > /* hashed in td_fill */ > + ohci_info(hc, "td alloc for %s ep%x: %p\n", > + udev->devpath, ep->desc.bEndpointAddress, td); > } > return td; > } > @@ -103,8 +106,14 @@ td_free (struct ohci_hcd *hc, struct td > { > struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; > > - while (*prev && *prev != td) > + ohci_info(hc, "td free %p\n", td); > + while (*prev && *prev != td) { > + if ((unsigned long) *prev == 0xa7a7a7a7) { > + ohci_info(hc, "poisoned hash at %p\n", prev); > + return; > + } > prev = &(*prev)->td_hash; > + } > if (*prev) > *prev = td->td_hash; > else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) > Index: 2.6.31/drivers/usb/host/ohci-q.c > =================================================================== > --- 2.6.31.orig/drivers/usb/host/ohci-q.c > +++ 2.6.31/drivers/usb/host/ohci-q.c > @@ -403,7 +403,7 @@ static struct ed *ed_get ( > } > > /* dummy td; end of td list for ed */ > - td = td_alloc (ohci, GFP_ATOMIC); > + td = td_alloc (ohci, GFP_ATOMIC, udev, ep); > if (!td) { > /* out of memory */ > ed_free (ohci, ed); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-media" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > >
On Wed, 16 Dec 2009, Sean wrote: > Thanks, that patch definitely traps the bug. Unfortunately there are so > many debug messages that the capture-example.c times out trying to > connect to the webcam. The debug messages slow down the process enough > to make that happen. But if I modify your patch and take out the extra > debug messages, it works well. The modified patch is below. The patch doesn't fix anything. The point was to gather enough information to figure out what's going wrong. Without the debug messages, there's no information. Perhaps things will slow down less if you change the new ohci_info() calls in the patch to ohci_dbg(). Or perhaps you can increase the timeout values in capture-example.c. You should also apply this patch (be sure to enable CONFIG_USB_DEBUG): http://marc.info/?l=linux-usb&m=126056642931083&w=2 It probably won't make any difference, but including it anyway is worthwhile. Alan Stern
Alan Stern wrote: > The patch doesn't fix anything. The point was to gather enough > information to figure out what's going wrong. Without the debug > messages, there's no information. > > Perhaps things will slow down less if you change the new ohci_info() > calls in the patch to ohci_dbg(). Or perhaps you can increase the > timeout values in capture-example.c. > > You should also apply this patch (be sure to enable CONFIG_USB_DEBUG): > > http://marc.info/?l=linux-usb&m=126056642931083&w=2 > > It probably won't make any difference, but including it anyway is > worthwhile. > > Alan Stern > The early return in td_free that is in the patch will trap the error. I changed the debug statements to ohci_dbg and I was able to capture the full output with klogd. It is attached. Sean
Created attachment 24341 [details] kernel.log with the debug patch
> Alan Stern wrote: >> The patch doesn't fix anything. The point was to gather enough >> information to figure out what's going wrong. Without the debug >> messages, there's no information. >> >> Perhaps things will slow down less if you change the new ohci_info() >> calls in the patch to ohci_dbg(). Or perhaps you can increase the >> timeout values in capture-example.c. >> >> You should also apply this patch (be sure to enable CONFIG_USB_DEBUG): >> >> http://marc.info/?l=linux-usb&m=126056642931083&w=2 >> >> It probably won't make any difference, but including it anyway is >> worthwhile. >> >> Alan Stern >> > The early return in td_free that is in the patch will trap the error. > > I changed the debug statements to ohci_dbg and I was able to capture > the full output with klogd. It is attached. > > Sean Sean wrote: In looking at the log file it seems that there is a mismatch of td_alloc calls and td_free calls. Sean
On Tue, 29 Dec 2009, Sean wrote: > > I changed the debug statements to ohci_dbg and I was able to capture > > the full output with klogd. It is attached. You did not get the full output. That's why I tell people to use dmesg instead of klogd or the kernel log files. If necessary you can always increase the size of the dmesg log buffer by changing CONFIG_LOG_BUF_SHIFT. I tend to set it to 18; you might want to go even higher. > In looking at the log file it seems that there is a mismatch of td_alloc > calls and td_free calls. That's because you didn't get the full output. Here's an enhanced version of the patch. It will provide more information and perhaps a smoking gun. Alan Stern Index: usb-2.6/drivers/usb/host/ohci-hcd.c =================================================================== --- usb-2.6.orig/drivers/usb/host/ohci-hcd.c +++ usb-2.6/drivers/usb/host/ohci-hcd.c @@ -201,7 +201,7 @@ static int ohci_urb_enqueue ( /* allocate the TDs (deferring hash chain updates) */ for (i = 0; i < size; i++) { - urb_priv->td [i] = td_alloc (ohci, mem_flags); + urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, urb->ep); if (!urb_priv->td [i]) { urb_priv->length = i; urb_free_priv (ohci, urb_priv); Index: usb-2.6/drivers/usb/host/ohci-mem.c =================================================================== --- usb-2.6.orig/drivers/usb/host/ohci-mem.c +++ usb-2.6/drivers/usb/host/ohci-mem.c @@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr /* TDs ... */ static struct td * -td_alloc (struct ohci_hcd *hc, gfp_t mem_flags) +td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev, + struct usb_host_endpoint *ep) { dma_addr_t dma; struct td *td; @@ -94,6 +95,8 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem td->hwNextTD = cpu_to_hc32 (hc, dma); td->td_dma = dma; /* hashed in td_fill */ + ohci_dbg(hc, "td alloc for %s ep%x: %p\n", + udev->devpath, ep->desc.bEndpointAddress, td); } return td; } @@ -101,14 +104,27 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem static void td_free (struct ohci_hcd *hc, struct td *td) { - struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; - - while (*prev && *prev != td) + int hash = TD_HASH_FUNC(td->td_dma); + struct td **prev = &hc->td_hash[hash]; + int n = 0; + + ohci_dbg(hc, "td free %p\n", td); + while (*prev && *prev != td) { + if ((unsigned long) *prev == 0xa7a7a7a7) { + ohci_dbg(hc, "poisoned hash at %p (%d %d) %p\n", prev, + hash, n, hc->td_hash[hash]); + return; + } prev = &(*prev)->td_hash; - if (*prev) + ++n; + } + if (*prev) { *prev = td->td_hash; - else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) - ohci_dbg (hc, "no hash for td %p\n", td); + ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, *prev); + } else { + ohci_dbg(hc, "no hash for td %p: %d %p\n", td, + hash, hc->td_hash[hash]); + } dma_pool_free (hc->td_cache, td, td->td_dma); } Index: usb-2.6/drivers/usb/host/ohci-q.c =================================================================== --- usb-2.6.orig/drivers/usb/host/ohci-q.c +++ usb-2.6/drivers/usb/host/ohci-q.c @@ -406,7 +406,7 @@ static struct ed *ed_get ( } /* dummy td; end of td list for ed */ - td = td_alloc (ohci, GFP_ATOMIC); + td = td_alloc (ohci, GFP_ATOMIC, udev, ep); if (!td) { /* out of memory */ ed_free (ohci, ed); @@ -560,6 +560,7 @@ td_fill (struct ohci_hcd *ohci, u32 info hash = TD_HASH_FUNC (td->td_dma); td->td_hash = ohci->td_hash [hash]; ohci->td_hash [hash] = td; + ohci_dbg(ohci, "hash %p to %d -> %p\n", td, hash, td->td_hash); /* HC might read the TD (or cachelines) right away ... */ wmb ();
Alan Stern wrote: > You did not get the full output. That's why I tell people to use dmesg > instead of klogd or the kernel log files. If necessary you can always > increase the size of the dmesg log buffer by changing > CONFIG_LOG_BUF_SHIFT. I tend to set it to 18; you might want to go > even higher. > > > That's because you didn't get the full output. Here's an enhanced > version of the patch. It will provide more information and perhaps a > smoking gun. > > Alan Stern > Alan, Thanks for the debug patch. I'll send you the dmesg.log output in another email. It is 2MB. Sean
Created attachment 24371 [details] dmesg log after enhanced debug patch from Alan Stern
Created attachment 24372 [details] dmesg log after enhanced debug patch from Alan Stern
On Tue, 29 Dec 2009, Sean wrote: > Alan, > > Thanks for the debug patch. I'll send you the dmesg.log output in > another email. It is 2MB. Got it. It's not good enough; the initial error occurred before the start of your extract. Here's yet another version of the patch; this one will stop printing the debug messages when an error is first detected so maybe it won't overrun your log buffer. Alan Stern Index: usb-2.6/drivers/usb/host/ohci-hcd.c =================================================================== --- usb-2.6.orig/drivers/usb/host/ohci-hcd.c +++ usb-2.6/drivers/usb/host/ohci-hcd.c @@ -201,7 +201,7 @@ static int ohci_urb_enqueue ( /* allocate the TDs (deferring hash chain updates) */ for (i = 0; i < size; i++) { - urb_priv->td [i] = td_alloc (ohci, mem_flags); + urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, urb->ep); if (!urb_priv->td [i]) { urb_priv->length = i; urb_free_priv (ohci, urb_priv); @@ -580,6 +580,7 @@ static int ohci_run (struct ohci_hcd *oh struct usb_hcd *hcd = ohci_to_hcd(ohci); disable (ohci); + ohci->testing = 1; /* boot firmware should have set this up (5.1.1.3.1) */ if (first) { Index: usb-2.6/drivers/usb/host/ohci-mem.c =================================================================== --- usb-2.6.orig/drivers/usb/host/ohci-mem.c +++ usb-2.6/drivers/usb/host/ohci-mem.c @@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr /* TDs ... */ static struct td * -td_alloc (struct ohci_hcd *hc, gfp_t mem_flags) +td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev, + struct usb_host_endpoint *ep) { dma_addr_t dma; struct td *td; @@ -94,6 +95,9 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem td->hwNextTD = cpu_to_hc32 (hc, dma); td->td_dma = dma; /* hashed in td_fill */ + if (hc->testing) + ohci_dbg(hc, "td alloc for %s ep%x: %p\n", + udev->devpath, ep->desc.bEndpointAddress, td); } return td; } @@ -101,14 +105,32 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem static void td_free (struct ohci_hcd *hc, struct td *td) { - struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; - - while (*prev && *prev != td) + int hash = TD_HASH_FUNC(td->td_dma); + struct td **prev = &hc->td_hash[hash]; + int n = 0; + + if (hc->testing) + ohci_dbg(hc, "td free %p\n", td); + while (*prev && *prev != td) { + if ((unsigned long) *prev == 0xa7a7a7a7) { + ohci_dbg(hc, "poisoned hash at %p (%d %d) %p\n", prev, + hash, n, hc->td_hash[hash]); + return; + } prev = &(*prev)->td_hash; - if (*prev) + ++n; + } + if (*prev) { *prev = td->td_hash; - else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) - ohci_dbg (hc, "no hash for td %p\n", td); + if (hc->testing) { + ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, *prev); + if (td->td_hash == td) + hc->testing = 0; + } + } else { + ohci_dbg(hc, "no hash for td %p: %d %p\n", td, + hash, hc->td_hash[hash]); + } dma_pool_free (hc->td_cache, td, td->td_dma); } Index: usb-2.6/drivers/usb/host/ohci-q.c =================================================================== --- usb-2.6.orig/drivers/usb/host/ohci-q.c +++ usb-2.6/drivers/usb/host/ohci-q.c @@ -406,7 +406,7 @@ static struct ed *ed_get ( } /* dummy td; end of td list for ed */ - td = td_alloc (ohci, GFP_ATOMIC); + td = td_alloc (ohci, GFP_ATOMIC, udev, ep); if (!td) { /* out of memory */ ed_free (ohci, ed); @@ -560,6 +560,11 @@ td_fill (struct ohci_hcd *ohci, u32 info hash = TD_HASH_FUNC (td->td_dma); td->td_hash = ohci->td_hash [hash]; ohci->td_hash [hash] = td; + if (ohci->testing) { + ohci_dbg(ohci, "hash %p to %d -> %p\n", td, hash, td->td_hash); + if (td->td_hash == td) + ohci->testing = 0; + } /* HC might read the TD (or cachelines) right away ... */ wmb (); Index: usb-2.6/drivers/usb/host/ohci.h =================================================================== --- usb-2.6.orig/drivers/usb/host/ohci.h +++ usb-2.6/drivers/usb/host/ohci.h @@ -346,6 +346,7 @@ typedef struct urb_priv { struct ohci_hcd { spinlock_t lock; + int testing; /* * I/O memory used to communicate with the HC (dma-consistent)
Alan Stern wrote: > Got it. It's not good enough; the initial error occurred before the > start of your extract. Here's yet another version of the patch; this > one will stop printing the debug messages when an error is first > detected so maybe it won't overrun your log buffer. > > Alan Stern > Alan, Thanks again. I was able to get the full dmesg output this time. I ran capture-example three times and each time removing the webcam before capture-example finished. On the third time I got the poisoned hash message and I captured the output to a file. Attached is the dmesg output. Sean
Created attachment 24402 [details] Second dmesg log. Complete dmesg output from boot to bug.
On Sat, 2 Jan 2010, Sean wrote: > Alan, > > Thanks again. I was able to get the full dmesg output this time. I ran > capture-example three times and each time removing the webcam before > capture-example finished. On the third time I got the poisoned hash > message and I captured the output to a file. Attached is the dmesg output. Okay. Take a look at the following output: $ egrep -n '[2e]e(80|9c)' dmesg2.log 680:pci 0000:00:0c.0: reg 14 io port: [0xee80-0xee83] 727:kobject: 'ieee80211' (c79d5e1c): kobject_add_internal: parent: 'class', set: 'class' 728:kobject: 'ieee80211' (c79d5e1c): kobject_uevent_env 729:kobject: 'ieee80211' (c79d5e1c): fill_kobj_path: path = '/class/ieee80211' 4994:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6662e80 5027:ohci_hcd 0000:00:0b.0: hash c6662e80 to 58 -> (null) 5185:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676ee80 5218:ohci_hcd 0000:00:0b.0: hash c676ee80 to 58 -> c6662e80 5276:ohci_hcd 0000:00:0b.0: td free c6662e80 5277:ohci_hcd 0000:00:0b.0: (58 1) c676ee9c -> (null) 5296:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6662e80 5329:ohci_hcd 0000:00:0b.0: hash c6662e80 to 58 -> c676ee80 5538:ohci_hcd 0000:00:0b.0: td free c676ee80 5539:ohci_hcd 0000:00:0b.0: (58 1) c6662e9c -> (null) 5558:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676ee80 5591:ohci_hcd 0000:00:0b.0: hash c676ee80 to 58 -> c6662e80 5644:ohci_hcd 0000:00:0b.0: td free c6662e80 5645:ohci_hcd 0000:00:0b.0: (58 1) c676ee9c -> (null) 5720:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6662e80 5753:ohci_hcd 0000:00:0b.0: hash c6662e80 to 58 -> c676ee80 5900:ohci_hcd 0000:00:0b.0: td free c676ee80 5901:ohci_hcd 0000:00:0b.0: (58 1) c6662e9c -> (null) 5978:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676ee80 6011:ohci_hcd 0000:00:0b.0: hash c676ee80 to 58 -> c6662e80 6072:ohci_hcd 0000:00:0b.0: td free c6662e80 6073:ohci_hcd 0000:00:0b.0: (58 1) c676ee9c -> (null) 6088:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6662e80 6121:ohci_hcd 0000:00:0b.0: hash c6662e80 to 58 -> c676ee80 6324:ohci_hcd 0000:00:0b.0: td free c676ee80 6325:ohci_hcd 0000:00:0b.0: (58 1) c6662e9c -> (null) 6343:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676ee80 6376:ohci_hcd 0000:00:0b.0: hash c676ee80 to 58 -> c6662e80 6416:ohci_hcd 0000:00:0b.0: td free c6662e80 6417:ohci_hcd 0000:00:0b.0: (58 1) c676ee9c -> c676ee80 6492:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6662e80 6525:ohci_hcd 0000:00:0b.0: hash c6662e80 to 58 -> c676ee80 6686:ohci_hcd 0000:00:0b.0: td free c676ee80 6687:ohci_hcd 0000:00:0b.0: (58 1) c6662e9c -> c676ee80 Ignore the first few lines as being irrelevant. Starting with line 5185 you can see that this shows two TDs being allocated, hashed, freed, and unlinked over and over again, at addresses c6662e80 and c676ee80. When each one is hashed into the list, its td_hash member is made to point to the other. When each is removed from the hash list, the other's td_hash member is set to NULL. It's all very regular and repetitious until line 6417. At that line, the td_hash member of c676ee80 (which is at offset 1c from the start of the structure, hence at c676ee9c) is made to point to its own structure! Thus later at line 6687, when c676ee80 is freed, the td_hash member of c6662e80 is set to point at the freed structure. This is what leads to poisoned pointer values later on. So what went wrong at line 6417? There's no way to know exactly. My guess is that the write at line 6325, where c6662e9c was supposed to be set to NULL, never got recorded properly in the computer's memory. This would mean that c6662e9c still contained the c676ee80 value assigned at line 6121, and hence the incorrect value was copied at line 6417. In other words, I'm guessing that you're suffering from hardware memory errors. A possible way to test this is to modify the patch. In td_free() where it adds the line: + ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, *prev); instead add this code: + barrier(); + ohci_dbg(hc, "(%d %d) %p -> %p [%p]\n", hash, n, + prev, *prev, td->td_hash); If we find that the value of *prev differs from the value of td->td_hash then we'll know for certain. (Or maybe the presence of the barrier() will cause the object code to change in a way that prevents the error from occurring.) Alan Stern
Alan Stern wrote: > In other words, I'm guessing that you're suffering from hardware memory > errors. A possible way to test this is to modify the patch. In > td_free() where it adds the line: > > + ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, > *prev); > > instead add this code: > > + barrier(); > + ohci_dbg(hc, "(%d %d) %p -> %p [%p]\n", hash, n, > + prev, *prev, td->td_hash); > > If we find that the value of *prev differs from the value of > td->td_hash then we'll know for certain. (Or maybe the presence of the > barrier() will cause the object code to change in a way that prevents > the error from occurring.) > > Alan Stern > Hmm, I applied the changes and I did not see a place where *prev differs from td->td_hash. I have run memtest86+ on this box and it has passed 16 times, so I do not suspect a hardware memory error. What do you think? Attached is the latest dmesg output. Sean
Created attachment 24416 [details] Third dmesg log
Hello Sean, Sorry to be a bit late. Looking at the dmesg, I found that the gspca version was 2.7.0. May you upgrade your linux media stuff from LinuxTv.org and check if this problem still occurs? Jef
bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14564 > > > Jean-Francois Moine <moinejf@free.fr> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |moinejf@free.fr > > > > > --- Comment #22 from Jean-Francois Moine <moinejf@free.fr> 2010-01-03 > 07:02:45 --- > Hello Sean, > > Sorry to be a bit late. Looking at the dmesg, I found that the gspca version > was 2.7.0. May you upgrade your linux media stuff from LinuxTv.org and check > if > this problem still occurs? > > Jef > > Jef, Should I use the whole v4l-dvb set from http://linuxtv.org/hg/v4l-dvb or should I just get the gspca stuff from someone's personal repository? Sean
Sean, Yes, you must get the whole stable version http://linuxtv.org/hg/v4l-dvb (there is no bug fix for your webcam in my development repository). Jef
On Sat, 2 Jan 2010, Sean wrote: > Hmm, I applied the changes and I did not see a place where *prev differs > from td->td_hash. I have run memtest86+ on this box and it has passed 16 > times, so I do not suspect a hardware memory error. What do you think? > Attached is the latest dmesg output. I don't know. The same pattern as before appears here: $ egrep -n '[1b]e(40|5c)' dmesg3.log 167:kobject: 'fs' (c7801e40): kobject_add_internal: parent: '<NULL>', set: '<NULL>' 4990:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6691e40 5023:ohci_hcd 0000:00:0b.0: hash c6691e40 to 57 -> (null) 5181:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676be40 5214:ohci_hcd 0000:00:0b.0: hash c676be40 to 57 -> c6691e40 5271:ohci_hcd 0000:00:0b.0: td free c6691e40 5272:ohci_hcd 0000:00:0b.0: (57 1) c676be5c -> (null) [(null)] 5294:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6691e40 5327:ohci_hcd 0000:00:0b.0: hash c6691e40 to 57 -> c676be40 5533:ohci_hcd 0000:00:0b.0: td free c676be40 5534:ohci_hcd 0000:00:0b.0: (57 1) c6691e5c -> (null) [(null)] 5556:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676be40 5589:ohci_hcd 0000:00:0b.0: hash c676be40 to 57 -> c6691e40 5640:ohci_hcd 0000:00:0b.0: td free c6691e40 5641:ohci_hcd 0000:00:0b.0: (57 1) c676be5c -> c676be40 [c676be40] 5713:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6691e40 5746:ohci_hcd 0000:00:0b.0: hash c6691e40 to 57 -> c676be40 5899:ohci_hcd 0000:00:0b.0: td free c676be40 5900:ohci_hcd 0000:00:0b.0: (57 1) c6691e5c -> c676be40 [c676be40] At line 5641 we see that the td_hash pointer in c676be40 gets corrupted. It is copied from the pointer in c6691e40, which was set to NULL in line 5534, but now it points to c676be40. The question is whether this corruption was caused by a hardware fault or a software bug. We have added debugging printouts to the only two places where the driver assigns anything to td->td_hash, and they don't show anything wrong. This leads me to suspect the hardware, but of course this is still just a guess. Here is a completely new patch. This one is more directed, to catch the sort of errors we now know to look for. Alan Stern Index: usb-2.6/drivers/usb/host/ohci-mem.c =================================================================== --- usb-2.6.orig/drivers/usb/host/ohci-mem.c +++ usb-2.6/drivers/usb/host/ohci-mem.c @@ -98,17 +98,56 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem return td; } +static void td_check(struct ohci_hcd *hc, int hash, char *msg) +{ + struct td *td, *first; + + first = hc->td_hash[hash]; + for (td = first; td; td = td->td_hash) { + if (td->td_hash == first || td->td_hash == td) { + ohci_err(hc, "Circular pointer %s: %d %p %p %p\n", + msg, hash, first, td, td->td_hash); + td->td_hash = NULL; + return; + } + } +} + +static void td_check_all(struct ohci_hcd *hc, char *msg) +{ + int hash; + + for (hash = 0; hash < TD_HASH_SIZE; ++hash) + td_check(hc, hash, msg); +} + static void td_free (struct ohci_hcd *hc, struct td *td) { - struct td **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)]; + int hash = TD_HASH_FUNC(td->td_dma); + struct td **prev = &hc->td_hash[hash]; - while (*prev && *prev != td) + td_check(hc, hash, "#1a"); + while (*prev && *prev != td) { + if ((unsigned long) *prev == 0xa7a7a7a7) { + ohci_err(hc, "poisoned hash at %p (%d) %p %p\n", prev, + hash, td, hc->td_hash[hash]); + return; + } prev = &(*prev)->td_hash; - if (*prev) + } + if (*prev) { *prev = td->td_hash; + if (*prev == td) { + ohci_err(hc, "invalid hash at %p (%d) %p %p\n", prev, + hash, td, hc->td_hash[hash]); + *prev = NULL; + } + } else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0) ohci_dbg (hc, "no hash for td %p\n", td); + mb(); + td_check(hc, hash, "#1b"); dma_pool_free (hc->td_cache, td, td->td_dma); } Index: usb-2.6/drivers/usb/host/ohci-q.c =================================================================== --- usb-2.6.orig/drivers/usb/host/ohci-q.c +++ usb-2.6/drivers/usb/host/ohci-q.c @@ -558,12 +558,14 @@ td_fill (struct ohci_hcd *ohci, u32 info /* hash it for later reverse mapping */ hash = TD_HASH_FUNC (td->td_dma); + td_check(ohci, hash, "#2a"); td->td_hash = ohci->td_hash [hash]; ohci->td_hash [hash] = td; /* HC might read the TD (or cachelines) right away ... */ wmb (); td->ed->hwTailP = td->hwNextTD; + td_check(ohci, hash, "#2b"); } /*-------------------------------------------------------------------------*/ @@ -1127,9 +1129,11 @@ dl_done_list (struct ohci_hcd *ohci) { struct td *td = dl_reverse_done_list (ohci); + td_check_all(ohci, "#3a"); while (td) { struct td *td_next = td->next_dl_td; takeback_td(ohci, td); td = td_next; } + td_check_all(ohci, "#3b"); }
Alan Stern wrote: > Here is a completely new patch. This one is more directed, to catch > the sort of errors we now know to look for. > > Alan Stern > Alan, I applied the patches and ran capture-example twice. On the second run of capture-example a circular pointer popped up. I did not need to remove the camera. Attached are the serial console capture as well as the dmesg log in debug4.tar.gz. Did you want me to try to reproduce the poison message? Sean
On Sun, 3 Jan 2010, Sean wrote: > Alan, > > I applied the patches and ran capture-example twice. On the second run > of capture-example a circular pointer popped up. I did not need to > remove the camera. Attached are the serial console capture as well as > the dmesg log in debug4.tar.gz. Did you want me to try to reproduce the > poison message? No. Among the things that patch did was to fix up the errors that caused the invalid pointers. Hence there should not have been any "poisoned hash" messages -- and indeed there weren't. The interesting part of the log is the error messages: ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 1 c6774040 c6542040 c6774040 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 1 c6774040 c6542040 c6774040 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 1 c6774040 c6542040 c6774040 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800 There are two different hash chains here (32 and 1), but in each case see the message is #2b, never #2a. This means the problem occurs between the places where the #2a and #2b messages were inserted, i.e., in td_fill(). The hash chain contained a single TD and was fine to begin with; then another TD was added at the start of the chain and the pointer in the earlier TD (now at the second position in the chain) got messed up. For example, the error message in the first line above implies that originally the 32nd hash chain contained only the TD at c6542800 with its td_hash member set to NULL. But then c6774800 was added to the start of the chain, after which c6542800's td_hash pointed to c6774800. Try inserting a line saying: td_check(ohci, hash, "#2c"); two lines above the #2b line, i.e., just after the wmb(). That'll help narrow down the search for the bug. And by the way, you don't need to post your entire dmesg log. Just the portion containing the new debugging messages will be enough. Alan Stern
Alan Stern wrote: > Try inserting a line saying: > > td_check(ohci, hash, "#2c"); > > two lines above the #2b line, i.e., just after the wmb(). That'll help > narrow down the search for the bug. Alan, I put the extra line in and ran capture-example twice. This is what I got: ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 Sean
On Mon, 4 Jan 2010, Sean wrote: > Alan Stern wrote: > > Try inserting a line saying: > > > > td_check(ohci, hash, "#2c"); > > > > two lines above the #2b line, i.e., just after the wmb(). That'll help > > narrow down the search for the bug. > Alan, > > I put the extra line in and ran capture-example twice. This is what I got: > > ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800 > ohci_hcd 0000:00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040 ... All right. Let's try this patch in place of all the others, then. Alan Stern Index: usb-2.6/drivers/usb/host/ohci-q.c =================================================================== --- usb-2.6.orig/drivers/usb/host/ohci-q.c +++ usb-2.6/drivers/usb/host/ohci-q.c @@ -505,6 +505,7 @@ td_fill (struct ohci_hcd *ohci, u32 info struct urb_priv *urb_priv = urb->hcpriv; int is_iso = info & TD_ISO; int hash; + volatile struct td * volatile td1, * volatile td2; // ASSERT (index < urb_priv->length); @@ -558,11 +559,30 @@ td_fill (struct ohci_hcd *ohci, u32 info /* hash it for later reverse mapping */ hash = TD_HASH_FUNC (td->td_dma); + + td1 = ohci->td_hash[hash]; + td2 = NULL; + if (td1) { + td2 = td1->td_hash; + if (td2 == td1 || td2 == td) { + ohci_err(ohci, "Circular hash: %d %p %p %p\n", + hash, td1, td2, td); + td2 = td1->td_hash = NULL; + } + } + td->td_hash = ohci->td_hash [hash]; ohci->td_hash [hash] = td; /* HC might read the TD (or cachelines) right away ... */ wmb (); + + if (td1 && td1->td_hash != td2) { + ohci_err(ohci, "Hash value changed: %d %p %p %p\n", + hash, td1, td2, td); + td1->td_hash = (struct td *) td2; + } + td->ed->hwTailP = td->hwNextTD; }
Alan Stern wrote: > ... > > All right. Let's try this patch in place of all the others, then. > > Alan Stern > > > Index: usb-2.6/drivers/usb/host/ohci-q.c > =================================================================== > --- usb-2.6.orig/drivers/usb/host/ohci-q.c > +++ usb-2.6/drivers/usb/host/ohci-q.c > @@ -505,6 +505,7 @@ td_fill (struct ohci_hcd *ohci, u32 info > struct urb_priv *urb_priv = urb->hcpriv; > int is_iso = info & TD_ISO; > int hash; > + volatile struct td * volatile td1, * volatile td2; > > // ASSERT (index < urb_priv->length); > > @@ -558,11 +559,30 @@ td_fill (struct ohci_hcd *ohci, u32 info > > /* hash it for later reverse mapping */ > hash = TD_HASH_FUNC (td->td_dma); > + > + td1 = ohci->td_hash[hash]; > + td2 = NULL; > + if (td1) { > + td2 = td1->td_hash; > + if (td2 == td1 || td2 == td) { > + ohci_err(ohci, "Circular hash: %d %p %p %p\n", > + hash, td1, td2, td); > + td2 = td1->td_hash = NULL; > + } > + } > + > td->td_hash = ohci->td_hash [hash]; > ohci->td_hash [hash] = td; > > /* HC might read the TD (or cachelines) right away ... */ > wmb (); > + > + if (td1 && td1->td_hash != td2) { > + ohci_err(ohci, "Hash value changed: %d %p %p %p\n", > + hash, td1, td2, td); > + td1->td_hash = (struct td *) td2; > + } > + > td->ed->hwTailP = td->hwNextTD; > } > > > Alan, This last patch seems to do the job. Thanks so much for your help! Where do I donate/send beer? Sean
bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14564 > > > Jean-Francois Moine <moinejf@free.fr> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |moinejf@free.fr > > > > > --- Comment #22 from Jean-Francois Moine <moinejf@free.fr> 2010-01-03 > 07:02:45 --- > Hello Sean, > > Sorry to be a bit late. Looking at the dmesg, I found that the gspca version > was 2.7.0. May you upgrade your linux media stuff from LinuxTv.org and check > if > this problem still occurs? > > Jef > Jef, I upgraded to the latest v4l-dvb from http://linuxtv.org/hg/v4l-dvb, made the kernel modules, made the v4l libraries, and recompiled capture-example.c. Gspca now shows 2.8.0. The error still persists. Alan Stern's latest patch to ohci-q.c traps the error. I think it is an issue with the cpu or usb controller on the Vortex86SX SoC. Sean
On Mon, 4 Jan 2010, Sean wrote: > Alan, > This last patch seems to do the job. Thanks so much for your help! Where > do I donate/send beer? Um, when you say it does the job, what do you mean? The job it was _intended_ to do was to prove that your problems are caused by hardware errors rather than software bugs. If the patch causes the problems to stop, without printing any error messages in the log, then it does indeed prove this. After all, the only places the patch changes any persistent values are after it prints an error message. (Admittedly, I didn't expect the problem to stop; I expected to get a bunch of messages from the second ohci_err(). Just out of curiosity, does it make any difference if you remove all those "volatile"s in the declaration line for td1 and td2?) I noticed that your CPU is a Cyrix. Perhaps it is the culprit. Have you tried running the program on a different computer? Alan Stern
On Mon, 4 Jan 2010, Sean wrote: > Jef, > > I upgraded to the latest v4l-dvb from http://linuxtv.org/hg/v4l-dvb, > made the kernel modules, made the v4l libraries, and recompiled > capture-example.c. Gspca now shows 2.8.0. The error still persists. Alan > Stern's latest patch to ohci-q.c traps the error. I think it is an issue > with the cpu or usb controller on the Vortex86SX SoC. The CPU is more likely than the USB controller. The erroneous pointer values we see are virtual addresses, whereas the USB controller would probably write a physical (or DMA) address if it was malfunctioning. Or it could be some bizarre timing problem with the memory bus, or something else equally obscure. You didn't mention before that this was a SoC rather than an ordinary PC. Alan Stern
Alan Stern wrote: > Um, when you say it does the job, what do you mean? It traps the error and prevents the kernel from crashing. > The job it was _intended_ to do was to prove that your problems are > caused by hardware errors rather than software bugs. If the patch > causes the problems to stop, without printing any error messages in the > log, then it does indeed prove this. After all, the only places the > patch changes any persistent values are after it prints an error > message. > It did print out error messages: usb 4-2: new full speed USB device using ohci_hcd and address 2 usb 4-2: New USB device found, idVendor=093a, idProduct=2460 usb 4-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 usb 4-2: Product: CIF Single Chip usb 4-2: Manufacturer: Pixart Imaging Inc. usb 4-2: configuration #1 chosen from 1 choice [root@X-Linux]:~ # modprobe gspca_pac207 Linux video capture interface: v2.00 gspca: main v2.8.0 registered gspca: probing 093a:2460 pac207: Pixart Sensor ID 0x27 Chips ID 0x09 pac207: Pixart PAC207BCA Image Processor and Control Chip detected (vid/pid 0x093A:0x2460) gspca: video0 created usbcore: registered new interface driver pac207 pac207: registered [root@X-Linux]:~ # capture-example ...................................................................... capture-example used greatest stack depth: 5848 bytes left [root@X-Linux]:~ # capture-example .ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ...ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 .ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 .ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 .ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ....ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 .ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 .ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 .ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ...ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 .ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 .ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 .ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 c677b800 ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 ..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 c677b900 .. [root@X-Linux]:~ # > (Admittedly, I didn't expect the problem to stop; I expected to get a > bunch of messages from the second ohci_err(). Just out of curiosity, > does it make any difference if you remove all those "volatile"s in the > declaration line for td1 and td2?) > It doesn't seem to make much difference. > I noticed that your CPU is a Cyrix. Perhaps it is the culprit. Have > you tried running the program on a different computer? > Yes, on other computers I don't get this error. Same os image. Though I haven't found a computer with an ohci controller yet. > Alan Stern > > -- > To unsubscribe from this list: send the line "unsubscribe linux-media" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > >
On Mon, 4 Jan 2010, Sean wrote: > Alan Stern wrote: > > Um, when you say it does the job, what do you mean? > It traps the error and prevents the kernel from crashing. As did some of the earlier patches, right? > > The job it was _intended_ to do was to prove that your problems are > > caused by hardware errors rather than software bugs. If the patch > > causes the problems to stop, without printing any error messages in the > > log, then it does indeed prove this. After all, the only places the > > patch changes any persistent values are after it prints an error > > message. > > > It did print out error messages: > .ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 > c677b900 > ...ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 > c677b900 > .ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 > c677b800 Ooh, that's odd. The "Circular hash" message occurs in the same spot as the "Circular pointer #2a" message in the previous patch -- and that message never got printed! > > I noticed that your CPU is a Cyrix. Perhaps it is the culprit. Have > > you tried running the program on a different computer? > > > Yes, on other computers I don't get this error. Same os image. Though I > haven't found a computer with an ohci controller yet. So that's not a real test, unfortunately. Still, at this point I'm not sure it's worthwhile to pursue this any farther. I'm convinced it's a hardware problem. Do you want to continue, or are you happy to switch computers and forget about it? Alan Stern
Alan Stern wrote: > On Mon, 4 Jan 2010, Sean wrote: > > >> Alan Stern wrote: >> >>> Um, when you say it does the job, what do you mean? >>> >> It traps the error and prevents the kernel from crashing. >> > > As did some of the earlier patches, right? > > Yes, but this one is good because it works without a debug kernel. > Still, at this point I'm not sure it's worthwhile to pursue this any > farther. I'm convinced it's a hardware problem. Do you want to > continue, or are you happy to switch computers and forget about it? > Thanks so much for your help Alan. I also think this is definitely a bug in the hardware. Let's leave it at that, I'm happy. Sean
On Tue, 5 Jan 2010, Sean wrote: > Thanks so much for your help Alan. I also think this is definitely a bug > in the hardware. Let's leave it at that, I'm happy. Okay. You should mark the Bugzilla report as REJECTED and close it out. Alan Stern