Bug 14564 - capture-example sleeping function called from invalid context at arch/x86/mm/fault.c
Summary: capture-example sleeping function called from invalid context at arch/x86/mm/...
Status: RESOLVED WILL_NOT_FIX
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Slab Allocator (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-11-09 08:59 UTC by Sean Lazar
Modified: 2010-01-05 21:42 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.31.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel config (54.15 KB, application/octet-stream)
2009-11-09 09:24 UTC, Sean Lazar
Details
kernel.log with the debug patch (239.84 KB, text/plain)
2009-12-29 09:36 UTC, Sean Lazar
Details
dmesg log after enhanced debug patch from Alan Stern (76.23 KB, text/plain)
2009-12-30 00:45 UTC, Sean Lazar
Details
dmesg log after enhanced debug patch from Alan Stern (76.23 KB, application/x-gzip)
2009-12-30 00:57 UTC, Sean Lazar
Details
Second dmesg log. Complete dmesg output from boot to bug. (37.59 KB, application/x-gzip)
2010-01-02 09:07 UTC, Sean Lazar
Details
Third dmesg log (35.83 KB, application/x-gzip)
2010-01-03 02:03 UTC, Sean Lazar
Details

Description Sean Lazar 2009-11-09 08:59:01 UTC
On a DM&P ebox2300sx, 300Mhz Vortex86 cpu, I have a vanilla 2.6.31.5 kernel with a pac207 webcam. I run capture-example from the v4l-dvb sample applications and it crashes 1 out of 5 times. Let me know if I need to collect more data or try anything.

[root@X-Linux]:~ # capture-example                                                                                                                             
......................................................................BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1069           
in_atomic(): 0, irqs_disabled(): 1, pid: 1178, name: capture-example                                                                                           
4 locks held by capture-example/1178:                                                                                                                          
 #0:  (&gspca_dev->queue_lock){+.+.+.}, at: [<c8872eda>] vidioc_streamoff+0x3b/0xb4 [gspca_main]                                                               
 #1:  (&gspca_dev->usb_lock){+.+.+.}, at: [<c8872eed>] vidioc_streamoff+0x4e/0xb4 [gspca_main]                                                                 
 #2:  (&ohci->lock){-.-...}, at: [<c11c8093>] ohci_endpoint_disable+0x31/0x192                                                                                 
 #3:  (&mm->mmap_sem){++++++}, at: [<c100c168>] do_page_fault+0xc1/0x1fe                                                                                       
irq event stamp: 11656                                                                                                                                         
hardirqs last  enabled at (11655): [<c12e41a0>] _spin_unlock_irq+0x22/0x26                                                                                     
hardirqs last disabled at (11656): [<c12e41da>] _spin_lock_irqsave+0x10/0x5a                                                                                   
softirqs last  enabled at (11610): [<c101a87f>] __do_softirq+0x145/0x14d                                                                                       
softirqs last disabled at (11605): [<c101a8b1>] do_softirq+0x2a/0x42                                                                                           
Pid: 1178, comm: capture-example Not tainted 2.6.31.5 #2                                                                                                       
Call Trace:                                                                                                                                                    
 [<c101222d>] __might_sleep+0xcb/0xd0                                                                                                                          
 [<c100c1ad>] do_page_fault+0x106/0x1fe                                                                                                                        
 [<c100c0a7>] ? do_page_fault+0x0/0x1fe                                                                                                                        
 [<c12e43c3>] error_code+0x63/0x70                                                                                                                             
 [<c100c0a7>] ? do_page_fault+0x0/0x1fe                                                                                                                        
 [<c11c5cef>] ? td_free+0x23/0x75                                                                                                                              
 [<c11c8175>] ohci_endpoint_disable+0x113/0x192                                                                                                                
 [<c11b4428>] usb_hcd_disable_endpoint+0x2e/0x32                                                                                                               
 [<c11b5b3f>] usb_disable_endpoint+0x6d/0x72                                                                                                                   
 [<c11b5cae>] usb_disable_interface+0x30/0x3f                                                                                                                  
 [<c11b70ac>] usb_set_interface+0x11b/0x1a0                                                                                                                    
 [<c8872e1d>] gspca_set_alt0+0x23/0x46 [gspca_main]                                                                                                            
 [<c8872e75>] gspca_stream_off+0x35/0x5f [gspca_main]                                                                                                          
 [<c8872ef8>] vidioc_streamoff+0x59/0xb4 [gspca_main]                                                                                                          
 [<c8817244>] __video_do_ioctl+0x17af/0x3920 [videodev]                                                                                                        
 [<c1032fa1>] ? __lock_acquire+0x6ef/0x755                                                                                                                     
 [<c102f436>] ? lock_release_holdtime+0x81/0x86                                                                                                                
 [<c103315c>] ? lock_release_non_nested+0xab/0x1cf                                                                                                             
 [<c105382f>] ? might_fault+0x3d/0x79                                                                                                                          
 [<c105382f>] ? might_fault+0x3d/0x79                                                                                                                          
 [<c11123d4>] ? copy_from_user+0x31/0x54                                                                                                                       
 [<c88196b8>] video_ioctl2+0x303/0x3ea [videodev]                                                                                                              
 [<c102f436>] ? lock_release_holdtime+0x81/0x86                                                                                                                
 [<c12e430e>] ? _spin_unlock_irqrestore+0x36/0x3c                                                                                                              
 [<c103086c>] ? trace_hardirqs_on_caller+0x104/0x12b                                                                                                           
 [<c103089e>] ? trace_hardirqs_on+0xb/0xd                                                                                                                      
 [<c88193b5>] ? video_ioctl2+0x0/0x3ea [videodev]                                                                                                              
 [<c88156d8>] v4l2_unlocked_ioctl+0x2e/0x32 [videodev]                                                                                                         
 [<c88156aa>] ? v4l2_unlocked_ioctl+0x0/0x32 [videodev]                                                                                                        
 [<c106dd91>] vfs_ioctl+0x19/0x50                                                                                                                              
 [<c106e36b>] do_vfs_ioctl+0x458/0x4a3                                                                                                                         
 [<c1155a42>] ? tty_ldisc_deref+0x8/0xa                                                                                                                        
 [<c1150c1c>] ? tty_write+0x1b1/0x1c2                                                                                                                          
 [<c1152d69>] ? n_tty_write+0x0/0x2e6                                                                                                                          
 [<c1150a6b>] ? tty_write+0x0/0x1c2                                                                                                                            
 [<c106431d>] ? vfs_write+0xe3/0xfa                                                                                                                            
 [<c1002858>] ? restore_all_notrace+0x0/0x18                                                                                                                   
 [<c106e3e2>] sys_ioctl+0x2c/0x45                                                                                                                              
 [<c1002825>] syscall_call+0x7/0xb                                                                                                                             
BUG: unable to handle kernel paging request at a7a7a7c3                                                                                                        
IP: [<c11c5cef>] td_free+0x23/0x75                                                                                                                             
*pde = 00000000                                                                                                                                                
Oops: 0000 [#1] DEBUG_PAGEALLOC                                                                                                                                
last sysfs file:                                                                                                                                               
Modules linked in: gspca_pac207 gspca_main videodev v4l1_compat                                                                                                
                                                                                                                                                               
Pid: 1178, comm: capture-example Not tainted (2.6.31.5 #2)                                                                                                     
EIP: 0060:[<c11c5cef>] EFLAGS: 00000083 CPU: 0                                                                                                                 
EIP is at td_free+0x23/0x75                                                                                                                                    
EAX: a7a7a7a7 EBX: c6b35bf0 ECX: c6b35ce4 EDX: a7a7a7c3                                                                                                        
ESI: c6b7d800 EDI: c6b35cd4 EBP: c6785cc4 ESP: c6785cb8                                                                                                        
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068                                                                                                                  
Process capture-example (pid: 1178, ti=c6784000 task=c678d338 task.ti=c6784000)                                                                                
Stack:                                                                                                                                                         
 c6b35bf0 000003e8 c6b35cd4 c6785cf0 c11c8175 c6ba2ea0 c6b35bf0 00000000                                                                                       
<0> c6b35bf0 00000292 c6b7c040 c6b35bf0 c6ba2ea0 c6b99ed8 c6785d00 c11b4428                                                                                    
<0> c6b32bf0 c6ba2ea0 c6785d14 c11b5b3f 01ba3df0 000000dc 00000005 c6785d30                                                                                    
Call Trace:                                                                                                                                                    
 [<c11c8175>] ? ohci_endpoint_disable+0x113/0x192                                                                                                              
 [<c11b4428>] ? usb_hcd_disable_endpoint+0x2e/0x32                                                                                                             
 [<c11b5b3f>] ? usb_disable_endpoint+0x6d/0x72                                                                                                                 
 [<c11b5cae>] ? usb_disable_interface+0x30/0x3f                                                                                                                
 [<c11b70ac>] ? usb_set_interface+0x11b/0x1a0                                                                                                                  
 [<c8872e1d>] ? gspca_set_alt0+0x23/0x46 [gspca_main]                                                                                                          
 [<c8872e75>] ? gspca_stream_off+0x35/0x5f [gspca_main]                                                                                                        
 [<c8872ef8>] ? vidioc_streamoff+0x59/0xb4 [gspca_main]                                                                                                        
 [<c8817244>] ? __video_do_ioctl+0x17af/0x3920 [videodev]                                                                                                      
 [<c1032fa1>] ? __lock_acquire+0x6ef/0x755                                                                                                                     
 [<c102f436>] ? lock_release_holdtime+0x81/0x86                                                                                                                
 [<c103315c>] ? lock_release_non_nested+0xab/0x1cf                                                                                                             
 [<c105382f>] ? might_fault+0x3d/0x79                                                                                                                          
 [<c105382f>] ? might_fault+0x3d/0x79                                                                                                                          
 [<c11123d4>] ? copy_from_user+0x31/0x54                                                                                                                       
 [<c88196b8>] ? video_ioctl2+0x303/0x3ea [videodev]                                                                                                            
 [<c102f436>] ? lock_release_holdtime+0x81/0x86                                                                                                                
 [<c12e430e>] ? _spin_unlock_irqrestore+0x36/0x3c                                                                                                              
 [<c103086c>] ? trace_hardirqs_on_caller+0x104/0x12b                                                                                                           
 [<c103089e>] ? trace_hardirqs_on+0xb/0xd                                                                                                                      
 [<c88193b5>] ? video_ioctl2+0x0/0x3ea [videodev]                                                                                                              
 [<c88156d8>] ? v4l2_unlocked_ioctl+0x2e/0x32 [videodev]                                                                                                       
 [<c88156aa>] ? v4l2_unlocked_ioctl+0x0/0x32 [videodev]                                                                                                        
 [<c106dd91>] ? vfs_ioctl+0x19/0x50                                                                                                                            
 [<c106e36b>] ? do_vfs_ioctl+0x458/0x4a3                                                                                                                       
 [<c1155a42>] ? tty_ldisc_deref+0x8/0xa                                                                                                                        
 [<c1150c1c>] ? tty_write+0x1b1/0x1c2                                                                                                                          
 [<c1152d69>] ? n_tty_write+0x0/0x2e6                                                                                                                          
 [<c1150a6b>] ? tty_write+0x0/0x1c2                                                                                                                            
 [<c106431d>] ? vfs_write+0xe3/0xfa                                                                                                                            
 [<c1002858>] ? restore_all_notrace+0x0/0x18                                                                                                                   
 [<c106e3e2>] ? sys_ioctl+0x2c/0x45                                                                                                                            
 [<c1002825>] ? syscall_call+0x7/0xb                                                                                                                           
Code: e5 e8 bf 7b e9 ff 5d c3 55 89 e5 57 89 c7 56 89 d6 53 8b 42 28 89 c2 c1 ea 06 31 d0 83 e0 3f 8d 94 87 cc 00 00 00 eb 03 8d 50 1c <8b> 02 85 c0 74 0b 39  
EIP: [<c11c5cef>] td_free+0x23/0x75 SS:ESP 0068:c6785cb8                                                                                                       
CR2: 00000000a7a7a7c3                                                                                                                                          
---[ end trace 6303a49aef2d5a94 ]---
Comment 1 Sean Lazar 2009-11-09 09:15:56 UTC
This bug is related to bug 14565.
Comment 2 Sean Lazar 2009-11-09 09:24:45 UTC
Created attachment 23709 [details]
kernel config
Comment 3 Andrew Morton 2009-11-11 23:21:30 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

(lotsa cc's added)

On Mon, 9 Nov 2009 08:59:05 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14564
> 
>            Summary: capture-example sleeping function called from invalid
>                     context at arch/x86/mm/fault.c
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 2.6.31.5
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Slab Allocator
>         AssignedTo: akpm@linux-foundation.org
>         ReportedBy: knife@toaster.net
>         Regression: No
> 

Thhis is odd.

> On a DM&P ebox2300sx, 300Mhz Vortex86 cpu, I have a vanilla 2.6.31.5 kernel
> with a pac207 webcam. I run capture-example from the v4l-dvb sample
> applications and it crashes 1 out of 5 times. Let me know if I need to
> collect
> more data or try anything.
> 
> [root@X-Linux]:~ # capture-example                                            
> ......................................................................BUG:
> sleeping function called from invalid context at arch/x86/mm/fault.c:1069     
> in_atomic(): 0, irqs_disabled(): 1, pid: 1178, name: capture-example          
> 4 locks held by capture-example/1178:                                         
>  #0:  (&gspca_dev->queue_lock){+.+.+.}, at: [<c8872eda>]
> vidioc_streamoff+0x3b/0xb4 [gspca_main]                                       
>  #1:  (&gspca_dev->usb_lock){+.+.+.}, at: [<c8872eed>]
> vidioc_streamoff+0x4e/0xb4 [gspca_main]                                       
>  #2:  (&ohci->lock){-.-...}, at: [<c11c8093>]
>  ohci_endpoint_disable+0x31/0x192  
>  #3:  (&mm->mmap_sem){++++++}, at: [<c100c168>] do_page_fault+0xc1/0x1fe      
> irq event stamp: 11656                                                        
> hardirqs last  enabled at (11655): [<c12e41a0>] _spin_unlock_irq+0x22/0x26    
> hardirqs last disabled at (11656): [<c12e41da>] _spin_lock_irqsave+0x10/0x5a  
> softirqs last  enabled at (11610): [<c101a87f>] __do_softirq+0x145/0x14d      
> softirqs last disabled at (11605): [<c101a8b1>] do_softirq+0x2a/0x42          
> Pid: 1178, comm: capture-example Not tainted 2.6.31.5 #2                      
> Call Trace:                                                                   
>  [<c101222d>] __might_sleep+0xcb/0xd0                                         
>  [<c100c1ad>] do_page_fault+0x106/0x1fe                                       
>  [<c100c0a7>] ? do_page_fault+0x0/0x1fe                                       
>  [<c12e43c3>] error_code+0x63/0x70                                            
>  [<c100c0a7>] ? do_page_fault+0x0/0x1fe                                       
>  [<c11c5cef>] ? td_free+0x23/0x75                                             
>  [<c11c8175>] ohci_endpoint_disable+0x113/0x192                               
>  [<c11b4428>] usb_hcd_disable_endpoint+0x2e/0x32                              
>  [<c11b5b3f>] usb_disable_endpoint+0x6d/0x72                                  
>  [<c11b5cae>] usb_disable_interface+0x30/0x3f                                 
>  [<c11b70ac>] usb_set_interface+0x11b/0x1a0                                   
>  [<c8872e1d>] gspca_set_alt0+0x23/0x46 [gspca_main]                           
>  [<c8872e75>] gspca_stream_off+0x35/0x5f [gspca_main]                         
>  [<c8872ef8>] vidioc_streamoff+0x59/0xb4 [gspca_main]                         
>  [<c8817244>] __video_do_ioctl+0x17af/0x3920 [videodev]                       
>  [<c1032fa1>] ? __lock_acquire+0x6ef/0x755                                    
>  [<c102f436>] ? lock_release_holdtime+0x81/0x86                               
>  [<c103315c>] ? lock_release_non_nested+0xab/0x1cf                            
>  [<c105382f>] ? might_fault+0x3d/0x79                                         
>  [<c105382f>] ? might_fault+0x3d/0x79                                         
>  [<c11123d4>] ? copy_from_user+0x31/0x54                                      
>  [<c88196b8>] video_ioctl2+0x303/0x3ea [videodev]                             
>  [<c102f436>] ? lock_release_holdtime+0x81/0x86                               
>  [<c12e430e>] ? _spin_unlock_irqrestore+0x36/0x3c                             
>  [<c103086c>] ? trace_hardirqs_on_caller+0x104/0x12b                          
>  [<c103089e>] ? trace_hardirqs_on+0xb/0xd                                     
>  [<c88193b5>] ? video_ioctl2+0x0/0x3ea [videodev]                             
>  [<c88156d8>] v4l2_unlocked_ioctl+0x2e/0x32 [videodev]                        
>  [<c88156aa>] ? v4l2_unlocked_ioctl+0x0/0x32 [videodev]                       
>  [<c106dd91>] vfs_ioctl+0x19/0x50                                             
>  [<c106e36b>] do_vfs_ioctl+0x458/0x4a3                                        
>  [<c1155a42>] ? tty_ldisc_deref+0x8/0xa                                       
>  [<c1150c1c>] ? tty_write+0x1b1/0x1c2                                         
>  [<c1152d69>] ? n_tty_write+0x0/0x2e6                                         
>  [<c1150a6b>] ? tty_write+0x0/0x1c2                                           
>  [<c106431d>] ? vfs_write+0xe3/0xfa                                           
>  [<c1002858>] ? restore_all_notrace+0x0/0x18                                  
>  [<c106e3e2>] sys_ioctl+0x2c/0x45                                             
>  [<c1002825>] syscall_call+0x7/0xb                                            

We oopsed in td_free() (see below).  But as part of that oops
processing the kernel entered do_page_fault() and emitted a
might_sleep() warning because we took a pagefault with local interrupts
disabled.

This is undesirable behaviour from the low-level x86 fault code and I
don't think it normally happens.

Did we break something in x86 land, or is this oops sufficiently weird
and whacky to bypass existing checks for this false positive?

> BUG: unable to handle kernel paging request at a7a7a7c3                       
> IP: [<c11c5cef>] td_free+0x23/0x75                                            
> *pde = 00000000                                                               
> Oops: 0000 [#1] DEBUG_PAGEALLOC                                               
> last sysfs file:                                                              
> Modules linked in: gspca_pac207 gspca_main videodev v4l1_compat               
> 
> Pid: 1178, comm: capture-example Not tainted (2.6.31.5 #2)                    
> EIP: 0060:[<c11c5cef>] EFLAGS: 00000083 CPU: 0                                
> EIP is at td_free+0x23/0x75                                                   
> EAX: a7a7a7a7 EBX: c6b35bf0 ECX: c6b35ce4 EDX: a7a7a7c3                       
> ESI: c6b7d800 EDI: c6b35cd4 EBP: c6785cc4 ESP: c6785cb8                       
>  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068                                 
> Process capture-example (pid: 1178, ti=c6784000 task=c678d338
> task.ti=c6784000) 
> Stack:                                                                        
>  c6b35bf0 000003e8 c6b35cd4 c6785cf0 c11c8175 c6ba2ea0 c6b35bf0 00000000      
> <0> c6b35bf0 00000292 c6b7c040 c6b35bf0 c6ba2ea0 c6b99ed8 c6785d00 c11b4428   
> <0> c6b32bf0 c6ba2ea0 c6785d14 c11b5b3f 01ba3df0 000000dc 00000005 c6785d30   
> Call Trace:                                                                   
>  [<c11c8175>] ? ohci_endpoint_disable+0x113/0x192                             
>  [<c11b4428>] ? usb_hcd_disable_endpoint+0x2e/0x32                            
>  [<c11b5b3f>] ? usb_disable_endpoint+0x6d/0x72                                
>  [<c11b5cae>] ? usb_disable_interface+0x30/0x3f                               
>  [<c11b70ac>] ? usb_set_interface+0x11b/0x1a0                                 
>  [<c8872e1d>] ? gspca_set_alt0+0x23/0x46 [gspca_main]                         
>  [<c8872e75>] ? gspca_stream_off+0x35/0x5f [gspca_main]                       
>  [<c8872ef8>] ? vidioc_streamoff+0x59/0xb4 [gspca_main]                       
>  [<c8817244>] ? __video_do_ioctl+0x17af/0x3920 [videodev]                     
>  [<c1032fa1>] ? __lock_acquire+0x6ef/0x755                                    
>  [<c102f436>] ? lock_release_holdtime+0x81/0x86                               
>  [<c103315c>] ? lock_release_non_nested+0xab/0x1cf                            
>  [<c105382f>] ? might_fault+0x3d/0x79                                         
>  [<c105382f>] ? might_fault+0x3d/0x79                                         
>  [<c11123d4>] ? copy_from_user+0x31/0x54                                      
>  [<c88196b8>] ? video_ioctl2+0x303/0x3ea [videodev]                           
>  [<c102f436>] ? lock_release_holdtime+0x81/0x86                               
>  [<c12e430e>] ? _spin_unlock_irqrestore+0x36/0x3c                             
>  [<c103086c>] ? trace_hardirqs_on_caller+0x104/0x12b                          
>  [<c103089e>] ? trace_hardirqs_on+0xb/0xd                                     
>  [<c88193b5>] ? video_ioctl2+0x0/0x3ea [videodev]                             
>  [<c88156d8>] ? v4l2_unlocked_ioctl+0x2e/0x32 [videodev]                      
>  [<c88156aa>] ? v4l2_unlocked_ioctl+0x0/0x32 [videodev]                       
>  [<c106dd91>] ? vfs_ioctl+0x19/0x50                                           
>  [<c106e36b>] ? do_vfs_ioctl+0x458/0x4a3                                      
>  [<c1155a42>] ? tty_ldisc_deref+0x8/0xa                                       
>  [<c1150c1c>] ? tty_write+0x1b1/0x1c2                                         
>  [<c1152d69>] ? n_tty_write+0x0/0x2e6                                         
>  [<c1150a6b>] ? tty_write+0x0/0x1c2                                           
>  [<c106431d>] ? vfs_write+0xe3/0xfa                                           
>  [<c1002858>] ? restore_all_notrace+0x0/0x18                                  
>  [<c106e3e2>] ? sys_ioctl+0x2c/0x45                                           
>  [<c1002825>] ? syscall_call+0x7/0xb                                          
> Code: e5 e8 bf 7b e9 ff 5d c3 55 89 e5 57 89 c7 56 89 d6 53 8b 42 28 89 c2 c1
> ea 06 31 d0 83 e0 3f 8d 94 87 cc 00 00 00 eb 03 8d 50 1c <8b> 02 85 c0 74 0b
> 39 
> EIP: [<c11c5cef>] td_free+0x23/0x75 SS:ESP 0068:c6785cb8                      
> CR2: 00000000a7a7a7c3                                                         

And here's the real oops.  drivers/usb/host/ohci-mem.c:td_free()
dereferenced a7a7a7c3.  Which looks like

/********** drivers/base/dmapool.c **********/
#define	POOL_POISON_FREED	0xa7	/* !inuse */
#define	POOL_POISON_ALLOCATED	0xa9	/* !initted */
Comment 4 Alan Stern 2009-11-12 16:20:33 UTC
On Wed, 11 Nov 2009, Andrew Morton wrote:

> > http://bugzilla.kernel.org/show_bug.cgi?id=14564
> > 
> >            Summary: capture-example sleeping function called from invalid
> >                     context at arch/x86/mm/fault.c

> We oopsed in td_free() (see below).  But as part of that oops
> processing the kernel entered do_page_fault() and emitted a
> might_sleep() warning because we took a pagefault with local interrupts
> disabled.
> 
> This is undesirable behaviour from the low-level x86 fault code and I
> don't think it normally happens.
> 
> Did we break something in x86 land, or is this oops sufficiently weird
> and whacky to bypass existing checks for this false positive?

No, what happened was a structure containing a linked-list entry got
freed while it was still on the list.  Then when the driver walked
through the list, it attempted to dereference a list pointer that had
been poisoned.  More or less by coincidence, the poison value
represented a paged-out address rather than an invalid address, so a
page fault occurred.  That's what caused the oops.

> > BUG: unable to handle kernel paging request at a7a7a7c3                     
> > IP: [<c11c5cef>] td_free+0x23/0x75                                          

> >  [<c1155a42>] ? tty_ldisc_deref+0x8/0xa                                     
> >  [<c1150c1c>] ? tty_write+0x1b1/0x1c2                                       
> >  [<c1152d69>] ? n_tty_write+0x0/0x2e6                                       
> >  [<c1150a6b>] ? tty_write+0x0/0x1c2                                         
> >  [<c106431d>] ? vfs_write+0xe3/0xfa                                         
> >  [<c1002858>] ? restore_all_notrace+0x0/0x18                                
> >  [<c106e3e2>] ? sys_ioctl+0x2c/0x45                                         
> >  [<c1002825>] ? syscall_call+0x7/0xb                                        
> > Code: e5 e8 bf 7b e9 ff 5d c3 55 89 e5 57 89 c7 56 89 d6 53 8b 42 28 89 c2
> c1
> > ea 06 31 d0 83 e0 3f 8d 94 87 cc 00 00 00 eb 03 8d 50 1c <8b> 02 85 c0 74
> 0b 39 
> > EIP: [<c11c5cef>] td_free+0x23/0x75 SS:ESP 0068:c6785cb8                    
> > CR2: 00000000a7a7a7c3                                                       
> 
> And here's the real oops.  drivers/usb/host/ohci-mem.c:td_free()
> dereferenced a7a7a7c3.  Which looks like
> 
> /********** drivers/base/dmapool.c **********/
> #define       POOL_POISON_FREED       0xa7    /* !inuse */
> #define       POOL_POISON_ALLOCATED   0xa9    /* !initted */

If I'm reading this correctly, the bad dereference occurred in the
second source line:

		prev = &(*prev)->td_hash;
	if (*prev)

The original value in *prev was 0xa7a7a7a7 and the offset of td_hash is
0x1c, causing the offending address to be 0xa7a7a7c3.

I have no idea why a struct td would have been freed while it was still 
in use.

Alan Stern
Comment 5 Sean Lazar 2009-12-03 05:53:28 UTC
Is there anything I can do to help? This is a show stopping bug for me.

Thanks,
Sean Lazar

Alan Stern wrote:
> On Wed, 11 Nov 2009, Andrew Morton wrote:
>
>   
>>> http://bugzilla.kernel.org/show_bug.cgi?id=14564
>>>
>>>            Summary: capture-example sleeping function called from invalid
>>>                     context at arch/x86/mm/fault.c
>>>       
>
>   
>> We oopsed in td_free() (see below).  But as part of that oops
>> processing the kernel entered do_page_fault() and emitted a
>> might_sleep() warning because we took a pagefault with local interrupts
>> disabled.
>>
>> This is undesirable behaviour from the low-level x86 fault code and I
>> don't think it normally happens.
>>
>> Did we break something in x86 land, or is this oops sufficiently weird
>> and whacky to bypass existing checks for this false positive?
>>     
>
> No, what happened was a structure containing a linked-list entry got
> freed while it was still on the list.  Then when the driver walked
> through the list, it attempted to dereference a list pointer that had
> been poisoned.  More or less by coincidence, the poison value
> represented a paged-out address rather than an invalid address, so a
> page fault occurred.  That's what caused the oops.
>
>   
>>> BUG: unable to handle kernel paging request at a7a7a7c3                     
>>> IP: [<c11c5cef>] td_free+0x23/0x75                                          
>>>       
>
>   
>>>  [<c1155a42>] ? tty_ldisc_deref+0x8/0xa                                     
>>>  [<c1150c1c>] ? tty_write+0x1b1/0x1c2                                       
>>>  [<c1152d69>] ? n_tty_write+0x0/0x2e6                                       
>>>  [<c1150a6b>] ? tty_write+0x0/0x1c2                                         
>>>  [<c106431d>] ? vfs_write+0xe3/0xfa                                         
>>>  [<c1002858>] ? restore_all_notrace+0x0/0x18                                
>>>  [<c106e3e2>] ? sys_ioctl+0x2c/0x45                                         
>>>  [<c1002825>] ? syscall_call+0x7/0xb                                        
>>> Code: e5 e8 bf 7b e9 ff 5d c3 55 89 e5 57 89 c7 56 89 d6 53 8b 42 28 89 c2
>>> c1
>>> ea 06 31 d0 83 e0 3f 8d 94 87 cc 00 00 00 eb 03 8d 50 1c <8b> 02 85 c0 74
>>> 0b 39 
>>> EIP: [<c11c5cef>] td_free+0x23/0x75 SS:ESP 0068:c6785cb8                    
>>> CR2: 00000000a7a7a7c3                                                       
>>>       
>> And here's the real oops.  drivers/usb/host/ohci-mem.c:td_free()
>> dereferenced a7a7a7c3.  Which looks like
>>
>> /********** drivers/base/dmapool.c **********/
>> #define      POOL_POISON_FREED       0xa7    /* !inuse */
>> #define      POOL_POISON_ALLOCATED   0xa9    /* !initted */
>>     
>
> If I'm reading this correctly, the bad dereference occurred in the
> second source line:
>
>               prev = &(*prev)->td_hash;
>       if (*prev)
>
> The original value in *prev was 0xa7a7a7a7 and the offset of td_hash is
> 0x1c, causing the offending address to be 0xa7a7a7c3.
>
> I have no idea why a struct td would have been freed while it was still 
> in use.
>
> Alan Stern
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
Comment 6 Alan Stern 2009-12-03 21:03:50 UTC
On Wed, 2 Dec 2009, Sean wrote:

> Is there anything I can do to help? This is a show stopping bug for me.

Here's a patch you can try.  It will add a _lot_ of debugging
information to the system log.  Maybe it will help pin down the source
of the problem.

Alan Stern



Index: 2.6.31/drivers/usb/host/ohci-hcd.c
===================================================================
--- 2.6.31.orig/drivers/usb/host/ohci-hcd.c
+++ 2.6.31/drivers/usb/host/ohci-hcd.c
@@ -197,7 +197,7 @@ static int ohci_urb_enqueue (
 
 	/* allocate the TDs (deferring hash chain updates) */
 	for (i = 0; i < size; i++) {
-		urb_priv->td [i] = td_alloc (ohci, mem_flags);
+		urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, urb->ep);
 		if (!urb_priv->td [i]) {
 			urb_priv->length = i;
 			urb_free_priv (ohci, urb_priv);
Index: 2.6.31/drivers/usb/host/ohci-mem.c
===================================================================
--- 2.6.31.orig/drivers/usb/host/ohci-mem.c
+++ 2.6.31/drivers/usb/host/ohci-mem.c
@@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr
 
 /* TDs ... */
 static struct td *
-td_alloc (struct ohci_hcd *hc, gfp_t mem_flags)
+td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev,
+	struct usb_host_endpoint *ep)
 {
 	dma_addr_t	dma;
 	struct td	*td;
@@ -94,6 +95,8 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem
 		td->hwNextTD = cpu_to_hc32 (hc, dma);
 		td->td_dma = dma;
 		/* hashed in td_fill */
+		ohci_info(hc, "td alloc for %s ep%x: %p\n",
+			udev->devpath, ep->desc.bEndpointAddress, td);
 	}
 	return td;
 }
@@ -103,8 +106,14 @@ td_free (struct ohci_hcd *hc, struct td 
 {
 	struct td	**prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)];
 
-	while (*prev && *prev != td)
+	ohci_info(hc, "td free %p\n", td);
+	while (*prev && *prev != td) {
+		if ((unsigned long) *prev == 0xa7a7a7a7) {
+			ohci_info(hc, "poisoned hash at %p\n", prev);
+			return;
+		}
 		prev = &(*prev)->td_hash;
+	}
 	if (*prev)
 		*prev = td->td_hash;
 	else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0)
Index: 2.6.31/drivers/usb/host/ohci-q.c
===================================================================
--- 2.6.31.orig/drivers/usb/host/ohci-q.c
+++ 2.6.31/drivers/usb/host/ohci-q.c
@@ -403,7 +403,7 @@ static struct ed *ed_get (
 		}
 
 		/* dummy td; end of td list for ed */
-		td = td_alloc (ohci, GFP_ATOMIC);
+		td = td_alloc (ohci, GFP_ATOMIC, udev, ep);
 		if (!td) {
 			/* out of memory */
 			ed_free (ohci, ed);
Comment 7 Sean Lazar 2009-12-18 04:41:48 UTC
Thanks, that patch definitely traps the bug. Unfortunately there are so 
many debug messages that the capture-example.c times out trying to 
connect to the webcam. The debug messages slow down the process enough 
to make that happen. But if I modify your patch and take out the extra 
debug messages, it works well. The modified patch is below.

Reproducing the bug in four separate instances I got:
ohci_hcd 0000:00:0a.0: poisoned hash at c67a285c 
ohci_hcd 0000:00:0a.0: poisoned hash at c67b875c
ohci_hcd 0000:00:0a.0: poisoned hash at c67a179c
ohci_hcd 0000:00:0a.0: poisoned hash at c679c79c

Sean Lazar

--- ohci-mem.c.orig    2009-12-16 22:57:49.000000000 +0000
+++ ohci-mem.c    2009-12-16 22:49:37.000000000 +0000
@@ -103,8 +103,13 @@
 {
     struct td    **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)];
 
-    while (*prev && *prev != td)
+    while (*prev && *prev != td) {
+        if ((unsigned long) *prev == 0xa7a7a7a7) {
+            ohci_info(hc, "poisoned hash at %p\n", prev);
+            return;
+        }
         prev = &(*prev)->td_hash;
+    }
     if (*prev)
         *prev = td->td_hash;
     else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0)
 

Alan Stern wrote:
> On Wed, 2 Dec 2009, Sean wrote:
>
>   
>> Is there anything I can do to help? This is a show stopping bug for me.
>>     
>
> Here's a patch you can try.  It will add a _lot_ of debugging
> information to the system log.  Maybe it will help pin down the source
> of the problem.
>
> Alan Stern
>
>
>
> Index: 2.6.31/drivers/usb/host/ohci-hcd.c
> ===================================================================
> --- 2.6.31.orig/drivers/usb/host/ohci-hcd.c
> +++ 2.6.31/drivers/usb/host/ohci-hcd.c
> @@ -197,7 +197,7 @@ static int ohci_urb_enqueue (
>  
>       /* allocate the TDs (deferring hash chain updates) */
>       for (i = 0; i < size; i++) {
> -             urb_priv->td [i] = td_alloc (ohci, mem_flags);
> +             urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev,
> urb->ep);
>               if (!urb_priv->td [i]) {
>                       urb_priv->length = i;
>                       urb_free_priv (ohci, urb_priv);
> Index: 2.6.31/drivers/usb/host/ohci-mem.c
> ===================================================================
> --- 2.6.31.orig/drivers/usb/host/ohci-mem.c
> +++ 2.6.31/drivers/usb/host/ohci-mem.c
> @@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr
>  
>  /* TDs ... */
>  static struct td *
> -td_alloc (struct ohci_hcd *hc, gfp_t mem_flags)
> +td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev,
> +     struct usb_host_endpoint *ep)
>  {
>       dma_addr_t      dma;
>       struct td       *td;
> @@ -94,6 +95,8 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem
>               td->hwNextTD = cpu_to_hc32 (hc, dma);
>               td->td_dma = dma;
>               /* hashed in td_fill */
> +             ohci_info(hc, "td alloc for %s ep%x: %p\n",
> +                     udev->devpath, ep->desc.bEndpointAddress, td);
>       }
>       return td;
>  }
> @@ -103,8 +106,14 @@ td_free (struct ohci_hcd *hc, struct td 
>  {
>       struct td       **prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)];
>  
> -     while (*prev && *prev != td)
> +     ohci_info(hc, "td free %p\n", td);
> +     while (*prev && *prev != td) {
> +             if ((unsigned long) *prev == 0xa7a7a7a7) {
> +                     ohci_info(hc, "poisoned hash at %p\n", prev);
> +                     return;
> +             }
>               prev = &(*prev)->td_hash;
> +     }
>       if (*prev)
>               *prev = td->td_hash;
>       else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0)
> Index: 2.6.31/drivers/usb/host/ohci-q.c
> ===================================================================
> --- 2.6.31.orig/drivers/usb/host/ohci-q.c
> +++ 2.6.31/drivers/usb/host/ohci-q.c
> @@ -403,7 +403,7 @@ static struct ed *ed_get (
>               }
>  
>               /* dummy td; end of td list for ed */
> -             td = td_alloc (ohci, GFP_ATOMIC);
> +             td = td_alloc (ohci, GFP_ATOMIC, udev, ep);
>               if (!td) {
>                       /* out of memory */
>                       ed_free (ohci, ed);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
Comment 8 Alan Stern 2009-12-18 14:47:30 UTC
On Wed, 16 Dec 2009, Sean wrote:

> Thanks, that patch definitely traps the bug. Unfortunately there are so 
> many debug messages that the capture-example.c times out trying to 
> connect to the webcam. The debug messages slow down the process enough 
> to make that happen. But if I modify your patch and take out the extra 
> debug messages, it works well. The modified patch is below.

The patch doesn't fix anything.  The point was to gather enough 
information to figure out what's going wrong.  Without the debug 
messages, there's no information.

Perhaps things will slow down less if you change the new ohci_info() 
calls in the patch to ohci_dbg().  Or perhaps you can increase the 
timeout values in capture-example.c.

You should also apply this patch (be sure to enable CONFIG_USB_DEBUG):

	http://marc.info/?l=linux-usb&m=126056642931083&w=2

It probably won't make any difference, but including it anyway is
worthwhile.

Alan Stern
Comment 9 Sean Lazar 2009-12-29 09:22:44 UTC
Alan Stern wrote:
> The patch doesn't fix anything.  The point was to gather enough 
> information to figure out what's going wrong.  Without the debug 
> messages, there's no information.
>
> Perhaps things will slow down less if you change the new ohci_info() 
> calls in the patch to ohci_dbg().  Or perhaps you can increase the 
> timeout values in capture-example.c.
>
> You should also apply this patch (be sure to enable CONFIG_USB_DEBUG):
>
>       http://marc.info/?l=linux-usb&m=126056642931083&w=2
>
> It probably won't make any difference, but including it anyway is
> worthwhile.
>
> Alan Stern
>   
The early return in td_free that is in the patch will trap the error.

I changed the debug statements to ohci_dbg and I was able to capture the 
full output with klogd. It is attached.

Sean
Comment 10 Sean Lazar 2009-12-29 09:36:28 UTC
Created attachment 24341 [details]
kernel.log with the debug patch
Comment 11 Sean Lazar 2009-12-29 19:50:55 UTC
> Alan Stern wrote:
>> The patch doesn't fix anything.  The point was to gather enough 
>> information to figure out what's going wrong.  Without the debug 
>> messages, there's no information.
>>
>> Perhaps things will slow down less if you change the new ohci_info() 
>> calls in the patch to ohci_dbg().  Or perhaps you can increase the 
>> timeout values in capture-example.c.
>>
>> You should also apply this patch (be sure to enable CONFIG_USB_DEBUG):
>>
>>     http://marc.info/?l=linux-usb&m=126056642931083&w=2
>>
>> It probably won't make any difference, but including it anyway is
>> worthwhile.
>>
>> Alan Stern
>>   
> The early return in td_free that is in the patch will trap the error.
>
> I changed the debug statements to ohci_dbg and I was able to capture 
> the full output with klogd. It is attached.
>
> Sean
Sean wrote:

In looking at the log file it seems that there is a mismatch of td_alloc 
calls and td_free calls.

Sean
Comment 12 Alan Stern 2009-12-29 21:23:50 UTC
On Tue, 29 Dec 2009, Sean wrote:

> > I changed the debug statements to ohci_dbg and I was able to capture 
> > the full output with klogd. It is attached.

You did not get the full output.  That's why I tell people to use dmesg
instead of klogd or the kernel log files.  If necessary you can always
increase the size of the dmesg log buffer by changing
CONFIG_LOG_BUF_SHIFT.  I tend to set it to 18; you might want to go 
even higher.

> In looking at the log file it seems that there is a mismatch of td_alloc 
> calls and td_free calls.

That's because you didn't get the full output.  Here's an enhanced 
version of the patch.  It will provide more information and perhaps a 
smoking gun.

Alan Stern


Index: usb-2.6/drivers/usb/host/ohci-hcd.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-hcd.c
+++ usb-2.6/drivers/usb/host/ohci-hcd.c
@@ -201,7 +201,7 @@ static int ohci_urb_enqueue (
 
 	/* allocate the TDs (deferring hash chain updates) */
 	for (i = 0; i < size; i++) {
-		urb_priv->td [i] = td_alloc (ohci, mem_flags);
+		urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, urb->ep);
 		if (!urb_priv->td [i]) {
 			urb_priv->length = i;
 			urb_free_priv (ohci, urb_priv);
Index: usb-2.6/drivers/usb/host/ohci-mem.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-mem.c
+++ usb-2.6/drivers/usb/host/ohci-mem.c
@@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr
 
 /* TDs ... */
 static struct td *
-td_alloc (struct ohci_hcd *hc, gfp_t mem_flags)
+td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev,
+	struct usb_host_endpoint *ep)
 {
 	dma_addr_t	dma;
 	struct td	*td;
@@ -94,6 +95,8 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem
 		td->hwNextTD = cpu_to_hc32 (hc, dma);
 		td->td_dma = dma;
 		/* hashed in td_fill */
+		ohci_dbg(hc, "td alloc for %s ep%x: %p\n",
+			udev->devpath, ep->desc.bEndpointAddress, td);
 	}
 	return td;
 }
@@ -101,14 +104,27 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem
 static void
 td_free (struct ohci_hcd *hc, struct td *td)
 {
-	struct td	**prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)];
-
-	while (*prev && *prev != td)
+	int		hash = TD_HASH_FUNC(td->td_dma);
+	struct td	**prev = &hc->td_hash[hash];
+	int		n = 0;
+
+	ohci_dbg(hc, "td free %p\n", td);
+	while (*prev && *prev != td) {
+		if ((unsigned long) *prev == 0xa7a7a7a7) {
+			ohci_dbg(hc, "poisoned hash at %p (%d %d) %p\n", prev,
+					hash, n, hc->td_hash[hash]);
+			return;
+		}
 		prev = &(*prev)->td_hash;
-	if (*prev)
+		++n;
+	}
+	if (*prev) {
 		*prev = td->td_hash;
-	else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0)
-		ohci_dbg (hc, "no hash for td %p\n", td);
+		ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, *prev);
+	} else {
+		ohci_dbg(hc, "no hash for td %p: %d %p\n", td,
+				hash, hc->td_hash[hash]);
+	}
 	dma_pool_free (hc->td_cache, td, td->td_dma);
 }
 
Index: usb-2.6/drivers/usb/host/ohci-q.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-q.c
+++ usb-2.6/drivers/usb/host/ohci-q.c
@@ -406,7 +406,7 @@ static struct ed *ed_get (
 		}
 
 		/* dummy td; end of td list for ed */
-		td = td_alloc (ohci, GFP_ATOMIC);
+		td = td_alloc (ohci, GFP_ATOMIC, udev, ep);
 		if (!td) {
 			/* out of memory */
 			ed_free (ohci, ed);
@@ -560,6 +560,7 @@ td_fill (struct ohci_hcd *ohci, u32 info
 	hash = TD_HASH_FUNC (td->td_dma);
 	td->td_hash = ohci->td_hash [hash];
 	ohci->td_hash [hash] = td;
+	ohci_dbg(ohci, "hash %p to %d -> %p\n", td, hash, td->td_hash);
 
 	/* HC might read the TD (or cachelines) right away ... */
 	wmb ();
Comment 13 Sean Lazar 2009-12-30 00:39:15 UTC
Alan Stern wrote:
> You did not get the full output.  That's why I tell people to use dmesg
> instead of klogd or the kernel log files.  If necessary you can always
> increase the size of the dmesg log buffer by changing
> CONFIG_LOG_BUF_SHIFT.  I tend to set it to 18; you might want to go 
> even higher.
>
>
> That's because you didn't get the full output.  Here's an enhanced 
> version of the patch.  It will provide more information and perhaps a 
> smoking gun.
>
> Alan Stern
>   
Alan,

Thanks for the debug patch. I'll send you the dmesg.log output in 
another email. It is 2MB.

Sean
Comment 14 Sean Lazar 2009-12-30 00:45:39 UTC
Created attachment 24371 [details]
dmesg log after enhanced debug patch from Alan Stern
Comment 15 Sean Lazar 2009-12-30 00:57:33 UTC
Created attachment 24372 [details]
dmesg log after enhanced debug patch from Alan Stern
Comment 16 Alan Stern 2009-12-30 03:22:29 UTC
On Tue, 29 Dec 2009, Sean wrote:

> Alan,
> 
> Thanks for the debug patch. I'll send you the dmesg.log output in 
> another email. It is 2MB.

Got it.  It's not good enough; the initial error occurred before the 
start of your extract.  Here's yet another version of the patch; this 
one will stop printing the debug messages when an error is first 
detected so maybe it won't overrun your log buffer.

Alan Stern


Index: usb-2.6/drivers/usb/host/ohci-hcd.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-hcd.c
+++ usb-2.6/drivers/usb/host/ohci-hcd.c
@@ -201,7 +201,7 @@ static int ohci_urb_enqueue (
 
 	/* allocate the TDs (deferring hash chain updates) */
 	for (i = 0; i < size; i++) {
-		urb_priv->td [i] = td_alloc (ohci, mem_flags);
+		urb_priv->td [i] = td_alloc (ohci, mem_flags, urb->dev, urb->ep);
 		if (!urb_priv->td [i]) {
 			urb_priv->length = i;
 			urb_free_priv (ohci, urb_priv);
@@ -580,6 +580,7 @@ static int ohci_run (struct ohci_hcd *oh
 	struct usb_hcd		*hcd = ohci_to_hcd(ohci);
 
 	disable (ohci);
+	ohci->testing = 1;
 
 	/* boot firmware should have set this up (5.1.1.3.1) */
 	if (first) {
Index: usb-2.6/drivers/usb/host/ohci-mem.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-mem.c
+++ usb-2.6/drivers/usb/host/ohci-mem.c
@@ -82,7 +82,8 @@ dma_to_td (struct ohci_hcd *hc, dma_addr
 
 /* TDs ... */
 static struct td *
-td_alloc (struct ohci_hcd *hc, gfp_t mem_flags)
+td_alloc (struct ohci_hcd *hc, gfp_t mem_flags, struct usb_device *udev,
+	struct usb_host_endpoint *ep)
 {
 	dma_addr_t	dma;
 	struct td	*td;
@@ -94,6 +95,9 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem
 		td->hwNextTD = cpu_to_hc32 (hc, dma);
 		td->td_dma = dma;
 		/* hashed in td_fill */
+		if (hc->testing)
+			ohci_dbg(hc, "td alloc for %s ep%x: %p\n",
+				udev->devpath, ep->desc.bEndpointAddress, td);
 	}
 	return td;
 }
@@ -101,14 +105,32 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem
 static void
 td_free (struct ohci_hcd *hc, struct td *td)
 {
-	struct td	**prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)];
-
-	while (*prev && *prev != td)
+	int		hash = TD_HASH_FUNC(td->td_dma);
+	struct td	**prev = &hc->td_hash[hash];
+	int		n = 0;
+
+	if (hc->testing)
+		ohci_dbg(hc, "td free %p\n", td);
+	while (*prev && *prev != td) {
+		if ((unsigned long) *prev == 0xa7a7a7a7) {
+			ohci_dbg(hc, "poisoned hash at %p (%d %d) %p\n", prev,
+					hash, n, hc->td_hash[hash]);
+			return;
+		}
 		prev = &(*prev)->td_hash;
-	if (*prev)
+		++n;
+	}
+	if (*prev) {
 		*prev = td->td_hash;
-	else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0)
-		ohci_dbg (hc, "no hash for td %p\n", td);
+		if (hc->testing) {
+			ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, *prev);
+			if (td->td_hash == td)
+				hc->testing = 0;
+		}
+	} else {
+		ohci_dbg(hc, "no hash for td %p: %d %p\n", td,
+				hash, hc->td_hash[hash]);
+	}
 	dma_pool_free (hc->td_cache, td, td->td_dma);
 }
 
Index: usb-2.6/drivers/usb/host/ohci-q.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-q.c
+++ usb-2.6/drivers/usb/host/ohci-q.c
@@ -406,7 +406,7 @@ static struct ed *ed_get (
 		}
 
 		/* dummy td; end of td list for ed */
-		td = td_alloc (ohci, GFP_ATOMIC);
+		td = td_alloc (ohci, GFP_ATOMIC, udev, ep);
 		if (!td) {
 			/* out of memory */
 			ed_free (ohci, ed);
@@ -560,6 +560,11 @@ td_fill (struct ohci_hcd *ohci, u32 info
 	hash = TD_HASH_FUNC (td->td_dma);
 	td->td_hash = ohci->td_hash [hash];
 	ohci->td_hash [hash] = td;
+	if (ohci->testing) {
+		ohci_dbg(ohci, "hash %p to %d -> %p\n", td, hash, td->td_hash);
+		if (td->td_hash == td)
+			ohci->testing = 0;
+	}
 
 	/* HC might read the TD (or cachelines) right away ... */
 	wmb ();
Index: usb-2.6/drivers/usb/host/ohci.h
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci.h
+++ usb-2.6/drivers/usb/host/ohci.h
@@ -346,6 +346,7 @@ typedef struct urb_priv {
 
 struct ohci_hcd {
 	spinlock_t		lock;
+	int			testing;
 
 	/*
 	 * I/O memory used to communicate with the HC (dma-consistent)
Comment 17 Sean Lazar 2010-01-02 09:02:37 UTC
Alan Stern wrote:
> Got it.  It's not good enough; the initial error occurred before the 
> start of your extract.  Here's yet another version of the patch; this 
> one will stop printing the debug messages when an error is first 
> detected so maybe it won't overrun your log buffer.
>
> Alan Stern
>   
Alan,

Thanks again. I was able to get the full dmesg output this time. I ran 
capture-example three times and each time removing the webcam before 
capture-example finished. On the third time I got the poisoned hash 
message and I captured the output to a file. Attached is the dmesg output.

Sean
Comment 18 Sean Lazar 2010-01-02 09:07:42 UTC
Created attachment 24402 [details]
Second dmesg log. Complete dmesg output from boot to bug.
Comment 19 Alan Stern 2010-01-02 20:43:30 UTC
On Sat, 2 Jan 2010, Sean wrote:

> Alan,
> 
> Thanks again. I was able to get the full dmesg output this time. I ran 
> capture-example three times and each time removing the webcam before 
> capture-example finished. On the third time I got the poisoned hash 
> message and I captured the output to a file. Attached is the dmesg output.

Okay.  Take a look at the following output:

$ egrep -n '[2e]e(80|9c)' dmesg2.log
680:pci 0000:00:0c.0: reg 14 io port: [0xee80-0xee83]
727:kobject: 'ieee80211' (c79d5e1c): kobject_add_internal: parent: 
'class', set: 'class'
728:kobject: 'ieee80211' (c79d5e1c): kobject_uevent_env
729:kobject: 'ieee80211' (c79d5e1c): fill_kobj_path: path = 
'/class/ieee80211'
4994:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6662e80
5027:ohci_hcd 0000:00:0b.0: hash c6662e80 to 58 -> (null)
5185:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676ee80
5218:ohci_hcd 0000:00:0b.0: hash c676ee80 to 58 -> c6662e80
5276:ohci_hcd 0000:00:0b.0: td free c6662e80
5277:ohci_hcd 0000:00:0b.0: (58 1) c676ee9c -> (null)
5296:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6662e80
5329:ohci_hcd 0000:00:0b.0: hash c6662e80 to 58 -> c676ee80
5538:ohci_hcd 0000:00:0b.0: td free c676ee80
5539:ohci_hcd 0000:00:0b.0: (58 1) c6662e9c -> (null)
5558:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676ee80
5591:ohci_hcd 0000:00:0b.0: hash c676ee80 to 58 -> c6662e80
5644:ohci_hcd 0000:00:0b.0: td free c6662e80
5645:ohci_hcd 0000:00:0b.0: (58 1) c676ee9c -> (null)
5720:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6662e80
5753:ohci_hcd 0000:00:0b.0: hash c6662e80 to 58 -> c676ee80
5900:ohci_hcd 0000:00:0b.0: td free c676ee80
5901:ohci_hcd 0000:00:0b.0: (58 1) c6662e9c -> (null)
5978:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676ee80
6011:ohci_hcd 0000:00:0b.0: hash c676ee80 to 58 -> c6662e80
6072:ohci_hcd 0000:00:0b.0: td free c6662e80
6073:ohci_hcd 0000:00:0b.0: (58 1) c676ee9c -> (null)
6088:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6662e80
6121:ohci_hcd 0000:00:0b.0: hash c6662e80 to 58 -> c676ee80
6324:ohci_hcd 0000:00:0b.0: td free c676ee80
6325:ohci_hcd 0000:00:0b.0: (58 1) c6662e9c -> (null)
6343:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676ee80
6376:ohci_hcd 0000:00:0b.0: hash c676ee80 to 58 -> c6662e80
6416:ohci_hcd 0000:00:0b.0: td free c6662e80
6417:ohci_hcd 0000:00:0b.0: (58 1) c676ee9c -> c676ee80
6492:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6662e80
6525:ohci_hcd 0000:00:0b.0: hash c6662e80 to 58 -> c676ee80
6686:ohci_hcd 0000:00:0b.0: td free c676ee80
6687:ohci_hcd 0000:00:0b.0: (58 1) c6662e9c -> c676ee80

Ignore the first few lines as being irrelevant.  Starting with line
5185 you can see that this shows two TDs being allocated, hashed,
freed, and unlinked over and over again, at addresses c6662e80 and
c676ee80.  When each one is hashed into the list, its td_hash member is
made to point to the other.  When each is removed from the hash list,
the other's td_hash member is set to NULL.

It's all very regular and repetitious until line 6417.  At that line,
the td_hash member of c676ee80 (which is at offset 1c from the start of
the structure, hence at c676ee9c) is made to point to its own
structure!  Thus later at line 6687, when c676ee80 is freed, the 
td_hash member of c6662e80 is set to point at the freed structure.  
This is what leads to poisoned pointer values later on.

So what went wrong at line 6417?  There's no way to know exactly.  My
guess is that the write at line 6325, where c6662e9c was supposed to be
set to NULL, never got recorded properly in the computer's memory.  
This would mean that c6662e9c still contained the c676ee80 value
assigned at line 6121, and hence the incorrect value was copied at line
6417.

In other words, I'm guessing that you're suffering from hardware memory
errors.  A possible way to test this is to modify the patch.  In
td_free() where it adds the line:

+			ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev, *prev);

instead add this code:

+			barrier();
+			ohci_dbg(hc, "(%d %d) %p -> %p [%p]\n", hash, n,
+					prev, *prev, td->td_hash);

If we find that the value of *prev differs from the value of
td->td_hash then we'll know for certain.  (Or maybe the presence of the 
barrier() will cause the object code to change in a way that prevents 
the error from occurring.)

Alan Stern
Comment 20 Sean Lazar 2010-01-03 01:59:03 UTC
Alan Stern wrote:
> In other words, I'm guessing that you're suffering from hardware memory
> errors.  A possible way to test this is to modify the patch.  In
> td_free() where it adds the line:
>
> +                     ohci_dbg(hc, "(%d %d) %p -> %p\n", hash, n, prev,
> *prev);
>
> instead add this code:
>
> +                     barrier();
> +                     ohci_dbg(hc, "(%d %d) %p -> %p [%p]\n", hash, n,
> +                                     prev, *prev, td->td_hash);
>
> If we find that the value of *prev differs from the value of
> td->td_hash then we'll know for certain.  (Or maybe the presence of the 
> barrier() will cause the object code to change in a way that prevents 
> the error from occurring.)
>
> Alan Stern
>   
Hmm, I applied the changes and I did not see a place where *prev differs 
from td->td_hash. I have run memtest86+ on this box and it has passed 16 
times, so I do not suspect a hardware memory error. What do you think? 
Attached is the latest dmesg output.

Sean
Comment 21 Sean Lazar 2010-01-03 02:03:27 UTC
Created attachment 24416 [details]
Third dmesg log
Comment 22 Jean-Francois Moine 2010-01-03 07:02:45 UTC
Hello Sean,

Sorry to be a bit late. Looking at the dmesg, I found that the gspca version was 2.7.0. May you upgrade your linux media stuff from LinuxTv.org and check if this problem still occurs?

Jef
Comment 23 Sean Lazar 2010-01-03 07:12:08 UTC
bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14564
>
>
> Jean-Francois Moine <moinejf@free.fr> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |moinejf@free.fr
>
>
>
>
> --- Comment #22 from Jean-Francois Moine <moinejf@free.fr>  2010-01-03
> 07:02:45 ---
> Hello Sean,
>
> Sorry to be a bit late. Looking at the dmesg, I found that the gspca version
> was 2.7.0. May you upgrade your linux media stuff from LinuxTv.org and check
> if
> this problem still occurs?
>
> Jef
>
>   
Jef,
Should I use the whole v4l-dvb set from http://linuxtv.org/hg/v4l-dvb or 
should I just get the gspca stuff from someone's personal repository?

Sean
Comment 24 Jean-Francois Moine 2010-01-03 08:15:46 UTC
Sean,
Yes, you must get the whole stable version http://linuxtv.org/hg/v4l-dvb (there is no bug fix for your webcam in my development repository).

Jef
Comment 25 Alan Stern 2010-01-03 17:36:00 UTC
On Sat, 2 Jan 2010, Sean wrote:

> Hmm, I applied the changes and I did not see a place where *prev differs 
> from td->td_hash. I have run memtest86+ on this box and it has passed 16 
> times, so I do not suspect a hardware memory error. What do you think? 
> Attached is the latest dmesg output.

I don't know.  The same pattern as before appears here:

$ egrep -n '[1b]e(40|5c)' dmesg3.log
167:kobject: 'fs' (c7801e40): kobject_add_internal: parent: '<NULL>', set: '<NULL>'
4990:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6691e40
5023:ohci_hcd 0000:00:0b.0: hash c6691e40 to 57 -> (null)
5181:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676be40
5214:ohci_hcd 0000:00:0b.0: hash c676be40 to 57 -> c6691e40
5271:ohci_hcd 0000:00:0b.0: td free c6691e40
5272:ohci_hcd 0000:00:0b.0: (57 1) c676be5c -> (null) [(null)]
5294:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6691e40
5327:ohci_hcd 0000:00:0b.0: hash c6691e40 to 57 -> c676be40
5533:ohci_hcd 0000:00:0b.0: td free c676be40
5534:ohci_hcd 0000:00:0b.0: (57 1) c6691e5c -> (null) [(null)]
5556:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c676be40
5589:ohci_hcd 0000:00:0b.0: hash c676be40 to 57 -> c6691e40
5640:ohci_hcd 0000:00:0b.0: td free c6691e40
5641:ohci_hcd 0000:00:0b.0: (57 1) c676be5c -> c676be40 [c676be40]
5713:ohci_hcd 0000:00:0b.0: td alloc for 2 ep85: c6691e40
5746:ohci_hcd 0000:00:0b.0: hash c6691e40 to 57 -> c676be40
5899:ohci_hcd 0000:00:0b.0: td free c676be40
5900:ohci_hcd 0000:00:0b.0: (57 1) c6691e5c -> c676be40 [c676be40]

At line 5641 we see that the td_hash pointer in c676be40 gets
corrupted.  It is copied from the pointer in c6691e40, which was set to
NULL in line 5534, but now it points to c676be40.

The question is whether this corruption was caused by a hardware fault 
or a software bug.  We have added debugging printouts to the only two 
places where the driver assigns anything to td->td_hash, and they don't 
show anything wrong.  This leads me to suspect the hardware, but of 
course this is still just a guess.

Here is a completely new patch.  This one is more directed, to catch 
the sort of errors we now know to look for.

Alan Stern



Index: usb-2.6/drivers/usb/host/ohci-mem.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-mem.c
+++ usb-2.6/drivers/usb/host/ohci-mem.c
@@ -98,17 +98,56 @@ td_alloc (struct ohci_hcd *hc, gfp_t mem
 	return td;
 }
 
+static void td_check(struct ohci_hcd *hc, int hash, char *msg)
+{
+	struct td	*td, *first;
+
+	first = hc->td_hash[hash];
+	for (td = first; td; td = td->td_hash) {
+		if (td->td_hash == first || td->td_hash == td) {
+			ohci_err(hc, "Circular pointer %s: %d %p %p %p\n",
+					msg, hash, first, td, td->td_hash);
+			td->td_hash = NULL;
+			return;
+		}
+	}
+}
+
+static void td_check_all(struct ohci_hcd *hc, char *msg)
+{
+	int	hash;
+
+	for (hash = 0; hash < TD_HASH_SIZE; ++hash)
+		td_check(hc, hash, msg);
+}
+
 static void
 td_free (struct ohci_hcd *hc, struct td *td)
 {
-	struct td	**prev = &hc->td_hash [TD_HASH_FUNC (td->td_dma)];
+	int 		hash = TD_HASH_FUNC(td->td_dma);
+	struct td	**prev = &hc->td_hash[hash];
 
-	while (*prev && *prev != td)
+	td_check(hc, hash, "#1a");
+	while (*prev && *prev != td) {
+		if ((unsigned long) *prev == 0xa7a7a7a7) {
+			ohci_err(hc, "poisoned hash at %p (%d) %p %p\n", prev,
+				hash, td, hc->td_hash[hash]);
+			return;
+		}
 		prev = &(*prev)->td_hash;
-	if (*prev)
+	}
+	if (*prev) {
 		*prev = td->td_hash;
+		if (*prev == td) {
+			ohci_err(hc, "invalid hash at %p (%d) %p %p\n", prev,
+				hash, td, hc->td_hash[hash]);
+			*prev = NULL;
+		}
+	}
 	else if ((td->hwINFO & cpu_to_hc32(hc, TD_DONE)) != 0)
 		ohci_dbg (hc, "no hash for td %p\n", td);
+	mb();
+	td_check(hc, hash, "#1b");
 	dma_pool_free (hc->td_cache, td, td->td_dma);
 }
 
Index: usb-2.6/drivers/usb/host/ohci-q.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-q.c
+++ usb-2.6/drivers/usb/host/ohci-q.c
@@ -558,12 +558,14 @@ td_fill (struct ohci_hcd *ohci, u32 info
 
 	/* hash it for later reverse mapping */
 	hash = TD_HASH_FUNC (td->td_dma);
+	td_check(ohci, hash, "#2a");
 	td->td_hash = ohci->td_hash [hash];
 	ohci->td_hash [hash] = td;
 
 	/* HC might read the TD (or cachelines) right away ... */
 	wmb ();
 	td->ed->hwTailP = td->hwNextTD;
+	td_check(ohci, hash, "#2b");
 }
 
 /*-------------------------------------------------------------------------*/
@@ -1127,9 +1129,11 @@ dl_done_list (struct ohci_hcd *ohci)
 {
 	struct td	*td = dl_reverse_done_list (ohci);
 
+	td_check_all(ohci, "#3a");
 	while (td) {
 		struct td	*td_next = td->next_dl_td;
 		takeback_td(ohci, td);
 		td = td_next;
 	}
+	td_check_all(ohci, "#3b");
 }
Comment 26 Sean Lazar 2010-01-03 23:50:09 UTC
Alan Stern wrote:
> Here is a completely new patch.  This one is more directed, to catch 
> the sort of errors we now know to look for.
>
> Alan Stern
>   
Alan,

I applied the patches and ran capture-example twice. On the second run 
of capture-example a circular pointer popped up. I did not need to 
remove the camera. Attached are the serial console capture as well as 
the dmesg log in debug4.tar.gz. Did you want me to try to reproduce the 
poison message?

Sean
Comment 27 Alan Stern 2010-01-04 16:07:04 UTC
On Sun, 3 Jan 2010, Sean wrote:

> Alan,
> 
> I applied the patches and ran capture-example twice. On the second run 
> of capture-example a circular pointer popped up. I did not need to 
> remove the camera. Attached are the serial console capture as well as 
> the dmesg log in debug4.tar.gz. Did you want me to try to reproduce the 
> poison message?

No.  Among the things that patch did was to fix up the errors that 
caused the invalid pointers.  Hence there should not have been any 
"poisoned hash" messages -- and indeed there weren't.

The interesting part of the log is the error messages:

ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 1 c6774040 c6542040 c6774040
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 1 c6774040 c6542040 c6774040
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 1 c6774040 c6542040 c6774040
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800
ohci_hcd 0000:00:0b.0: Circular pointer #2b: 32 c6774800 c6542800 c6774800

There are two different hash chains here (32 and 1), but in each case
see the message is #2b, never #2a.  This means the problem occurs
between the places where the #2a and #2b messages were inserted, i.e.,
in td_fill().  The hash chain contained a single TD and was fine to
begin with; then another TD was added at the start of the chain and the
pointer in the earlier TD (now at the second position in the chain) got
messed up.

For example, the error message in the first line above implies that
originally the 32nd hash chain contained only the TD at c6542800 with
its td_hash member set to NULL.  But then c6774800 was added to the
start of the chain, after which c6542800's td_hash pointed to c6774800.

Try inserting a line saying:

	td_check(ohci, hash, "#2c");

two lines above the #2b line, i.e., just after the wmb().  That'll help 
narrow down the search for the bug.

And by the way, you don't need to post your entire dmesg log.  Just the 
portion containing the new debugging messages will be enough.

Alan Stern
Comment 28 Sean Lazar 2010-01-04 20:04:34 UTC
Alan Stern wrote:
> Try inserting a line saying:
>
>       td_check(ohci, hash, "#2c");
>
> two lines above the #2b line, i.e., just after the wmb().  That'll help 
> narrow down the search for the bug.
Alan,

I put the extra line in and ran capture-example twice. This is what I got:
 
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800

Sean
Comment 29 Alan Stern 2010-01-04 20:48:52 UTC
On Mon, 4 Jan 2010, Sean wrote:

> Alan Stern wrote:
> > Try inserting a line saying:
> >
> >     td_check(ohci, hash, "#2c");
> >
> > two lines above the #2b line, i.e., just after the wmb().  That'll help 
> > narrow down the search for the bug.
> Alan,
> 
> I put the extra line in and ran capture-example twice. This is what I got:
>  
> ohci_hcd 0000:00:0b.0: Circular pointer #2c: 32 c6782800 c66a4800 c6782800
> ohci_hcd 0000:00:0b.0: Circular pointer #2c: 1 c6782040 c66a4040 c6782040
...

All right.  Let's try this patch in place of all the others, then.

Alan Stern


Index: usb-2.6/drivers/usb/host/ohci-q.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-q.c
+++ usb-2.6/drivers/usb/host/ohci-q.c
@@ -505,6 +505,7 @@ td_fill (struct ohci_hcd *ohci, u32 info
 	struct urb_priv		*urb_priv = urb->hcpriv;
 	int			is_iso = info & TD_ISO;
 	int			hash;
+	volatile struct td	* volatile td1, * volatile td2;
 
 	// ASSERT (index < urb_priv->length);
 
@@ -558,11 +559,30 @@ td_fill (struct ohci_hcd *ohci, u32 info
 
 	/* hash it for later reverse mapping */
 	hash = TD_HASH_FUNC (td->td_dma);
+
+	td1 = ohci->td_hash[hash];
+	td2 = NULL;
+	if (td1) {
+		td2 = td1->td_hash;
+		if (td2 == td1 || td2 == td) {
+			ohci_err(ohci, "Circular hash: %d %p %p %p\n",
+					hash, td1, td2, td);
+			td2 = td1->td_hash = NULL;
+		}
+	}
+
 	td->td_hash = ohci->td_hash [hash];
 	ohci->td_hash [hash] = td;
 
 	/* HC might read the TD (or cachelines) right away ... */
 	wmb ();
+
+	if (td1 && td1->td_hash != td2) {
+		ohci_err(ohci, "Hash value changed: %d %p %p %p\n",
+					hash, td1, td2, td);
+		td1->td_hash = (struct td *) td2;
+	}
+
 	td->ed->hwTailP = td->hwNextTD;
 }
Comment 30 Sean Lazar 2010-01-04 22:26:10 UTC
Alan Stern wrote:
> ...
>
> All right.  Let's try this patch in place of all the others, then.
>
> Alan Stern
>
>
> Index: usb-2.6/drivers/usb/host/ohci-q.c
> ===================================================================
> --- usb-2.6.orig/drivers/usb/host/ohci-q.c
> +++ usb-2.6/drivers/usb/host/ohci-q.c
> @@ -505,6 +505,7 @@ td_fill (struct ohci_hcd *ohci, u32 info
>       struct urb_priv         *urb_priv = urb->hcpriv;
>       int                     is_iso = info & TD_ISO;
>       int                     hash;
> +     volatile struct td      * volatile td1, * volatile td2;
>  
>       // ASSERT (index < urb_priv->length);
>  
> @@ -558,11 +559,30 @@ td_fill (struct ohci_hcd *ohci, u32 info
>  
>       /* hash it for later reverse mapping */
>       hash = TD_HASH_FUNC (td->td_dma);
> +
> +     td1 = ohci->td_hash[hash];
> +     td2 = NULL;
> +     if (td1) {
> +             td2 = td1->td_hash;
> +             if (td2 == td1 || td2 == td) {
> +                     ohci_err(ohci, "Circular hash: %d %p %p %p\n",
> +                                     hash, td1, td2, td);
> +                     td2 = td1->td_hash = NULL;
> +             }
> +     }
> +
>       td->td_hash = ohci->td_hash [hash];
>       ohci->td_hash [hash] = td;
>  
>       /* HC might read the TD (or cachelines) right away ... */
>       wmb ();
> +
> +     if (td1 && td1->td_hash != td2) {
> +             ohci_err(ohci, "Hash value changed: %d %p %p %p\n",
> +                                     hash, td1, td2, td);
> +             td1->td_hash = (struct td *) td2;
> +     }
> +
>       td->ed->hwTailP = td->hwNextTD;
>  }
>  
>
>   
Alan,
This last patch seems to do the job. Thanks so much for your help! Where 
do I donate/send beer?

Sean
Comment 31 Sean Lazar 2010-01-05 02:32:17 UTC
bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14564
>
>
> Jean-Francois Moine <moinejf@free.fr> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |moinejf@free.fr
>
>
>
>
> --- Comment #22 from Jean-Francois Moine <moinejf@free.fr>  2010-01-03
> 07:02:45 ---
> Hello Sean,
>
> Sorry to be a bit late. Looking at the dmesg, I found that the gspca version
> was 2.7.0. May you upgrade your linux media stuff from LinuxTv.org and check
> if
> this problem still occurs?
>
> Jef
>   
Jef,

I upgraded to the latest v4l-dvb from http://linuxtv.org/hg/v4l-dvb, 
made the kernel modules, made the v4l libraries, and recompiled 
capture-example.c. Gspca now shows 2.8.0. The error still persists. Alan 
Stern's latest patch to ohci-q.c traps the error. I think it is an issue 
with the cpu or usb controller on the Vortex86SX SoC.

Sean
Comment 32 Alan Stern 2010-01-05 02:40:53 UTC
On Mon, 4 Jan 2010, Sean wrote:

> Alan,
> This last patch seems to do the job. Thanks so much for your help! Where 
> do I donate/send beer?

Um, when you say it does the job, what do you mean?

The job it was _intended_ to do was to prove that your problems are
caused by hardware errors rather than software bugs.  If the patch
causes the problems to stop, without printing any error messages in the
log, then it does indeed prove this.  After all, the only places the
patch changes any persistent values are after it prints an error 
message.

(Admittedly, I didn't expect the problem to stop; I expected to get a
bunch of messages from the second ohci_err().  Just out of curiosity, 
does it make any difference if you remove all those "volatile"s in the 
declaration line for td1 and td2?)

I noticed that your CPU is a Cyrix.  Perhaps it is the culprit.  Have
you tried running the program on a different computer?

Alan Stern
Comment 33 Alan Stern 2010-01-05 02:45:47 UTC
On Mon, 4 Jan 2010, Sean wrote:

> Jef,
> 
> I upgraded to the latest v4l-dvb from http://linuxtv.org/hg/v4l-dvb, 
> made the kernel modules, made the v4l libraries, and recompiled 
> capture-example.c. Gspca now shows 2.8.0. The error still persists. Alan 
> Stern's latest patch to ohci-q.c traps the error. I think it is an issue 
> with the cpu or usb controller on the Vortex86SX SoC.

The CPU is more likely than the USB controller.  The erroneous pointer 
values we see are virtual addresses, whereas the USB controller would 
probably write a physical (or DMA) address if it was malfunctioning.

Or it could be some bizarre timing problem with the memory bus, or 
something else equally obscure.  You didn't mention before that this 
was a SoC rather than an ordinary PC.

Alan Stern
Comment 34 Sean Lazar 2010-01-05 03:34:22 UTC
Alan Stern wrote:
> Um, when you say it does the job, what do you mean?
It traps the error and prevents the kernel from crashing.
> The job it was _intended_ to do was to prove that your problems are
> caused by hardware errors rather than software bugs.  If the patch
> causes the problems to stop, without printing any error messages in the
> log, then it does indeed prove this.  After all, the only places the
> patch changes any persistent values are after it prints an error 
> message.
>   
It did print out error messages:
usb 4-2: new full speed USB device using ohci_hcd and address 
2                                   
usb 4-2: New USB device found, idVendor=093a, 
idProduct=2460                                                         
usb 4-2: New USB device strings: Mfr=1, Product=2, 
SerialNumber=0                                                    
usb 4-2: Product: CIF Single 
Chip                                                                                     

usb 4-2: Manufacturer: Pixart Imaging 
Inc.                                                                            

usb 4-2: configuration #1 chosen from 1 
choice                                                                        

[root@X-Linux]:~ # modprobe 
gspca_pac207                                                                                                 

Linux video capture interface: 
v2.00                                                                                  

gspca: main v2.8.0 
registered                                                                                         

gspca: probing 
093a:2460                                                                                              

pac207: Pixart Sensor ID 0x27 Chips ID 
0x09                                                                           

pac207: Pixart PAC207BCA Image Processor and Control Chip detected 
(vid/pid 0x093A:0x2460)                           
gspca: video0 
created                                                                                                 

usbcore: registered new interface driver 
pac207                                                                       

pac207: 
registered                                                                                                    

[root@X-Linux]:~ # 
capture-example                                                                                    

......................................................................                                                

capture-example used greatest stack depth: 5848 bytes 
left                                                           
[root@X-Linux]:~ # 
capture-example                                                                                    

.ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                 
...ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                               
.ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                 
.ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                 
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                  
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
.ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                 
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                  
....ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                              
ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                  
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
.ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                 
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
.ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                 
.ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                 
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                  
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                  
...ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                               
.ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                 
.ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                 
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                  
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                  
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
.ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                 
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
c677b800                                                
ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                  
..ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
c677b900                                                
..                                                                                                                    

[root@X-Linux]:~ #
> (Admittedly, I didn't expect the problem to stop; I expected to get a
> bunch of messages from the second ohci_err().  Just out of curiosity, 
> does it make any difference if you remove all those "volatile"s in the 
> declaration line for td1 and td2?)
>   
It doesn't seem to make much difference.
> I noticed that your CPU is a Cyrix.  Perhaps it is the culprit.  Have
> you tried running the program on a different computer?
>   
Yes, on other computers I don't get this error. Same os image. Though I 
haven't found a computer with an ohci controller yet.
> Alan Stern
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
Comment 35 Alan Stern 2010-01-05 15:11:13 UTC
On Mon, 4 Jan 2010, Sean wrote:

> Alan Stern wrote:
> > Um, when you say it does the job, what do you mean?
> It traps the error and prevents the kernel from crashing.

As did some of the earlier patches, right?

> > The job it was _intended_ to do was to prove that your problems are
> > caused by hardware errors rather than software bugs.  If the patch
> > causes the problems to stop, without printing any error messages in the
> > log, then it does indeed prove this.  After all, the only places the
> > patch changes any persistent values are after it prints an error 
> > message.
> >   
> It did print out error messages:

> .ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
> c677b900                                                 
> ...ohci_hcd 0000:00:0b.0: Circular hash: 36 c669f900 c677b900 
> c677b900                                               
> .ohci_hcd 0000:00:0b.0: Circular hash: 32 c669f800 c677b800 
> c677b800                                                 

Ooh, that's odd.  The "Circular hash" message occurs in the same spot 
as the "Circular pointer #2a" message in the previous patch -- and that 
message never got printed!

> > I noticed that your CPU is a Cyrix.  Perhaps it is the culprit.  Have
> > you tried running the program on a different computer?
> >   
> Yes, on other computers I don't get this error. Same os image. Though I 
> haven't found a computer with an ohci controller yet.

So that's not a real test, unfortunately.

Still, at this point I'm not sure it's worthwhile to pursue this any
farther.  I'm convinced it's a hardware problem.  Do you want to 
continue, or are you happy to switch computers and forget about it?

Alan Stern
Comment 36 Sean Lazar 2010-01-05 20:08:24 UTC
Alan Stern wrote:
> On Mon, 4 Jan 2010, Sean wrote:
>
>   
>> Alan Stern wrote:
>>     
>>> Um, when you say it does the job, what do you mean?
>>>       
>> It traps the error and prevents the kernel from crashing.
>>     
>
> As did some of the earlier patches, right?
>
>   
Yes, but this one is good because it works without a debug kernel.
> Still, at this point I'm not sure it's worthwhile to pursue this any
> farther.  I'm convinced it's a hardware problem.  Do you want to 
> continue, or are you happy to switch computers and forget about it?
>   
Thanks so much for your help Alan. I also think this is definitely a bug 
in the hardware. Let's leave it at that, I'm happy.

Sean
Comment 37 Alan Stern 2010-01-05 21:06:10 UTC
On Tue, 5 Jan 2010, Sean wrote:

> Thanks so much for your help Alan. I also think this is definitely a bug 
> in the hardware. Let's leave it at that, I'm happy.

Okay.  You should mark the Bugzilla report as REJECTED and close it 
out.

Alan Stern

Note You need to log in before you can comment on or make changes to this bug.