Bug 18252 - spinlock lockup in __make_request <- submit_bio <- ondemand_readahead
spinlock lockup in __make_request <- submit_bio <- ondemand_readahead
Status: CLOSED CODE_FIX
Product: IO/Storage
Classification: Unclassified
Component: Block Layer
All Linux
: P1 normal
Assigned To: Jens Axboe
:
Depends on:
Blocks: 16444
  Show dependency treegraph
 
Reported: 2010-09-11 09:36 UTC by Stefan Richter
Modified: 2010-09-14 18:10 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.36-rc3
Tree: Mainline
Regression: Yes


Attachments
BUG screenshot (237.09 KB, image/jpeg)
2010-09-11 09:36 UTC, Stefan Richter
Details

Description Stefan Richter 2010-09-11 09:36:15 UTC
Created attachment 29562 [details]
BUG screenshot

After a week uptime of 2.6.36-rc3 (I ran 2.6.35 before that), I was greeted by a black screen of death today in the morning:

(see screenshot in attachment; partial transcript:)

sending NMI to all CPUs:
BUG: soinlock lockup on CPU#0, ktorrent/4313, ffff8802...
PID: 4313, comm: ktorrent Tainted: G  M D W   2.6.36-rc3 #3
Call Trace:
 [...] do_raw_spin_lock+0x118/0x147
 [...] _raw_spin_lock_irq+0x44/0x49
 [...] ? __make_request+0x5c/0x400
 [...] __make_request+0x5c/0x400
 [...] generic_make_request+0x23a/0x2a9
 [...] submit_bio+0xad/b6
 [...] mpage_bio_submit...
 [...] do_mpage_readpage...
 [...] ? get_parent_ip...
 [...] ? sub_preempt_count...
 [...] ? __lru_cache_add...
 [...] mpage_readpages...
 [...] ? ext4_get_block...
 [...] ? __alloc_pages_nodemask...
 [...] ? ext4_get_block...
 [...] ext4_readpages...
 [...] __do_page_cache_readahead...
 [...] ? __do_page_cache_readahead...
 [...] ra_submit...
 [...] ondemand_readahead...

This is a system with Phenom II x4 and Radeon graphics.  Since kernel mode setting is fairly new for radeon, it is possible that the lockup happened with earlier kernels too but simply ended in a lockup without trace dump to the screen.  IOW, it is not clear to me whether this is a regression or not.

The bug happened while kaffeine wrote an MPEG 2 TS to the same filesystem from which ktorrent was reading.  Of course this kind of commonplace workload happened without problem two or three times before during the week in which I ran 2.6.36-rc3.
Comment 1 Stefan Richter 2010-09-11 09:51:11 UTC
Full quote for lkml:

bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=18252
> 
>            Summary: spinlock lockup in __make_request <- submit_bio <-
>                     ondemand_readahead
>            Product: IO/Storage
>            Version: 2.5
>     Kernel Version: 2.6.36-rc3
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Block Layer
>         AssignedTo: axboe@kernel.dk
>         ReportedBy: stefanr@s5r6.in-berlin.de
>         Regression: No
> 
> 
> Created an attachment (id=29562)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=29562)
> BUG screenshot
> 
> After a week uptime of 2.6.36-rc3 (I ran 2.6.35 before that),

Almost two weeks uptime actually.

> I was greeted by a black screen of death today in the morning:
> 
> (see screenshot in attachment; partial transcript:)
> 
> sending NMI to all CPUs:
> BUG: soinlock lockup on CPU#0, ktorrent/4313, ffff8802...
> PID: 4313, comm: ktorrent Tainted: G  M D W   2.6.36-rc3 #3
> Call Trace:
>  [...] do_raw_spin_lock+0x118/0x147
>  [...] _raw_spin_lock_irq+0x44/0x49
>  [...] ? __make_request+0x5c/0x400
>  [...] __make_request+0x5c/0x400
>  [...] generic_make_request+0x23a/0x2a9
>  [...] submit_bio+0xad/b6
>  [...] mpage_bio_submit...
>  [...] do_mpage_readpage...
>  [...] ? get_parent_ip...
>  [...] ? sub_preempt_count...
>  [...] ? __lru_cache_add...
>  [...] mpage_readpages...
>  [...] ? ext4_get_block...
>  [...] ? __alloc_pages_nodemask...
>  [...] ? ext4_get_block...
>  [...] ext4_readpages...
>  [...] __do_page_cache_readahead...
>  [...] ? __do_page_cache_readahead...
>  [...] ra_submit...
>  [...] ondemand_readahead...
> 
> This is a system with Phenom II x4 and Radeon graphics.  Since kernel mode
> setting is fairly new for radeon, it is possible that the lockup happened with
> earlier kernels too but simply ended in a lockup without trace dump to the
> screen.  IOW, it is not clear to me whether this is a regression or not.
> 
> The bug happened while kaffeine wrote an MPEG 2 TS to the same filesystem from
> which ktorrent was reading.  Of course this kind of commonplace workload
> happened without problem two or three times before during the week in which I
> ran 2.6.36-rc3.
> 

(The screenshot is a bit large, hence I reported in bugzilla instead of the list.)

The kernel taint was due to prior apparently unrelated lockdep report, bug
17752 "2.6.36-rc3: inconsistent lock state (iprune_sem, shrink_icache_memory".
 And there were three machine check events ten days ago due to corrected ECC
memory errors.
Comment 2 Andrew Morton 2010-09-13 21:42:02 UTC
On Sat, 11 Sep 2010 11:50:41 +0200
Stefan Richter <stefanr@s5r6.in-berlin.de> wrote:

> Full quote for lkml:
> 
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=18252
> > 
> >            Summary: spinlock lockup in __make_request <- submit_bio <-
> >                     ondemand_readahead
> >            Product: IO/Storage
> >            Version: 2.5
> >     Kernel Version: 2.6.36-rc3
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Block Layer
> >         AssignedTo: axboe@kernel.dk
> >         ReportedBy: stefanr@s5r6.in-berlin.de
> >         Regression: No
> > 
> > 
> > Created an attachment (id=29562)
> >  --> (https://bugzilla.kernel.org/attachment.cgi?id=29562)
> > BUG screenshot
> > 
> > After a week uptime of 2.6.36-rc3 (I ran 2.6.35 before that),
> 
> Almost two weeks uptime actually.
> 
> > I was greeted by a black screen of death today in the morning:
> > 
> > (see screenshot in attachment; partial transcript:)
> > 
> > sending NMI to all CPUs:
> > BUG: soinlock lockup on CPU#0, ktorrent/4313, ffff8802...
> > PID: 4313, comm: ktorrent Tainted: G  M D W   2.6.36-rc3 #3
> > Call Trace:
> >  [...] do_raw_spin_lock+0x118/0x147
> >  [...] _raw_spin_lock_irq+0x44/0x49
> >  [...] ? __make_request+0x5c/0x400
> >  [...] __make_request+0x5c/0x400
> >  [...] generic_make_request+0x23a/0x2a9
> >  [...] submit_bio+0xad/b6
> >  [...] mpage_bio_submit...
> >  [...] do_mpage_readpage...
> >  [...] ? get_parent_ip...
> >  [...] ? sub_preempt_count...
> >  [...] ? __lru_cache_add...
> >  [...] mpage_readpages...
> >  [...] ? ext4_get_block...
> >  [...] ? __alloc_pages_nodemask...
> >  [...] ? ext4_get_block...
> >  [...] ext4_readpages...
> >  [...] __do_page_cache_readahead...
> >  [...] ? __do_page_cache_readahead...
> >  [...] ra_submit...
> >  [...] ondemand_readahead...
> > 
> > This is a system with Phenom II x4 and Radeon graphics.  Since kernel mode
> > setting is fairly new for radeon, it is possible that the lockup happened with
> > earlier kernels too but simply ended in a lockup without trace dump to the
> > screen.  IOW, it is not clear to me whether this is a regression or not.
> > 
> > The bug happened while kaffeine wrote an MPEG 2 TS to the same filesystem from
> > which ktorrent was reading.  Of course this kind of commonplace workload
> > happened without problem two or three times before during the week in which I
> > ran 2.6.36-rc3.
> > 
> 
> (The screenshot is a bit large, hence I reported in bugzilla instead of the list.)
> 

What you've quoted above appears to be just the aftermath. 
https://bugzilla.kernel.org/attachment.cgi?id=29562 indicates that the
kernel earlier crashed in scsi code, perhaps under
scsi_setup_fs_cmnd().

The question is: was that actually the first crash, or did an even
earlier one scroll off?
Comment 3 Florian Mickler 2010-09-14 06:22:09 UTC
There was an scsi-related use-after-free OOPS fixed recently and pulled 3 days ago.  

On Sat, 11 Sep 2010 19:07:44 +0000
James Bottomley <James.Bottomley@suse.de> wrote:

> This includes the oops from use after free, a set of qla2xxx fixes, some
> misc warning cleanups from the recently introduced printk issue, an hpsa
> lockup fix and a medium removal bug in sd introduced by the BKL
> pushdown.
> 
> The patch is available here:
> 
> master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git


Maybe you are seeing that?
Comment 4 Florian Mickler 2010-09-14 06:23:21 UTC
(reacting to the general-protection-fault preceded by scsi_init in the attachment jpg)
Comment 5 Stefan Richter 2010-09-14 06:57:14 UTC
Andrew Morton wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=18252
...
> What you've quoted above appears to be just the aftermath. 
> https://bugzilla.kernel.org/attachment.cgi?id=29562 indicates that the
> kernel earlier crashed in scsi code, perhaps under
> scsi_setup_fs_cmnd().
> 
> The question is: was that actually the first crash, or did an even
> earlier one scroll off?

It happened overnight.  The screenshot
https://bugzilla.kernel.org/attachment.cgi?id=29562 shows that there was a lot
more logged before it.  When I saw it in the morning I assumed that the tail
was a repetition of the leading bug trace, but it seems I am mistaken.

Florian Mickler wrote:
> There was an scsi-related use-after-free OOPS fixed recently and pulled 3 days
> ago.  
> 
> On Sat, 11 Sep 2010 19:07:44 +0000
> James Bottomley <James.Bottomley@suse.de> wrote:
> 
>> This includes the oops from use after free, a set of qla2xxx fixes, some
>> misc warning cleanups from the recently introduced printk issue, an hpsa
>> lockup fix and a medium removal bug in sd introduced by the BKL
>> pushdown.
>> 
>> The patch is available here:
>> 
>> master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git
> 
> Maybe you are seeing that?
> 
> (reacting to the general-protection-fault preceded by scsi_init in the
> attachment jpg)

Now that you point it out --- perhaps.  Though I haven't looked into the
mechanics of the now fixed scsi_ini_io use after free.

I am going to update to 2.6.36-rc4 today (I had reverted to 2.6.35 since the
report), and if the issue does not return after two weeks or so I will close
it as fixed, I suggest.
Comment 6 Jens Axboe 2010-09-14 07:38:04 UTC
On 2010-09-14 08:56, Stefan Richter wrote:
> Andrew Morton wrote:
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=18252
> ...
>> What you've quoted above appears to be just the aftermath. 
>> https://bugzilla.kernel.org/attachment.cgi?id=29562 indicates that the
>> kernel earlier crashed in scsi code, perhaps under
>> scsi_setup_fs_cmnd().
>>
>> The question is: was that actually the first crash, or did an even
>> earlier one scroll off?
> 
> It happened overnight.  The screenshot
> https://bugzilla.kernel.org/attachment.cgi?id=29562 shows that there was a lot
> more logged before it.  When I saw it in the morning I assumed that the tail
> was a repetition of the leading bug trace, but it seems I am mistaken.
> 
> Florian Mickler wrote:
>> There was an scsi-related use-after-free OOPS fixed recently and pulled 3 days
>> ago.  
>>
>> On Sat, 11 Sep 2010 19:07:44 +0000
>> James Bottomley <James.Bottomley@suse.de> wrote:
>>
>>> This includes the oops from use after free, a set of qla2xxx fixes, some
>>> misc warning cleanups from the recently introduced printk issue, an hpsa
>>> lockup fix and a medium removal bug in sd introduced by the BKL
>>> pushdown.
>>>
>>> The patch is available here:
>>>
>>> master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git
>>
>> Maybe you are seeing that?
>>
>> (reacting to the general-protection-fault preceded by scsi_init in the
>> attachment jpg)
> 
> Now that you point it out --- perhaps.  Though I haven't looked into the
> mechanics of the now fixed scsi_ini_io use after free.

It seems the very likely explanation, since I can't see any other way that
you would deadlock on the queue lock from that call trace if you haven't
had someone else crash with the lock held already.
Comment 7 Stefan Richter 2010-09-14 11:18:32 UTC
Jens Axboe wrote:
> On 2010-09-14 08:56, Stefan Richter wrote:
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=18252
>> Florian Mickler wrote:
>>> There was an scsi-related use-after-free OOPS fixed recently and pulled 3 days
>>> ago.  
...
>>> Maybe you are seeing that?
>>>
>>> (reacting to the general-protection-fault preceded by scsi_init in the
>>> attachment jpg)
>> Now that you point it out --- perhaps.  Though I haven't looked into the
>> mechanics of the now fixed scsi_ini_io use after free.
> 
> It seems the very likely explanation, since I can't see any other way that
> you would deadlock on the queue lock from that call trace if you haven't
> had someone else crash with the lock held already.

Good, I close the bugzilla item right away with the reasonable assuption that
it is fixed by commit 3a5c19c23db65a554f2e4f5df5f307c668277056.

Note You need to log in before you can comment on or make changes to this bug.