Created attachment 29562 [details] BUG screenshot After a week uptime of 2.6.36-rc3 (I ran 2.6.35 before that), I was greeted by a black screen of death today in the morning: (see screenshot in attachment; partial transcript:) sending NMI to all CPUs: BUG: soinlock lockup on CPU#0, ktorrent/4313, ffff8802... PID: 4313, comm: ktorrent Tainted: G M D W 2.6.36-rc3 #3 Call Trace: [...] do_raw_spin_lock+0x118/0x147 [...] _raw_spin_lock_irq+0x44/0x49 [...] ? __make_request+0x5c/0x400 [...] __make_request+0x5c/0x400 [...] generic_make_request+0x23a/0x2a9 [...] submit_bio+0xad/b6 [...] mpage_bio_submit... [...] do_mpage_readpage... [...] ? get_parent_ip... [...] ? sub_preempt_count... [...] ? __lru_cache_add... [...] mpage_readpages... [...] ? ext4_get_block... [...] ? __alloc_pages_nodemask... [...] ? ext4_get_block... [...] ext4_readpages... [...] __do_page_cache_readahead... [...] ? __do_page_cache_readahead... [...] ra_submit... [...] ondemand_readahead... This is a system with Phenom II x4 and Radeon graphics. Since kernel mode setting is fairly new for radeon, it is possible that the lockup happened with earlier kernels too but simply ended in a lockup without trace dump to the screen. IOW, it is not clear to me whether this is a regression or not. The bug happened while kaffeine wrote an MPEG 2 TS to the same filesystem from which ktorrent was reading. Of course this kind of commonplace workload happened without problem two or three times before during the week in which I ran 2.6.36-rc3.
Full quote for lkml: bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=18252 > > Summary: spinlock lockup in __make_request <- submit_bio <- > ondemand_readahead > Product: IO/Storage > Version: 2.5 > Kernel Version: 2.6.36-rc3 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Block Layer > AssignedTo: axboe@kernel.dk > ReportedBy: stefanr@s5r6.in-berlin.de > Regression: No > > > Created an attachment (id=29562) > --> (https://bugzilla.kernel.org/attachment.cgi?id=29562) > BUG screenshot > > After a week uptime of 2.6.36-rc3 (I ran 2.6.35 before that), Almost two weeks uptime actually. > I was greeted by a black screen of death today in the morning: > > (see screenshot in attachment; partial transcript:) > > sending NMI to all CPUs: > BUG: soinlock lockup on CPU#0, ktorrent/4313, ffff8802... > PID: 4313, comm: ktorrent Tainted: G M D W 2.6.36-rc3 #3 > Call Trace: > [...] do_raw_spin_lock+0x118/0x147 > [...] _raw_spin_lock_irq+0x44/0x49 > [...] ? __make_request+0x5c/0x400 > [...] __make_request+0x5c/0x400 > [...] generic_make_request+0x23a/0x2a9 > [...] submit_bio+0xad/b6 > [...] mpage_bio_submit... > [...] do_mpage_readpage... > [...] ? get_parent_ip... > [...] ? sub_preempt_count... > [...] ? __lru_cache_add... > [...] mpage_readpages... > [...] ? ext4_get_block... > [...] ? __alloc_pages_nodemask... > [...] ? ext4_get_block... > [...] ext4_readpages... > [...] __do_page_cache_readahead... > [...] ? __do_page_cache_readahead... > [...] ra_submit... > [...] ondemand_readahead... > > This is a system with Phenom II x4 and Radeon graphics. Since kernel mode > setting is fairly new for radeon, it is possible that the lockup happened > with > earlier kernels too but simply ended in a lockup without trace dump to the > screen. IOW, it is not clear to me whether this is a regression or not. > > The bug happened while kaffeine wrote an MPEG 2 TS to the same filesystem > from > which ktorrent was reading. Of course this kind of commonplace workload > happened without problem two or three times before during the week in which I > ran 2.6.36-rc3. > (The screenshot is a bit large, hence I reported in bugzilla instead of the list.) The kernel taint was due to prior apparently unrelated lockdep report, bug 17752 "2.6.36-rc3: inconsistent lock state (iprune_sem, shrink_icache_memory". And there were three machine check events ten days ago due to corrected ECC memory errors.
On Sat, 11 Sep 2010 11:50:41 +0200 Stefan Richter <stefanr@s5r6.in-berlin.de> wrote: > Full quote for lkml: > > bugzilla-daemon@bugzilla.kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=18252 > > > > Summary: spinlock lockup in __make_request <- submit_bio <- > > ondemand_readahead > > Product: IO/Storage > > Version: 2.5 > > Kernel Version: 2.6.36-rc3 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Block Layer > > AssignedTo: axboe@kernel.dk > > ReportedBy: stefanr@s5r6.in-berlin.de > > Regression: No > > > > > > Created an attachment (id=29562) > > --> (https://bugzilla.kernel.org/attachment.cgi?id=29562) > > BUG screenshot > > > > After a week uptime of 2.6.36-rc3 (I ran 2.6.35 before that), > > Almost two weeks uptime actually. > > > I was greeted by a black screen of death today in the morning: > > > > (see screenshot in attachment; partial transcript:) > > > > sending NMI to all CPUs: > > BUG: soinlock lockup on CPU#0, ktorrent/4313, ffff8802... > > PID: 4313, comm: ktorrent Tainted: G M D W 2.6.36-rc3 #3 > > Call Trace: > > [...] do_raw_spin_lock+0x118/0x147 > > [...] _raw_spin_lock_irq+0x44/0x49 > > [...] ? __make_request+0x5c/0x400 > > [...] __make_request+0x5c/0x400 > > [...] generic_make_request+0x23a/0x2a9 > > [...] submit_bio+0xad/b6 > > [...] mpage_bio_submit... > > [...] do_mpage_readpage... > > [...] ? get_parent_ip... > > [...] ? sub_preempt_count... > > [...] ? __lru_cache_add... > > [...] mpage_readpages... > > [...] ? ext4_get_block... > > [...] ? __alloc_pages_nodemask... > > [...] ? ext4_get_block... > > [...] ext4_readpages... > > [...] __do_page_cache_readahead... > > [...] ? __do_page_cache_readahead... > > [...] ra_submit... > > [...] ondemand_readahead... > > > > This is a system with Phenom II x4 and Radeon graphics. Since kernel mode > > setting is fairly new for radeon, it is possible that the lockup happened > with > > earlier kernels too but simply ended in a lockup without trace dump to the > > screen. IOW, it is not clear to me whether this is a regression or not. > > > > The bug happened while kaffeine wrote an MPEG 2 TS to the same filesystem > from > > which ktorrent was reading. Of course this kind of commonplace workload > > happened without problem two or three times before during the week in which > I > > ran 2.6.36-rc3. > > > > (The screenshot is a bit large, hence I reported in bugzilla instead of the > list.) > What you've quoted above appears to be just the aftermath. https://bugzilla.kernel.org/attachment.cgi?id=29562 indicates that the kernel earlier crashed in scsi code, perhaps under scsi_setup_fs_cmnd(). The question is: was that actually the first crash, or did an even earlier one scroll off?
There was an scsi-related use-after-free OOPS fixed recently and pulled 3 days ago. On Sat, 11 Sep 2010 19:07:44 +0000 James Bottomley <James.Bottomley@suse.de> wrote: > This includes the oops from use after free, a set of qla2xxx fixes, some > misc warning cleanups from the recently introduced printk issue, an hpsa > lockup fix and a medium removal bug in sd introduced by the BKL > pushdown. > > The patch is available here: > > master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git Maybe you are seeing that?
(reacting to the general-protection-fault preceded by scsi_init in the attachment jpg)
Andrew Morton wrote: >>> https://bugzilla.kernel.org/show_bug.cgi?id=18252 ... > What you've quoted above appears to be just the aftermath. > https://bugzilla.kernel.org/attachment.cgi?id=29562 indicates that the > kernel earlier crashed in scsi code, perhaps under > scsi_setup_fs_cmnd(). > > The question is: was that actually the first crash, or did an even > earlier one scroll off? It happened overnight. The screenshot https://bugzilla.kernel.org/attachment.cgi?id=29562 shows that there was a lot more logged before it. When I saw it in the morning I assumed that the tail was a repetition of the leading bug trace, but it seems I am mistaken. Florian Mickler wrote: > There was an scsi-related use-after-free OOPS fixed recently and pulled 3 > days > ago. > > On Sat, 11 Sep 2010 19:07:44 +0000 > James Bottomley <James.Bottomley@suse.de> wrote: > >> This includes the oops from use after free, a set of qla2xxx fixes, some >> misc warning cleanups from the recently introduced printk issue, an hpsa >> lockup fix and a medium removal bug in sd introduced by the BKL >> pushdown. >> >> The patch is available here: >> >> master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git > > Maybe you are seeing that? > > (reacting to the general-protection-fault preceded by scsi_init in the > attachment jpg) Now that you point it out --- perhaps. Though I haven't looked into the mechanics of the now fixed scsi_ini_io use after free. I am going to update to 2.6.36-rc4 today (I had reverted to 2.6.35 since the report), and if the issue does not return after two weeks or so I will close it as fixed, I suggest.
On 2010-09-14 08:56, Stefan Richter wrote: > Andrew Morton wrote: >>>> https://bugzilla.kernel.org/show_bug.cgi?id=18252 > ... >> What you've quoted above appears to be just the aftermath. >> https://bugzilla.kernel.org/attachment.cgi?id=29562 indicates that the >> kernel earlier crashed in scsi code, perhaps under >> scsi_setup_fs_cmnd(). >> >> The question is: was that actually the first crash, or did an even >> earlier one scroll off? > > It happened overnight. The screenshot > https://bugzilla.kernel.org/attachment.cgi?id=29562 shows that there was a > lot > more logged before it. When I saw it in the morning I assumed that the tail > was a repetition of the leading bug trace, but it seems I am mistaken. > > Florian Mickler wrote: >> There was an scsi-related use-after-free OOPS fixed recently and pulled 3 >> days >> ago. >> >> On Sat, 11 Sep 2010 19:07:44 +0000 >> James Bottomley <James.Bottomley@suse.de> wrote: >> >>> This includes the oops from use after free, a set of qla2xxx fixes, some >>> misc warning cleanups from the recently introduced printk issue, an hpsa >>> lockup fix and a medium removal bug in sd introduced by the BKL >>> pushdown. >>> >>> The patch is available here: >>> >>> master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git >> >> Maybe you are seeing that? >> >> (reacting to the general-protection-fault preceded by scsi_init in the >> attachment jpg) > > Now that you point it out --- perhaps. Though I haven't looked into the > mechanics of the now fixed scsi_ini_io use after free. It seems the very likely explanation, since I can't see any other way that you would deadlock on the queue lock from that call trace if you haven't had someone else crash with the lock held already.
Jens Axboe wrote: > On 2010-09-14 08:56, Stefan Richter wrote: >>>>> https://bugzilla.kernel.org/show_bug.cgi?id=18252 >> Florian Mickler wrote: >>> There was an scsi-related use-after-free OOPS fixed recently and pulled 3 >>> days >>> ago. ... >>> Maybe you are seeing that? >>> >>> (reacting to the general-protection-fault preceded by scsi_init in the >>> attachment jpg) >> Now that you point it out --- perhaps. Though I haven't looked into the >> mechanics of the now fixed scsi_ini_io use after free. > > It seems the very likely explanation, since I can't see any other way that > you would deadlock on the queue lock from that call trace if you haven't > had someone else crash with the lock held already. Good, I close the bugzilla item right away with the reasonable assuption that it is fixed by commit 3a5c19c23db65a554f2e4f5df5f307c668277056.