Bug 14436

Summary: Computer becomes unusable without any apparent reason
Product: Memory Management Reporter: Pitxyoki (Pitxyoki)
Component: OtherAssignee: Andrew Morton (akpm)
Status: CLOSED CODE_FIX    
Severity: high CC: rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32-rc4 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14230    
Attachments: System logs for both occurrences of the bug.
Fourth occurrence
Fifth occurrence

Description Pitxyoki 2009-10-18 18:32:04 UTC
Hi,
This happened to me two times today.
The first time, I wasn't even in front of the computer: I heard a system beep and when I looked at it, the computer was totally irresponsive. I couldn't input anything on the screen, the Num, Caps or Scroll Lock keys wouldn't do any effect on the keyboard lights, and the cursor wouldn't move. After a cold-reboot I ran memcheck86+ and fsck on all drives, but no errors appeared.

The second time, I had just clicked on an URL on icedove (= Mozilla Thunderbird) to a (trusted) PDF file sent by a friend. When the file was opening, the system beep started sounding uninterruptedly and no input could be sent to the computer. After a cold-reboot, still no errors found.

I'm attaching the syslog for both occurrences.

Regards,
Luís Picciochi
Comment 1 Pitxyoki 2009-10-18 18:34:11 UTC
Created attachment 23461 [details]
System logs for both occurrences of the bug.
Comment 2 Pitxyoki 2009-10-18 21:40:47 UTC
This happened for the third time just now. /var/log/syslog has absolutely nothing about it this time.
Comment 3 Pitxyoki 2009-10-28 23:45:19 UTC
Once more. This time with 2.6.32-rc5. Attaching corresponding syslog.
Comment 4 Pitxyoki 2009-10-28 23:47:19 UTC
Created attachment 23572 [details]
Fourth occurrence
Comment 5 Pitxyoki 2009-10-29 20:45:46 UTC
This happened once more yesterday. This time I tried shutting it down using ssh. I could login and send the command, but the shutdown sequence didn't finish. I had to press the power button to turn it off.
I'm starting to fear for my filesystem's consistency. For the first time ever I have two files on an ext3 FS' lost+found. fsck reported more errors than I would consider acceptable if these hangups wouldn't be happening.

Aside from that, do you want me to continue submitting syslogs or is this enough? I'm considering going back to an older kernel.
Comment 6 Pitxyoki 2009-10-29 20:47:11 UTC
Created attachment 23590 [details]
Fifth occurrence
Comment 7 Andrew Morton 2009-11-03 07:26:39 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sun, 18 Oct 2009 18:32:05 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14436
> 
>            Summary: Computer becomes unusable without any apparent reason
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 2.6.32-rc4
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>         AssignedTo: akpm@linux-foundation.org
>         ReportedBy: Pitxyoki@gmail.com
>         Regression: No
> 
> 
> Hi,
> This happened to me two times today.
> The first time, I wasn't even in front of the computer: I heard a system beep
> and when I looked at it, the computer was totally irresponsive. I couldn't
> input anything on the screen, the Num, Caps or Scroll Lock keys wouldn't do
> any
> effect on the keyboard lights, and the cursor wouldn't move. After a
> cold-reboot I ran memcheck86+ and fsck on all drives, but no errors appeared.
> 
> The second time, I had just clicked on an URL on icedove (= Mozilla
> Thunderbird) to a (trusted) PDF file sent by a friend. When the file was
> opening, the system beep started sounding uninterruptedly and no input could
> be
> sent to the computer. After a cold-reboot, still no errors found.
> 
> I'm attaching the syslog for both occurrences.
> 

Reproducible oops in tty_devnum():
http://bugzilla.kernel.org/attachment.cgi?id=23572

I think it would be safe to assume that this is a regression. 
Pitxyoki, was 2.6.31 OK?

Thanks.
Comment 8 Alan 2009-11-03 10:15:14 UTC
> Reproducible oops in tty_devnum():
> http://bugzilla.kernel.org/attachment.cgi?id=23572
> 
> I think it would be safe to assume that this is a regression. 

Looks to me like a memory scribble or freeing up of stuff under the
kernel. The oopses are coming from the fact the task struct now contains
ascii.

Turn on slab poison and all the memory debug and try and repeat it. Grabs
the oops and after that if you are using 4K stacks switch to 8K stacks and
repeat the attempt
Comment 9 Pitxyoki 2009-11-03 11:01:14 UTC
On Tue, Nov 3, 2009 at 10:16 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > Reproducible oops in tty_devnum():
> > http://bugzilla.kernel.org/attachment.cgi?id=23572
> >
> > I think it would be safe to assume that this is a regression.
>
>
Thank you for paying attention to this.
I can't tell if 2.6.31 was OK, but I really never had these problems with
it.
Even with 2.6.32 everything seems to be OK. Sometimes everything is fine for
some days, other times it crashes multiple times a day.


> Looks to me like a memory scribble or freeing up of stuff under the
> kernel. The oopses are coming from the fact the task struct now contains
> ascii.
>
> Turn on slab poison and all the memory debug and try and repeat it. Grabs
> the oops and after that if you are using 4K stacks switch to 8K stacks and
> repeat the attempt
>

I'm sorry, but I'm not sure I know how to do this. Are these options on the
.config file? If not, can you please instruct me more clearly on how to do
this?

Regards,
Luís Picciochi
Comment 10 Alan 2009-11-03 11:15:37 UTC
> I'm sorry, but I'm not sure I know how to do this. Are these options on the
> .config file? If not, can you please instruct me more clearly on how to do
> this?

They are .config options

I would enable

DEBUG_KERNEL
DEBUG_PAGEALLOC
PAGE_POISONING
DEBUG_STACKOVERFLOW
DEBUG_STACK_USAGE
DEBUG_OBJECTS
DEBUG_OBJECTS_FREE
DEBUG_SLAB or DEBUG_SLUB or SQLB_DEBUG

and the stack size is configured with

4KSTACKS
Comment 11 Pitxyoki 2009-11-21 20:29:43 UTC
I'm suspecting more and more that this bug might be related with bug #12794.
Please see my last attachment on that bug.

After I recompiled the kernel with the options you asked me I haven't seen
any messages on syslog that seemed related with the ones I reported here.
When my computer hang again I have been seeing messages like thoe ones I
reported on bug #12794, related with the rndis_wlan driver.

On the logs I reported to this bug you can see other programs associated
with the oops, and not rndis_wlan... But you can see "BUG: unable to handle
kernel paging request at xxx" on the rndis_wlan-related logs.
I can't be sure if these are the same bug, if they are related or if they
are completely separate issues, but maybe you would know it better than me.

Thanks and regards,
Luís Picciochi

On Tue, Nov 3, 2009 at 11:17 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > I'm sorry, but I'm not sure I know how to do this. Are these options on
> the
> > .config file? If not, can you please instruct me more clearly on how to
> do
> > this?
>
> They are .config options
>
> I would enable
>
> DEBUG_KERNEL
> DEBUG_PAGEALLOC
> PAGE_POISONING
> DEBUG_STACKOVERFLOW
> DEBUG_STACK_USAGE
> DEBUG_OBJECTS
> DEBUG_OBJECTS_FREE
> DEBUG_SLAB or DEBUG_SLUB or SQLB_DEBUG
>
> and the stack size is configured with
>
> 4KSTACKS
>
Comment 12 Pitxyoki 2009-12-07 16:16:20 UTC
As reported on bug #12794, I consider that bug to be resolved. Since I applied the last patch I did not have any more crashes like the ones reported here. These two bugs really seemed the same to me.

You may close this if you consider that you don't need any more info about this.

Regards,
Pitxyoki
Comment 13 Rafael J. Wysocki 2009-12-29 21:58:22 UTC
On Tuesday 29 December 2009, Luís Picciochi Oliveira wrote:
> Hi,
> The bug is present on 2.6.32 and subsequent versions (2.6.32.1, 2.6.32.2).
> It has been resolved as of 2.6.33-rc1.
> 
> Regards,
> Luís Picciochi
> 
> On Tue, Dec 29, 2009 at 3:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32.  Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=14436
> > Subject         : Computer becomes unusable without any apparent reason
> > Submitter       : Pitxyoki <Pitxyoki@gmail.com>
> > Date            : 2009-10-18 18:32 (73 days old)
Comment 14 Rafael J. Wysocki 2009-12-29 22:35:42 UTC
On Tuesday 29 December 2009, Luís Picciochi Oliveira wrote:
> On Tue, Dec 29, 2009 at 10:04 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Tuesday 29 December 2009, Luís Picciochi Oliveira wrote:
> >> Hi,
> >> The bug is present on 2.6.32 and subsequent versions (2.6.32.1, 2.6.32.2).
> >> It has been resolved as of 2.6.33-rc1.
> >
> > Thanks for the update.
> >
> > Is it known how it was fixed in 2.6.33-rc1 or do you just see that the bug
> is
> > not present in there any more?
> 
> Hi,
> Like I reported at [1], I strongly believe this was the same as bug as
> #12794, which was resolved by a patch resulting from my feedback and
> Jussi Kivilinna's work. After I enabled memory debug like suggested on
> bug #14436, everything pointed in the direction of that bug.
> The issue was solved after applying Jussi's patch. That patch has been
> commited to the mainline kernel and I can assert that since then the
> bug didn't occur again.