Bug 26612

Summary: BUG in fs/inode.c:429
Product: File System Reporter: Maciej Rutecki (maciej.rutecki)
Component: OtherAssignee: fs_other
Status: CLOSED CODE_FIX    
Severity: normal CC: cebbert, florian, florian, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.37-rc7 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 21782    
Attachments: Picture of the bug
Picture of the bug
Picture of the bug
Picture of the bug
Next crash
dmesg
relevant dmesg output
dmesg output
The first time lil' bugger came out of hiding
And the second time
Trace

Description Maciej Rutecki 2011-01-12 19:17:54 UTC
Subject    : BUG in fs/inode.c:429
Submitter  : Florian Kriener <florian@kriener.org>
Date       : 2011-01-06 16:35
Message-ID : 201101061735.40060.florian@kriener.org
References : http://marc.info/?l=linux-kernel&m=129433235223735&w=2

This entry is being used for tracking a regression from 2.6.36. Please don't
close it until the problem is fixed in the mainline.
Comment 1 Florian Kriener 2011-01-28 08:17:53 UTC
I think I have found the problem and it's not in the kernel tree, it's in tp_smapi, which I have been using. Since I disabled it the kernel is as stable as it should be.

I suspected tp_smapi when I saw that all the traces had a power supply related sysfs file in "last sysfs file". I then disabled it and since then I didn't see anymore crashes.
Comment 2 Florian Mickler 2011-01-29 20:29:26 UTC
Ok, you might want to file a bug with tp_smapi though. 

Thanks,
Flo
Comment 3 Florian Kriener 2011-02-02 21:07:51 UTC
I'm sorry but could one with one with appropriate permissions reopen this bug? I have been hit by the bug again. Not only once though, and it always looks different. :-(

Here is the problem: I sometimes hit an invalid opcode bug, it mostly happens when when resuming from hibernate but it is not really reproduceable and the backtrace always shows something different. To top it off, there is no trace of it on the hard drive, so I am limited to taking pictures the bug. I'll add some to this bug report as attachment.

There is however one similarity: The "last sysfs file" always is /sys/devices/LNXSYSTM:00/device:00/PNP0A08:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full if I can see that line on the screen.
Comment 4 Florian Kriener 2011-02-02 21:16:42 UTC
Created attachment 46102 [details]
Picture of the bug
Comment 5 Florian Kriener 2011-02-02 21:19:15 UTC
Created attachment 46112 [details]
Picture of the bug
Comment 6 Florian Kriener 2011-02-02 21:19:39 UTC
Created attachment 46122 [details]
Picture of the bug
Comment 7 Florian Kriener 2011-02-02 21:20:08 UTC
Created attachment 46132 [details]
Picture of the bug
Comment 8 Florian Kriener 2011-02-02 21:23:39 UTC
I wasn't paying enough attention. It's not always 

/sys/devices/LNXSYSTM:00/device:00/PNP0A08:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full

in attachment 46102 [details] it's

/sys/devices/LNXSYSTM:00/device:00/PNP0A08:00/device:01/PNP0C09:00/PNP0C0A:00/ACPI<...>/power_supply/AC/online
Comment 9 Florian Kriener 2011-02-04 21:17:46 UTC
Created attachment 46252 [details]
Next crash

This crash happend when I tried to hibernate my system.
Comment 10 Florian Kriener 2011-02-04 21:28:45 UTC
Here are some observations regarding the images above.

Functions in the backtrace that appear often:
1. shrink_dcache_memory
2. d_find_alias

Last unloaded module is always: scsi_wait_scan
Comment 11 Florian Kriener 2011-02-22 08:16:52 UTC
I hit this bug with 2.6.37 as well as with 2.6.37-rc7 and am currently very cautious not to hibernate my system. I will try again when I install a new kernel.
Comment 12 Jan Steffens 2011-02-25 00:38:40 UTC
Created attachment 48952 [details]
dmesg

I believe I hit this bug as well, using a 2.6.37.1 ZEN kernel. The system ran fine for a while after waking up from hibernation (a couple hours, if I recall correctly), then bugged.
Comment 13 Chuck Ebbert 2011-02-25 15:02:46 UTC
This appears to be another problem caused by hibernation, and not a real bug?
Comment 14 Florian Kriener 2011-03-01 08:48:11 UTC
Created attachment 49722 [details]
relevant dmesg output

Here is another bug report. I am not sure whether this is the same bug or not, but I assume that they at least connected as this one surfaced after resuming from hibernation as well. However, there is one major difference to the ones I reported above: My system did not freeze and I could preserve the trace.

There are two traces actually. The first one, as you clearly see, appeared directly after resuming and the second when I tried rebooting (which worked btw.).
Comment 15 Florian Kriener 2011-03-01 08:51:02 UTC
(In reply to comment #13)
> This appears to be another problem caused by hibernation, and not a real bug?

If you define "not a real bug" as heavily interfering with work in coincidence with potential data loss this is not a real bug, right.
Comment 16 Florian Mickler 2011-03-29 21:04:06 UTC
Is this still happening as of 2.6.38.y / 2.6.39-rc* ?
Comment 17 Florian Kriener 2011-03-30 08:32:38 UTC
Only if this is related:

http://article.gmane.org/gmane.linux.kernel/1116962
Comment 18 Florian Mickler 2011-03-30 19:54:51 UTC
Ok. Are you only and consistently getting the backtrace from comment #17 with 2.6.38.y? If so, then we can assume that this is a seperate issue. If you get varying backtraces with 2.6.38.y then chances are high, that it is the same underlying cause. 

Anyway, here are some thoughts on how to troubleshoot this:
 
Module unload is probably done by some suspend-script of your distribution?  Can you test with plain 'echo ... > sys/power/state'?

I don't know if scsi_wait_scan is vital for your operations? If not, you could try blacklisting it (i.e. never load it). 

Also please don't use tp_smapi while debugging this. We want to get as many potential failure sources out of the way in order to circle in on the cause. 

Although we don't know if these two things are related to the crashes, it might be good to cross them off from our list of suspects. 

Another thing you could try is to check who is accessing sysfs and prevent them from doing so. (maybe it's also the hibernate/suspend scripts you are using or it is some power-managment daemon). Probably a red herring.. but still since that stands in the room we also want to check if that is the cause...
Comment 19 Florian Kriener 2011-04-01 09:21:50 UTC
(In reply to comment #18)
> Ok. Are you only and consistently getting the backtrace from comment #17 with
> 2.6.38.y? If so, then we can assume that this is a seperate issue. If you get
> varying backtraces with 2.6.38.y then chances are high, that it is the same
> underlying cause. 

I only got that backtrace once, that's why I though that it could be related in the first place.

Currently, that is with 2.6.38-2-amd64 from debian (i.e. 2.6.38.2 with a few patches), hibernation ceased to work completely. When resuming from hibernation my system is unusable (even with all external devices unplugged from init 1 with disabled framebuffer). It freezes with the message:

PM: Loading and decompressing image data...
PM: Read ... kbytes in ... seconds
Suspending console(s) (use no_console_suspend to debug)


> Anyway, here are some thoughts on how to troubleshoot this:

I haven't had time to test all of your suggestions but I will get to them later. 

> Also please don't use tp_smapi while debugging this. We want to get as many
> potential failure sources out of the way in order to circle in on the cause. 

I left tp_smapi uninstalled since I suspected it to be the culprit in comment #1, so I am sure it's not that. I won't reinstall it though.

Thanks for your kind help so far, I will try to investigate this further and keep this bug report updated.
Comment 20 Florian Kriener 2011-04-01 17:35:07 UTC
(In reply to comment #18)
> Module unload is probably done by some suspend-script of your distribution? 
> Can you test with plain 'echo ... > sys/power/state'?

The same as described above with kernel 2.6.38.2.

I am currently hoping that those two bugs are the same so that I can reproduce them.
Comment 21 Florian Kriener 2011-04-01 17:57:30 UTC
bzw. the kernel parameter no_console_suspend yields another break. This time similar to this:

PM: Loading and decompressing image data...
PM: Read ... kbytes in ... seconds
[...]
Extended CMLS year 2000

All in all I found several bug reports that show the same problem:

https://bugs.launchpad.net/linux/+bug/282220
https://bugzilla.novell.com/show_bug.cgi?id=450256

The bug reports there are marked as closed but they forgot to post the solution.
Comment 22 Florian Kriener 2011-04-01 18:29:52 UTC
Created attachment 52982 [details]
dmesg output

The same happens when booting with init=/bin/bash and using sys/power/state to hibernate

# lsmod
Module                  Size  Used by
ext4                  285166  1 
mbcache                12930  1 ext4
jbd2                   65157  1 ext4
crc16                  12343  1 ext4
sha256_generic         16797  2 
cryptd                 14463  0 
aes_x86_64             16796  8 
aes_generic            37122  1 aes_x86_64
cbc                    12747  4 
dm_crypt               22256  1 
dm_mod                 62467  12 dm_crypt
btrfs                 428766  0 
zlib_deflate           25466  1 btrfs
crc32c                 12656  1 
libcrc32c              12426  1 btrfs
sg                     25769  0 
sd_mod                 35501  2 
sr_mod                 21824  0 
cdrom                  35134  1 sr_mod
crc_t10dif             12348  1 sd_mod
uhci_hcd               26290  0 
i915                  315266  1 
drm_kms_helper         26893  1 i915
drm                   165567  2 i915,drm_kms_helper
e1000e                123965  0 
ahci                   25089  1 
libahci                22568  1 ahci
i2c_algo_bit           12834  1 i915
i2c_core               23725  4 i915,drm_kms_helper,drm,i2c_algo_bit
libata                147240  2 ahci,libahci
sdhci_pci              13184  0 
sdhci                  21727  1 sdhci_pci
ehci_hcd               39529  0 
scsi_mod              161457  4 sg,sr_mod,sd_mod,libata
usbcore               122908  3 uhci_hcd,ehci_hcd
mmc_core               58589  1 sdhci
video                  17553  1 i915
nls_base               12753  1 usbcore
thermal                17330  0 
thermal_sys            17939  2 video,thermal
button                 12994  1 i915
Comment 23 Florian Mickler 2011-04-01 23:42:26 UTC
For the 2.6.38.2 hibernate problem, you might wanna try the patch from https://patchwork.kernel.org/patch/679502/

.. or wait for the next stable release.
Comment 24 Florian Kriener 2011-04-02 10:30:08 UTC
Okay thanks. I will wait a little and then test what you suggested again. I hope I can find a way to reproduce the problem. Sorry for spamming this report with unrelated stuff.
Comment 25 Florian Kriener 2011-04-07 17:11:19 UTC
Okay I'm back with a new kernel and the bug from comment #19 is gone now but the problem remains as expected. The new kernel is 2.6.38-2-amd64 version 2.6.38-3 from debian unstable btw.

I tried hibernating a few times and was "lucky" to catch the bug. The first two times were almost the same: I booted as usual and then instead of logging in via kdm switched to tty1, logged in as root and issued pm-hibernate. After the system had been hibernated I restarted the computer and was, after a few seconds of waiting, in front of the console again. Then I switched to tty7, where kdm resides, and logged in. Immediately after that the system crashed. The first time it was unusable afterwards, the second time I could use it just fine. Both times I could save the traces and they will be attached after this comment.

Then I wondered what would happen if I used sysfs for hibernating and was astonished to find it working just perfect (twice again). Then I used pm-hibernate with the environment variable PM_DEBUG set to see if I could figure out what the problem was. However, all worked fine after that and the bug did not reappear. I tried another few times again and ... nothing. All as it should be. The lil' bugger's really into hiding. Damn.
Comment 26 Florian Kriener 2011-04-07 17:12:58 UTC
Created attachment 53782 [details]
The first time lil' bugger came out of hiding
Comment 27 Florian Kriener 2011-04-07 17:13:18 UTC
Created attachment 53792 [details]
And the second time
Comment 28 Florian Kriener 2011-04-23 10:02:42 UTC
Created attachment 55102 [details]
Trace

Here is another trace. I saw this trace popping up in kde's konsole after resuming from hibernation, I immediately tried to reboot but the system froze.

Kernel version is 2.6.38-3 from debian unstable.
Comment 29 Florian Kriener 2011-08-10 17:24:59 UTC
I hadn't had any further crashes with linux 3.0, so I guess [1] fixed it.

[1] https://lkml.org/lkml/2011/7/17/103
Comment 30 Florian Mickler 2011-08-11 07:56:15 UTC
That wasn't applied but Al Viro provided this in the same thread: 

commit d6e9bd256c88ce5f4b668249e363a74f51393daa
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Fri May 27 07:03:15 2011 -0400

    Lift the check for automount points into do_lookup()
Comment 31 Florian Mickler 2011-08-11 07:57:49 UTC
Merged in v3.0-rc1 btw.