Bug 26612
Summary: | BUG in fs/inode.c:429 | ||
---|---|---|---|
Product: | File System | Reporter: | Maciej Rutecki (maciej.rutecki) |
Component: | Other | Assignee: | fs_other |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | cebbert, florian, florian, maciej.rutecki, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.37-rc7 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 21782 | ||
Attachments: |
Picture of the bug
Picture of the bug Picture of the bug Picture of the bug Next crash dmesg relevant dmesg output dmesg output The first time lil' bugger came out of hiding And the second time Trace |
Description
Maciej Rutecki
2011-01-12 19:17:54 UTC
I think I have found the problem and it's not in the kernel tree, it's in tp_smapi, which I have been using. Since I disabled it the kernel is as stable as it should be. I suspected tp_smapi when I saw that all the traces had a power supply related sysfs file in "last sysfs file". I then disabled it and since then I didn't see anymore crashes. Ok, you might want to file a bug with tp_smapi though. Thanks, Flo I'm sorry but could one with one with appropriate permissions reopen this bug? I have been hit by the bug again. Not only once though, and it always looks different. :-( Here is the problem: I sometimes hit an invalid opcode bug, it mostly happens when when resuming from hibernate but it is not really reproduceable and the backtrace always shows something different. To top it off, there is no trace of it on the hard drive, so I am limited to taking pictures the bug. I'll add some to this bug report as attachment. There is however one similarity: The "last sysfs file" always is /sys/devices/LNXSYSTM:00/device:00/PNP0A08:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full if I can see that line on the screen. Created attachment 46102 [details]
Picture of the bug
Created attachment 46112 [details]
Picture of the bug
Created attachment 46122 [details]
Picture of the bug
Created attachment 46132 [details]
Picture of the bug
I wasn't paying enough attention. It's not always
/sys/devices/LNXSYSTM:00/device:00/PNP0A08:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full
in attachment 46102 [details] it's
/sys/devices/LNXSYSTM:00/device:00/PNP0A08:00/device:01/PNP0C09:00/PNP0C0A:00/ACPI<...>/power_supply/AC/online
Created attachment 46252 [details]
Next crash
This crash happend when I tried to hibernate my system.
Here are some observations regarding the images above. Functions in the backtrace that appear often: 1. shrink_dcache_memory 2. d_find_alias Last unloaded module is always: scsi_wait_scan I hit this bug with 2.6.37 as well as with 2.6.37-rc7 and am currently very cautious not to hibernate my system. I will try again when I install a new kernel. Created attachment 48952 [details]
dmesg
I believe I hit this bug as well, using a 2.6.37.1 ZEN kernel. The system ran fine for a while after waking up from hibernation (a couple hours, if I recall correctly), then bugged.
This appears to be another problem caused by hibernation, and not a real bug? Created attachment 49722 [details]
relevant dmesg output
Here is another bug report. I am not sure whether this is the same bug or not, but I assume that they at least connected as this one surfaced after resuming from hibernation as well. However, there is one major difference to the ones I reported above: My system did not freeze and I could preserve the trace.
There are two traces actually. The first one, as you clearly see, appeared directly after resuming and the second when I tried rebooting (which worked btw.).
(In reply to comment #13) > This appears to be another problem caused by hibernation, and not a real bug? If you define "not a real bug" as heavily interfering with work in coincidence with potential data loss this is not a real bug, right. Is this still happening as of 2.6.38.y / 2.6.39-rc* ? Only if this is related: http://article.gmane.org/gmane.linux.kernel/1116962 Ok. Are you only and consistently getting the backtrace from comment #17 with 2.6.38.y? If so, then we can assume that this is a seperate issue. If you get varying backtraces with 2.6.38.y then chances are high, that it is the same underlying cause. Anyway, here are some thoughts on how to troubleshoot this: Module unload is probably done by some suspend-script of your distribution? Can you test with plain 'echo ... > sys/power/state'? I don't know if scsi_wait_scan is vital for your operations? If not, you could try blacklisting it (i.e. never load it). Also please don't use tp_smapi while debugging this. We want to get as many potential failure sources out of the way in order to circle in on the cause. Although we don't know if these two things are related to the crashes, it might be good to cross them off from our list of suspects. Another thing you could try is to check who is accessing sysfs and prevent them from doing so. (maybe it's also the hibernate/suspend scripts you are using or it is some power-managment daemon). Probably a red herring.. but still since that stands in the room we also want to check if that is the cause... (In reply to comment #18) > Ok. Are you only and consistently getting the backtrace from comment #17 with > 2.6.38.y? If so, then we can assume that this is a seperate issue. If you get > varying backtraces with 2.6.38.y then chances are high, that it is the same > underlying cause. I only got that backtrace once, that's why I though that it could be related in the first place. Currently, that is with 2.6.38-2-amd64 from debian (i.e. 2.6.38.2 with a few patches), hibernation ceased to work completely. When resuming from hibernation my system is unusable (even with all external devices unplugged from init 1 with disabled framebuffer). It freezes with the message: PM: Loading and decompressing image data... PM: Read ... kbytes in ... seconds Suspending console(s) (use no_console_suspend to debug) > Anyway, here are some thoughts on how to troubleshoot this: I haven't had time to test all of your suggestions but I will get to them later. > Also please don't use tp_smapi while debugging this. We want to get as many > potential failure sources out of the way in order to circle in on the cause. I left tp_smapi uninstalled since I suspected it to be the culprit in comment #1, so I am sure it's not that. I won't reinstall it though. Thanks for your kind help so far, I will try to investigate this further and keep this bug report updated. (In reply to comment #18) > Module unload is probably done by some suspend-script of your distribution? > Can you test with plain 'echo ... > sys/power/state'? The same as described above with kernel 2.6.38.2. I am currently hoping that those two bugs are the same so that I can reproduce them. bzw. the kernel parameter no_console_suspend yields another break. This time similar to this: PM: Loading and decompressing image data... PM: Read ... kbytes in ... seconds [...] Extended CMLS year 2000 All in all I found several bug reports that show the same problem: https://bugs.launchpad.net/linux/+bug/282220 https://bugzilla.novell.com/show_bug.cgi?id=450256 The bug reports there are marked as closed but they forgot to post the solution. Created attachment 52982 [details]
dmesg output
The same happens when booting with init=/bin/bash and using sys/power/state to hibernate
# lsmod
Module Size Used by
ext4 285166 1
mbcache 12930 1 ext4
jbd2 65157 1 ext4
crc16 12343 1 ext4
sha256_generic 16797 2
cryptd 14463 0
aes_x86_64 16796 8
aes_generic 37122 1 aes_x86_64
cbc 12747 4
dm_crypt 22256 1
dm_mod 62467 12 dm_crypt
btrfs 428766 0
zlib_deflate 25466 1 btrfs
crc32c 12656 1
libcrc32c 12426 1 btrfs
sg 25769 0
sd_mod 35501 2
sr_mod 21824 0
cdrom 35134 1 sr_mod
crc_t10dif 12348 1 sd_mod
uhci_hcd 26290 0
i915 315266 1
drm_kms_helper 26893 1 i915
drm 165567 2 i915,drm_kms_helper
e1000e 123965 0
ahci 25089 1
libahci 22568 1 ahci
i2c_algo_bit 12834 1 i915
i2c_core 23725 4 i915,drm_kms_helper,drm,i2c_algo_bit
libata 147240 2 ahci,libahci
sdhci_pci 13184 0
sdhci 21727 1 sdhci_pci
ehci_hcd 39529 0
scsi_mod 161457 4 sg,sr_mod,sd_mod,libata
usbcore 122908 3 uhci_hcd,ehci_hcd
mmc_core 58589 1 sdhci
video 17553 1 i915
nls_base 12753 1 usbcore
thermal 17330 0
thermal_sys 17939 2 video,thermal
button 12994 1 i915
For the 2.6.38.2 hibernate problem, you might wanna try the patch from https://patchwork.kernel.org/patch/679502/ .. or wait for the next stable release. Okay thanks. I will wait a little and then test what you suggested again. I hope I can find a way to reproduce the problem. Sorry for spamming this report with unrelated stuff. Okay I'm back with a new kernel and the bug from comment #19 is gone now but the problem remains as expected. The new kernel is 2.6.38-2-amd64 version 2.6.38-3 from debian unstable btw. I tried hibernating a few times and was "lucky" to catch the bug. The first two times were almost the same: I booted as usual and then instead of logging in via kdm switched to tty1, logged in as root and issued pm-hibernate. After the system had been hibernated I restarted the computer and was, after a few seconds of waiting, in front of the console again. Then I switched to tty7, where kdm resides, and logged in. Immediately after that the system crashed. The first time it was unusable afterwards, the second time I could use it just fine. Both times I could save the traces and they will be attached after this comment. Then I wondered what would happen if I used sysfs for hibernating and was astonished to find it working just perfect (twice again). Then I used pm-hibernate with the environment variable PM_DEBUG set to see if I could figure out what the problem was. However, all worked fine after that and the bug did not reappear. I tried another few times again and ... nothing. All as it should be. The lil' bugger's really into hiding. Damn. Created attachment 53782 [details]
The first time lil' bugger came out of hiding
Created attachment 53792 [details]
And the second time
Created attachment 55102 [details]
Trace
Here is another trace. I saw this trace popping up in kde's konsole after resuming from hibernation, I immediately tried to reboot but the system froze.
Kernel version is 2.6.38-3 from debian unstable.
I hadn't had any further crashes with linux 3.0, so I guess [1] fixed it. [1] https://lkml.org/lkml/2011/7/17/103 That wasn't applied but Al Viro provided this in the same thread: commit d6e9bd256c88ce5f4b668249e363a74f51393daa Author: Al Viro <viro@zeniv.linux.org.uk> Date: Fri May 27 07:03:15 2011 -0400 Lift the check for automount points into do_lookup() Merged in v3.0-rc1 btw. |