Bug 13536 - crash while copying atime with touch
Summary: crash while copying atime with touch
Status: CLOSED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: ReiserFS (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: ReiseFS developers team
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-14 11:33 UTC by Elmar Stellnberger
Modified: 2012-06-08 11:57 UTC (History)
4 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
files for which time could be changed + last file on which crash occurred (723.18 KB, text/plain)
2009-06-14 11:33 UTC, Elmar Stellnberger
Details
/var/log/messages (130.36 KB, application/x-bzip2)
2009-06-14 11:36 UTC, Elmar Stellnberger
Details
lsmod (5.37 KB, application/x-kdeuser2)
2009-06-15 16:33 UTC, Elmar Stellnberger
Details
lsmod-list for last crash (4.11 KB, application/x-kdeuser2)
2009-06-18 13:54 UTC, Elmar Stellnberger
Details
lsmods for last working full run of the timestamping script (1.89 KB, application/x-kdeuser2)
2009-06-18 13:56 UTC, Elmar Stellnberger
Details

Description Elmar Stellnberger 2009-06-14 11:33:28 UTC
Created attachment 21910 [details]
files for which time could be changed + last file on which crash occurred

The system suddenly powers off executing this little scriptlet:
mount -t reiserfs /dev/sdb8 /mnt/ptgsuse
mount -t reiserfs /dev/temp/ptgsuse /mnt/ptgsuse-new/
cd /mnt/ptgsuse
find . -print0| while read -d $'\000' file; do
  echo $file >>/mnt/timestamped.ok
  touch -r "$file" -a "/mnt/ptgsuse-new/$file"
done
Comment 1 Elmar Stellnberger 2009-06-14 11:36:20 UTC
Created attachment 21911 [details]
/var/log/messages
Comment 2 Elmar Stellnberger 2009-06-14 11:37:42 UTC
tested for kernel-vanilla-2.6.30-master_20090610141226_1b1ebc88 from http://ftp.suse.com/pub/projects/kernel/kotd/master/.
Comment 3 Elmar Stellnberger 2009-06-14 11:54:22 UTC
also there for kernel-2.6.27.23-0.1.
Comment 4 Jeff Mahoney 2009-06-14 22:45:26 UTC
Are you able to reproduce this without virtualbox?
Comment 5 Elmar Stellnberger 2009-06-15 16:29:57 UTC
did not use virtualbox in a fact.
good idea nonetheless to dump all unnecessary kernel modules first (vmware, virtualbox) and retry it in runlevel 1 or 2.
Comment 6 Elmar Stellnberger 2009-06-15 16:33:53 UTC
Created attachment 21926 [details]
lsmod

perhaps there are some other modules I could unload before issuing the next test (please tell me).
Comment 7 Elmar Stellnberger 2009-06-15 17:36:17 UTC
Why not test it forthwith kernel-debug? Just tell me where to find the core file then.?
Comment 8 Elmar Stellnberger 2009-06-18 13:53:02 UTC
  Lately I tried to infringe the problem by unloading all possibly unloadable kernel modules (up to 9) and see it suddenly worked without any problem. Unexpectedly the vmware&virualbox modules did not harm. However finding the kernel module that causes the crash by binary search is almost impossible since there is no crash as soon as all requests can be satisfied from the cache memory. Nonetheless I can provide a minimal module list the script has been working with and a maximal module list that is sure to cause a crash.
  I wonder if there is any way to forge ahead to a backtrace/core file. At FreeBSD there is always a core-dump hanging around after a crash.
Comment 9 Elmar Stellnberger 2009-06-18 13:54:19 UTC
Created attachment 21988 [details]
lsmod-list for last crash
Comment 10 Elmar Stellnberger 2009-06-18 13:56:14 UTC
Created attachment 21990 [details]
lsmods for last working full run of the timestamping script
Comment 11 Roland Kletzing 2009-08-11 14:33:00 UTC
>However finding the kernel module that causes the crash by binary search is
>>almost impossible since there is no crash as soon as all requests can be 
>satisfied from the cache

so, for each iteration you will need a reboot before, right ?

but, what`s the problem with that ? ;)

i`d go this way:

1. make a list of all modules you want to unload.
2. name that file however you want
3. do "cat file.list |while read module;do modprobe -r $module;done"
4. test
5. after the next crash, cut that file into 2 parts and repeat 3+4
6. in theory, one of the file should contain the "bad" module. name that file appropriately (.good / .bad)
7. take the .bad file and repeat the steps until bisection results with a file containing a single module
Comment 12 Elmar Stellnberger 2009-08-12 09:09:11 UTC
  This is exactly what I have been doing. Very cumbersome and tedious since not only booting takes a while but even much worse the looped touch consumes a lot of time and hdd activity so that my hard disk will have to suffer. Logarithmic search is far not as fast here as I would wish it to be.
  Besides this I am really in wonder if there is no way to debug the Linux kernel. It is so easy to analyze an auto-created backtrace under FreeBSD. Furthermore I have no idea how you will find the bug just knowing the offensive kernel module. It could hide anywhere. Is there really no possibility to get a Linux kernel core dump?
Comment 13 Roland Kletzing 2009-08-15 16:19:26 UTC
i don`t know how to do a kernel dump core, but besides all that, i think you have another major problem with your system. 

i`d probably call that a root-cause (found it in the logs you sent):

Jun 14 10:50:07 scaleo smartd[3947]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 111 to 123
Jun 14 10:50:07 scaleo smartd[3947]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 200 to 100
Jun 14 10:50:07 scaleo smartd[3947]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 123
Jun 14 11:20:07 scaleo smartd[4022]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 123 to 112


123 degrees celsius ?  

DOH! 

i never saw a harddisk getting that hot and by sure that`s an operational temperature which is above the specs.

i think you could fry eggs on your harddisks!!!

if the harddisks really get that hot (please check that smartd doesn`t lie here) i`d first consider getting the cooling for your system fixed.
Comment 14 Roland Kletzing 2009-08-15 18:50:55 UTC
>i don`t know how to do a kernel dump core
here is some information for you:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/kdump/kdump.txt
Comment 15 Christian Kujau 2009-08-18 07:32:56 UTC
> Jun 14 11:20:07 scaleo smartd[4022]: Device: /dev/sda [SAT], SMART Usage
> Attribute: 194 Temperature_Celsius changed from 123 to 112

This should be the raw value of the SMART temperature attribute, not the real value. Elmar, you should nevertheless check the temperature of your drives and for hardware errors in general (cabling, bad memory, hot components, etc) to rule out hardware issues. It may well be that the CPU is getting warmer than usual when executing the script and the box just powers off for safety reasons.

 > find . -print0| while read -d $'\000' file; do
 >  echo $file >>/mnt/timestamped.ok
 >  touch -r "$file" -a "/mnt/ptgsuse-new/$file"
 > done

So, you basically create new files in /mnt/ptgsuse-new, with a refrence timestamp of the file found in /mnt/ptgsuse? Or do the files in /mnt/ptgsuse-new exist already and you're just changing atimes?

  > cd $MNT1
  > find . -xdev ! -type l -exec touch -r '{}' -a $MNT2/'{}' \; \
                           -exec echo '{}' >> $HOME/timestamped.ok \;

I've run this one a few times with already existing files in the other filesystem (MNT2), no errors or powerdowns so far (vanilla, latest git).
Comment 16 Elmar Stellnberger 2009-08-18 07:56:54 UTC
  My new FS-notebook does not get hot at all (in contrast to others). I have recently had a comprehensive hardware check (memory, hdd). The only thing I can not fully exclude by the time is a malware infection though basic tests yielded negative.
  The posted program does not copy files but only atime records (see Comment #1). No crash will occur if you just copy a few files. Moreover the problem may be related to reiserfs. I will read Roland Kletzings howto on creating kernel core dumps to provide useful information, soon. Nonetheless it would be momentous if someone else could reproduce it. Use the root-fs of a fully installed system, two reiserfs partitions and two different sata hard disks for them (here: SATA_WDC_WD6400AAKS).
Comment 17 Roland Kletzing 2009-08-18 09:07:37 UTC
>This should be the raw value of the SMART temperature attribute, not the 
>real value.

how can we determine the real value then ?

on my system it looks like this:

neoware:~ # smartctl -d ata -a /dev/sda |egrep "RAW_VALUE|Temperature_Celsius"
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
194 Temperature_Celsius     0x0022   076   046   000    Old_age   Always       -       52 (Lifetime Min/Max 9/67)

mind the column "RAW_VALUE". 
so, if smartd isn`t telling nonsense, my disk runs a little bit cooler.... (whereas i think 52 degrees is still too hot)


>I will read Roland Kletzings howto on creating kernel core dumps to 
>provide useful information, soon.
mind, that this is just a pointer. i cannot even tell how good this works or if it works at all, as i did not try it.
Comment 18 Christian Kujau 2009-08-18 09:21:00 UTC
On 8/18/09 2:07 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
>> This should be the raw value of the SMART temperature attribute, not the
>> real value.
>
> how can we determine the real value then ?

My wording was wrong: the 123 degrees were not the "raw" but the 
"normalized" value:

http://smartmontools.sourceforge.net/faq.html#disk-temperature

> on my system it looks like this:
[...]

Yeah, on mine too. smartd however has to be tought to print the raw 
("real") value as well, as the FAQ above explains.

C.
Comment 19 Roland Kletzing 2009-08-20 10:27:25 UTC
ok, thanks. seems smartmon/smartd has some weirdos or some disks report just nonsense. let`s skip that temperature isssue for now....

Elmar, can we have your mount options for your reiser filesystems?

furthermore, can you try mounting with "noatime" to see if the problem also happens if atime is disabled for appropriate filesystem ?
Comment 20 Roland Kletzing 2009-08-22 08:53:16 UTC
Elmar, could you confirm if mount option "data=journal" is triggering your problem and if this patch: http://bugzilla.kernel.org/attachment.cgi?id=22800  is fixing it ?  you can otherwise try .31-rc7 kernel, which already has this patch included.
Comment 21 Elmar Stellnberger 2009-08-23 20:44:13 UTC
Good news; the bug seems having already been resolved with kernel 2.6.30-master_20090610141226_1b1ebc88-default. Could not evoke a crash either by using mount options as before (defaults,noauto,users) or by adding noatime,data=journal or data=ordered respectively.
I can still evoke the crash when booting into kernel 2.6.27.23-0.1.
Comment 22 Elmar Stellnberger 2009-08-23 20:48:57 UTC
... but this is the kernel I had initially posted the bug for?!
... wait a moment, I will have a malware test ...
Comment 23 Elmar Stellnberger 2009-08-23 21:00:45 UTC
No, it is still reproducible for both kernels. The only thing about it is if you mount once with all the additional options described in Comment #20 and then mount normally no crash will occur.
Comment 24 Elmar Stellnberger 2009-08-23 21:00:59 UTC
Will kernel-default-2.6.31-master_20090821145917_6697a63e contain the desired patch?
Comment 25 Roland Kletzing 2009-08-23 22:29:51 UTC
no, the fix went in later.

i was speaking of the kernel downloadable at http://www.kernel.org/

i.e. download
http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.30.tar.bz2
and apply this patch:
http://www.kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.31-rc7.bz2

if you want to use suse kotd, you may need to wait some more until some kernel dated 23.08.2009(+) shows up there
Comment 26 Elmar Stellnberger 2009-08-25 20:07:20 UTC
No, the bug is still there with kernel 2.6.31-rc7-master_20090824154012_a3944f56.
Comment 27 Roland Kletzing 2009-08-25 21:01:32 UTC
weird...

but as this is bugzilla for vanilla kernel and not suse kernel (which contains suse patches) i would have taken kernel-vanilla... package from kotd.

mind trying that, too ?

regarding comment #23 - i`m a little bit unsure. now, what is the offending mount option? atime, data=journal or both ?

if you`re unsure,too, please test again and post results of

noatime,data=journal 
noatime,data=ordered
atime,data=journal
atime,data=ordered


furthermore, contrary to vanilla kernel suse kernel has barrier=flush as default in place. so i would also compare barrier=flush vs. barrier=none 

+config REISERFS_DEFAULTS_TO_BARRIERS_ENABLED
+       bool "Default to 'barrier=flush' in reiserfs"
+       depends on REISERFS_FS
+        help
+         Modern disk drives support write caches that can speed up writeback.
+         Some devices, in order to improve their performance statistics,
+         report that the write has been completed even when it has only
+         been committed to volatile cache memory. This can result in
+         severe corruption in the event of power loss.
+
+         The -o barrier option enables the file system to direct the block
+         layer to issue a barrier, which ensures that the cache has been
+         flushed before proceeding. This can produce some slowdown in
+         certain environments, but allows higher end storage arrays with
+         battery-backed caches to report completes writes sooner than
+         would be otherwise possible.
+
+         Without this option, disk write caches should be disabled if
+         you value data integrity over writeback performance.
+
+         If unsure, say N.
+
Comment 28 Elmar Stellnberger 2009-08-28 12:19:10 UTC
  Unfortunately my latest tests with noatime have alltogether evoked crashes.
noatime, data=journal & noatime, data=ordered, both with kernel 2.6.31rc7 as well as kernel 2.6.30 (as mentioned above).
  Had forgotten to tell that target volume is a LVM-LV. With 2.6.31rc7-vanilla I could not access the LVM-LV any more (it must have worked with vanilla kernels before).
  Very starnge; can`t tell why it has not crashed with any setting one time.
Comment 29 Elmar Stellnberger 2009-10-21 11:00:46 UTC
  Atime-setting seems having been totally rewritten with vmlinuz-2.6.31.1-3-debug. It is now much faster and does not crash any more during atime-setting. 
  However after having executed the atime-set-script multiple times and after waiting a while my computer has crashed again. Trying to get a backtrace I had no success. It was possible to load the crash kernel at crashkernel=64M@32M but not before as recommended (64M@16M rejected). However by doing an 'echo c >/proc/sysrq-trigger' I got a crash rather than entering the crash handler kernel even though kexec had success in loading the crash-kernel before.

Note You need to log in before you can comment on or make changes to this bug.