Bug 7732 - System freeze every 10 days
Summary: System freeze every 10 days
Status: REJECTED UNREPRODUCIBLE
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-12-22 01:09 UTC by Fabrice BOYRIE
Modified: 2008-09-22 09:20 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.19.*
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Fabrice BOYRIE 2006-12-22 01:09:35 UTC
Most recent kernel where this bug did *NOT* occur: unknown

Distribution: Mandriva Linux 2007

Hardware Environment:
Tyan Tank GT25 model B5381(http://www.tyan.com/products/html/gt25b5381.html)
_ 2*Xeon 5160
_ LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS
_ Ethernet controller: Intel Corporation PRO/1000 EB Network Connection with I/O
Acceleration
_ 8 G FBDIMM
I can attach output of lscpi -vvv and dmidecode if you want

Software Environment:
/usr/src/linux-2.6.19/scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.
 
Linux speedy-node0 2.6.19.1 #1 SMP Mon Dec 18 09:21:38 CET 2006 x86_64 Intel(R)
Xeon(R) CPU            5160  @ 3.00GHz GNU/Linux
 
Gnu C                  4.1.1
Gnu make               3.81
binutils               2.16.91.0.7
util-linux             2.12r
mount                  2.12r
module-init-tools      3.2.2
e2fsprogs              1.39
xfsprogs               2.8.11
Linux C Library        > libc.2.4
Dynamic linker (ldd)   2.4
Procps                 3.2.6
Net-tools              1.60
Console-tools          0.2.3
Sh-utils               5.97
udev                   098
Modules Loaded         xt_tcpudp xt_state xt_pkttype iptable_raw xt_CLASSIFY
xt_CONNMARK xt_MARK xt_length xt_connmark xt_physdev bridge llc xt_policy
xt_multiport xt_conntrack ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_TCPMSS
ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE
ipt_LOG ipt_iprange ipt_hashlimit ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah
ipt_addrtype ip_nat_irc ip_nat_tftp ip_nat_ftp ip_conntrack_irc
ip_conntrack_tftp ip_conntrack_ftp iptable_nat ip_nat ip_conntrack nfnetlink
iptable_mangle iptable_filter ipv6 ip_tables x_tables e1000 af_packet ide_cd
binfmt_misc loop nfsd exportfs lockd nfs_acl sunrpc dm_mod video thermal sbs
i2c_ec i2c_core fan container button battery asus_acpi ac usbhid ff_memless
cpufreq_ondemand cpufreq_conservative usbkbd cpufreq_powersave freq_table
processor cpuid ehci_hcd uhci_hcd usbcore

Problem Description:
Every 10 days the computers freeze completely (screen stays blank, SysRq doesn't
do anything) There is no oops in the log. But, just before the last freeze I had
the following message (4 times because l502.exe uses 4 threads)

"speedy-node0 kernel: printk: 53 messages suppressed.
 Dec 22 08:27:59 speedy-node0 kernel: l502.exe: page allocation failure.  
order:3, mode:0x20
 Dec 22 08:27:59 speedy-node0 kernel: 
 Dec 22 08:27:59 speedy-node0 kernel: Call Trace:
 Dec 22 08:27:59 speedy-node0 kernel:  <IRQ>  [<ffffffff8010ea3b>] 
__alloc_pages+0x290/0x2a9
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff80157f59>] 
cache_alloc_refill+0x270/0x4ac
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff80332134>] 
ip_local_deliver_finish+0x0/0x1e7
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff801adbcb>] __kmalloc+0x5b/0x62
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff8012ba47>]  __alloc_skb+0x5c/0x123
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff80318540>] 
__netdev_alloc_skb+0x12/0x2d
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff8814cacf>] 
:e1000:e1000_alloc_rx_buffers+0xcd/0x2e8
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff8814d1d9>]
:e1000:e1000_clean_rx_irq+0x4ef/0x50f
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff80159616>]
invalidate_interrupt2+0x66/0x70
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff8814c065>] 
:e1000:e1000_clean+0x90/0x166
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff80159616>]
invalidate_interrupt2+0x66/0x70
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff8010bcdc>] 
net_rx_action+0xa4/0x1af
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff80111307>] __do_softirq+0x55/0xc4
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff80159e7c>] call_softirq+0x1c/0x28
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff80163070>] do_softirq+0x2c/0x7d
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff801631f1>] do_IRQ+0x130/0x151
 Dec 22 08:27:59 speedy-node0 kernel:  [<ffffffff80159271>]  ret_from_intr+0x0/0xa
 Dec 22 08:27:59 speedy-node0 kernel:  <EOI> 
 Dec 22 08:27:59 speedy-node0 kernel: Mem-info:
 Dec 22 08:27:59 speedy-node0 kernel: DMA per-cpu:
 Dec 22 08:27:59 speedy-node0 kernel: CPU    0: Hot: hi:    0, btch:   1 usd:  
 0   Cold: hi:    0, btch:   1 usd:   0
 Dec 22 08:27:59 speedy-node0 kernel: CPU    1: Hot: hi:    0, btch:   1 usd:  
 0   Cold: hi:    0, btch:   1 usd:   0
 Dec 22 08:27:59 speedy-node0 kernel: CPU    2: Hot: hi:    0, btch:   1 usd:  
 0   Cold: hi:    0, btch:   1 usd:   0
 Dec 22 08:27:59 speedy-node0 kernel: CPU    3: Hot: hi:    0, btch:   1 usd:  
 0   Cold: hi:    0, btch:   1 usd:   0
 Dec 22 08:27:59 speedy-node0 kernel: DMA32 per-cpu:
 Dec 22 08:27:59 speedy-node0 kernel: CPU    0: Hot: hi:  186, btch:  31 usd:  
25   Cold: hi:   62, btch:  15 usd:   4
 Dec 22 08:27:59 speedy-node0 kernel: CPU    1: Hot: hi:  186, btch:  31 usd: 
112   Cold: hi:   62, btch:  15 usd:   3
 Dec 22 08:27:59 speedy-node0 kernel: CPU    2: Hot: hi:  186, btch:  31 usd: 
156   Cold: hi:   62, btch:  15 usd:  57
 Dec 22 08:27:59 speedy-node0 kernel: CPU    3: Hot: hi:  186, btch:  31 usd:  
 1   Cold: hi:   62, btch:  15 usd:  48
 Dec 22 08:27:59 speedy-node0 kernel: Normal per-cpu:
 Dec 22 08:27:59 speedy-node0 kernel: CPU    0: Hot: hi:  186, btch:  31 usd: 
147   Cold: hi:   62, btch:  15 usd:   0
 Dec 22 08:27:59 speedy-node0 kernel: CPU    1: Hot: hi:  186, btch:  31 usd:  
51   Cold: hi:   62, btch:  15 usd:   0
 Dec 22 08:27:59 speedy-node0 kernel: CPU    2: Hot: hi:  186, btch:  31 usd: 
144   Cold: hi:   62, btch:  15 usd:  56
 Dec 22 08:27:59 speedy-node0 kernel: CPU    3: Hot: hi:  186, btch:  31 usd: 
31   Cold: hi:   62, btch:  15 usd:   4
 Dec 22 08:27:59 speedy-node0 kernel: Active:1933823 inactive:47032 dirty:10900
writeback:0 unstable:0 free:9315 slab:41702 mapped:17
41 pagetables:4941
 Dec 22 08:27:59 speedy-node0 kernel: DMA free:12152kB min:16kB low:20kB
high:24kB active:0kB inactive:0kB present:11704kB pages_scan
ned:0 all_unreclaimable? yes
 Dec 22 08:27:59 speedy-node0 kernel: lowmem_reserve[]: 0 3255 8052
 Dec 22 08:27:59 speedy-node0 kernel: DMA32 free:20928kB min:4636kB low:5792kB
high:6952kB active:2956164kB inactive:112640kB present
:3333344kB pages_scanned:658 all_unreclaimable? noDec 22 08:27:59 speedy-node0 
kernel: lowmem_reserve[]: 0 0 4797
 Dec 22 08:27:59 speedy-node0 kernel: Normal free:4180kB min:6836kB low:8544kB
high:10252kB active:4779128kB inactive:75488kB present
:4912640kB pages_scanned:32 all_unreclaimable? no
 Dec 22 08:27:59 speedy-node0 kernel: lowmem_reserve[]: 0 0 0
 Dec 22 08:27:59 speedy-node0 kernel: DMA: 4*4kB 3*8kB 3*16kB 5*32kB 4*64kB
3*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 1215
2kB
 Dec 22 08:27:59 speedy-node0 kernel: DMA32: 36*4kB 30*8kB 30*16kB 1*32kB 1*64kB
0*128kB 0*256kB 9*512kB 3*1024kB 0*2048kB 3*4096kB =
 20928kB
 Dec 22 08:27:59 speedy-node0 kernel: Normal: 29*4kB 12*8kB 228*16kB 0*32kB
1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB
 = 4180kB
 Dec 22 08:27:59 speedy-node0 kernel: Swap cache: add 27624357, delete 27623144,
find 11253258/15176314, race 4+240
 Dec 22 08:27:59 speedy-node0 kernel: Free swap  = 7557008kB
 Dec 22 08:27:59 speedy-node0 kernel: Total swap = 8193140kB
 Dec 22 08:27:59 speedy-node0 kernel: Free swap:       7557008kB
 Dec 22 08:27:59 speedy-node0 kernel: 2293760 pages of RAM
 Dec 22 08:27:59 speedy-node0 kernel: 248943 reserved pages
 Dec 22 08:27:59 speedy-node0 kernel: 128551 pages shared
 Dec 22 08:27:59 speedy-node0 kernel: 1273 pages swap cached
"

NB: The network MTU is 9000 which maybe explains this message, but I had also
the freeze with MTU=1500.

All data are on XFS partitions. 

Steps to reproduce:
Launch an openmp version of Gaussian using 7.3G of system RAM
Use also the computer as NFS server for another one.
Use also the computer with some visualisation program (eg p4vasp)
And SOMETIMES it freezes.


NB: The computer is used in production so I can not make many tests.
Comment 1 Martin J. Bligh 2006-12-22 08:14:55 UTC
First - check the simplest thing ... does alt+sysrq work OK when the system is
NOT locked up?

If so, even under memory alloc failure, it should still function. The main
reason it wouldn't is if interrupts are off. You're best off enabling nmi
watchdog to catch that sort of thing, and it should give us a backtrace.

Plus ... I really wouldn't run a production machine on 2.6.19.1 if I were you, I
think the corruption issues are still unresolved. If you want something stable,
I generally recommend staying on the latest rev of the *previous* release, ie
2.6.18.6 (or whatever we're up to now) ... until 2.6.20 is released, at which
point you'd move to 2.6.19.X
Comment 2 Fabrice BOYRIE 2006-12-22 08:53:43 UTC
Alt SysRq s works as expected when the system is not locked.
How do i enable nmi watchdog ? It is the softdog module ?
I know 2.6.19 is a bit recent, but the goal was to prepare the purchase of ten
identical computers in a few months (when 2.6.20 will be out).
Comment 3 Martin J. Bligh 2006-12-22 10:16:58 UTC
Documentation/nmi_watchdog.txt
Comment 4 Fabrice BOYRIE 2006-12-22 10:54:09 UTC
Thanks

Now I just have to wait.
Comment 5 Fabrice BOYRIE 2006-12-26 23:35:07 UTC
The computer was booted with nmi_watchdog=1
In the log I've
speedy-node0 kernel: testing NMI watchdog ... OK.

And I've NMI in /proc/interrupts

But this night the computer locked-up. The screens stays black, nothing in the
logs and no SysRq didn't work.
Comment 6 Martin J. Bligh 2006-12-27 09:09:16 UTC
Presumably the machine is not pingable either? Or reachable through serial console?

If so, that's ... very odd ;-)
Comment 7 Fabrice BOYRIE 2006-12-27 14:53:16 UTC
I will try at the next lock up. 

Thanks
Comment 8 Fabrice BOYRIE 2007-02-21 23:26:03 UTC
Lock-up are very rare now. I don't know why, may be the load has changed. But
this night there was another one. No ping, no SysRq and the serial console was
also frozen (and I have no messages on this console).
  May be I should try 2.6.20 ? 
Comment 9 Fabrice BOYRIE 2007-04-16 02:52:15 UTC
I have the same problems with a 2.6.20.2 kernel.
But this week-end there was an error message readable on the screen and two
keyboards led were blinking.

  I had to hand copy the screen, there is certainly some errors.

[<ffffffff80175bc5>] do_nmi+0x2B/0x48
[<ffffffff80167dcf>] nmi+0x7F/0x98
[<ffffffff80177740>] flat_send_IPI_nosk
[<ffffffff80164deb>] __write_lock_failed+0x9/0x10
<<EOE>> [<ffffffff8016794F>] _write_lock_irq+0xf/0x10
[<ffffffff8011604f>] do_exit +0x4bf/0x880
[<ffffffff801676d6>] _spin_unlock_irq_restore +0x8/0x10
[<ffffffff8010ad72>] do_page_fault +0x772/0x870
[<ffffffff8018531c>] find_bissect_group +0x2bc/0x760
[<ffffffff880b5050>] surrpc: do_cache_clean +0x0/0x50
[<ffffffff80167b2d>] error_exit +0x0/0x84
[<ffffffff880b5050>] sunrpc: do_cache_clean 0x0/0x50
[<ffffffff880b475d>] sunrpc: cache_clean 0xfd/0x230
[<ffffffff880b5059>] sunrpc: do_cache_clean 0x9/0x50
[<ffffffff80150138>] run_workqueue +0xa8/0x160
[<ffffffff8014c6f0>] worker_thread+0x0/0x190
[<ffffffff8014c83b>] worker_thread+0x14b/0x190
[<ffffffff80186910>] default_wake_function +x0/0x10
[<ffffffff8014c6f0>] worker_thread +0x0/0x190
[<ffffffff80134499>] kthread 0xd9/0x120
[<ffffffff80163468>] child_rip 0xd/0x12
[<ffffffff801343c0>] kthread +0x0/0x120
[<ffffffff8016345e>] child_rip +0x0/0x12
Comment 10 Natalie Protasevich 2007-08-26 11:34:50 UTC
Fabrice,
How is it working for you with 2.6.22+?
Last trace shows that the rpc process is trying to get a write lock; in this code path irqs do get disabled. The top trace is actually for the IRQ stack while processing the soft irq for e1000 buffer allocation. So in some tight memory conditions it appears to be locking tension or deadlock.
Moving item to networking for now.
Comment 11 Fabrice BOYRIE 2007-08-27 05:27:45 UTC
No crash since I've installed 2.6.22 (33 days ago). But the computer is busy with a program needing only 4,4G since more than 20 days.
Comment 12 Natalie Protasevich 2008-04-16 00:13:24 UTC
Any updates on this problem, has the system been running with new kernel better?
Comment 13 Fabrice BOYRIE 2008-04-16 00:59:02 UTC
I've put 2.6.24.2 on the computer. I've always problem when programs using too much memory run.

  Whith this kernel I've seen three times an error about reading swap (afterward computer freeze). I've remade the swap, used badblocks -n on the swap device without error. 

smartctl gives
Device: SEAGATE  ST373455SS       Version: 0001
Serial number: 3LQ04LJV00009713SQ1P
Device type: disk
Transport protocol: SAS
Local Time is: Wed Apr 16 09:57:52 2008 CEST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     25 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 250549901
  Blocks received from initiator = 59939445
  Blocks read from cache and sent to initiator = 621789276
  Number of read and write commands whose size <= segment size = 80380618
  Number of read and write commands whose size > segment size = 160
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 11595.78
  number of minutes until next internal SMART test = 5

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   71867968       26         0  71867994   71867994       6705.456           0
write:         0        0         0         0          0    1729755.619           0

Non-medium error count:      476

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -   11493                 - [-   -    -]

May be it is an hardware error...
Comment 14 jasin colegrove 2008-05-30 19:54:18 UTC
Random computer problems in my experience are always 99.9% of the time related to power issues, the other .1% to specific hardware pieces. Normal hardware failure will just not work, period. while power issues will always show randomness, since any sort of activity can peak the power to the point of it be underpowered.

Note You need to log in before you can comment on or make changes to this bug.