Most recent kernel where this bug did *NOT* occur: unknown Distribution: Mandriva Linux 2007 Hardware Environment: Tyan Tank GT25 model B5381(http://www.tyan.com/products/html/gt25b5381.html) _ 2*Xeon 5160 _ LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS _ Ethernet controller: Intel Corporation PRO/1000 EB Network Connection with I/O Acceleration _ 8 G FBDIMM I can attach output of lscpi -vvv and dmidecode if you want Software Environment: /usr/src/linux-2.6.19/scripts/ver_linux If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. Linux speedy-node0 2.6.19.1 #1 SMP Mon Dec 18 09:21:38 CET 2006 x86_64 Intel(R) Xeon(R) CPU 5160 @ 3.00GHz GNU/Linux Gnu C 4.1.1 Gnu make 3.81 binutils 2.16.91.0.7 util-linux 2.12r mount 2.12r module-init-tools 3.2.2 e2fsprogs 1.39 xfsprogs 2.8.11 Linux C Library > libc.2.4 Dynamic linker (ldd) 2.4 Procps 3.2.6 Net-tools 1.60 Console-tools 0.2.3 Sh-utils 5.97 udev 098 Modules Loaded xt_tcpudp xt_state xt_pkttype iptable_raw xt_CLASSIFY xt_CONNMARK xt_MARK xt_length xt_connmark xt_physdev bridge llc xt_policy xt_multiport xt_conntrack ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_TCPMSS ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE ipt_LOG ipt_iprange ipt_hashlimit ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah ipt_addrtype ip_nat_irc ip_nat_tftp ip_nat_ftp ip_conntrack_irc ip_conntrack_tftp ip_conntrack_ftp iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle iptable_filter ipv6 ip_tables x_tables e1000 af_packet ide_cd binfmt_misc loop nfsd exportfs lockd nfs_acl sunrpc dm_mod video thermal sbs i2c_ec i2c_core fan container button battery asus_acpi ac usbhid ff_memless cpufreq_ondemand cpufreq_conservative usbkbd cpufreq_powersave freq_table processor cpuid ehci_hcd uhci_hcd usbcore Problem Description: Every 10 days the computers freeze completely (screen stays blank, SysRq doesn't do anything) There is no oops in the log. But, just before the last freeze I had the following message (4 times because l502.exe uses 4 threads) "speedy-node0 kernel: printk: 53 messages suppressed. Dec 22 08:27:59 speedy-node0 kernel: l502.exe: page allocation failure. order:3, mode:0x20 Dec 22 08:27:59 speedy-node0 kernel: Dec 22 08:27:59 speedy-node0 kernel: Call Trace: Dec 22 08:27:59 speedy-node0 kernel: <IRQ> [<ffffffff8010ea3b>] __alloc_pages+0x290/0x2a9 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff80157f59>] cache_alloc_refill+0x270/0x4ac Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff80332134>] ip_local_deliver_finish+0x0/0x1e7 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff801adbcb>] __kmalloc+0x5b/0x62 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff8012ba47>] __alloc_skb+0x5c/0x123 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff80318540>] __netdev_alloc_skb+0x12/0x2d Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff8814cacf>] :e1000:e1000_alloc_rx_buffers+0xcd/0x2e8 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff8814d1d9>] :e1000:e1000_clean_rx_irq+0x4ef/0x50f Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff80159616>] invalidate_interrupt2+0x66/0x70 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff8814c065>] :e1000:e1000_clean+0x90/0x166 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff80159616>] invalidate_interrupt2+0x66/0x70 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff8010bcdc>] net_rx_action+0xa4/0x1af Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff80111307>] __do_softirq+0x55/0xc4 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff80159e7c>] call_softirq+0x1c/0x28 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff80163070>] do_softirq+0x2c/0x7d Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff801631f1>] do_IRQ+0x130/0x151 Dec 22 08:27:59 speedy-node0 kernel: [<ffffffff80159271>] ret_from_intr+0x0/0xa Dec 22 08:27:59 speedy-node0 kernel: <EOI> Dec 22 08:27:59 speedy-node0 kernel: Mem-info: Dec 22 08:27:59 speedy-node0 kernel: DMA per-cpu: Dec 22 08:27:59 speedy-node0 kernel: CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 Dec 22 08:27:59 speedy-node0 kernel: CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 Dec 22 08:27:59 speedy-node0 kernel: CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 Dec 22 08:27:59 speedy-node0 kernel: CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 Dec 22 08:27:59 speedy-node0 kernel: DMA32 per-cpu: Dec 22 08:27:59 speedy-node0 kernel: CPU 0: Hot: hi: 186, btch: 31 usd: 25 Cold: hi: 62, btch: 15 usd: 4 Dec 22 08:27:59 speedy-node0 kernel: CPU 1: Hot: hi: 186, btch: 31 usd: 112 Cold: hi: 62, btch: 15 usd: 3 Dec 22 08:27:59 speedy-node0 kernel: CPU 2: Hot: hi: 186, btch: 31 usd: 156 Cold: hi: 62, btch: 15 usd: 57 Dec 22 08:27:59 speedy-node0 kernel: CPU 3: Hot: hi: 186, btch: 31 usd: 1 Cold: hi: 62, btch: 15 usd: 48 Dec 22 08:27:59 speedy-node0 kernel: Normal per-cpu: Dec 22 08:27:59 speedy-node0 kernel: CPU 0: Hot: hi: 186, btch: 31 usd: 147 Cold: hi: 62, btch: 15 usd: 0 Dec 22 08:27:59 speedy-node0 kernel: CPU 1: Hot: hi: 186, btch: 31 usd: 51 Cold: hi: 62, btch: 15 usd: 0 Dec 22 08:27:59 speedy-node0 kernel: CPU 2: Hot: hi: 186, btch: 31 usd: 144 Cold: hi: 62, btch: 15 usd: 56 Dec 22 08:27:59 speedy-node0 kernel: CPU 3: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: 4 Dec 22 08:27:59 speedy-node0 kernel: Active:1933823 inactive:47032 dirty:10900 writeback:0 unstable:0 free:9315 slab:41702 mapped:17 41 pagetables:4941 Dec 22 08:27:59 speedy-node0 kernel: DMA free:12152kB min:16kB low:20kB high:24kB active:0kB inactive:0kB present:11704kB pages_scan ned:0 all_unreclaimable? yes Dec 22 08:27:59 speedy-node0 kernel: lowmem_reserve[]: 0 3255 8052 Dec 22 08:27:59 speedy-node0 kernel: DMA32 free:20928kB min:4636kB low:5792kB high:6952kB active:2956164kB inactive:112640kB present :3333344kB pages_scanned:658 all_unreclaimable? noDec 22 08:27:59 speedy-node0 kernel: lowmem_reserve[]: 0 0 4797 Dec 22 08:27:59 speedy-node0 kernel: Normal free:4180kB min:6836kB low:8544kB high:10252kB active:4779128kB inactive:75488kB present :4912640kB pages_scanned:32 all_unreclaimable? no Dec 22 08:27:59 speedy-node0 kernel: lowmem_reserve[]: 0 0 0 Dec 22 08:27:59 speedy-node0 kernel: DMA: 4*4kB 3*8kB 3*16kB 5*32kB 4*64kB 3*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 1215 2kB Dec 22 08:27:59 speedy-node0 kernel: DMA32: 36*4kB 30*8kB 30*16kB 1*32kB 1*64kB 0*128kB 0*256kB 9*512kB 3*1024kB 0*2048kB 3*4096kB = 20928kB Dec 22 08:27:59 speedy-node0 kernel: Normal: 29*4kB 12*8kB 228*16kB 0*32kB 1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4180kB Dec 22 08:27:59 speedy-node0 kernel: Swap cache: add 27624357, delete 27623144, find 11253258/15176314, race 4+240 Dec 22 08:27:59 speedy-node0 kernel: Free swap = 7557008kB Dec 22 08:27:59 speedy-node0 kernel: Total swap = 8193140kB Dec 22 08:27:59 speedy-node0 kernel: Free swap: 7557008kB Dec 22 08:27:59 speedy-node0 kernel: 2293760 pages of RAM Dec 22 08:27:59 speedy-node0 kernel: 248943 reserved pages Dec 22 08:27:59 speedy-node0 kernel: 128551 pages shared Dec 22 08:27:59 speedy-node0 kernel: 1273 pages swap cached " NB: The network MTU is 9000 which maybe explains this message, but I had also the freeze with MTU=1500. All data are on XFS partitions. Steps to reproduce: Launch an openmp version of Gaussian using 7.3G of system RAM Use also the computer as NFS server for another one. Use also the computer with some visualisation program (eg p4vasp) And SOMETIMES it freezes. NB: The computer is used in production so I can not make many tests.
First - check the simplest thing ... does alt+sysrq work OK when the system is NOT locked up? If so, even under memory alloc failure, it should still function. The main reason it wouldn't is if interrupts are off. You're best off enabling nmi watchdog to catch that sort of thing, and it should give us a backtrace. Plus ... I really wouldn't run a production machine on 2.6.19.1 if I were you, I think the corruption issues are still unresolved. If you want something stable, I generally recommend staying on the latest rev of the *previous* release, ie 2.6.18.6 (or whatever we're up to now) ... until 2.6.20 is released, at which point you'd move to 2.6.19.X
Alt SysRq s works as expected when the system is not locked. How do i enable nmi watchdog ? It is the softdog module ? I know 2.6.19 is a bit recent, but the goal was to prepare the purchase of ten identical computers in a few months (when 2.6.20 will be out).
Documentation/nmi_watchdog.txt
Thanks Now I just have to wait.
The computer was booted with nmi_watchdog=1 In the log I've speedy-node0 kernel: testing NMI watchdog ... OK. And I've NMI in /proc/interrupts But this night the computer locked-up. The screens stays black, nothing in the logs and no SysRq didn't work.
Presumably the machine is not pingable either? Or reachable through serial console? If so, that's ... very odd ;-)
I will try at the next lock up. Thanks
Lock-up are very rare now. I don't know why, may be the load has changed. But this night there was another one. No ping, no SysRq and the serial console was also frozen (and I have no messages on this console). May be I should try 2.6.20 ?
I have the same problems with a 2.6.20.2 kernel. But this week-end there was an error message readable on the screen and two keyboards led were blinking. I had to hand copy the screen, there is certainly some errors. [<ffffffff80175bc5>] do_nmi+0x2B/0x48 [<ffffffff80167dcf>] nmi+0x7F/0x98 [<ffffffff80177740>] flat_send_IPI_nosk [<ffffffff80164deb>] __write_lock_failed+0x9/0x10 <<EOE>> [<ffffffff8016794F>] _write_lock_irq+0xf/0x10 [<ffffffff8011604f>] do_exit +0x4bf/0x880 [<ffffffff801676d6>] _spin_unlock_irq_restore +0x8/0x10 [<ffffffff8010ad72>] do_page_fault +0x772/0x870 [<ffffffff8018531c>] find_bissect_group +0x2bc/0x760 [<ffffffff880b5050>] surrpc: do_cache_clean +0x0/0x50 [<ffffffff80167b2d>] error_exit +0x0/0x84 [<ffffffff880b5050>] sunrpc: do_cache_clean 0x0/0x50 [<ffffffff880b475d>] sunrpc: cache_clean 0xfd/0x230 [<ffffffff880b5059>] sunrpc: do_cache_clean 0x9/0x50 [<ffffffff80150138>] run_workqueue +0xa8/0x160 [<ffffffff8014c6f0>] worker_thread+0x0/0x190 [<ffffffff8014c83b>] worker_thread+0x14b/0x190 [<ffffffff80186910>] default_wake_function +x0/0x10 [<ffffffff8014c6f0>] worker_thread +0x0/0x190 [<ffffffff80134499>] kthread 0xd9/0x120 [<ffffffff80163468>] child_rip 0xd/0x12 [<ffffffff801343c0>] kthread +0x0/0x120 [<ffffffff8016345e>] child_rip +0x0/0x12
Fabrice, How is it working for you with 2.6.22+? Last trace shows that the rpc process is trying to get a write lock; in this code path irqs do get disabled. The top trace is actually for the IRQ stack while processing the soft irq for e1000 buffer allocation. So in some tight memory conditions it appears to be locking tension or deadlock. Moving item to networking for now.
No crash since I've installed 2.6.22 (33 days ago). But the computer is busy with a program needing only 4,4G since more than 20 days.
Any updates on this problem, has the system been running with new kernel better?
I've put 2.6.24.2 on the computer. I've always problem when programs using too much memory run. Whith this kernel I've seen three times an error about reading swap (afterward computer freeze). I've remade the swap, used badblocks -n on the swap device without error. smartctl gives Device: SEAGATE ST373455SS Version: 0001 Serial number: 3LQ04LJV00009713SQ1P Device type: disk Transport protocol: SAS Local Time is: Wed Apr 16 09:57:52 2008 CEST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 25 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 250549901 Blocks received from initiator = 59939445 Blocks read from cache and sent to initiator = 621789276 Number of read and write commands whose size <= segment size = 80380618 Number of read and write commands whose size > segment size = 160 Vendor (Seagate/Hitachi) factory information number of hours powered up = 11595.78 number of minutes until next internal SMART test = 5 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 71867968 26 0 71867994 71867994 6705.456 0 write: 0 0 0 0 0 1729755.619 0 Non-medium error count: 476 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed - 11493 - [- - -] May be it is an hardware error...
Random computer problems in my experience are always 99.9% of the time related to power issues, the other .1% to specific hardware pieces. Normal hardware failure will just not work, period. while power issues will always show randomness, since any sort of activity can peak the power to the point of it be underpowered.