Bug 7458
Summary: | Severe crashes using MD-RAID5 over LVM2 over IDE | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Xu (development--bugzilla.kernel.org) |
Component: | IDE | Assignee: | Bartlomiej Zolnierkiewicz (bzolnier) |
Status: | CLOSED PATCH_ALREADY_AVAILABLE | ||
Severity: | blocking | CC: | agk, bunk, neilb, okir |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.19-rc4 | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Xu
2006-11-04 05:52:43 UTC
This looks like random memory corruption in various slabs (dm_tio, dm_io, biovec-1). The slabs themselves have been trampled on. Does this problem persist with newer kernels? BTW, I'm seeing CIFS messages in the mix. Can you reproduce this with no CIFS mounts? Sorry, I'm currently overseas (and will remain overseas for some months), so I cannot afford to try to crash the machine in question, as I for myself cannot restart it, as long as I'm overseas. Hello, I wished this bug has vanished (and I actually thought that), but it hit me hard on newer kernels (2.6.21.3) :-( So yes, this problem persists with newer kernels. Now, sometimes, the machine reboots suddenly. Sometimes, the machine prints this stack trace and reboots some minutes later: Jun 11 00:10:25 router kernel: [ 6656.688000] slab: Internal list corruption detected in cache 'biovec-1'(145), slabp e4141000(86). Hexdump: Jun 11 00:10:25 router kernel: [ 6656.688000] Jun 11 00:10:25 router kernel: [ 6656.688000] 000: 00 10 3e c2 dc 77 7e e7 60 02 00 00 60 12 14 e4 Jun 11 00:10:25 router kernel: [ 6656.688000] 010: 56 00 00 00 7b 00 00 00 00 00 c8 9a 22 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 020: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 030: fe ff ff ff fe ff ff ff 2d 00 00 00 2c 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 040: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 050: fe ff ff ff fe ff ff ff fe ff ff ff 25 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 060: fe ff ff ff 7d 00 00 00 ff ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 070: 48 00 00 00 fe ff ff ff fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 080: fe ff ff ff fe ff ff ff 20 00 00 00 1b 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 090: 44 00 00 00 fe ff ff ff fe ff ff ff 5b 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 0a0: 3e 00 00 00 4f 00 00 00 fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 0b0: 59 00 00 00 fe ff ff ff 51 00 00 00 07 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 0c0: fe ff ff ff fe ff ff ff fe ff ff ff 58 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 0d0: 27 00 00 00 fe ff ff ff fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 0e0: 87 00 00 00 fe ff ff ff 21 00 00 00 50 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 0f0: fe ff ff ff 15 00 00 00 fe ff ff ff 78 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 100: fe ff ff ff fe ff ff ff 1d 00 00 00 6d 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 110: fe ff ff ff 31 00 00 00 fe ff ff ff 1c 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 120: 4a 00 00 00 fe ff ff ff fe ff ff ff 81 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 130: fe ff ff ff 5d 00 00 00 3b 00 00 00 67 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 140: 5a 00 00 00 47 00 00 00 fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 150: fe ff ff ff fe ff ff ff 13 00 00 00 86 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 160: 55 00 00 00 fe ff ff ff fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 170: 00 00 00 00 12 00 00 00 fe ff ff ff 62 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 180: 46 00 00 00 00 00 00 00 7c 00 00 00 fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 190: 63 00 00 00 fe ff ff ff 49 00 00 00 fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 1a0: fe ff ff ff 5f 00 00 00 41 00 00 00 73 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 1b0: fe ff ff ff fe ff ff ff 64 00 00 00 fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 1c0: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 1d0: 80 00 00 00 fe ff ff ff fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 1e0: fe ff ff ff fe ff ff ff 28 00 00 00 fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 1f0: fe ff ff ff fe ff ff ff fe ff ff ff 36 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 200: fe ff ff ff fe ff ff ff 38 00 00 00 34 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 210: 33 00 00 00 fe ff ff ff fe ff ff ff 10 00 00 00 Jun 11 00:10:25 router kernel: [ 6656.688000] 220: 08 00 00 00 fe ff ff ff fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 230: fe ff ff ff 56 00 00 00 3c 00 00 00 fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 240: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] 250: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff Jun 11 00:10:25 router kernel: [ 6656.688000] ------------[ cut here ]------------ Jun 11 00:10:25 router kernel: [ 6656.688000] kernel BUG at mm/slab.c:2936! Jun 11 00:10:25 router kernel: [ 6656.688000] invalid opcode: 0000 [#1] Jun 11 00:10:25 router kernel: [ 6656.688000] PREEMPT Jun 11 00:10:25 router kernel: [ 6656.688000] Modules linked in: nls_utf8 cifs nls_cp850 nls_iso8859_1 smbfs act_police sch_ingress cls_u32 sch_sfq sch_htb rfcomm hidp l2cap bluetooth cls_fw sch_prio sch_tbf xt_mark xt_multiport xt_MARK ipt_MASQUERADE xt_TCPMSS ipt_TOS xt_length iptable_mangle nf_nat_ftp nf_conntrack_ftp ipt_REJECT iptable_filter xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nfnetlink ip_tables x_tables hisax isdn mISDN_dsp hfcpci mISDN_capi l3udss1 mISDN_l2 mISDN_l1 mISDN_core capi capifs kernelcapi eep rom lp capability commoncap softdog nls_iso8859_15 isofs zlib_inflate loop psmouse pcips2 8250_pnp 8250 usblp serial_core i2c_viapro via686a i2c_isa pcspkr i2c_core cyblafb via_agp parport_pc agpgart parport evdev dm_mirror pppoe pppox ppp_generic slhc ohci_hcd uhci_hcd usbmouse usbkbd usbhid usbcore ipv6 af_packet netconsole 8139too mii bitrev crc32 unix Jun 11 00:10:25 router kernel: [ 6656.688000] CPU: 0 Jun 11 00:10:25 router kernel: [ 6656.688000] EIP: 0060:[<c0171ff0>] Not tainted VLI Jun 11 00:10:25 router kernel: [ 6656.688000] EFLAGS: 00010086 (2.6.21.3lowLatency #2) Jun 11 00:10:25 router kernel: [ 6656.688000] EIP is at check_slabp+0xf0/0x110 Jun 11 00:10:25 router kernel: [ 6656.688000] eax: 00000001 ebx: e414125f ecx: c7444000 edx: 00000001 Jun 11 00:10:25 router kernel: [ 6656.688000] esi: e4141000 edi: 00000260 ebp: c74459ac esp: c7445988 Jun 11 00:10:25 router kernel: [ 6656.688000] ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 Jun 11 00:10:25 router kernel: [ 6656.688000] Process md2_resync (pid: 8045, ti=c7444000 task=d835e490 task.ti=c7444000) Jun 11 00:10:25 router kernel: [ 6656.688000] Stack: c04405c1 000000ff 00000091 e4141000 00000056 e77c32a0 00000000 00000246 Jun 11 00:10:25 router kernel: [ 6656.688000] e4141000 c7445a18 c0173381 c0172602 00000000 00000044 c74459f0 00000000 Jun 11 00:10:25 router kernel: [ 6656.688000] 00011200 00011200 e77c32a0 e77d6dd0 00000010 e77e77dc e77db918 c74459f0 Jun 11 00:10:25 router kernel: [ 6656.688000] Call Trace: Jun 11 00:10:25 router kernel: [ 6656.688000] [<c010528a>] show_trace_log_lvl+0x1a/0x30 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c0105351>] show_stack_log_lvl+0xb1/0xe0 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c010557f>] show_registers+0x1ff/0x380 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c0105823>] die+0x123/0x260 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c03791f2>] do_trap+0x82/0xb0 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c0105f07>] do_invalid_op+0x97/0xb0 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c0378fbc>] error_code+0x74/0x7c Jun 11 00:10:25 router kernel: [ 6656.688000] [<c0173381>] cache_alloc_refill+0xd1/0x6b0 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c0173cf3>] kmem_cache_alloc+0xb3/0xc0 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c015780e>] mempool_alloc_slab+0xe/0x10 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c0157941>] mempool_alloc+0x31/0x140 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c019d823>] bio_alloc_bioset+0x73/0x140 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c02f2407>] clone_bio+0x37/0x80 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c02f2b8e>] __split_bio+0x17e/0x470 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c02f39fe>] dm_request+0xce/0x140 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c020f90b>] generic_make_request+0x1bb/0x360 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c02da183>] handle_stripe5+0xb53/0x17b0 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c02dc8d2>] handle_stripe+0x382/0x1a10 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c02dec1d>] sync_request+0x21d/0xcc0 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c02ec8c7>] md_do_sync+0x7e7/0xd20 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c02eb901>] md_thread+0x31/0x110 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c0134633>] kthread+0xa3/0xd0 Jun 11 00:10:25 router kernel: [ 6656.688000] [<c0104e77>] kernel_thread_helper+0x7/0x10 Jun 11 00:10:25 router kernel: [ 6656.688000] ======================= Jun 11 00:10:26 router kernel: [ 6656.688000] Code: ff 8b 55 f0 8b 42 20 8d 04 85 1c 00 00 00 39 f8 76 0d 83 c3 01 f7 c7 0f 00 00 00 75 ce eb b9 c7 04 24 c1 05 44 c0 e8 90 ea fa ff <0f> 0b eb fe 83 c4 18 5b 5e 5f 5d c3 8b 56 10 e9 67 ff ff ff 8d Jun 11 00:10:26 router kernel: [ 6656.688000] EIP: [<c0171ff0>] check_slabp+0xf0/0x110 SS:ESP 0068:c7445988 Jun 11 00:10:26 router kernel: [ 6656.688000] note: md2_resync[8045] exited with preempt_count 1 One additional observation: In one incidence, the machine rebooted about 1..3 seconds after "smartd" has checked the SMART status of each of the IDE hard disks. Also, the monitoring of the file "/sys/block/md2/md/sync_completed" showed that the value of "/sys/block/md2/md/sync_completed" (while normally changing constantly during RAID rebuilding) did not change for about 1.5 seconds before, and additionally did change slower than usual before. This leads to a hypothesis that "smartd" may trigger these reboots, maybe by inducing longer delays in disk access, maybe leading to sudden error states or maybe leading to timeouts kicking in (which do not kick in normally). Maybe the sudden-reboot problem is unrelated to the slab corruption problem, maybe not. I don't recognise the precise problem, but there have been fixes in related parts of the code, so do please keep retrying with newer kernels to see if it got fixed. My guess is that this is a problem with the driver for the VIA ide controller. I don't suppose you have a spare IDE card from a different manufacturer that you could try putting in?? Should we assign it to the IDE people to see if they can help (I think you would need to do that Alasdair). > did change slower than usual before. This leads to a hypothesis that "smartd" > may trigger these reboots, maybe by inducing longer delays in disk access, > maybe leading to sudden error states or maybe leading to timeouts kicking in Yes, SMART check may induce delays in disk access but this shouldn't cause other problems (at least for IDE). > My guess is that this is a problem with the driver for the VIA ide > controller. This is possible but there are no open/known issues with VIA host driver currently so more info is needed (dmesg output). > I don't suppose you have a spare IDE card from a different manufacturer > that you could try putting in?? That would be useful, also does the issue still happen with 2.6.23? PS disabling "smartd" completely and seeing if it helps is also worth a try. I have been doing extensive resyncs under linux 2.6.22.7 with the slab allocator as memory allocator on the same machine with the same setup, and I cannot reproduce the bug anymore, regardless whether smartd is switched on or off. Thus, I assume that this bug has been fixed (for some not exactly known reason) between linux 2.6.21.3 and linux 2.6.22.7. :-) Thank you very much for your support. :-) Thus, I'm closing this bug for the time being. Great, thanks for reporting it. |