Most recent kernel where this bug did not occur: needs more testing, maybe 2.6.14 but could not Oldest kernel where this bug occurs: Three different machines (maybe more): One i386 with CONFIG_IEEE1394_VERBOSEDEBUG: 2.6.15 and all of 2.6.16-rc[1234] Two x86_64 without CONFIG_IEEE1394_VERBOSEDEBUG off: all of 2.6.16-rc[1234] Distribution: SuSE 10.1-beta4 Hardware Environment: i386 and x86_64 Software Environment: gcc version 4.1.0 20060210 (prerelease) (SUSE Linux) GNU assembler version 2.16.91.0.5 Problem Description: partition scan from sbp2 with slab debugging leads to double free message example dmesg ouptput (excerpt) - differs strongly: ACPI: PCI Interrupt 0000:06:09.1[B] -> Link [LNKB] -> GSI 10 (level, low) -> IRQ 10 ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[10] MMIO=[b8003000-b80037ff] Max Packet=[2048] IR/IT contexts=[4/4] ... SCSI subsystem initialized ieee1394: sbp2: Driver forced to serialize I/O (serialize_io=1) ieee1394: sbp2: Try serialize_io=0 for better performance spurious 8259A interrupt: IRQ7. ieee1394: Node added: ID:BUS[0-00:1023] GUID[0010100500000000] ieee1394: Node changed: 0-00:1023 -> 0-01:1023 scsi0 : SBP-2 IEEE-1394 ieee1394: sbp2: Node 0-00:1023: Using 36byte inquiry workaround ieee1394: sbp2: Logged into SBP-2 device ieee1394: Node 0-00:1023: Max speed [S400] - Max payload [2048] Vendor: Initio Model: 6L200P0 Rev: 2.35 Type: Direct-Access ANSI SCSI revision: 00 SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB) sda: Write Protect is off sda: Mode Sense: 86 0b 00 02 sda: got wrong page sda: assuming drive cache: write through slab error in cache_free_debugcheck(): cache `size-512(DMA)': double free, or memory outside object was overwritten [<c014c1a6>] cache_free_debugcheck+0xbf/0x192 [<e10d3818>] sd_revalidate_disk+0xbab/0xc97 [sd_mod] [<c014c623>] kfree+0x3c/0x6c [<e10d3818>] sd_revalidate_disk+0xbab/0xc97 [sd_mod] [<c01aaeb6>] idr_get_new_above_int+0x10c/0x1cf [<e10d3b87>] sd_probe+0x283/0x31e [sd_mod] [<c01832dc>] sysfs_new_dirent+0x19/0x58 [<c020aed6>] __device_attach+0x0/0x5 [<c020ae8d>] driver_probe_device+0x42/0x8b [<c020a8b5>] bus_for_each_drv+0x39/0x60 [<c020af30>] device_attach+0x55/0x65 [<c020aed6>] __device_attach+0x0/0x5 [<c020a7d3>] bus_add_device+0x27/0xd0 [<c0209d8e>] device_add+0xd0/0x12e [<e112fb99>] scsi_sysfs_add_sdev+0x2a/0x1e0 [scsi_mod] [<e112e5e1>] scsi_probe_and_add_lun+0x62c/0x806 [scsi_mod] [<e112e6da>] scsi_probe_and_add_lun+0x725/0x806 [scsi_mod] [<c0209de5>] device_add+0x127/0x12e [<e112f251>] __scsi_add_device+0x54/0x81 [scsi_mod] [<e10cc0d3>] sbp2_max_speed_and_size+0xaf/0xb7 [sbp2] [<e112f289>] scsi_add_device+0xb/0x1c [scsi_mod] [<e10cd3d6>] sbp2_probe+0x3da/0x430 [sbp2] [<c020aed6>] __device_attach+0x0/0x5 [<c020ae8d>] driver_probe_device+0x42/0x8b [<c020a8b5>] bus_for_each_drv+0x39/0x60 [<c020af30>] device_attach+0x55/0x65 [<c020aed6>] __device_attach+0x0/0x5 [<c020a7d3>] bus_add_device+0x27/0xd0 [<c0209d8e>] device_add+0xd0/0x12e [<e0e52f27>] nodemgr_register_device+0x78/0x12b [ieee1394] [<c013d853>] kzalloc+0x11/0x3a [<e0e533af>] nodemgr_process_unit_directory+0x32b/0x340 [ieee1394] [<e0e536df>] nodemgr_probe_ne+0x168/0x3e0 [ieee1394] [<e0e5410b>] nodemgr_host_thread+0x7b4/0x906 [ieee1394] [<e0e53957>] nodemgr_host_thread+0x0/0x906 [ieee1394] [<c0101005>] kernel_thread_helper+0x5/0xb c009ec80: redzone 1: 0x170fc2a5, redzone 2: 0x0. SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB) sda: Write Protect is off sda: Mode Sense: 86 0b 00 02 sda: got wrong page sda: assuming drive cache: write through slab error in cache_free_debugcheck(): cache `size-512(DMA)': double free, or memory outside object was overwritten [<c014c1a6>] cache_free_debugcheck+0xbf/0x192 [<e10d3818>] sd_revalidate_disk+0xbab/0xc97 [sd_mod] [<c014c623>] kfree+0x3c/0x6c [<e10d3818>] sd_revalidate_disk+0xbab/0xc97 [sd_mod] [<c0164197>] ifind+0x5b/0x8b [<c017eac5>] rescan_partitions+0x69/0x195 .... backtraces vary sometimes a bit, mat not even be caused by spb2 but SCSI, ... Steps to reproduce: 1) Compile and install kernel with CONFIG_DEBUG_SLAB=y 2) have a Initio SBP2 disk conneced with having loaded/loading sbp2 and sd_mod, or plug in SBP2 disk while these drivers are loaded, both work relabily
Created attachment 7427 [details] dmesg from 2.6.14 full dmesg from 2.4.14-rc5 (2.4.14.r4 looks the same) attached. This is a cut and paste from the lines around the first cache_free_debugcheck(): ohci1394: fw-host0: IntEvent: 00000010 ohci1394: fw-host0: Got RQPkt interrupt status=0x00008451 ohci1394: fw-host0: Single packet rcv'd ohci1394: fw-host0: Packet received from node 0 ack=0x11 spd=2 tcode=0x1 length=28 ctx=0 tlabel=44 ieee1394: received packet: ffc1b110 ffc0fffe 00000000 00080000 slab error in cache_free_debugcheck(): cache `sgpool-8': double free, or memory out side object was overwritten [<c0142d9d>] cache_free_debugcheck+0x1ad/0x250 [<c0143789>] kmem_cache_free+0x39/0x90 [<c0289c3d>] scsi_io_completion+0x24d/0x540 [<c0289c3d>] scsi_io_completion+0x24d/0x540 [<c028a1c7>] scsi_generic_done+0x37/0x50 [<c0283eff>] scsi_softirq+0x16f/0x1a0 [<c011d9c2>] __do_softirq+0x42/0xa0 [<c011da46>] do_softirq+0x26/0x30 [<c01049be>] do_IRQ+0x1e/0x30 [<c01032c2>] common_interrupt+0x1a/0x20 [<c014007b>] test_clear_page_writeback+0x9b/0xe0 [<c0141822>] check_poison_obj+0x72/0x1a0 [<c014320f>] cache_alloc_debugcheck_after+0x8f/0x180 [<c0143711>] __kmalloc+0x91/0xd0 [<c01a4825>] get_mem_for_virtual_node+0x65/0xe0 [<c01a4825>] get_mem_for_virtual_node+0x65/0xe0 [<c01a4cba>] fix_nodes+0xda/0x3e0 [<c01b2a21>] reiserfs_paste_into_item+0x141/0x220 [<c019f2e2>] reiserfs_allocate_blocks_for_region+0x10a2/0x1660 [<c01985bf>] make_cpu_key+0x4f/0x60 [<c01aef2f>] pathrelse+0x2f/0x50 [<c01a0e8c>] reiserfs_file_write+0x53c/0x790 [<c013e066>] __alloc_pages+0x1c6/0x4a0 [<c016533c>] pipe_poll+0x3c/0xd0 [<c01415fe>] poison_obj+0x2e/0x60 [<c0158aa1>] vfs_write+0xb1/0x130 [<c0158beb>] sys_write+0x4b/0x80 [<c01030f5>] syscall_call+0x7/0xb c14a5af0: redzone 1: 0x170fc2a5, redzone 2: 0xc013cb9a. didn't trigger with a 2.3.13.1 so far, but I continue testing older versions
verified that 2.6.13/2.6.14-rc1 is the window where it triggers. Tests so far: Doesn't trigger so far with 2.6.13 Triggers reliably with 2.6.14-rc{1,2,3,4,5} 2.6.1{4,5}, 2.6.16-rc{1,2,3,4} With the 2.6.16-rc's I can also trigger it on AMD64 running x86_64. Triggered on <= 2.6.15 only with Samsung X20 (Pentium M) and Initio 6L200P0 Rev: 2.35 (200 GB). Didn't trigger it with the other Firewire disk. I'll try to find out which changes between 2.6.13 and 2.6.14-rc1 make the message appear on the X20.
Fixed by patch "sbp2: update 36byte inquiry workaround (fix compatibility regression)", available in the 1394 development git tree: http://www.kernel.org/git/?p=linux/kernel/git/scjody/ieee1394.git;a=commit;h=99496037c6744fd938ffb8ccfc8fc91762322ff8 or here: http://me.in-berlin.de/~s5r6/linux1394/updates/ and also in 2.6.16-rc4-mm1: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc4/2.6.16-rc4-mm1/broken-out/git-ieee1394.patch Please verify.
Created attachment 7432 [details] dmesg of sbp2 and modprobe sd_mod while the problematic TYPE_RBC disk was alreay connected to the bus I get the attached dmesg, with the patch ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc4/2.6.16-rc4-mm1/broken-out/git-ieee1394.patch attached. I on the log are also some debug printk("force_inquiry_hack = %d", force_inqiry_hack); to have evidence of the value of the variable, which I forced to 1 to be sure it's active. The patch in http://sourceforge.net/mailarchive/message.php?msg_id=14879016 from Al Viro at in the Thread "Re: TYPE_RBC cache fixes (sbp2.c affected)" fixed it for me and for Stefan Richer, as well, see his reply: http://sourceforge.net/mailarchive/message.php?msg_id=14879017 Next attachment will be the diff between this debug log and the one where it works with Al Viro's patch (linked above)
Created attachment 7433 [details] that's diff between this debug log and the one where it works with Al Viro's patch notable in the diff is: +sda: missing header in MODE_SENSE response -ieee1394: sbp2: SCSI transfer size = 17d9 +ieee1394: sbp2: SCSI transfer size = d -sda: got wrong page -sda: assuming drive cache: write through -slab error in cache_free_debugcheck(): cache `size-512(DMA)': double free, or memory outside object was overwri tten - [<c0151126>] cache_free_debugcheck+0x186/0x210 - [<e099095b>] sd_revalidate_disk+0x5bb/0xe10 [sd_mod] ... - [<c0102ebd>] syscall_call+0x7/0xb -c009ec80: redzone 1: 0x170fc2a5, redzone 2: 0x3e881a4. +SCSI device sda: drive cache: write back (and repeated a second time) Looks good here so far, hdparm -t says 25MB/s, dd_rescue from /dev/zero to disk ~14MB/s, both with default serialize_io=1, with serialize_io=0, hdparm -t says 27 MB/s. this fixes the bug for me, great!
Just for reference, this is exacly the disk which triggered the slab message: http://www.ciao.de/Avalon_Mobile_Festplatte_200_GB__2138203 It has been sold in the supermarkets of the German 'Soft'-Discounter plus.de a few weeks ago. It can be also found uniqely using the search words Plus Avalon 200GB using some search engines.
Bug category changed to "SCSI Drivers". I split Al Viro's patch into the general critical part and the special optional part and submitted it for inclusion into 2.6.16 and 2.6.15.x. http://marc.theaimsgroup.com/?t=114065708500001 http://marc.theaimsgroup.com/?t=114065708500002
Patch was merged by Linus. http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=489708007785389941a89fa06aedc5ec53303c96
Tested a kernel with the fix (based on 2.6.16-rc4-git10) and could not reproduce anymore, the kernel messages with my disk (I didn't pass any special options to modprobe or any other modifications) were: scsi0 : SBP-2 IEEE-1394 ieee1394: sbp2: Node 0-00:1023: Using 36byte inquiry workaround ieee1394: sbp2: Logged into SBP-2 device ieee1394: Node 0-00:1023: Max speed [S400] - Max payload [2048] Vendor: Initio Model: 6L200P0 Rev: 2.35 Type: Direct-Access ANSI SCSI revision: 00 SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB) sda: Write Protect is off sda: Mode Sense: 86 0b 00 02 sda: assuming drive cache: write through SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB) sda: Write Protect is off sda: Mode Sense: 86 0b 00 02 sda: assuming drive cache: write through sda: sda1 sda2 sda3 sd 0:0:0:0: Attached scsi disk sda Performance using the simple tests which I did previously was practically the same. I can confirm that the fixes in mainline work, many thanks!