Bug 6114

Summary: Initio sbp2 causes: "slab error in cache_free_debugcheck(): cache `size-512(DMA)': double free, or memory outside" object was overwritten
Product: SCSI Drivers Reporter: Bernhard Kaindl (bk)
Component: OtherAssignee: Stefan Richter (stefanr)
Status: CLOSED CODE_FIX    
Severity: high    
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.16-rc4 Subsystem:
Regression: --- Bisected commit-id:
Attachments: dmesg from 2.6.14
dmesg of sbp2 and modprobe sd_mod while the problematic TYPE_RBC disk was alreay connected to the bus
that's diff between this debug log and the one where it works with Al Viro's patch

Description Bernhard Kaindl 2006-02-21 03:58:26 UTC
Most recent kernel where this bug did not occur: needs more testing, maybe
2.6.14 but could not 

Oldest kernel where this bug occurs:

Three different machines (maybe more):
One i386 with CONFIG_IEEE1394_VERBOSEDEBUG: 2.6.15 and all of 2.6.16-rc[1234]
Two x86_64 without CONFIG_IEEE1394_VERBOSEDEBUG off: all of 2.6.16-rc[1234]

Distribution: SuSE 10.1-beta4

Hardware Environment:
i386 and x86_64

Software Environment:
gcc version 4.1.0 20060210 (prerelease) (SUSE Linux)
GNU assembler version 2.16.91.0.5

Problem Description:

partition scan from sbp2 with slab debugging leads to double free message

example dmesg ouptput (excerpt) - differs strongly:

ACPI: PCI Interrupt 0000:06:09.1[B] -> Link [LNKB] -> GSI 10 (level, low) -> IRQ 10
ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[10]  MMIO=[b8003000-b80037ff]  Max
Packet=[2048]  IR/IT contexts=[4/4]
...
SCSI subsystem initialized
ieee1394: sbp2: Driver forced to serialize I/O (serialize_io=1)
ieee1394: sbp2: Try serialize_io=0 for better performance
spurious 8259A interrupt: IRQ7.
ieee1394: Node added: ID:BUS[0-00:1023]  GUID[0010100500000000]
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
scsi0 : SBP-2 IEEE-1394
ieee1394: sbp2: Node 0-00:1023: Using 36byte inquiry workaround
ieee1394: sbp2: Logged into SBP-2 device
ieee1394: Node 0-00:1023: Max speed [S400] - Max payload [2048]
  Vendor: Initio    Model: 6L200P0           Rev: 2.35
  Type:   Direct-Access                      ANSI SCSI revision: 00
SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
sda: Write Protect is off
sda: Mode Sense: 86 0b 00 02
sda: got wrong page
sda: assuming drive cache: write through
slab error in cache_free_debugcheck(): cache `size-512(DMA)': double free, or
memory outside object was overwritten
 [<c014c1a6>] cache_free_debugcheck+0xbf/0x192
 [<e10d3818>] sd_revalidate_disk+0xbab/0xc97 [sd_mod]
 [<c014c623>] kfree+0x3c/0x6c
 [<e10d3818>] sd_revalidate_disk+0xbab/0xc97 [sd_mod]
 [<c01aaeb6>] idr_get_new_above_int+0x10c/0x1cf
 [<e10d3b87>] sd_probe+0x283/0x31e [sd_mod]
 [<c01832dc>] sysfs_new_dirent+0x19/0x58
 [<c020aed6>] __device_attach+0x0/0x5
 [<c020ae8d>] driver_probe_device+0x42/0x8b
 [<c020a8b5>] bus_for_each_drv+0x39/0x60
 [<c020af30>] device_attach+0x55/0x65
 [<c020aed6>] __device_attach+0x0/0x5
 [<c020a7d3>] bus_add_device+0x27/0xd0
 [<c0209d8e>] device_add+0xd0/0x12e
 [<e112fb99>] scsi_sysfs_add_sdev+0x2a/0x1e0 [scsi_mod]
 [<e112e5e1>] scsi_probe_and_add_lun+0x62c/0x806 [scsi_mod]
 [<e112e6da>] scsi_probe_and_add_lun+0x725/0x806 [scsi_mod]
 [<c0209de5>] device_add+0x127/0x12e
 [<e112f251>] __scsi_add_device+0x54/0x81 [scsi_mod]
 [<e10cc0d3>] sbp2_max_speed_and_size+0xaf/0xb7 [sbp2]
 [<e112f289>] scsi_add_device+0xb/0x1c [scsi_mod]
 [<e10cd3d6>] sbp2_probe+0x3da/0x430 [sbp2]
 [<c020aed6>] __device_attach+0x0/0x5
 [<c020ae8d>] driver_probe_device+0x42/0x8b
 [<c020a8b5>] bus_for_each_drv+0x39/0x60
 [<c020af30>] device_attach+0x55/0x65
 [<c020aed6>] __device_attach+0x0/0x5
 [<c020a7d3>] bus_add_device+0x27/0xd0
 [<c0209d8e>] device_add+0xd0/0x12e
 [<e0e52f27>] nodemgr_register_device+0x78/0x12b [ieee1394]
 [<c013d853>] kzalloc+0x11/0x3a
 [<e0e533af>] nodemgr_process_unit_directory+0x32b/0x340 [ieee1394]
 [<e0e536df>] nodemgr_probe_ne+0x168/0x3e0 [ieee1394]
 [<e0e5410b>] nodemgr_host_thread+0x7b4/0x906 [ieee1394]
 [<e0e53957>] nodemgr_host_thread+0x0/0x906 [ieee1394]
 [<c0101005>] kernel_thread_helper+0x5/0xb
c009ec80: redzone 1: 0x170fc2a5, redzone 2: 0x0.
SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
sda: Write Protect is off
sda: Mode Sense: 86 0b 00 02
sda: got wrong page
sda: assuming drive cache: write through
slab error in cache_free_debugcheck(): cache `size-512(DMA)': double free, or
memory outside
 object was overwritten
 [<c014c1a6>] cache_free_debugcheck+0xbf/0x192
 [<e10d3818>] sd_revalidate_disk+0xbab/0xc97 [sd_mod]
 [<c014c623>] kfree+0x3c/0x6c
 [<e10d3818>] sd_revalidate_disk+0xbab/0xc97 [sd_mod]
 [<c0164197>] ifind+0x5b/0x8b
 [<c017eac5>] rescan_partitions+0x69/0x195
....
backtraces vary sometimes a bit, mat not even be caused by spb2 but SCSI, ...

Steps to reproduce:

1) Compile and install kernel with CONFIG_DEBUG_SLAB=y

2) have a Initio SBP2 disk conneced with having loaded/loading sbp2 and sd_mod,
or plug in SBP2 disk while these drivers are loaded, both work relabily
Comment 1 Bernhard Kaindl 2006-02-21 06:54:14 UTC
Created attachment 7427 [details]
dmesg from 2.6.14

full dmesg from 2.4.14-rc5 (2.4.14.r4 looks the same) attached.

This is a cut and paste from the lines around the first
cache_free_debugcheck():

ohci1394: fw-host0: IntEvent: 00000010
ohci1394: fw-host0: Got RQPkt interrupt status=0x00008451
ohci1394: fw-host0: Single packet rcv'd
ohci1394: fw-host0: Packet received from node 0 ack=0x11 spd=2 tcode=0x1
length=28
ctx=0 tlabel=44
ieee1394: received packet: ffc1b110 ffc0fffe 00000000 00080000
slab error in cache_free_debugcheck(): cache `sgpool-8': double free, or memory
out
side object was overwritten
 [<c0142d9d>] cache_free_debugcheck+0x1ad/0x250
 [<c0143789>] kmem_cache_free+0x39/0x90
 [<c0289c3d>] scsi_io_completion+0x24d/0x540
 [<c0289c3d>] scsi_io_completion+0x24d/0x540
 [<c028a1c7>] scsi_generic_done+0x37/0x50
 [<c0283eff>] scsi_softirq+0x16f/0x1a0
 [<c011d9c2>] __do_softirq+0x42/0xa0
 [<c011da46>] do_softirq+0x26/0x30
 [<c01049be>] do_IRQ+0x1e/0x30
 [<c01032c2>] common_interrupt+0x1a/0x20
 [<c014007b>] test_clear_page_writeback+0x9b/0xe0
 [<c0141822>] check_poison_obj+0x72/0x1a0
 [<c014320f>] cache_alloc_debugcheck_after+0x8f/0x180
 [<c0143711>] __kmalloc+0x91/0xd0
 [<c01a4825>] get_mem_for_virtual_node+0x65/0xe0
 [<c01a4825>] get_mem_for_virtual_node+0x65/0xe0
 [<c01a4cba>] fix_nodes+0xda/0x3e0
 [<c01b2a21>] reiserfs_paste_into_item+0x141/0x220
 [<c019f2e2>] reiserfs_allocate_blocks_for_region+0x10a2/0x1660
 [<c01985bf>] make_cpu_key+0x4f/0x60
 [<c01aef2f>] pathrelse+0x2f/0x50
 [<c01a0e8c>] reiserfs_file_write+0x53c/0x790
 [<c013e066>] __alloc_pages+0x1c6/0x4a0
 [<c016533c>] pipe_poll+0x3c/0xd0
 [<c01415fe>] poison_obj+0x2e/0x60
 [<c0158aa1>] vfs_write+0xb1/0x130
 [<c0158beb>] sys_write+0x4b/0x80
 [<c01030f5>] syscall_call+0x7/0xb
c14a5af0: redzone 1: 0x170fc2a5, redzone 2: 0xc013cb9a.

didn't trigger with a 2.3.13.1 so far, but I continue testing older versions
Comment 2 Bernhard Kaindl 2006-02-21 08:17:42 UTC
verified that 2.6.13/2.6.14-rc1 is the window where it triggers.
Tests so far:

Doesn't trigger so far with 2.6.13
Triggers reliably with      2.6.14-rc{1,2,3,4,5} 2.6.1{4,5}, 2.6.16-rc{1,2,3,4}

With the 2.6.16-rc's I can also trigger it on AMD64 running x86_64.

Triggered on <= 2.6.15 only with Samsung X20 (Pentium M) and
Initio 6L200P0 Rev: 2.35 (200 GB).

Didn't trigger it with the other Firewire disk.

I'll try to find out which changes between 2.6.13 and 2.6.14-rc1 make the
message appear on the X20.
Comment 3 Stefan Richter 2006-02-21 11:16:38 UTC
Fixed by patch "sbp2: update 36byte inquiry workaround (fix compatibility
regression)", available in the 1394 development git tree:
http://www.kernel.org/git/?p=linux/kernel/git/scjody/ieee1394.git;a=commit;h=99496037c6744fd938ffb8ccfc8fc91762322ff8

or here:
http://me.in-berlin.de/~s5r6/linux1394/updates/

and also in 2.6.16-rc4-mm1:
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc4/2.6.16-rc4-mm1/broken-out/git-ieee1394.patch

Please verify.
Comment 4 Bernhard Kaindl 2006-02-21 15:21:39 UTC
Created attachment 7432 [details]
dmesg of sbp2 and modprobe sd_mod while the problematic TYPE_RBC disk was alreay connected to the bus

I get the attached dmesg, with the patch
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc4/2.6.16-rc4-mm1/broken-out/git-ieee1394.patch

attached.

I on the log are also some debug printk("force_inquiry_hack = %d",
force_inqiry_hack); to have evidence of the value of the variable,
which I forced to 1 to be sure it's active.

The patch in
http://sourceforge.net/mailarchive/message.php?msg_id=14879016
from Al Viro at in the Thread
"Re: TYPE_RBC cache fixes (sbp2.c affected)" fixed it for me
and for Stefan Richer, as well, see his reply:
http://sourceforge.net/mailarchive/message.php?msg_id=14879017

Next attachment will be the diff between this debug log and the one
where it works with Al Viro's patch (linked above)
Comment 5 Bernhard Kaindl 2006-02-21 15:40:03 UTC
Created attachment 7433 [details]
that's diff between this debug log and the one where it works with Al Viro's patch

notable in the diff is:

+sda: missing header in MODE_SENSE response

-ieee1394: sbp2: SCSI transfer size = 17d9
+ieee1394: sbp2: SCSI transfer size = d

-sda: got wrong page
-sda: assuming drive cache: write through
-slab error in cache_free_debugcheck(): cache `size-512(DMA)': double free, or
memory outside object was overwri
tten
- [<c0151126>] cache_free_debugcheck+0x186/0x210
- [<e099095b>] sd_revalidate_disk+0x5bb/0xe10 [sd_mod]
...
- [<c0102ebd>] syscall_call+0x7/0xb
-c009ec80: redzone 1: 0x170fc2a5, redzone 2: 0x3e881a4.
+SCSI device sda: drive cache: write back

(and repeated a second time)
Looks good here so far, hdparm -t says 25MB/s,
dd_rescue from /dev/zero to disk ~14MB/s, both with default serialize_io=1,
with serialize_io=0, hdparm -t says 27 MB/s.

this fixes the bug for me, great!
Comment 6 Bernhard Kaindl 2006-02-21 16:09:37 UTC
Just for reference, this is exacly the disk which triggered the slab message:

http://www.ciao.de/Avalon_Mobile_Festplatte_200_GB__2138203

It has been sold in the supermarkets of the German 'Soft'-Discounter plus.de
a few weeks ago.

It can be also found uniqely using the search words Plus Avalon 200GB
using some search engines.
Comment 7 Stefan Richter 2006-02-22 17:24:33 UTC
Bug category changed to "SCSI Drivers".

I split Al Viro's patch into the general critical part and the special optional
part and submitted it for inclusion into 2.6.16 and 2.6.15.x.
http://marc.theaimsgroup.com/?t=114065708500001
http://marc.theaimsgroup.com/?t=114065708500002
Comment 9 Bernhard Kaindl 2006-02-27 07:46:59 UTC
Tested a kernel with the fix (based on 2.6.16-rc4-git10) and could not reproduce
anymore, the kernel messages with my disk (I didn't pass any special options to
modprobe or any other modifications) were:

scsi0 : SBP-2 IEEE-1394
ieee1394: sbp2: Node 0-00:1023: Using 36byte inquiry workaround
ieee1394: sbp2: Logged into SBP-2 device
ieee1394: Node 0-00:1023: Max speed [S400] - Max payload [2048]
  Vendor: Initio    Model: 6L200P0           Rev: 2.35
  Type:   Direct-Access                      ANSI SCSI revision: 00
SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
sda: Write Protect is off
sda: Mode Sense: 86 0b 00 02
sda: assuming drive cache: write through
SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
sda: Write Protect is off
sda: Mode Sense: 86 0b 00 02
sda: assuming drive cache: write through
 sda: sda1 sda2 sda3
sd 0:0:0:0: Attached scsi disk sda

Performance using the simple tests which I did previously was practically
the same.

I can confirm that the fixes in mainline work, many thanks!