Bug 10890

Summary: Heavy iSCSI traffic triggers memory allocation complaints and system lockup
Product: SCSI Drivers Reporter: Bart Van Assche (bvanassche)
Component: OtherAssignee: scsi_drivers-other
Status: RESOLVED OBSOLETE    
Severity: high CC: akpm, alan, jejb, kernel, roland, webmaster
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.1.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: Kernel config (.config)
/var/log/kern.log
Contents of /proc/meminfo, sampled once every 10 seconds.

Description Bart Van Assche 2008-06-10 02:56:46 UTC
Latest working kernel version: (not known)
Earliest failing kernel version: fails at least with 2.6.24.3-ubuntu and 2.6.25.6
Distribution: Ubuntu Linux 8.04 server
Hardware Environment: x86_64
Software Environment:
Problem Description:

System starts behaving strange after about an hour of heavy iSCSI traffic: the console locks up (keys typed in are not echoed) and error messages are logged about problems with memory allocation.

Steps to reproduce:

This is how my setup looks like:
- One server is running OpenSolaris build 90 and there have been 64 iSCSI targets configured on this system, where each target is 10 GB in size.
- On the second server I have installed Ubuntu 8.04 server 64-bit, and I replaced the Ubuntu kernel by the 2.6.25.6 kernel.
- Both systems are connected via a 4x SDR InfiniBand network (10 Gbit/s).
- Open-iscsi was configured on the Linux server such that it connects to all 64 iSCSI targets and such that I/O to all 64 targets was active simultaneously. I used the following command (see below for the scripts):
  ./iscsi-logout-and-login && ./iscsi-stress-test &

# cat iscsi-logout
#!/bin/bash
target_ip="${1:-192.168.67.10}"
iscsiadm -m session \
  | while read proto num ip iqn
    do
      iscsiadm -m node -p ${ip} -T $iqn --logout
    done

# cat iscsi-logout-and-login
#!/bin/bash
target_ip=192.168.67.10
./iscsi-logout "${target_ip}"
i=0
iscsiadm -m discovery -t st -p ${target_ip} \
  | grep '\.67\.' \
  | cut -f2 -d' ' \
  | sort -k 1.60 \
  | while read iqn
    do
      if [ "$1" != "" -a "$i" -ge "${1:-0}" ]; then
        break
      fi
      iscsiadm -m node -p ${target_ip} -T ${iqn} --login
      iscsiadm --mode node -p ${target_ip} -T ${iqn} \
        --op update -n "node.session.timeo.replacement_timeout" -v 10
      iscsiadm --mode node -p ${target_ip} -T ${iqn} \
        --op update -n "node.conn[0].timeo.noop_out_interval" -v 0
      iscsiadm --mode node -p ${target_ip} -T ${iqn} \
        --op update -n "node.conn[0].timeo.noop_out_timeout" -v 0
      i=$((i+1))
    done

# cat iscsi-stress-test
#!/bin/bash
while true
do
  date
  for d in /dev/disk/by-path/ip-*:3260-iscsi-iqn.1986-03.com.sun*
  do
    dd if=/dev/zero of=$d bs=100K &
  done
  wait
done
Comment 1 Bart Van Assche 2008-06-10 03:00:36 UTC
Created attachment 16445 [details]
Kernel config (.config)
Comment 2 Bart Van Assche 2008-06-10 03:03:31 UTC
Created attachment 16446 [details]
/var/log/kern.log
Comment 3 Bart Van Assche 2008-06-10 05:23:42 UTC
Same test runs fine over Ethernet, so this is probably an issue in either the ib_ipoib kernel module or the ib_mthca kernel module.
Comment 4 Eli Cohen 2008-06-10 08:35:20 UTC
Can you specify also which HCA HW you are using?
Comment 5 Roland Dreier 2008-06-10 11:44:45 UTC
Another question: is this IPoIB datagram mode or connected mode?

There is one GFP_ATOMIC allocation in the connected mode RX refill path; other than that I don't see anything in the IB side that is likely to hit.
Comment 6 Andrew Morton 2008-06-10 14:23:24 UTC
Does the system actually recover?  You say "the
console locks up".  Does it stay locked up, or is this a transient thing?
Comment 7 Bart Van Assche 2008-06-10 23:20:58 UTC
(In reply to comment #4)
> Can you specify also which HCA HW you are using?

This is what lspci told me about the HCA hardware:
$ lspci|grep Inf
06:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev a0)
Comment 8 Bart Van Assche 2008-06-10 23:22:08 UTC
(In reply to comment #5)
> Another question: is this IPoIB datagram mode or connected mode?

iSCSI was set up such that it uses TCP/IPv4 -- does this answer your question ?
Comment 9 Bart Van Assche 2008-06-10 23:42:42 UTC
(In reply to comment #6)
> Does the system actually recover?  You say "the
> console locks up".  Does it stay locked up, or is this a transient thing?

What I see is the following:
- Any keys typed in at the console are no longer echoed.
- If the console screensaver is active, pressing a key at the console does no longer unblank the console.
- Anything I type in in an open ssh session is no longer echoed.
- The system does no longer respond to ICMP requests sent to the InfiniBand interface.
- The system still responds to most but not all ICMP requests sent to the Ethernet interface:

# ping 192.168.64.8 -c 10
PING 192.168.64.8 (192.168.64.8) 56(84) bytes of data.
64 bytes from 192.168.64.8: icmp_seq=1 ttl=64 time=0.137 ms
64 bytes from 192.168.64.8: icmp_seq=2 ttl=64 time=0.094 ms
64 bytes from 192.168.64.8: icmp_seq=5 ttl=64 time=0.089 ms
64 bytes from 192.168.64.8: icmp_seq=6 ttl=64 time=0.150 ms
64 bytes from 192.168.64.8: icmp_seq=8 ttl=64 time=0.139 ms
64 bytes from 192.168.64.8: icmp_seq=9 ttl=64 time=0.068 ms
64 bytes from 192.168.64.8: icmp_seq=10 ttl=64 time=0.118 ms

--- 192.168.64.8 ping statistics ---
10 packets transmitted, 7 received, 30% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.068/0.113/0.150/0.030 ms
Comment 10 Roland Dreier 2008-06-11 08:05:03 UTC
re comment #8:

no, the question is what /sys/class/net/ib0/mode shows.
Comment 11 Bart Van Assche 2008-06-11 23:34:06 UTC
(In reply to comment #10)
> re comment #8:
> 
> no, the question is what /sys/class/net/ib0/mode shows.

$ cat /sys/class/net/ib0/mode
datagram
Comment 12 Roland Dreier 2008-06-18 11:29:38 UTC
Has anyone done anything further about this?

One thing to do would be to take snapshots of /proc/slabinfo and /proc/meminfo while the test is running to try and see where memory is going.  And if I recall correctly, there is something that can be done with slub_debug to track alloc callers to see who did allocations, in case one slab is tying up lots of memory.
Comment 13 Bart Van Assche 2008-06-19 05:51:52 UTC
Created attachment 16546 [details]
Contents of /proc/meminfo, sampled once every 10 seconds.

Appended the output of the following script, running on a vanilla 2.6.25.7 kernel:

#!/bin/bash

# Header
echo -n "Time "
cat /proc/meminfo | \
while read tag value unit
do
  echo -n "${tag%:} "
done
echo ""

# Actual log
while true
do
  echo -n "$(date +%s) "
  cat /proc/meminfo | \
  while read tag value unit
  do
    echo -n "${value} "
  done
  echo ""
  sleep 10
done
Comment 14 Bart Van Assche 2008-06-19 05:55:03 UTC
(In reply to comment #12)
> One thing to do would be to take snapshots of /proc/slabinfo and
> /proc/meminfo
> while the test is running to try and see where memory is going.

If I interpret the contents of /proc/meminfo correctly, there is no SLAB/SLUB leak: the value of SUnreclaim remains constant during the test run.
Comment 15 Denis Roy 2009-02-11 12:42:26 UTC
FWIW, I _was_ seeing this too, but on plain GbE using two iscsi devices in an md raid-1 on a plain PC x86_64 initiator. The system did not lock up, however.

It appears the problem was with my tweaked the TCP window/buffer size settings in /etc/sysctl.conf. After removing my tweaks (leaving the defaults) I no longer get this problem.
Comment 16 pfudd 2012-02-21 22:02:43 UTC
Hi, I'm experiencing memory problems as well.  Situation:
  I've set up a machine as a FreeNAS 8.0.3 server with iSCSI.
  I've booted an imac with CloneZilla 1.2.10 (Linux kernel 3.1.0)

Iscsi is set up according to:
  http://ninjix.blogspot.com/2010/05/use-clonezilla-for-physical-disk-to.html

I'm copying a partition using 
  partclone.hfsp -o /mnt/sdb/shalala.sda2.partclone -c -s /dev/sda2

The throughput is fantastic!  It's around 40-60MB/s, on a direct computer-to-computer 1GB Ethernet link, which has a maximum of 100MB/s (6GB/min).

I'm getting kernel messages like the following, on the console.  No obvious effects on the system, but I don't know if the data will be damaged.

[ 2688.898855] kworker/u:2: page allocation failure: order:1, mode:0x20
[ 2688.900009] Pid: 2058, comm: kworker/u:2 Tainted: G         C   3.1.0-1-486 #1
[ 2688.900009] Call Trace:
[ 2688.900009]  [<c1077923>] ? warn_alloc_failed+0xbb/0xc9
[ 2688.900009]  [<c10796a4>] ? __alloc_pages_nodemask+0x573/0x5da
[ 2688.900009]  [<c1099543>] ? ____cache_alloc+0x20b/0x39d
[ 2688.900009]  [<c10997df>] ? __kmalloc+0x63/0xab
[ 2688.900009]  [<c11d80a6>] ? pskb_expand_head+0xbc/0x1fa
[ 2688.900009]  [<c11d80a6>] ? pskb_expand_head+0xbc/0x1fa
[ 2688.900009]  [<c11d88b5>] ? __pskb_pull_tail+0x3d/0x1fe
[ 2688.900009]  [<c11dfb19>] ? dev_hard_start_xmit+0x312/0x46e
[ 2688.900009]  [<c11f0353>] ? sch_direct_xmit+0x32/0xc6
[ 2688.900009]  [<c11dfe07>] ? dev_queue_xmit+0x192/0x299
[ 2688.900009]  [<c11fef3e>] ? dst_output+0x9/0xa
[ 2688.900009]  [<c120f988>] ? tcp_transmit_skb+0x628/0x650
[ 2688.900009]  [<c1210350>] ? tcp_write_xmit+0x628/0x712
[ 2688.900009]  [<c1210475>] ? __tcp_push_pending_frames+0x15/0x42
[ 2688.900009]  [<c1206eee>] ? tcp_sendpage+0x352/0x37d
[ 2688.900009]  [<c121de10>] ? inet_sendpage+0x56/0x7d
[ 2688.900009]  [<c1206b9c>] ? tcp_sendmsg+0x62a/0x62a
[ 2688.900009]  [<f8bd33be>] ? iscsi_sw_tcp_pdu_xmit+0x8d/0x1f2 [iscsi_tcp]
[ 2688.900009]  [<f8cb0226>] ? iscsi_tcp_task_xmit+0x1b/0x1a2 [libiscsi_tcp]
[ 2688.900009]  [<f8bbf980>] ? iscsi_xmit_task+0x2a/0x51 [libiscsi]
[ 2688.900009]  [<f8bc11d1>] ? iscsi_xmitworker+0x192/0x1e7 [libiscsi]
[ 2688.900009]  [<c12752f0>] ? common_interrupt+0x30/0x40
[ 2688.900009]  [<c1032e97>] ? process_one_work+0xfb/0x1dc
[ 2688.900009]  [<f8bc103f>] ? iscsi_prep_scsi_cmd_pdu+0x4dd/0x4dd [libiscsi]
[ 2688.900009]  [<c10337a1>] ? worker_thread+0xa3/0x11a
[ 2688.900009]  [<c10336fe>] ? manage_workers.isra.33+0x18a/0x18a
[ 2688.900009]  [<c1035a4a>] ? kthread+0x62/0x67
[ 2688.900009]  [<c10359e8>] ? flush_kthread_worker+0x73/0x73
[ 2688.900009]  [<c1275306>] ? kernel_thread_helper+0x6/0xd
[ 2688.900009] Mem-Info:
[ 2688.900009] DMA per-cpu:
[ 2688.900009] CPU    0: hi:    0, btch:   1 usd:   0
[ 2688.900009] Normal per-cpu:
[ 2688.900009] CPU    0: hi:  186, btch:  31 usd:  58
[ 2688.900009] HighMem per-cpu:
[ 2688.900009] CPU    0: hi:  186, btch:  31 usd:  29
[ 2688.900009] active_anon:4209 inactive_anon:21976 isolated_anon:0
[ 2688.900009]  active_file:37896 inactive_file:429908 isolated_file:0
[ 2688.900009]  unevictable:809 dirty:2745 writeback:16769 unstable:0
[ 2688.900009]  free:3416 slab_reclaimable:9354 slab_unreclaimable:3288
[ 2688.900009]  mapped:1132 shmem:22758 pagetables:146 bounce:0
[ 2688.900009] DMA free:3492kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:20kB inactive_file:10780kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:1184kB slab_unreclaimable:420kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2688.900009] lowmem_reserve[]: 0 865 2000 2000
[ 2688.900009] Normal free:4660kB min:3728kB low:4660kB high:5592kB active_anon:0kB inactive_anon:0kB active_file:28408kB inactive_file:785980kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:885944kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:36232kB slab_unreclaimable:12732kB kernel_stack:688kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2688.900009] lowmem_reserve[]: 0 0 9086 9086
[ 2688.900009] HighMem free:5512kB min:512kB low:1732kB high:2956kB active_anon:16836kB inactive_anon:87904kB active_file:123156kB inactive_file:922872kB unevictable:3236kB isolated(anon):0kB isolated(file):0kB present:1163044kB mlocked:3236kB dirty:10980kB writeback:67076kB mapped:4524kB shmem:91032kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:584kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2688.900009] lowmem_reserve[]: 0 0 0 0
[ 2688.900009] DMA: 13*4kB 4*8kB 5*16kB 8*32kB 8*64kB 4*128kB 4*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB = 3492kB
[ 2688.900009] Normal: 1115*4kB 1*8kB 0*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4660kB
[ 2688.900009] HighMem: 2*4kB 2*8kB 3*16kB 2*32kB 2*64kB 3*128kB 1*256kB 1*512kB 2*1024kB 1*2048kB 0*4096kB = 5512kB
[ 2688.900009] 491060 total pagecache pages
[ 2688.900009] 0 pages in swap cache
[ 2688.900009] Swap cache stats: add 0, delete 0, find 0/0
[ 2688.900009] Free swap  = 0kB
[ 2688.900009] Total swap = 0kB
[ 2688.900009] 520361 pages RAM
[ 2688.900009] 293051 pages HighMem
[ 2688.900009] 5480 pages reserved
[ 2688.900009] 463661 pages shared
[ 2688.900009] 53096 pages non-shared
Comment 17 Alan 2015-02-19 15:14:02 UTC
This bug relates to a very old kernel. Closing as obsolete.

If you see it in a recent kernel please open a new bug