I have this issue that refused to be solved no matter what I do . My ASRock comes with onboard SAS controller (LSI 2308) , since I recieved it always does this one thing : Drops all HDDs connected to it . It happens only under heavy IO operations after a few minutes . I can recreate it easily by running either dd , md5deep or even btrfs scrub . Kernel locks , can't even shut it down from console and a quick ls /dev/disk/by-id shows that all the HDDs connected to the SAS controller have disappeared . It happens with the stable kernel (3.9 and 3.10.3) and the mainline (3.11-rc2) as of this day . It's not a hardware issue , because I installed a Windows Server 2012 on the same machine with a few HDDs I have laying around and beat the controller to the ground and it never hanged . So I know it's a Linux-specific issue . Dmesg logs before and after the issue are attached . Thank you .
Created attachment 107032 [details] dmesg logs
Kernel locks are rather "soft" , the machine functions but the HDDs activity LED stays on and the kernel doesn't respond to a reboot or shutdown command from console . It has to be hard-reset using the power button .
Hi, Can you please provide the driver logs my setting the driver logging level to 0x3f8. Here are the steps to set the mpt2sas driver logging level a.While loading the driver modprobe mpt2sas logging_level=0x3f8 b. If driver is in ramdisk, then in RHEL5/SLES/OEL5 OS, following line has to be added in /etc/modprobe.conf and reboot the system options mpt2sas logging_level=0x3f8 (Or) Add below word at the end of kernel module parameters line in /boot/grub/menu.lst or /boot/grub/grub.conf file and reboot the system mpt2sas.logging_level=0x3f8 c. During driver run time echo 0x3f8 > /sys/module/mpt2sas/parameters/logging_level Also please provide us the IO rate at which you are faceing this problem. Regards, Sreekanth
Hi . I'll attach it , however dmesg only shows the last 16000 events . I hope it would be enough . Sorry for being a noob in reporting my first bug , but can you tell me how can I find the exact IO rate ? It doesn't happen under daily workload , though . (rsync cronjob , writing a gzipped root backup to the RAID) . Thank you .
Created attachment 107033 [details] dmeg logs 2
Hi again . I did monitor the IO of the RAID array using IOstat tool . I'll attach the output . One thing I noticed is that monitoring the raid array made it survive a LOT longer than before . I simply used dd to dump 300G of zeros into the array , while at the same time using md5deep on the entire mountpoint . it stopped after writing around 280G this time , I was surprised because it never exceeded 77G before . Tell me if you need me to do anything else . Thank you very much .
Created attachment 107038 [details] iostat log
Hi, Can you please provide me the /var/log/message file as dmesg logs is not enough to analysize this issue. Thanks, Sreekanth
Hi . Ok . Journal for this entire day will be attached . It starts at 12:00 AM To save your time , mpt2sas errors start at (03:19:26) mark . Thank you .
Created attachment 107041 [details] Journal 1
Hi, Thanks for providing the logs. From the logs what I observed is that controller is going in to the non operational state and so we are seeing the messages "mpt2sas0: _base_fault_reset_work : SAS host is non-operational !!!!". So once the controller stays in this state then driver will remove this controller's host entry from the scsi mid layer ( i.e. HBA's host is removed from the /sys/class/scsi_host/hostX). And hence your are observing that all the drivers attached to this controller are dropped. But still I am not sure why controller enters in to the non-operational state. So, I thought to reproduce this locally, So can you please help me in reproducing this issue i.e. can you please tell me the steps and which utils with cmds your are used to reproduce this issue. Regards, Sreekanth
Hi . I can easily reproduce this issue in a second by running : btrfs scrub start /MOUNTPOINT The btrfs system is a RAID1 that consists of 5 drives . Also in an MD-RAID0 that consists of 3 drives by running a little harsh commands like : dd if=/dev/zero of=/MOUNTPOINT/dd.img bs=1G count=300 and/or : md5deep -r /MOUNTPOINT My CPU is an Ivy-Bridge i5 , with 32GB of RAM . (Watching htop , the CPU never reaches 30% of load) Thank you .
One more thing , the MD-RAID0 has XFS but that doesn't matter because it used to have EXT4 with the same results .
Hi . Any updates regarding this bug ?
(In reply to liveaxle from comment #14) > Hi . > Any updates regarding this bug ? I tried to reproduce this issue locally, but for me this issue is not reproduced. Here are the steps which I followed to reproduce the issue 1. I have created RAID0 vloume on two 500 GB SAS drives. 2. Created the EXT4 file system. 3. Mounted this FS to /mnt 4. And run the IO's using cmd 'dd if=/dev/zero of=/mnt/dd.img bs=1G count=300' Result: IO's run successfully with out any issue. Please let me know whether I have missed any steps while reproducing this issue.
Hello . The thing is that I'm using SATA drives and not SAS drives . The motherboard exposes the LSI controller as 8 SATA ports . This wasn't an issue under Windows 2012 , so I think that hardware issues are pretty much not the cause in here . Sorry if I'm demanding too much , but can you try to create a BTRFS RAID1 , fill it with data and then run : btrfs scrub start /MOUNTPOINT It always produces the issue in less than 2 minutes . Thank you .
Hi . Today , I ran some tests on all 8 drives connected to the LSI 2308 . The results are rather surprising . The controller goes non-operational under high READ workloads , while WRITE workloads always complete just fine . I'll run more tests , but at this point I can safely say that heavy READ operations (md5 checks , btrfs scrub , torrent files checking , etc) are the problem , while heavy WRITE workloads (dd , copy , rsync) always complete successfully . I hope that can be useful in nailing this bug . Thank you .
Looks to be the same as https://bugzilla.kernel.org/show_bug.cgi?id=59301 I'm seeing the same thing with a LSI 9211-8i-card (firmware 16) with kernel 3.2, 3.5 and 3.8. My 5 SATA-drives gets dropped when resyncing a SW RAID-6 set.
It seems the LSI isn't interested in fixing this . I also purchased a 9211-8i card lately , and it has the same issue . Perhaps , I might consider buying an Adaptec HBA to replace these LSI controllers . Thank you .
Created attachment 107333 [details] mpt2sas-disable-watchdog.patch mpt2sas: add module option to disable watchdog.
Try with this patch. With a bit of luck it's just the firmware becoming sluggish under high load, so disabling the watchdog will be circumvent this. And any real error would still be handled by SCSI EH. Keep fingers crossed.
Hi . Thank you Hannes very much for the patch . I compiled it inside a 3.11-rc7 , and put "mpt2sas.disable_watchdog=1" in the boot parameters . It helped the driver to survive longer -around an hour longer than before- but then it failed . But this time , it hard-locked the machine (ssh sessions are closed , new sessions time-out , iscsi targets are dropped) I had to hard-reset the server . Thank you .
So the firmware does indeed wedge under high load. Given the issues I've had so far with LSI SATL I'm not surprised. Does the same thing happen when running on a single disk, ie without MD? There have been issues with MD dropping any queue limitations (ie the 4k physical / 512 logical block sizes you're having) so MD might end up spitting out non-aligned requests. Which in turn might trigger issues in the firmware translation. Using the devices directly without MD would eliminate this problem.
Hello . I used to have the same issues with MD . yes . I'm using BTRFS , I don't know if BTRFS RAID code was ported from MD , but the issue is the same . I didn't try anything other than BTRFS and MD . Maybe I should give ZFS a try , although it is still slower on linux . I'll report back as soon as possible . Thank you .
Have you already tried to give the controller less work, i.e. by setting /sys/block/sdX/device/queue_depth to 1? If you can't set it, use the mpt2sas option max_queue_depth=1. A low value of max_sgl_entries and max_sectors also might help. Lowering all of that is not good for performance, but might increase stability.
Hi . Thanks for your kind help , Bernd . setting /sys/block/sdX(c~j)/device/queue_depth to 1 unfortuantely didn't solve the issue . I put the following in the end of the boot line : mpt2sas.disable_watchdog=1 mpt2sas.max_queue_depth=1 mpt2sas.max_sgl_entries=64 mpt2sas.max_sectors=64 However , the OS doesn't see the controllers anymore . So I had to remove these entries from the boot line . Thank you .
Hi again . I set up two disks as two seperate BTRFS volumes (No RAID) , and did some tests . One of the disks failed to complete the process given to it , but it wasn't dropped , and checking the mountpoint shows that it still mounted . The error it gives is "Stale file handle" . The other drive , completed all the processes successfully . It seems that RAID is indeed a problem if not THE problem . Thank you .
PS . : To summerize what I have in my server : 1 - BTRFS RAID1 (5 disks) : FAILS 2 - BTRFS Single Data Profile (2 disks) : FAILS 3 - BTRFS Single disk FS (No RAID) : FAILS (But recovers without rebooting - Doesn't drop or Unmount) 4 - BTRFS Single disk FS (No RAID) : WORKS . I'll run one test , this time I'll use a leafsize of 64k , I have a doubt that it helps . THank you .
Hi . I created a BTRFS RAID0 using a leafsize of 64k . Copying some files to the RAID results in some strange output in dmesg : [107991.481826] sd 8:0:10:0: [sdm] [107991.482239] Sense Key : 0x2 [current] [107991.482657] sd 8:0:10:0: [sdm] [107991.483067] ASC=0x4 ASCQ=0x0 [107991.483475] sd 8:0:10:0: [sdm] CDB: [107991.483885] cdb[0]=0x2a: 2a 00 00 3a 0b 80 00 04 00 00 [107991.484337] sd 8:0:10:0: [sdm] Device not ready [107991.484752] sd 8:0:10:0: [sdm] [107991.485163] Result: hostbyte=0x00 driverbyte=0x08 [107991.485581] sd 8:0:10:0: [sdm] [107991.486001] Sense Key : 0x2 [current] [107991.486441] sd 8:0:10:0: [sdm] [107991.486856] ASC=0x4 ASCQ=0x0 [107991.487260] sd 8:0:10:0: [sdm] CDB: [107991.487659] cdb[0]=0x2a: 2a 00 00 3a 0f 80 00 04 00 00 [107991.488100] sd 8:0:10:0: [sdm] Device not ready [107991.488510] sd 8:0:10:0: [sdm] [107991.488915] Result: hostbyte=0x00 driverbyte=0x08 [107991.489321] sd 8:0:10:0: [sdm] [107991.489715] Sense Key : 0x2 [current] [107991.490108] sd 8:0:10:0: [sdm] [107991.490495] ASC=0x4 ASCQ=0x0 [107991.490882] sd 8:0:10:0: [sdm] CDB: [107991.491269] cdb[0]=0x2a: 2a 00 00 3a 13 80 00 04 00 00 [107991.491704] sd 8:0:10:0: [sdm] Device not ready [107991.492103] sd 8:0:10:0: [sdm] [107991.492491] Result: hostbyte=0x00 driverbyte=0x08 [107991.492882] sd 8:0:10:0: [sdm] [107991.493291] Sense Key : 0x2 [current] [107991.493684] sd 8:0:10:0: [sdm] [107991.494070] ASC=0x4 ASCQ=0x0 [107991.494454] sd 8:0:10:0: [sdm] CDB: [107991.494838] cdb[0]=0x2a: 2a 00 00 3a 17 80 00 04 00 00 [107991.495272] sd 8:0:10:0: [sdm] Device not ready [107991.495669] sd 8:0:10:0: [sdm] [107991.496057] Result: hostbyte=0x00 driverbyte=0x08 [107991.496446] sd 8:0:10:0: [sdm] [107991.496835] Sense Key : 0x2 [current] [107991.497227] sd 8:0:10:0: [sdm] [107991.497612] ASC=0x4 ASCQ=0x0 [107991.497996] sd 8:0:10:0: [sdm] CDB: [107991.498379] cdb[0]=0x2a: 2a 00 00 3a 1b 80 00 04 00 00 [107991.498810] sd 8:0:10:0: [sdm] Device not ready [107991.499207] sd 8:0:10:0: [sdm] [107991.499593] Result: hostbyte=0x00 driverbyte=0x08 [107991.500018] sd 8:0:10:0: [sdm] [107991.500407] Sense Key : 0x2 [current] [107991.500797] sd 8:0:10:0: [sdm] [107991.501182] ASC=0x4 ASCQ=0x0 [107991.501567] sd 8:0:10:0: [sdm] CDB: [107991.501951] cdb[0]=0x2a: 2a 00 00 3a 1f 80 00 04 00 00 [107991.502385] sd 8:0:10:0: [sdm] Device not ready [107991.502782] sd 8:0:10:0: [sdm] [107991.503170] Result: hostbyte=0x00 driverbyte=0x08 [107991.503558] sd 8:0:10:0: [sdm] [107991.503948] Sense Key : 0x2 [current] [107991.504338] sd 8:0:10:0: [sdm] [107991.504723] ASC=0x4 ASCQ=0x0 [107991.505108] sd 8:0:10:0: [sdm] CDB: [107991.505493] cdb[0]=0x2a: 2a 00 00 3a 23 80 00 04 00 00 [107991.505927] sd 8:0:10:0: [sdm] Device not ready [107991.506343] sd 8:0:10:0: [sdm] [107991.506730] Result: hostbyte=0x00 driverbyte=0x08 [107991.507119] sd 8:0:10:0: [sdm] [107991.507507] Sense Key : 0x2 [current] [107991.507896] sd 8:0:10:0: [sdm] [107991.508296] ASC=0x4 ASCQ=0x0 [107991.508678] sd 8:0:10:0: [sdm] CDB: [107991.509059] cdb[0]=0x2a: 2a 00 00 3a 27 80 00 04 00 00 [107991.509486] sd 8:0:10:0: [sdm] Device not ready [107991.509881] sd 8:0:10:0: [sdm] [107991.510267] Result: hostbyte=0x00 driverbyte=0x08 [107991.510655] sd 8:0:10:0: [sdm] [107991.511043] Sense Key : 0x2 [current] [107991.511430] sd 8:0:10:0: [sdm] [107991.511815] ASC=0x4 ASCQ=0x0 [107991.512197] sd 8:0:10:0: [sdm] CDB: [107991.512581] cdb[0]=0x2a: 2a 00 00 3a 2b 80 00 04 00 00 [107991.513044] sd 8:0:10:0: [sdm] Device not ready [107991.513440] sd 8:0:10:0: [sdm] [107991.513826] Result: hostbyte=0x00 driverbyte=0x08 [107991.514214] sd 8:0:10:0: [sdm] [107991.514603] Sense Key : 0x2 [current] [107991.514994] sd 8:0:10:0: [sdm] [107991.515377] ASC=0x4 ASCQ=0x0 [107991.515760] sd 8:0:10:0: [sdm] CDB: [107991.516144] cdb[0]=0x2a: 2a 00 00 3a 2f 80 00 04 00 00 [107991.516576] sd 8:0:10:0: [sdm] Device not ready [107991.516973] sd 8:0:10:0: [sdm] [107991.517360] Result: hostbyte=0x00 driverbyte=0x08 [107991.517749] sd 8:0:10:0: [sdm] [107991.518138] Sense Key : 0x2 [current] [107991.518527] sd 8:0:10:0: [sdm] [107991.518911] ASC=0x4 ASCQ=0x0 [107991.519295] sd 8:0:10:0: [sdm] CDB: [107991.519696] cdb[0]=0x2a: 2a 00 00 3a 33 80 00 04 00 00 [107991.520129] sd 8:0:10:0: [sdm] Device not ready [107991.520528] sd 8:0:10:0: [sdm] [107991.520915] Result: hostbyte=0x00 driverbyte=0x08 [107991.521305] sd 8:0:10:0: [sdm] [107991.521695] Sense Key : 0x2 [current] [107991.522087] sd 8:0:10:0: [sdm] [107991.522472] ASC=0x4 ASCQ=0x0 [107991.522856] sd 8:0:10:0: [sdm] CDB: [107991.523241] cdb[0]=0x2a: 2a 00 00 3a 37 80 00 04 00 00 [107991.523675] sd 8:0:10:0: [sdm] Device not ready [107991.524073] sd 8:0:10:0: [sdm] [107991.524460] Result: hostbyte=0x00 driverbyte=0x08 [107991.524850] sd 8:0:10:0: [sdm] [107991.525241] Sense Key : 0x2 [current] [107991.525631] sd 8:0:10:0: [sdm] [107991.526018] ASC=0x4 ASCQ=0x0 [107991.526437] sd 8:0:10:0: [sdm] CDB: [107991.526823] cdb[0]=0x2a: 2a 00 00 3a 3b 80 00 04 00 00 [107991.527257] sd 8:0:10:0: [sdm] Device not ready [107991.527657] sd 8:0:10:0: [sdm] [107991.528047] Result: hostbyte=0x00 driverbyte=0x08 [107991.528437] sd 8:0:10:0: [sdm] [107991.528829] Sense Key : 0x2 [current] [107991.529221] sd 8:0:10:0: [sdm] [107991.529609] ASC=0x4 ASCQ=0x0 [107991.529996] sd 8:0:10:0: [sdm] CDB: [107991.530382] cdb[0]=0x2a: 2a 00 00 3a 3f 80 00 04 00 00 [107991.530816] sd 8:0:10:0: [sdm] Device not ready [107991.531216] sd 8:0:10:0: [sdm] [107991.531604] Result: hostbyte=0x00 driverbyte=0x08 [107991.531996] sd 8:0:10:0: [sdm] [107991.532387] Sense Key : 0x2 [current] [107991.532780] sd 8:0:10:0: [sdm] [107991.533184] ASC=0x4 ASCQ=0x0 [107991.533571] sd 8:0:10:0: [sdm] CDB: [107991.533958] cdb[0]=0x2a: 2a 00 00 3a 43 80 00 04 00 00 [107991.534392] sd 8:0:10:0: [sdm] Device not ready [107991.534791] sd 8:0:10:0: [sdm] [107991.535180] Result: hostbyte=0x00 driverbyte=0x08 [107991.535572] sd 8:0:10:0: [sdm] [107991.535964] Sense Key : 0x2 [current] [107991.536358] sd 8:0:10:0: [sdm] [107991.536747] ASC=0x4 ASCQ=0x0 [107991.537134] sd 8:0:10:0: [sdm] CDB: [107991.537520] cdb[0]=0x2a: 2a 00 00 3a 47 80 00 04 00 00 [107991.537956] sd 8:0:10:0: [sdm] Device not ready [107991.538357] sd 8:0:10:0: [sdm] [107991.538746] Result: hostbyte=0x00 driverbyte=0x08 [107991.539139] sd 8:0:10:0: [sdm] [107991.539529] Sense Key : 0x2 [current] [107991.539954] sd 8:0:10:0: [sdm] [107991.540340] ASC=0x4 ASCQ=0x0 [107991.540726] sd 8:0:10:0: [sdm] CDB: [107991.541112] cdb[0]=0x2a: 2a 00 00 3a 4b 80 00 04 00 00 [107991.541542] sd 8:0:10:0: [sdm] Device not ready [107991.541941] sd 8:0:10:0: [sdm] [107991.542330] Result: hostbyte=0x00 driverbyte=0x08 [107991.542720] sd 8:0:10:0: [sdm] [107991.543113] Sense Key : 0x2 [current] [107991.543507] sd 8:0:10:0: [sdm] [107991.543894] ASC=0x4 ASCQ=0x0 [107991.544281] sd 8:0:10:0: [sdm] CDB: [107991.544668] cdb[0]=0x2a: 2a 00 00 3a 4f 80 00 01 80 00 [107991.545090] sd 8:0:10:0: [sdm] Device not ready [107991.545489] sd 8:0:10:0: [sdm] [107991.545883] Result: hostbyte=0x00 driverbyte=0x08 [107991.546277] sd 8:0:10:0: [sdm] [107991.546684] Sense Key : 0x2 [current] [107991.547076] sd 8:0:10:0: [sdm] [107991.547464] ASC=0x4 ASCQ=0x0 [107991.547848] sd 8:0:10:0: [sdm] CDB: [107991.548234] cdb[0]=0x2a: 2a 00 00 21 38 00 00 04 00 00 [107991.548669] sd 8:0:10:0: [sdm] Device not ready [107991.549083] sd 8:0:10:0: [sdm] [107991.549470] Result: hostbyte=0x00 driverbyte=0x08 [107991.549861] sd 8:0:10:0: [sdm] [107991.550252] Sense Key : 0x2 [current] [107991.550642] sd 8:0:10:0: [sdm] [107991.551029] ASC=0x4 ASCQ=0x0 [107991.551415] sd 8:0:10:0: [sdm] CDB: [107991.551800] cdb[0]=0x2a: 2a 00 00 21 3c 00 00 04 00 00 [107991.552235] sd 8:0:10:0: [sdm] Device not ready [107991.552649] sd 8:0:10:0: [sdm] [107991.553054] Result: hostbyte=0x00 driverbyte=0x08 [107991.553444] sd 8:0:10:0: [sdm] [107991.553835] Sense Key : 0x2 [current] [107991.554225] sd 8:0:10:0: [sdm] [107991.554610] ASC=0x4 ASCQ=0x0 [107991.554996] sd 8:0:10:0: [sdm] CDB: [107991.555382] cdb[0]=0x2a: 2a 00 00 21 40 00 00 04 00 00 [107991.555817] sd 8:0:10:0: [sdm] Device not ready [107991.556232] sd 8:0:10:0: [sdm] [107991.556621] Result: hostbyte=0x00 driverbyte=0x08 [107991.557013] sd 8:0:10:0: [sdm] [107991.557402] Sense Key : 0x2 [current] [107991.557792] sd 8:0:10:0: [sdm] [107991.558181] ASC=0x4 ASCQ=0x0 [107991.558564] sd 8:0:10:0: [sdm] CDB: [107991.558949] cdb[0]=0x2a: 2a 00 00 21 44 00 00 04 00 00 [107991.559385] sd 8:0:10:0: [sdm] Device not ready [107991.559811] sd 8:0:10:0: [sdm] [107991.560199] Result: hostbyte=0x00 driverbyte=0x08 [107991.560588] sd 8:0:10:0: [sdm] [107991.560976] Sense Key : 0x2 [current] [107991.561367] sd 8:0:10:0: [sdm] [107991.561750] ASC=0x4 ASCQ=0x0 [107991.562132] sd 8:0:10:0: [sdm] CDB: [107991.562516] cdb[0]=0x2a: 2a 00 00 21 48 00 00 04 00 00 [107991.562949] sd 8:0:10:0: [sdm] Device not ready [107991.563362] sd 8:0:10:0: [sdm] [107991.563748] Result: hostbyte=0x00 driverbyte=0x08 [107991.564135] sd 8:0:10:0: [sdm] [107991.564524] Sense Key : 0x2 [current] [107991.564912] sd 8:0:10:0: [sdm] [107991.565295] ASC=0x4 ASCQ=0x0 [107991.565676] sd 8:0:10:0: [sdm] CDB: [107991.566058] cdb[0]=0x2a: 2a 00 00 21 4c 00 00 04 00 00 [107991.566523] sd 8:0:10:0: [sdm] Device not ready [107991.566937] sd 8:0:10:0: [sdm] [107991.567322] Result: hostbyte=0x00 driverbyte=0x08 [107991.567708] sd 8:0:10:0: [sdm] [107991.568095] Sense Key : 0x2 [current] [107991.568482] sd 8:0:10:0: [sdm] [107991.568865] ASC=0x4 ASCQ=0x0 [107991.569246] sd 8:0:10:0: [sdm] CDB: [107991.569628] cdb[0]=0x2a: 2a 00 00 21 50 00 00 04 00 00 [107991.570058] sd 8:0:10:0: [sdm] Device not ready [107991.570469] sd 8:0:10:0: [sdm] [107991.570855] Result: hostbyte=0x00 driverbyte=0x08 [107991.571244] sd 8:0:10:0: [sdm] [107991.571632] Sense Key : 0x2 [current] [107991.572020] sd 8:0:10:0: [sdm] [107991.572402] ASC=0x4 ASCQ=0x0 [107991.572783] sd 8:0:10:0: [sdm] CDB: [107991.573182] cdb[0]=0x2a: 2a 00 00 21 54 00 00 04 00 00 [107991.573612] sd 8:0:10:0: [sdm] Device not ready [107991.574023] sd 8:0:10:0: [sdm] [107991.574409] Result: hostbyte=0x00 driverbyte=0x08 [107991.574796] sd 8:0:10:0: [sdm] [107991.575186] Sense Key : 0x2 [current] [107991.575576] sd 8:0:10:0: [sdm] [107991.575960] ASC=0x4 ASCQ=0x0 [107991.576345] sd 8:0:10:0: [sdm] CDB: [107991.576729] cdb[0]=0x2a: 2a 00 00 21 58 00 00 04 00 00 [107991.577161] sd 8:0:10:0: [sdm] Device not ready [107991.577572] sd 8:0:10:0: [sdm] [107991.577959] Result: hostbyte=0x00 driverbyte=0x08 [107991.578348] sd 8:0:10:0: [sdm] [107991.578736] Sense Key : 0x2 [current] [107991.579125] sd 8:0:10:0: [sdm] [107991.579509] ASC=0x4 ASCQ=0x0 [107991.579924] sd 8:0:10:0: [sdm] CDB: [107991.580307] cdb[0]=0x2a: 2a 00 00 21 5c 00 00 04 00 00 The copying proccess does not stop , drives do not drop and the mountpoint still intact . However , trying to read the copied files result in Input/Output Error . THank you .
Hi Liveaxle, would you share the brand and model of your HDDs, and print here the exact partition table you are using? Thank you
Hi Kurk . As for the HDDs , here is the list : Hitachi_HDS5C4040ALE630 (4TB) - (4 disks) TOSHIBA_DT01ACA300 (3TB) (2 disks) WDC_WD10EARS-00Y5B1_WD-WCAV5N165986 (1TB) (1 disk) WDC_WD3200AAJS-00L7A0_WD-WMAV20125236 (320GB) (1 disk) WDC_WD3200AAKX-001CA0_WD-WCAYUH130479 (320GB) (1 disk) All of them use BTRFS , which has its own way of partitioning . WHen a new BTRFS volume is created , it clears the old partition table . And fdisk -l doesn't show any partitions . Thank you .
Hi . This bug is still present in 3.12-rc1 as of today's tests . Thank you .
You might want to try the latest P17 firmware aswell, it has been out for a couple of weeks now. There's not much in the changelogs, but it seems to fix some other SGPIO related issues at least.
Hi . In fact I installed P17 on both controllers (2308 -as 9207- and M1015 -as 9211-) , but nothing has changed at all . P16 worked just fine under Windows Server 2012 . the problem lies in MPT2SAS as far as I see . Thank you .
Has anyone checked operating temp of the SAS chip on the HBAs? Max operating temp is 55C. I'm seeing this issue on a box with three 9207-8i running zfs and the operating climbs to 64-67C before a drop out occurs. Temps can be checked with the lsiutil utility.
Hello all, I'm currently hitting this problem consistently with kernel 3.10.25 during MDRAID6 resync, one thing that I found that will stop this from happening is to disable SERR and PERR in BIOS, I don't know if that helps.
I can confirm Tommy's observations about disabling PERR and SERR solving the issue. The motherboard I am using (Supermicro X8DTH-iF) does not have those exact BIOS settings to control. In my case the following BIOS changes and kernel command line arguments eliminated the issue: MB BIOS: BIOS->Advanced->Advanced Chipset Configuration->North Bridge Configuration->ASPM=Disabled Linux boot command line options: pcie_aspm=off disable_msi=1 With these changes a very intense fio run has gone four days without a single error or issue on a Linux-ZFS filesystem. Before these changes I could not go four hours without multiple HBAs disappearing (in my config there are three HBAs). Sometimes the HBAs would disappear within 15-20 minutes of benchmark runtime.
Apologies, I forgot to add important information.. I am not running a 3.x kernel, I am running a 2.6 kernel. It would appear that this problem is a hardware issue (LSI) and not a driver or kernel issue. I am running: Linux zfs-0-0.local 2.6.32-279.14.1.el6.x86_64 #1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux LSI driver: filename: /lib/modules/2.6.32-279.14.1.el6.x86_64/extra/mpt2sas.ko version: 18.00.01.00 license: GPL description: LSI MPT Fusion SAS 2.0 Device Driver author: LSI Corporation <DL-MPTFusionLinux@lsi.com> srcversion: ED5DA691FBB263E9F3A55B1 --Jeff
Created attachment 121891 [details] attachment-9982-0.html I am on vacation till 17th Janury 2014. I will have limited access to emails during this time. For urgent issues, please contact my manager Krishna (Krishnaraddi.Mankani@lsi.com<mailto:Krishnaraddi.Mankani@lsi.com>), or call on my mobile (+91 8722810905). Regards, Sreekanth Reddy
Hello all, I've actually been playing around with this a bit more and found out that the PERR and SERR might only hide problem for a while, sustained load over longer periods (4-5 hours) will still unfold in a crash, I have re-enabled the PERR and SERR and I'm running with ASPM enabled aswell, but the change that made it all go away was that to switch encoding on the PCIe to use "Above 4G", now the thing I noted was aswell that I had changed from using a PCIe2.0 SAS HBA to a PCIe3.0 SAS HBA and the problem manifested it self only with the latter, anoter thing to note is that I have mixed PCIe2 and PCIe3 devices with a PCIe3 capable cpu. /Tommy
Created attachment 122561 [details] dmesg showing mpt2sas errors with ibm m1015 in it mode (fw 19)
Created attachment 122581 [details] "zpool status" output zpool status showing the disks configured as a raidz3 vdev.
Hello all, enabling "Above 4G encoding" in the bios did not help in my case. I enabled PERR and SERR as well. PCIe ASPM is forced on by the bios and the kernel. When I scrub my zpool, the system locks up. This time at 7.13% progress. After a reset the scrubbing continues and sometimes the locks up a second time. So in general I get 1-2 lockups during a scrub, but it always finishes the scrub without errors (ofc when the disks drop out the zfs scrubbing mentions errors). Hardware: Case: Inter-Tech 4HU-4324L Board: Supermicro X9SCM-F CPU: Intel Xeon E3-1230 V2 RAM: 2x8GB ECC ( Samsung M391B1G73BH0-CH9 ) HBA: IBM ServeRAID M1015 ( IT mode, FW version 17 ) Disks: 10x 3TB WD Green ( WD30EZRX ) and 1x 3TB Hitachi ( HDS5C303 ) Software: - Gentoo hardened, kernel 3.12.6-hardened-r4 (other kernel version failing, too) - All the disks luks encrypted - A pool "rpool" for the system on a ssd - A pool "tank" for the data on a raidz3 I have attached "zpool status" and dmesg logs (see posts above).
modinfo mpt2sas filename: /lib/modules/3.12.6-hardened-r4/kernel/drivers/scsi/mpt2sas/mpt2sas.ko version: 16.100.00.00 license: GPL description: LSI MPT Fusion SAS 2.0 Device Driver author: LSI Corporation <DL-MPTFusionLinux@lsi.com> srcversion: 17F8D55839A477BC4077B0B
I ran more detailed tests this weekend. ASPM & MSI disabled = stable machine under zfs load ASPM disabled / MSI enabled = stable machine under zfs load ASPM enabled / MSI disabled = unstable, lost an HBA under zfs load Hardware: Supermicro X8DTH-iF, BIOS 2.1b (current) 2x Xeon X5670, 48GB DDR3 1333Mhz Reg/ECC 3x LSI 9207-8i, phase 18 firmware 36x Seagate ST32000444SS It appears to be ASPM and vulnerability to issue may vary by chipset. I know other motherboards mentioned in this thread have a different chipset. I have other systems in the field with similar components but with a C206 chipset based motherboard and these issues are not occurring.
addendum/corrections to last... LSI 9207-8i with phase 17 firmware. Motherboards with Intel C602 chipset appear to be functioning w/o issue Apologies.. not enough coffee consumed this morning
Thanks for the hint at ASPM ! After disabling it in the BIOS I was able to scrub my zpool without a single issue. # zpool status .... scan: scrub repaired 0 in 6h32m with 0 errors on Tue Jan 21 03:15:12 2014 .... Problem solved. Proper support for PCIe ASPM would be great though !
Awesome! I'm glad it is working for others.
(In reply to Jeff Johnson from comment #48) > Awesome! I'm glad it is working for others. Disabling ASPM did the trick for me! Always hanged within 20min under heavy IO before disabling ASPM. On Supermicro x10sl7 its under advanced, chipset config, system agent config, pcie config
Created attachment 248831 [details] Possible fix This patch disables ASPM powersave for controller's pci link. I can't reproduce the isse with ASPM enabled and this patch applied (4.9 kernel, LSI SAS 9217-8i HBA), but more testing will not hurt.
(as a side note, does anyone else has an issue [0] with `rmmod mpt3sas`? It's reproducible with LSI SAS 9217-8i HBA, and I would like to know if other HBAs are affected) [0] https://www.spinics.net/lists/linux-scsi/msg100687.html
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ffdadd68af5a397b8a52289ab39d62e1acb39e63 Patch is merged, bug can be closed.