Bug 5173
Summary: | USB subsystems hangs when stressing the bulk-only storage device | ||
---|---|---|---|
Product: | Drivers | Reporter: | Feng-sung Yang (fsyang_tw) |
Component: | USB | Assignee: | Alan Stern (stern) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | greg |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.13 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 5089 | ||
Attachments: | 1 fail, 3 okay, 3 pending, 1 not found |
Description
Feng-sung Yang
2005-09-02 02:21:21 UTC
The patches in these two messages will help prevent usb-storage from hanging when one of these errors occurs. Note that both patches must be applied, in this order: http://marc.theaimsgroup.com/?l=linux-usb-devel&m=112258712200336&w=2 http://marc.theaimsgroup.com/?l=linux-usb-devel&m=112258718111885&w=2 Unforunately they didn't get submitted in time to appear in 2.6.13. You may also find that this patch http://marc.theaimsgroup.com/?l=linux-usb-devel&m=112551468126219&w=2 helps the port resets to work better. I don't know what else we can do about improving the error recovery. Perhaps adding a time delay before doing the port reset would help? You can try editing the source code to do this. In drivers/usb/storage/transport.c, immediately after the statement label Handle_Errors: in the usb_stor_invoke_transport() routine, insert a line saying msleep(2000); Make that however many milliseconds you think is needed. Created attachment 5896 [details]
1 fail, 3 okay, 3 pending, 1 not found
Thanks for the patches. The situation gets better. But after long stress testings, the USB subsystem still hangs. I tested 8 pairs of flash card/card readers at the same time. After the tests are all done, these 8 pairs are removed and then reconnected simultaneously. Initially the tests are okay. (By "okay", it means the tests can finish but some may take much longer time to finish than others. It should be a retry issue.) After several tests, sometimes one pair won't be detected or it will be detected much later than others. The port number of this card reader is not specific. Later I connected another USB disk (with 8 partitions) to the bus while testing to read some other files. At the last test, the first card reader (/dev/sda) failed to be opened for write after some tests. The following 3 are okay although lots of re-opening /dev/sdX is performed to retry the testing. The mentioned 4 card readers are connected to the same hub. Another four readers connected to another hub are detected inside /proc/bus/usb/005/. However, the block devices /dev/sdX never appear. Doing a "lsusb" hangs. [root@fsyang fsyang]# ls -l /proc/bus/usb/005 total 0 -rw-rw---- 1 fsyang usb 43 Sep 5 10:55 001 -rw-rw-r-- 1 root usb 43 Sep 5 11:22 038 -rw-rw-r-- 1 root usb 43 Sep 5 11:23 040 -rw-rw-r-- 1 root usb 51 Sep 5 13:43 046 -rw-rw-r-- 1 root usb 50 Sep 5 14:03 062 -rw-rw-r-- 1 root usb 50 Sep 5 14:03 063 -rw-rw-r-- 1 root usb 50 Sep 5 14:03 064 -rw-rw-r-- 1 root usb 50 Sep 5 14:03 065 -rw-rw-r-- 1 root usb 50 Sep 5 14:03 066 -rw-rw-r-- 1 root usb 50 Sep 5 14:03 067 -rw-rw-r-- 1 root usb 50 Sep 5 14:03 068 [root@fsyang kernel]# ps -eo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:30,comm | grep usb 25033 25033 TS - 0 24 0 0.0 S - usb-storage 28030 28030 TS - 0 24 0 0.0 S - usb-storage 28046 28046 TS - 0 24 0 0.0 S - usb-storage 28082 28082 TS - 0 24 0 0.0 S - usb-storage 28108 28108 TS - 0 24 0 0.0 S - usb-storage 28134 28134 TS - 0 24 0 0.0 S - usb-storage 28160 28160 TS - 0 24 0 0.0 S - usb-storage 28186 28186 TS - 0 24 0 0.0 S - usb-storage 28960 28960 TS - 0 23 0 0.0 D+ - lsusb [root@fsyang kernel]# ps -eo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:30,comm | grep scsi 3445 3445 TS - 0 24 0 0.0 S - scsi_eh_5 3471 3471 TS - 0 19 0 0.0 S - scsi_eh_6 3497 3497 TS - 0 24 0 0.0 S - scsi_eh_7 3523 3523 TS - 0 24 0 0.0 S - scsi_eh_8 14934 14934 TS - 0 24 0 0.0 S - scsi_eh_73 16122 16122 TS - 0 19 0 0.0 S - scsi_eh_77 16137 16137 TS - 0 19 0 0.0 S - scsi_eh_78 16174 16174 TS - 0 19 0 0.0 S - scsi_eh_79 16200 16200 TS - 0 19 0 0.0 S - scsi_eh_80 16226 16226 TS - 0 19 0 0.0 S - scsi_eh_81 16252 16252 TS - 0 19 0 0.0 S - scsi_eh_82 16278 16278 TS - 0 19 0 0.0 S - scsi_eh_83 17901 17901 TS - 0 24 0 0.0 S - scsi_eh_91 17953 17953 TS - 0 19 0 0.0 S - scsi_eh_93 17979 17979 TS - 0 19 0 0.0 S - scsi_eh_94 18005 18005 TS - 0 19 0 0.0 S - scsi_eh_95 18033 18033 TS - 0 19 0 0.0 S - scsi_eh_96 18062 18062 TS - 0 24 0 0.0 S - scsi_eh_97 18774 18774 TS - 0 19 0 0.0 S - scsi_eh_98 18801 18801 TS - 0 19 0 0.0 S - scsi_eh_99 18827 18827 TS - 0 19 0 0.0 S - scsi_eh_100 18853 18853 TS - 0 19 0 0.0 S - scsi_eh_101 18879 18879 TS - 0 19 0 0.0 S - scsi_eh_102 18905 18905 TS - 0 19 0 0.0 S - scsi_eh_103 18931 18931 TS - 0 19 0 0.0 S - scsi_eh_104 19650 19650 TS - 0 24 0 0.0 S - scsi_eh_105 19676 19676 TS - 0 24 0 0.0 S - scsi_eh_106 19702 19702 TS - 0 24 0 0.0 S - scsi_eh_107 19728 19728 TS - 0 24 0 0.0 S - scsi_eh_108 19754 19754 TS - 0 19 0 0.0 S - scsi_eh_109 19780 19780 TS - 0 24 0 0.0 S - scsi_eh_110 19806 19806 TS - 0 19 0 0.0 S - scsi_eh_111 20275 20275 TS - 0 19 0 0.0 S - scsi_eh_112 25031 25031 TS - 0 19 0 0.0 S - scsi_eh_129 25786 25786 TS - 0 19 0 0.0 S - scsi_eh_130 25801 25801 TS - 0 19 0 0.0 S - scsi_eh_131 25838 25838 TS - 0 19 0 0.0 S - scsi_eh_132 25864 25864 TS - 0 19 0 0.0 S - scsi_eh_133 25890 25890 TS - 0 19 0 0.0 S - scsi_eh_134 25916 25916 TS - 0 19 0 0.0 S - scsi_eh_135 25942 25942 TS - 0 19 0 0.0 S - scsi_eh_136 28029 28029 TS - 0 24 0 0.0 S - scsi_eh_145 28045 28045 TS - 0 19 0 0.0 S - scsi_eh_146 28081 28081 TS - 0 19 0 0.0 S - scsi_eh_147 28107 28107 TS - 0 19 0 0.0 S - scsi_eh_148 28133 28133 TS - 0 19 0 0.0 S - scsi_eh_149 28159 28159 TS - 0 19 0 0.0 S - scsi_eh_150 28185 28185 TS - 0 19 0 0.0 S - scsi_eh_151 Since the data amount is too large. I didn't enable the USB storage verbose debug message option. Only the normal messages are logged (/var/log/kernel/info). They are attached here (bz2 file). I hope it will be helpful. About my previous comment, the scsi_en_XX drivers seem not got released after my test, so I do some more tests now. I find if I disconnect the card reader when the program is testing it (I am sure write operations will result in this phenomemon. But I don't know if read will...). The usb_storage driver will be gone, however, the scsi_en_XX still present. In my previous comment, I did remove the card readers several times during the test to test if the program/Linux is robust enough. Now.. I am sure the reading operations will also prevent the scsi_en_XX driver from released if you disconnect the device at the same time. You're trying to do too much at once. If you are interested in tracking down and solving these hangups, then you should test only one storage device at a time. Then when something goes wrong, you'll be able to find out where and why. Also you definitely should turn on CONFIG_USB_STORAGE_DEBUG. It's the only way to tell what the usb-storage driver is really doing. With only one device being tested, the output won't be too large to handle -- especially if it's one of the devices that starts getting errors almost immediately. There have been some patches posted for the SCSI core that might help solve some of the problems you see. They aren't included in 2.6.13, but they might be in the current -mm tree. I have tried the -mm1 patch (plus the http://marc.theaimsgroup.com/?l=linux-usb-devel&m=112551468126219&w=2 reset patch). This time, only one card reader is used. I remove the card reader while it's being read (dd if=/dev/sda of=/dev/null). Then the usb-storage driver is gone with scsi_eh_4 left. I am not sure if I should open another bug report for this bug since it may have already been discovered by others although I haven't find the corresponding bug report. The following is the last trace messages: Sep 6 17:00:34 fsyang kernel: usb-storage: Command READ_10 (10 bytes) Sep 6 17:00:34 fsyang kernel: usb-storage: 28 00 00 01 30 40 00 00 40 00 Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk Command S 0x43425355 T 0x4cf L 32768 F 128 Trg 0 LUN 0 CL 10 Sep 6 17:00:34 fsyang kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes Sep 6 17:00:34 fsyang kernel: usb-storage: Status code 0; transferred 31/31 Sep 6 17:00:34 fsyang kernel: usb-storage: -- transfer complete Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk command transfer result=0 Sep 6 17:00:34 fsyang kernel: usb-storage: usb_stor_bulk_transfer_sglist: xfer 32768 bytes, 1 entries Sep 6 17:00:34 fsyang kernel: usb-storage: Status code 0; transferred 32768/32768 Sep 6 17:00:34 fsyang kernel: usb-storage: -- transfer complete Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk data transfer result 0x0 Sep 6 17:00:34 fsyang kernel: usb-storage: Attempting to get CSW... Sep 6 17:00:34 fsyang kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 13 bytes Sep 6 17:00:34 fsyang kernel: usb-storage: Status code 0; transferred 13/13 Sep 6 17:00:34 fsyang kernel: usb-storage: -- transfer complete Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk status result = 0 Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk Status S 0x53425355 T 0x4cf R 0 Stat 0x0 Sep 6 17:00:34 fsyang kernel: usb-storage: scsi cmd done, result=0x0 Sep 6 17:00:34 fsyang kernel: usb-storage: *** thread sleeping. Sep 6 17:00:34 fsyang kernel: usb-storage: queuecommand called Sep 6 17:00:34 fsyang kernel: usb-storage: *** thread awakened. Sep 6 17:00:34 fsyang kernel: usb-storage: Command READ_10 (10 bytes) Sep 6 17:00:34 fsyang kernel: usb-storage: 28 00 00 01 30 80 00 00 40 00 Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk Command S 0x43425355 T 0x4d0 L 32768 F 128 Trg 0 LUN 0 CL 10 Sep 6 17:00:34 fsyang kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes Sep 6 17:00:34 fsyang kernel: usb-storage: Status code 0; transferred 31/31 Sep 6 17:00:34 fsyang kernel: usb-storage: -- transfer complete Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk command transfer result=0 Sep 6 17:00:34 fsyang kernel: usb-storage: usb_stor_bulk_transfer_sglist: xfer 32768 bytes, 1 entries Sep 6 17:00:34 fsyang kernel: usb-storage: Status code 0; transferred 32768/32768 Sep 6 17:00:34 fsyang kernel: usb-storage: -- transfer complete Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk data transfer result 0x0 Sep 6 17:00:34 fsyang kernel: usb-storage: Attempting to get CSW... Sep 6 17:00:34 fsyang kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 13 bytes Sep 6 17:00:34 fsyang kernel: usb-storage: Status code 0; transferred 13/13 Sep 6 17:00:34 fsyang kernel: usb-storage: -- transfer complete Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk status result = 0 Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk Status S 0x53425355 T 0x4d0 R 0 Stat 0x0 Sep 6 17:00:34 fsyang kernel: usb-storage: scsi cmd done, result=0x0 Sep 6 17:00:34 fsyang kernel: usb-storage: *** thread sleeping. Sep 6 17:00:34 fsyang kernel: usb-storage: queuecommand called Sep 6 17:00:34 fsyang kernel: usb-storage: *** thread awakened. Sep 6 17:00:34 fsyang kernel: usb-storage: Command READ_10 (10 bytes) Sep 6 17:00:34 fsyang kernel: usb-storage: 28 00 00 01 30 c0 00 00 40 00 Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk Command S 0x43425355 T 0x4d1 L 32768 F 128 Trg 0 LUN 0 CL 10 Sep 6 17:00:34 fsyang kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes Sep 6 17:00:34 fsyang kernel: usb-storage: Status code 0; transferred 31/31 Sep 6 17:00:34 fsyang kernel: usb-storage: -- transfer complete Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk command transfer result=0 Sep 6 17:00:34 fsyang kernel: usb-storage: usb_stor_bulk_transfer_sglist: xfer 32768 bytes, 1 entries Sep 6 17:00:34 fsyang kernel: usb-storage: Status code -71; transferred 26112/32768 Sep 6 17:00:34 fsyang kernel: usb-storage: -- unknown error Sep 6 17:00:34 fsyang kernel: usb-storage: Bulk data transfer result 0x4 Sep 6 17:00:34 fsyang kernel: usb-storage: -- transport indicates error, resetting Sep 6 17:00:34 fsyang kernel: usb 5-1: USB disconnect, address 17 Sep 6 17:00:34 fsyang kernel: usb-storage: storage_disconnect() called Sep 6 17:00:34 fsyang kernel: usb-storage: usb_stor_stop_transport called Sep 6 17:00:36 fsyang kernel: usb-storage: No reset during disconnect Sep 6 17:00:36 fsyang kernel: usb-storage: usb_stor_Bulk_reset called Sep 6 17:00:36 fsyang kernel: usb-storage: No reset during disconnect Sep 6 17:00:36 fsyang kernel: usb-storage: scsi cmd done, result=0x70000 Sep 6 17:00:36 fsyang kernel: usb-storage: *** thread sleeping. Sep 6 17:00:36 fsyang kernel: usb-storage: queuecommand called Sep 6 17:00:36 fsyang kernel: usb-storage: Fail command during disconnect Sep 6 17:00:36 fsyang kernel: SCSI error : <4 0 0 0> return code = 0x10000 Sep 6 17:00:36 fsyang kernel: usb-storage: queuecommand called Sep 6 17:00:36 fsyang kernel: usb-storage: Fail command during disconnect Sep 6 17:00:36 fsyang kernel: SCSI error : <4 0 0 0> return code = 0x10000 Sep 6 17:00:36 fsyang kernel: usb-storage: -- usb_stor_release_resources Sep 6 17:00:36 fsyang kernel: usb-storage: -- sending exit command to thread Sep 6 17:00:36 fsyang kernel: usb-storage: -- dissociate_dev Sep 6 17:00:36 fsyang kernel: usb-storage: *** thread awakened. Sep 6 17:00:36 fsyang kernel: usb-storage: -- exiting As for the hang problem, since I haven't find it happen for only one card reader. I can only use several card readers simultaneously to generate this error. I have tried four today and no error happened. I'll try more to log the verbose debug message. Hi, I have open another bug report for the orphan scsi_eh_XX problem as http://bugzilla.kernel.org/show_bug.cgi?id=5195 since it's more serious than I thought before. Hey, I just noticed this. I have nothing to contribute except for some applause at the notion of this kind of stress test! Linux needs this kind of testing, as well as the one-drive-at-a-time sort of testing. It can be a real pain to _find_ the bugs reported by this sort of test, but it's important to know how well these work. I think these problems have been fixed in 2.6.14-rc2. Try using that kernel and see what happens. Hi, I have tried linux-2.6.14-rc4 dozens of times and no hang problem happens again. Although the device (/dev/sdX) usually failed to be opened or written when 8 pairs were tested simultaneously, I think it's another problem. Thanks a lot. Okay, I'll mark this bug CLOSED. |