Bug 2289
Summary: | random hang/null pointer after backup and disconnect of treo 300 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Joe Nardelli (jnardelli) |
Component: | USB | Assignee: | Greg Kroah-Hartman (greg) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | clark_torgerson, r40 |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.4 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
OOPS
Some debugging code. some debugging output usbsnoop data capture proposed patch cleaner patch Real patch that will be added to the kernel tree. |
Description
Joe Nardelli
2004-03-12 06:51:51 UTC
Upgraded treo to 3.5.2H6.3 firmware rev 1.21 prl rev 10019, and got same results. Should be fixed in 2.6.6-rc3. If not, please reopen the bug and post the new oops message (if any). Created attachment 2749 [details]
OOPS
Of four attempts, three resulted in lockups with nothing being logged, and the
fourth produced a nice OOPS.
Linux iscjoe.infosciences.com 2.6.6-rc3 #3 SMP Thu Apr 29 17:15:45 EDT 2004
i686 unknown unknown GNU/Linux
As an insanely awful workaround, commenting out all of the usb_unlink_urb, usb_free_urb, and kfree calls in port_release in usb-serial.c appears to prevent my machine from crashing, even after 7 backups. I assume that this results in a very nasty memory leak :-) I'm just about positive I've identified what's causing this problem, and it looks like it affects all treos/visors, but I only have a treo 300 to test against, so I'm not sure. Here's the problem: 1) On connect, treo_attach() in visor.c is called, and the following are reused across both ttyUSB0, and ttyUSB1: read_urb, bulk_in_endpointAddress, bulk_in_buffer, write_urb, bulk_out_size, bulk_out_endpointAddress, bulk_out_buffer, interrupt_in_urb, interrupt_in_endpointAddress, and interrupt_in_buffer. There's a funny comment just above calling this 'pretty ugly', and I would have to agree :-) 2) On disconnect, port_release() in usb-serial is called, which tries to do some cleanup, but results in calling usb_unlink_urb(), usb_free_urb(), and kfree() multiple times on the same memory address, which I assume is a very bad thing to do. This is getting way beyond my knowledge of USB treo/visor communication, but I would assume the thing to do, would either be to add a treo/visor specific cleanup, or to make sure that the areas being reused in treo_attach() do not have problems on cleanup. I've attached a code snippet showing some debug work I did in port_release(), and some of it's output. In this example, it looks like read_urb (c14e4348), interrupt_in_urb (c0c8dc94), bulk_in_buffer (c7439980), and interrupt_in_buffer (c5a56ea8) are being cleaned up multiple times. I'll post this at bugzilla as well, I just thought fellow treo/visor users might be interested. Created attachment 2778 [details]
Some debugging code.
Some debugging code.
Created attachment 2779 [details]
some debugging output
some debugging output
Yes, this looks like it is the problem, good debugging. So, care to make a patch to fix this? :) I'm an expert on neither USB nor the kernel, but I'll see what I can do. Be prepared for some noob questions :-) Having the same problem. I don't have much coding ability, but I would be more than willing to test and confirm once a patch becomes available. I have the same problem with my Treo 90. This is the only thing stopping me from migrating from 2.4 to 2.6. I've currently got 2.6.6 waiting in the wings for more testing / bugfixes so any way I can be of help... Fran :):):) I've currently identified 3 problem areas in treo related code (and possibly others). I'm not sure if all of them NEED to be fixed, but at the very least, the third one does: 1) The connection initialization in both palm_os_3_probe and palm_os_4_probe does not appear to correspond to the actual communication that is going on. There is some sanity checking done there to handle bad communication, but I'm not sure that it is adequate. 2) The number of bulk in, bulk out, and interrupt in endpoints should be 1, 2, and 1 respectively, but are instead listed as 2, 2, and 2 in the code. It seems a bit odd to have these hard-coded, as these can be determined from the device. 3) treo_attach mangles memory between the first and second port and causes REALLY bad things to happen. I'm currently desperately looking for documentation describing the protocol that the treos are using, as I suspect that would make things alot easier. Looking at the raw data going over the wire using usbsnoop under windows is a bit time consuming, but at least provides a known working example. The protocol is the same as documented by the pilot-link program. But why do you need/care about the protocol? Once we set up the proper endpoint bindings, everything should work just fine. The driver doesn't care about the data flowing through it to/from the device at all (with the exception of the startup information, which is documented quite well in visor.h.) I should have been more specific about the protocol info I was looking for - I'm interested in connection startup related info. As you can see from http://bugme.osdl.org/attachment.cgi?id=2749&action=view, the driver thinks that there are an insane number of ports (23130) during connection startup, but the only outcome of this is to fallback to using a presumably safe number of ports (2). I suspect that the connect info may have changed for treos, but even so, I don't think that changing anything in palm_os_3_probe would have a real affect - I'd just feel a little better if I knew for sure that we were reading good connect info. Also, given the insane number of ports the driver thinks there is, I'd like to make sure that we're not reading from unitialized data. I agree with you about the driver not caring about the protocol after the endpoints are setup (although endpoint setup is the real problem here). Created attachment 2924 [details]
usbsnoop data capture
Based upon a usb data capture under windows, it would appear that both
VISOR_GET_CONNECTION_INFORMATION and PALM_GET_EXT_CONNECTION_INFORMATION
requests are being sent, but neither one is generating anything useful on
return.
Created attachment 2930 [details]
proposed patch
Here's a proposed patch to (hopefully) solve this problem.
Note that it does not do anything to set the correct number of bulk in, bulk
out, or interrupt in urbs, but it looks like their specification in this file
is obsolete. They can be found from the device, and manually setting them to a
large value (200) does not seem to have any effect.
This patch could use ALOT more testing by users of other devices.
Created attachment 2937 [details]
cleaner patch
Here's a cleaned up patch based upon kernel/usb mailing list feedback.
Created attachment 2972 [details]
Real patch that will be added to the kernel tree.
Here's the patch that I've applied to my trees and will send to Linus in a day
or so.
That patch should now fix this. Just compiled (all usb is in kernel not modules) and syncs ok. Woo hoo! So far so good :) |