|Summary:||random hang/null pointer after backup and disconnect of treo 300|
|Product:||Drivers||Reporter:||Joe Nardelli (jnardelli)|
|Component:||USB||Assignee:||Greg Kroah-Hartman (greg)|
Some debugging code.
some debugging output
usbsnoop data capture
Real patch that will be added to the kernel tree.
Description Joe Nardelli 2004-03-12 06:51:51 UTC
Comment 1 Joe Nardelli 2004-03-24 11:51:52 UTC
Upgraded treo to 3.5.2H6.3 firmware rev 1.21 prl rev 10019, and got same results.
Comment 2 Greg Kroah-Hartman 2004-04-29 12:26:41 UTC
Should be fixed in 2.6.6-rc3. If not, please reopen the bug and post the new oops message (if any).
Comment 3 Joe Nardelli 2004-04-29 14:28:35 UTC
Created attachment 2749 [details] OOPS Of four attempts, three resulted in lockups with nothing being logged, and the fourth produced a nice OOPS. Linux iscjoe.infosciences.com 2.6.6-rc3 #3 SMP Thu Apr 29 17:15:45 EDT 2004 i686 unknown unknown GNU/Linux
Comment 4 Joe Nardelli 2004-04-30 14:38:45 UTC
As an insanely awful workaround, commenting out all of the usb_unlink_urb, usb_free_urb, and kfree calls in port_release in usb-serial.c appears to prevent my machine from crashing, even after 7 backups. I assume that this results in a very nasty memory leak :-)
Comment 5 Joe Nardelli 2004-05-03 16:20:16 UTC
I'm just about positive I've identified what's causing this problem, and it looks like it affects all treos/visors, but I only have a treo 300 to test against, so I'm not sure. Here's the problem: 1) On connect, treo_attach() in visor.c is called, and the following are reused across both ttyUSB0, and ttyUSB1: read_urb, bulk_in_endpointAddress, bulk_in_buffer, write_urb, bulk_out_size, bulk_out_endpointAddress, bulk_out_buffer, interrupt_in_urb, interrupt_in_endpointAddress, and interrupt_in_buffer. There's a funny comment just above calling this 'pretty ugly', and I would have to agree :-) 2) On disconnect, port_release() in usb-serial is called, which tries to do some cleanup, but results in calling usb_unlink_urb(), usb_free_urb(), and kfree() multiple times on the same memory address, which I assume is a very bad thing to do. This is getting way beyond my knowledge of USB treo/visor communication, but I would assume the thing to do, would either be to add a treo/visor specific cleanup, or to make sure that the areas being reused in treo_attach() do not have problems on cleanup. I've attached a code snippet showing some debug work I did in port_release(), and some of it's output. In this example, it looks like read_urb (c14e4348), interrupt_in_urb (c0c8dc94), bulk_in_buffer (c7439980), and interrupt_in_buffer (c5a56ea8) are being cleaned up multiple times. I'll post this at bugzilla as well, I just thought fellow treo/visor users might be interested.
Comment 6 Joe Nardelli 2004-05-03 16:22:11 UTC
Created attachment 2778 [details] Some debugging code. Some debugging code.
Comment 7 Joe Nardelli 2004-05-03 16:22:58 UTC
Created attachment 2779 [details] some debugging output some debugging output
Comment 8 Greg Kroah-Hartman 2004-05-04 13:45:18 UTC
Yes, this looks like it is the problem, good debugging. So, care to make a patch to fix this? :)
Comment 9 Joe Nardelli 2004-05-04 14:07:11 UTC
I'm an expert on neither USB nor the kernel, but I'll see what I can do. Be prepared for some noob questions :-)
Comment 10 Clark Torgerson 2004-05-07 01:12:18 UTC
Having the same problem. I don't have much coding ability, but I would be more than willing to test and confirm once a patch becomes available.
Comment 11 Fran 2004-05-15 15:42:18 UTC
I have the same problem with my Treo 90. This is the only thing stopping me from migrating from 2.4 to 2.6. I've currently got 2.6.6 waiting in the wings for more testing / bugfixes so any way I can be of help... Fran :):):)
Comment 12 Joe Nardelli 2004-05-17 08:01:31 UTC
I've currently identified 3 problem areas in treo related code (and possibly others). I'm not sure if all of them NEED to be fixed, but at the very least, the third one does: 1) The connection initialization in both palm_os_3_probe and palm_os_4_probe does not appear to correspond to the actual communication that is going on. There is some sanity checking done there to handle bad communication, but I'm not sure that it is adequate. 2) The number of bulk in, bulk out, and interrupt in endpoints should be 1, 2, and 1 respectively, but are instead listed as 2, 2, and 2 in the code. It seems a bit odd to have these hard-coded, as these can be determined from the device. 3) treo_attach mangles memory between the first and second port and causes REALLY bad things to happen. I'm currently desperately looking for documentation describing the protocol that the treos are using, as I suspect that would make things alot easier. Looking at the raw data going over the wire using usbsnoop under windows is a bit time consuming, but at least provides a known working example.
Comment 13 Greg Kroah-Hartman 2004-05-17 15:29:15 UTC
The protocol is the same as documented by the pilot-link program. But why do you need/care about the protocol? Once we set up the proper endpoint bindings, everything should work just fine. The driver doesn't care about the data flowing through it to/from the device at all (with the exception of the startup information, which is documented quite well in visor.h.)
Comment 14 Joe Nardelli 2004-05-18 07:08:39 UTC
I should have been more specific about the protocol info I was looking for - I'm interested in connection startup related info. As you can see from http://bugme.osdl.org/attachment.cgi?id=2749&action=view, the driver thinks that there are an insane number of ports (23130) during connection startup, but the only outcome of this is to fallback to using a presumably safe number of ports (2). I suspect that the connect info may have changed for treos, but even so, I don't think that changing anything in palm_os_3_probe would have a real affect - I'd just feel a little better if I knew for sure that we were reading good connect info. Also, given the insane number of ports the driver thinks there is, I'd like to make sure that we're not reading from unitialized data. I agree with you about the driver not caring about the protocol after the endpoints are setup (although endpoint setup is the real problem here).
Comment 15 Joe Nardelli 2004-05-20 07:13:34 UTC
Created attachment 2924 [details] usbsnoop data capture Based upon a usb data capture under windows, it would appear that both VISOR_GET_CONNECTION_INFORMATION and PALM_GET_EXT_CONNECTION_INFORMATION requests are being sent, but neither one is generating anything useful on return.
Comment 16 Joe Nardelli 2004-05-20 15:59:40 UTC
Created attachment 2930 [details] proposed patch Here's a proposed patch to (hopefully) solve this problem. Note that it does not do anything to set the correct number of bulk in, bulk out, or interrupt in urbs, but it looks like their specification in this file is obsolete. They can be found from the device, and manually setting them to a large value (200) does not seem to have any effect. This patch could use ALOT more testing by users of other devices.
Comment 17 Joe Nardelli 2004-05-21 12:57:58 UTC
Created attachment 2937 [details] cleaner patch Here's a cleaned up patch based upon kernel/usb mailing list feedback.
Comment 18 Greg Kroah-Hartman 2004-05-25 11:33:55 UTC
Created attachment 2972 [details] Real patch that will be added to the kernel tree. Here's the patch that I've applied to my trees and will send to Linus in a day or so.
Comment 19 Greg Kroah-Hartman 2004-05-25 11:34:49 UTC
That patch should now fix this.
Comment 20 Fran 2004-05-29 14:45:15 UTC
Just compiled (all usb is in kernel not modules) and syncs ok. Woo hoo! So far so good :)