Bug 8223

Summary: RACE: Lock is expected before calling ata_dev_select, but in some call chains not held!
Product: IO/Storage Reporter: Ding (yuanding02)
Component: Serial ATAAssignee: Tejun Heo (htejun)
Status: RESOLVED CODE_FIX    
Severity: normal CC: alan, bugzilla.kernel.org, bunk, htejun, jgarzik, tj
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.20.1 Subsystem:
Regression: No Bisected commit-id:

Description Ding 2007-03-16 21:59:19 UTC
ata_dev_select:

The comment of this function: " LOCKING: caller".
But in some call chain lock is not acquired:
ata_dev_read_id <- ata_qc_issue_prot <- ata_bus_probe <- ata_dev_revalidate <-
ata_eh_revalidate_and_attach <- ata_device_add


A lock is not acquired.
But in other call chains, there is lock acquired.
Comment 1 Tejun Heo 2007-03-17 21:56:51 UTC
Hmmm...  Am I missing something?

How does ata_dev_revalidate() end up calling ata_bus_probe()?  The call path is

ata_dev_revalidate() -> ata_dev_read_id() -> ata_exec_internal[_sg](), this one
acquires lock -> .qc_issue().
Comment 2 Ding 2007-03-17 22:39:25 UTC
Thanks for your prompt reply! The comments before "ata_dev_select" says:
"LOCKING: caller."
The call chain: 
ata_dev_revalidate() -> ata_dev_read_id() -> ata_dev_select(), and then it calls
ata_exec_internal[_sg], so shall I assume the lock is acquired AFTER calling
ata_dev_select(), which violates the comment?

Sorry in previous I messed up earlier version of kernel (2.6.6)
Comment 3 Tejun Heo 2007-03-17 23:21:08 UTC
No, what code are you looking at?  ata_dev_read_id() calls ata_exec_internal()
which calls ata_exec_internal_sg() which grabs the host lock and then calls
->qc_issue which in turn calls ata_dev_select().  Nothing wrong there.  I'm
closing the bug as INVALID.
Comment 4 Ding 2007-03-17 23:33:45 UTC
I think I am looking at version 2.6.20.1, and following is the link of it on
cross-reference.
http://lxr.linux.no/source/drivers/ata/libata-core.c#L1448
If I am right, I find on line 1462 ata_dev_read_id calls ata_dev_select(), and
on line 1483, it further calls ata_exec_internal(). So when calling
ata_dev_select() seems to me no lock is acquired.
Am I making mistake?
Thanks.
Comment 5 Tejun Heo 2007-03-17 23:42:26 UTC
No you aren't.  I was mistaken.  My apologies.  Anyways, yeah, there is a race
there.  The direct call to ata_dev_select() in ata_dev_read_id() is there for
mostly historical reason as ->qc_issue() is responsible for that anyway.  It
also can be dangerous on a PATA channel.  Thanks for reporting this.  Not sure
whether that should be wrapped in host lock or just be killed.  I'll post a
message to linux-ide mailing list.
Comment 6 Ding 2007-03-17 23:55:45 UTC
Thanks for your all your prompt replies!
I am also suspecting the following functions may be potential race case:
ata_pio_devchk
ata_devchk
ata_mmio_devchk
From the comment we see they all require the caller to hold the lock, but in the
call chain:
ata_bus_reset -> ata_pio_devchk -> ata_mmio_devchk|ata_pio_devchk, seems there
is no lock.
Am I making mistakes?
Thanks.
Comment 7 Tejun Heo 2007-03-18 00:08:31 UTC
I might be too prompt today (writing faster than thinking).  Anyways, yeah, it
seems like a class of issues.  We should be holding host lock whether we're
calling from EH or not.  Entering EH does take care of exclusion in the port but
nothing is guaranteed host-wide and register accesses should be guaranteed
host-wide exclusion.

In practice, as EH actions don't happen frequently && modern controllers
implement each port very independently, this doesn't cause real problem, but
there still is a slim chance that EH operation going on one channel interferes
operation going on another channel.

For the record, I don't think fixing this problem is of very high priority as it
has been working well until now.  I'll post a patch for this in a few days. 
Thanks a lot.
Comment 8 Ding 2007-03-18 00:19:46 UTC
Thanks!
Comment 9 Alan 2007-06-05 08:30:13 UTC
Status ? Has this been fixed upstream ?
Comment 10 Alan 2007-06-18 07:41:18 UTC
Hello ping ??
Comment 11 Tejun Heo 2007-06-18 09:38:42 UTC
The problem is still there.  It definitely needs fixing but priority is still rather low.
Comment 12 Alan 2008-09-23 11:00:07 UTC
Ping ?
Comment 13 Tejun Heo 2008-09-28 18:49:15 UTC
Pong.  Planning on working on it when moving EH to block layer before the end of this year, hopefully. :-)
Comment 14 Alan 2009-03-18 09:32:30 UTC
Its now march so ping again
Comment 15 Alan 2009-06-29 12:20:25 UTC
Ping again - Tejun did you look at this yet - its almost a year now ?
Comment 16 Tejun Heo 2009-07-15 09:52:12 UTC
This being so tedious, I was trying to think of some clever way to do it without changing every exception processing functions.  Ehhh... I think I should just bite and do it.  Will work on it.  Thanks for reminding this.
Comment 17 Marc Bejarano 2010-09-18 14:55:02 UTC
FYI: Tejun posted a patch and Jeff applied it:
http://thread.gmane.org/gmane.linux.ide/47481