Bug 15658

Summary: [PATCH] x86 constant_test_bit() prone to misoptimization with gcc-4.4
Product: Platform Specific/Hardware Reporter: Michael Shigorin (mike)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED CODE_FIX    
Severity: normal CC: greg, hpa, mingo, vsu
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27 Subsystem:
Regression: No Bisected commit-id:
Attachments: bluntly drop overcasting instead of fiddling with gcc's sense of humour

Description Michael Shigorin 2010-03-31 18:25:28 UTC
Created attachment 25776 [details]
bluntly drop overcasting instead of fiddling with gcc's sense of humour

While debugging bit_spin_lock() hang, it was tracked down to gcc-4.4 misoptimization of constant_test_bit() when 'const volatile unsigned long *addr' cast to 'unsigned long *' with subsequent unconditional jump to pause (and not to the test) leading to hang.

Compiling with gcc-4.3 or disabling CONFIG_OPTIMIZE_INLINING yields inlined constant_test_bit() and correct jump.

Other arches than asm-x86 (it's include/asm-x86/bitops.h) may implement this slightly differently; 2.6.29 mitigates the misoptimization by changing the function prototype (commit c4295fbb6048d85f0b41c5ced5cbf63f6811c46c) but probably fixing the issue itself would be better.

Here's my translation of original analysis which led to the above conclusion (Russian text in internal bugzilla; I'm no kernel hacker so take it with grain of salt and blame me, not them):

---
Gory details
============

Slightly simplified bit_spin_lock() main part:
-------------------include/linux/bit_spinlock.h-----------------------

static inline void bit_spin_lock(int bitnum, unsigned long *addr)
{
        <...>
        while (test_and_set_bit_lock(bitnum, addr)) {
                while (test_bit(bitnum, addr)) {
                        <...>
                        cpu_relax();
                        <...>
                }
        }
        <...>
}
----------------------------------------------------------------------

Outer loop is an attempt to set the needed bit in atomic manner.
If the bit was already set, inner loop waits to reset it and so forth.

Assembly code generated by gcc-4.4 for the internal loop:
----------------------------------------------------------------------
$ rpm2cpio kernel-image-tmc-srv-2.6.27-tmc24.x86_64.rpm | cpio -idmv *jbd.ko
<...>
$ cd ./lib/modules/2.6.27-tmc-srv-tmc24/kernel/fs/jbd/
$ objdump -d --no-show-raw-insn jbd.ko | grep -A92 "<journal_get_undo_access>:" | tail -7
    240a:       mov    %rbx,%rsi
    240d:       mov    $0x16,%edi
    2412:       callq  30 <constant_test_bit>
    2417:       test   %eax,%eax
    2419:       je     2328 <journal_get_undo_access+0x48>
    241f:       pause
    2421:       jmp    241f <journal_get_undo_access+0x13f>
----------------------------------------------------------------------

Offsets:
  240a, 240d - preparing arguments (bit no. 32, bitmap address in rsi)
  2412 - call to constant_test_bit()
  2417 - returned value test
  2419 - break if 0
  241f - otherwise pause and then
  2421 - unconditional jump *but* not to the test but to the pause

Thus endless loop.  That is, in C equivalent:
----------------------------------------------------------------------
static inline void bit_spin_lock(int bitnum, unsigned long *addr)
{
        <...>
        while (test_and_set_bit_lock(bitnum, addr)) {
                if (test_bit(bitnum, addr))
                        while (1)
                                cpu_relax();
        }
        <...>
}
----------------------------------------------------------------------

When using gcc-4.3 or disabling CONFIG_OPTIMIZE_INLINING kernel configuration
parameter constant_test_bit() function gets inlined and the unconditional jump
in the inner loop goes before condition test as it should.

It is worth mentioning that bit_spin_lock() is used in several places and
not all of them result in wrong code being generated.  For example,
end_buffer_async_read() has no problem with the inner loop:
----------------------------------------------------------------------
$ rpm2cpio kernel-modules-oprofile-tmc-srv-2.6.27-tmc24.x86_64.rpm | cpio -idmv *vmlinux
<...>
$ cd ./lib/modules/2.6.27-tmc-srv-tmc24/
$ objdump -d --no-show-raw-insn vmlinux | grep -A106 "<end_buffer_async_read>:" | tail -7
ffffffff802f55e2:       mov    %r13,%rsi
ffffffff802f55e5:       mov    $0x4,%edi
ffffffff802f55ea:       callq  ffffffff802f4360 <constant_test_bit>
ffffffff802f55ef:       test   %eax,%eax
ffffffff802f55f1:       je     ffffffff802f54f9 <end_buffer_async_read+0x49>
ffffffff802f55f7:       pause
ffffffff802f55f9:       jmp    ffffffff802f55e2 <end_buffer_async_read+0x132>
----------------------------------------------------------------------


Hang scenario
=============
The below is my take at what leads from the above problem to the "hung"
system state observed.

1. At some moment the process changing the filesystem tries to "lock"
   struct buffer_head by means of bit_spin_lock() but fails at first
   and then goes to internal loop to wait for the buffer being freed.

2. Due to the problem described above the wait comes a bit long.

3. kjournald kernel thread wakes (because current transaction times out
   or gets oversized) to commit the transaction into on-disk journal.
   But upon changing it to T_LOCKED state kjournald notices that not all
   atomic operations on the transactions are complete.  Thus it cannot
   continue and goes to sleep (in journal_commit_transaction()). Of course,
   it waits for the process looping indefinitely as described in (1).

4. Subsequent filesystem change requests result in a new active (T_RUNNING)
   transaction.  As free space in that transaction ends up, all the processes
   trying to change the filesystem march to sleep until kjournald would commit
   that transaction.  But it's not done with the previous one yet as in (3).

And so it all "hangs".


Solutions
=========
IMHO it looks very much like a bug in gcc-4.4.  Adding 'volatile' qualifier
to the datatype of the data addressed by the second parameter to
bit_spin_lock() doesn't help.  Maybe there's an upstream fix.  Maybe it
gets fixed by some flags.  Disabling CONFIG_OPTIMIZE_INLINING does help.
gcc-4.3 does compile things right, at least for JBD.  Maybe there are other
possibilities... not me to judge.
---

I wasn't able to find anything similar upon a /quick/ google search, so posting a new bug.

These folks from Massive Computing worked on this one:
* Volodymyr G. Lukiianyk (volodymyrgl/gmail.com) -- analysis
* Alexander Chumachenko (ledest/gmail.com) -- pin-up and fix
Comment 1 Michael Shigorin 2010-07-06 11:38:07 UTC
ping
Comment 2 H. Peter Anvin 2010-07-06 23:01:48 UTC
I have seen in other contexts that gcc 4.4 seems to mishandle "const volatile".
Comment 3 Michael Shigorin 2010-07-07 06:31:02 UTC
This one led to filesystem hangs, hence CCing Greg for 2.6.27.

What else can we do to
1) get the patch merged in 2.6.27.y;
2) draw attention to the problem class and get it fixed not mitigated?

Thanks!
Comment 4 Michael Shigorin 2010-09-30 13:52:08 UTC
Went LKML and after several mails here it is:
http://git.kernel.org/tip/c9e2fbd909c20b165b2b9ffb59f8b674cf0a55b0

Thanks hpa@, gregkh@, akpm@ and linus@.