Bug 16312

Summary: WARNING: at fs/fs-writeback.c:1127 __mark_inode_dirty
Product: File System Reporter: Maciej Rutecki (maciej.rutecki)
Component: ext3Assignee: fs_ext3 (fs_ext3)
Status: CLOSED CODE_FIX    
Severity: normal CC: axboe, bvanassche, florian, htejun, jack, Larry.Finger, maciej.rutecki, rjw, tj, zdenek.kabelac
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 16055    
Attachments: First patch to fix the issue
Second patch to fix the issue
Debugging patch to print more BDI information
Initialize noop bdi properly
sd-delay-uevent.patch
Fix race in add_disk() causing warning in __mark_inode_dirty().

Description Maciej Rutecki 2010-06-28 17:47:43 UTC
Subject    : WARNING: at fs/fs-writeback.c:1127 __mark_inode_dirty
Submitter  : Zdenek Kabelac <zdenek.kabelac@gmail.com>
Date       : 2010-06-28 9:40
Message-ID : AANLkTin24fr5O4_q5Xbo9Y_NKkEmtcp6Hgmr9_4qXaFz@mail.gmail.com
References : http://marc.info/?l=linux-kernel&m=127771804806465&w=2

This entry is being used for tracking a regression from 2.6.34.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Larry Finger 2010-07-09 02:23:40 UTC
This is a duplicate of Bug #16122.
Comment 2 Rafael J. Wysocki 2010-07-09 21:32:53 UTC
*** Bug 16122 has been marked as a duplicate of this bug. ***
Comment 3 Jan Kara 2010-07-23 15:30:50 UTC
I was looking into this. A trivial reproducer is:
mknod devzero c 1 5; touch devzero

Essentially, the problem is that any device inode can get dirty because of time
stamp update or similar reason but some BDIs (in this case the one backing
/dev/zero) do not expect dirty inodes. Hmm, I now have to think how to fix
this... It would be more natural if device inodes were queued for metadata updates in the queues of "parent" filesystem. But this brings problem that device inode would need to be in two queues - one for metadata update to the parent filesystem and one for data writeout to the device (this is not exactly the case of /dev/zero but if you have a device node for /dev/sdb on /dev/sda, then this is exactly what we would need).
Comment 4 Rafael J. Wysocki 2010-07-23 20:11:24 UTC
On Friday, July 23, 2010, Larry Finger wrote:
> On 07/23/2010 06:42 AM, Rafael J. Wysocki wrote:
> > This message contains a list of some regressions from 2.6.34,
> > for which there are no fixes in the mainline known to the tracking team.
> > If any of them have been fixed already, please let us know.
> 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=16312
> > Subject             : WARNING: at fs/fs-writeback.c:1127 __mark_inode_dirty
> > Submitter   : Zdenek Kabelac<zdenek.kabelac@gmail.com>
> > Date                : 2010-06-28 9:40 (26 days old)
> 
> I still have this in 2.6.35-rc5.
Comment 5 Jan Kara 2010-07-26 13:01:36 UTC
Created attachment 27255 [details]
First patch to fix the issue
Comment 6 Jan Kara 2010-07-26 13:02:19 UTC
Created attachment 27256 [details]
Second patch to fix the issue

The above two patches fix the issue for me.
Comment 7 Jan Kara 2010-07-26 13:05:09 UTC
Guys, it would be nice if you could test 2.6.35-rc5 (or basically 2.6.35-whatever) with the above to patches and check that they fix the issue for you. Thanks.
Comment 8 Jan Kara 2010-07-26 13:59:55 UTC
Hmm, looking at the original report referenced, my patches need not fix this particular warning. This particular warning rather looks like a race when udisks-part-id manages to read the device (and touch it's atime) before sd_probe_async runs and registers the bdi (via add_disk()). That seems to be possible as device_add() is run before we even schedule to run sd_async_probe() in a separate thread.

Jens, any idea how this should be fixed? I believe we should avoid this race because we should better have the device's queue fully set up before allowing IO to it... But I'm far from fully understanding this area...
Comment 9 Larry Finger 2010-07-26 17:53:04 UTC
These patches break my system (the one reported in Bug #16122). It obviously is different.

After installing the patches. I get a NULL pointer dereference for address 0x30 as __mark_inode_dirty + 0xf0/0x1a0 (x86_64 system). The code offset 0xf30. I have a 2.6.35-rc6 kernel.

The assembler code around that address is

/home/finger/linux-temp/fs/fs-writeback.c:960
     f28:       4d 8b ac 24 b0 01 00    mov    0x1b0(%r12),%r13
     f2f:       00 
/home/finger/linux-temp/fs/fs-writeback.c:962
     f30:       41 f6 45 30 02          testb  $0x2,0x30(%r13)
     f35:       75 31                   jne    f68 <__mark_inode_dirty+0x128>
     f37:       41 f6 45 28 10          testb  $0x10,0x28(%r13)
     f3c:       75 2a                   jne    f68 <__mark_inode_dirty+0x128>

The code at line 960 is:

        struct backing_dev_info *bdi = wb->bdi;

        if (bdi_cap_writeback_dirty(bdi) &&
            !test_bit(BDI_registered, &bdi->state)) {
                WARN_ON(1);
                printk(KERN_ERR "bdi-%s not registered\n",
                       bdi->name);
        }

and wb->bdi is NULL.
Comment 10 Larry Finger 2010-07-26 19:56:06 UTC
The rest of the traceback is:

touch_atime+0x12d
vfs_readdir+0xbe
sys_detdents64+0x80
system_call_fastpath+0x16
Comment 11 Jan Kara 2010-07-27 11:54:54 UTC
Doh, thanks for testing. It looks as if we somehow get to bdi which isn't properly initialized (wb->bdi should always point back to bdi containing the wb structure). Can you try once more with the debugging patch below applied on top of my two patches and report what it writes to the log? Thanks.
Comment 12 Jan Kara 2010-07-27 11:55:41 UTC
Created attachment 27267 [details]
Debugging patch to print more BDI information
Comment 13 Larry Finger 2010-07-27 14:20:44 UTC
The log will have to be hand written. The system would not boot, neither a netconsole or serial console is possible, and the pictures I took of the screen are really blurry.

The beginning of the dump is:

Inode devices at sysfs has uninit bdi
Mapping bdi sysfs, caps 131
inode_to_bdi returns bdi noop, caps 0
bdi=noop not registered
BUG: unable to handle NULL pointer at offset 0x8
__mark_inode_dirty+0x159/0x240

The above NULL pointer is in the expansion of

      list_move(&inode->i_list, &wb->b_dirty);

at the statement next->prev = new

Beyond this point, it generates lots of other errors and eventually hangs.

Good luck.
Comment 14 Jan Kara 2010-07-27 15:03:26 UTC
Thanks for the prompt test. Indeed noop_backing_dev_info isn't properly initialized. Attached patch should fix this so this patch plus patches from comments 5 and 6 should at least boot (at least they do for me ;).
Comment 15 Jan Kara 2010-07-27 15:04:10 UTC
Created attachment 27272 [details]
Initialize noop bdi properly
Comment 16 Larry Finger 2010-07-27 15:23:51 UTC
With the patches from comments 5, 6 and 15, my system boots cleanly. Good job. As there probably will not be another rc for 2.6.35, this fix is just in time.

Thanks.
Comment 17 Larry Finger 2010-07-27 15:36:23 UTC
I forgot one part. If you want, add a "Tested-by: Larry Finger <Larry.Finger@lwfinger.net> to the patches.
Comment 18 Rafael J. Wysocki 2010-07-27 18:05:55 UTC
Handled-By : Jan Kara <jack@suse.cz>
Patch : https://bugzilla.kernel.org/attachment.cgi?id=27272
Comment 19 Larry Finger 2010-07-29 19:29:50 UTC
This one needs to be reopened. The WARNING that I get does not happen every reboot, but now I'm getting 2 when it happens. This kernel includes the 3 patches posted above.

[   12.183797] ------------[ cut here ]------------
[   12.183808] WARNING: at fs/fs-writeback.c:964 __mark_inode_dirty+0x10f/0x1a0()
[   12.183811] Hardware name: HP Pavilion dv2700 Notebook PC
[   12.183814] Modules linked in: loop(+) dm_mod ide_cd_mod cdrom ide_pci_generic arc4 ecb b43 rng_core mac80211 snd_hda_codec_conexant cfg80211 rfkill led_class amd74xx ssb mmc_core ide_core snd_hda_intel snd_hda_codec snd_pcm snd_timer k8temp pcmcia pcmcia_core forcedeth snd i2c_nforce2 joydev hwmon sg snd_page_alloc serio_raw i2c_core ac battery button ext4 mbcache jbd2 crc16 ohci_hcd ehci_hcd sd_mod usbcore fan processor ahci libahci libata scsi_mod thermal
[   12.183872] Pid: 2444, comm: udisks-part-id Not tainted 2.6.35-rc6-realtek+ #11
[   12.183875] Call Trace:
[   12.183882]  [<ffffffff8104608a>] warn_slowpath_common+0x7a/0xb0
[   12.183887]  [<ffffffff810460d5>] warn_slowpath_null+0x15/0x20
[   12.183891]  [<ffffffff81129d5f>] __mark_inode_dirty+0x10f/0x1a0
[   12.183895]  [<ffffffff8111e07d>] touch_atime+0x12d/0x170
[   12.183901]  [<ffffffff810cab91>] generic_file_aio_read+0x5c1/0x720
[   12.183906]  [<ffffffff81107ca2>] do_sync_read+0xd2/0x110
[   12.183912]  [<ffffffff81077e7d>] ? trace_hardirqs_on+0xd/0x10
[   12.183916]  [<ffffffff811083c3>] vfs_read+0xb3/0x170
[   12.183921]  [<ffffffff81002d1c>] ? sysret_check+0x27/0x62
[   12.183925]  [<ffffffff811084cc>] sys_read+0x4c/0x80
[   12.183929]  [<ffffffff81002ceb>] system_call_fastpath+0x16/0x1b
[   12.183932] ---[ end trace fdce3ec00c26fa3d ]---
[   12.183935] bdi-block not registered
[   12.205056] ------------[ cut here ]------------
[   12.205067] WARNING: at fs/fs-writeback.c:964 __mark_inode_dirty+0x10f/0x1a0()
[   12.205070] Hardware name: HP Pavilion dv2700 Notebook PC
[   12.205072] Modules linked in: loop(+) dm_mod ide_cd_mod cdrom ide_pci_generic arc4 ecb b43 rng_core mac80211 snd_hda_codec_conexant cfg80211 rfkill led_class amd74xx ssb mmc_core ide_core snd_hda_intel snd_hda_codec snd_pcm snd_timer k8temp pcmcia pcmcia_core forcedeth snd i2c_nforce2 joydev hwmon sg snd_page_alloc serio_raw i2c_core ac battery button ext4 mbcache jbd2 crc16 ohci_hcd ehci_hcd sd_mod usbcore fan processor ahci libahci libata scsi_mod thermal
[   12.205131] Pid: 2456, comm: udisks-part-id Tainted: G        W   2.6.35-rc6-realtek+ #11
[   12.205134] Call Trace:
[   12.205141]  [<ffffffff8104608a>] warn_slowpath_common+0x7a/0xb0
[   12.205146]  [<ffffffff810460d5>] warn_slowpath_null+0x15/0x20
[   12.205150]  [<ffffffff81129d5f>] __mark_inode_dirty+0x10f/0x1a0
[   12.205155]  [<ffffffff8111e07d>] touch_atime+0x12d/0x170
[   12.205160]  [<ffffffff810cab91>] generic_file_aio_read+0x5c1/0x720
[   12.205165]  [<ffffffff81107ca2>] do_sync_read+0xd2/0x110
[   12.205171]  [<ffffffff81077e7d>] ? trace_hardirqs_on+0xd/0x10
[   12.205175]  [<ffffffff811083c3>] vfs_read+0xb3/0x170
[   12.205180]  [<ffffffff81002d1c>] ? sysret_check+0x27/0x62
[   12.205183]  [<ffffffff811084cc>] sys_read+0x4c/0x80
[   12.205188]  [<ffffffff81002ceb>] system_call_fastpath+0x16/0x1b
[   12.205191] ---[ end trace fdce3ec00c26fa3e ]---
[   12.205193] bdi-block not registered

I will put the diagnostic patch from comment 12 to get more info.
Comment 20 Larry Finger 2010-08-01 19:40:00 UTC
The diagnostics patch from #12 did not change anything.

I am beginning to think that Bug 16312 is not the same as Bug 16122. Even with the patches from 16312, I still get warnings as below:

[   11.728776] ------------[ cut here ]------------
[   11.728787] WARNING: at fs/fs-writeback.c:964 __mark_inode_dirty+0x10f/0x1a0()
[   11.728790] Hardware name: HP Pavilion dv2700 Notebook PC
[   11.728792] Modules linked in: loop(+) dm_mod ide_cd_mod cdrom snd_hda_codec_conexant ide_pci_generic arc4 ecb b43 rng_core mac80211 snd_hda_intel r8712u(C) cfg80211 snd_hda_codec amd74xx snd_pcm sg ide_core rfkill led_class snd_timer ssb mmc_core pcmcia snd joydev k8temp hwmon i2c_nforce2 pcmcia_core forcedeth serio_raw snd_page_alloc i2c_core battery ac button ext4 mbcache jbd2 crc16 ohci_hcd sd_mod ehci_hcd usbcore fan processor ahci libahci libata scsi_mod thermal
[   11.728854] Pid: 2449, comm: udisks-part-id Tainted: G         C  2.6.35-rc6-realtek+ #15
[   11.728857] Call Trace:
[   11.728865]  [<ffffffff8104608a>] warn_slowpath_common+0x7a/0xb0
[   11.728869]  [<ffffffff810460d5>] warn_slowpath_null+0x15/0x20
[   11.728874]  [<ffffffff81129d5f>] __mark_inode_dirty+0x10f/0x1a0
[   11.728879]  [<ffffffff8111e07d>] touch_atime+0x12d/0x170
[   11.728885]  [<ffffffff810cab91>] generic_file_aio_read+0x5c1/0x720
[   11.728890]  [<ffffffff81107ca2>] do_sync_read+0xd2/0x110
[   11.728896]  [<ffffffff81077e7d>] ? trace_hardirqs_on+0xd/0x10
[   11.728900]  [<ffffffff811083c3>] vfs_read+0xb3/0x170
[   11.728906]  [<ffffffff81002d1c>] ? sysret_check+0x27/0x62
[   11.728909]  [<ffffffff811084cc>] sys_read+0x4c/0x80
[   11.728914]  [<ffffffff81002ceb>] system_call_fastpath+0x16/0x1b
[   11.728917] ---[ end trace 32e16cacad33229f ]---
[   11.728919] bdi-block not registered

The warnings do not occur with every boot and appear to be some kind of race condition.
Comment 21 Jan Kara 2010-08-02 13:21:05 UTC
Yes, I think these are two different bugs just with a similar manifestation. See comment 8 in this bug. I'm just not sure how to fix the second bug in sd.c. I think I'll send a direct email to a few people to catch some attention.
Comment 22 Tejun Heo 2010-08-05 12:11:59 UTC
Created attachment 27356 [details]
sd-delay-uevent.patch

This would make sd announce the sd device after block device is set up, which probably is the right thing to do but it still is wrong to trigger WARN_ON() when a user program accesses a device node which isn't ready yet.  Can't that warning be removed by using a default bdi w/ BDI_CAP_NO_ACCT_AND_WRITEBACK set for block device nodes?

Thanks.
Comment 23 Jan Kara 2010-08-05 18:07:50 UTC
Thanks for the patch Tejun! Larry, could you please try it?

Well, my naive opinion rather is that userspace shouldn't be able to access a device node which isn't fully set up. That could have unexpected pitfalls for the process and the kernel as well, couldn't it? Technically, it's also hard to switch BDI under a living device inode - the BDI pointer could be cached in various places...
Comment 24 Larry Finger 2010-08-05 18:31:07 UTC
I will as soon as I can.

The hard drive in the box with that particular problem failed and it is in the shop being repaired. That will take longer than it should as it is in warranty and the work has to be done at a central repair facility. The 20 minute job of replacing the hard drive in that notebook will take 3-4 weeks.
Comment 25 Tejun Heo 2010-08-06 06:32:03 UTC
(In reply to comment #23)
> Well, my naive opinion rather is that userspace shouldn't be able to access a
> device node which isn't fully set up. That could have unexpected pitfalls for
> the process and the kernel as well, couldn't it?

But device nodes can be created for any device number whether the device exists or not.  The patch makes sd announce it later so at least udev doesn't try to access it earlier but unless I'm misunderstanding something userspace still can trigger these warnings easily.

Thansk.
Comment 26 Larry Finger 2010-08-20 22:23:12 UTC
I finally got my computer back. Tejun's patch seems to fix the problem.

The warning has been intermittent and I will report back if the warning returns later.
Comment 27 Florian Mickler 2010-09-08 07:44:20 UTC
The patches here seem not to be mainline yet. 
I can see some discussion about the patch set of Jan Kara on lkml[1].
But I can't see Tejun's patch either in linus tree nor on lkml. 

Are these still necessary? 

[1]
References: http://lkml.indiana.edu/hypermail/linux/kernel/1007.3/00884.html
Comment 28 Tejun Heo 2010-09-08 08:23:54 UTC
Ah, okay, now I remember this one.  Jan, what do you think about comment#25?  The patch would prevent udev from triggering the problem automatically but it's still broken.

Thanks.
Comment 29 Rafael J. Wysocki 2010-09-13 17:56:20 UTC
Ignore-Patch : https://bugzilla.kernel.org/attachment.cgi?id=27272
Comment 30 Jan Kara 2010-09-16 18:13:21 UTC
OK, Tejun, I finally read the code. So you are right that the patch doesn't solve the race. The race seems to be in block/genhd.c:add_disk(). Once we call register_disk(), device can be successfully opened (get_gendisk() from blkdev_open() will succeed as soon as disk's kobject exists). But until we call bdi_register_dev(), backing dev isn't properly initialized and thus can see the warning. This also explains why Larry is able to see the warning even with your patch.
Comment 31 Jan Kara 2010-09-16 18:15:04 UTC
Created attachment 30282 [details]
Fix race in add_disk() causing warning in __mark_inode_dirty().
Comment 32 Jan Kara 2010-09-16 18:16:19 UTC
This patch should fix the problem. The kernel boots with it so it shouldn't be really wrong. I'll send it to Jens for review and inclusion...
Comment 33 Rafael J. Wysocki 2010-09-16 18:39:55 UTC
Handled-By : Jan Kara <jack@suse.cz>
Patch : https://bugzilla.kernel.org/attachment.cgi?id=30282
Comment 34 Larry Finger 2010-09-17 00:18:00 UTC
I will test that patch when I get back to my machine, but that will be several weeks.

Larry
Comment 35 Jan Kara 2010-09-20 10:20:33 UTC
*** Bug 17461 has been marked as a duplicate of this bug. ***
Comment 36 Bart Van Assche 2010-10-03 10:05:02 UTC
What is the current status of this issue, and which of the attached patches fixes this issue ? This still occurs with 2.6.35.7 when loading the kernel module ib_srp:

------------[ cut here ]------------
WARNING: at fs/fs-writeback.c:951 __mark_inode_dirty+0x104/0x180()
Hardware name: P5Q DELUXE
Modules linked in: ib_srp scsi_transport_srp scsi_tgt netconsole configfs snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device af_packet rdma_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf mlx4_ib ib_mad ib_core fuse loop dm_mod snd_hda_codec_atihdmi snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd mlx4_core sr_mod button i2c_i801 intel_agp soundcore snd_page_alloc serio_raw pcspkr i2c_core cdrom sg usbhid hid uhci_hcd rtc_cmos rtc_core ehci_hcd rtc_lib sd_mod crc_t10dif usbcore edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix pata_marvell ahci libahci libata scsi_mod thermal processor thermal_sys hwmon [last unloaded: ib_srp]
Pid: 3900, comm: blkid Not tainted 2.6.35.7 #1
Call Trace:
 [<ffffffff8104a81f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff8104a87a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff8112f384>] __mark_inode_dirty+0x104/0x180
 [<ffffffff811232f5>] touch_atime+0x135/0x180
 [<ffffffff810c7f07>] generic_file_aio_read+0x2c7/0x6c0
 [<ffffffff8110ccfa>] do_sync_read+0xda/0x120
 [<ffffffff8103049a>] ? do_page_fault+0x18a/0x3d0
 [<ffffffff8111cb6d>] ? vfs_ioctl+0x3d/0xd0
 [<ffffffff8111cd28>] ? do_vfs_ioctl+0x88/0x540
 [<ffffffff8110d925>] vfs_read+0xb5/0x180
 [<ffffffff8110dae1>] sys_read+0x51/0x80
 [<ffffffff81002fdb>] system_call_fastpath+0x16/0x1b
---[ end trace d756b562ec21031e ]---
bdi-block not registered
Comment 37 Florian Mickler 2010-10-03 10:25:47 UTC
It should be the one mentioned in comment #33 . 

But we have no confirmation about it yet. Can you test it?
Comment 38 Rafael J. Wysocki 2010-10-03 20:13:52 UTC
The patch works (I can reproduce the problem without it), but it's not going to be included in 2.6.36 AFAICT.  It's going to be merged into 2.6.37 and than put into -stable.
Comment 39 Bart Van Assche 2010-10-04 21:10:08 UTC
(In reply to comment #37)
> It should be the one mentioned in comment #33 . 
> 
> But we have no confirmation about it yet. Can you test it?

That patch works great on my setup.
Comment 40 Florian Mickler 2011-01-23 16:09:24 UTC
fixed in .37-rc1 by: 
commit 01ea50638bc04ca5259f5711fcdedefcdde1cf43
Author: Signed-off-by: Jan Kara <jack@suse.cz>
Date:   Thu Sep 16 20:36:36 2010 +0200

    block: Fix race during disk initialization