Bug 5897 - oops when reading battery info
Summary: oops when reading battery info
Status: CLOSED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: VFS (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Suzuki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-15 07:14 UTC by Vladimir Kondratiev
Modified: 2006-02-15 21:51 UTC (History)
0 users

See Also:
Kernel Version: 2.6.15
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Vladimir Kondratiev 2006-01-15 07:14:38 UTC
Most recent kernel where this bug did not occur: 2.6.13
Distribution: Fedora Core 4
Hardware Environment: IBM T42P
Software Environment: Vanilla 2.6.15
Problem Description:
I have 'gkrellm' application permanently running. One of its tasks is to monitor
battery status. This is achieved by reading ACPI /proc files.
After some number of suspend/resume cycles, grkellm stalls;
System continue to run; eventually it ceases to suspend.
 /var/log/messages gives:
Jan 15 15:42:20 vkondra-mobl kernel: Unable to handle kernel paging request at
virtual address 6b6b6b8b
Jan 15 15:42:20 vkondra-mobl kernel:  printing eip:
Jan 15 15:42:20 vkondra-mobl kernel: c01340f3
Jan 15 15:42:20 vkondra-mobl kernel: *pde = 00000000
Jan 15 15:42:20 vkondra-mobl kernel: Oops: 0000 [#1]
Jan 15 15:42:20 vkondra-mobl kernel: PREEMPT
Jan 15 15:42:20 vkondra-mobl kernel: Modules linked in: nls_cp437 nls_utf8 smbfs
nls_base bluetooth cpufreq_stats cpufreq_conserva
tive cpufreq_performance cpufreq_powersave cpufreq_ondemand speedstep_centrino
freq_table microcode evdev ipv6 lp autofs4 sunrpc d
m_mirror dm_mod video thermal processor ibm_acpi fan container button battery ac
parport_pc parport snd_intel8x0m snd_seq_dummy sn
d_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_oss snd_seq_midi_event snd_seq
snd_seq_device snd_pcm_oss snd_mixer_oss floppy pcmc
ia ipw2200 snd_pcm usbhid crc32 ieee80211 ieee80211_crypt snd_timer snd
soundcore yenta_socket i2c_i801 e1000 firmware_class snd_p
age_alloc rsrc_nonstatic pcmcia_core ehci_hcd uhci_hcd i2c_core intel_agp
agpgart hw_random usbcore ext3 jbd
Jan 15 15:42:20 vkondra-mobl kernel: CPU:    0
Jan 15 15:42:20 vkondra-mobl kernel: EIP:    0060:[<c01340f3>]    Not tainted VLI
Jan 15 15:42:20 vkondra-mobl kernel: EFLAGS: 00010286   (2.6.15)
Jan 15 15:42:20 vkondra-mobl kernel: EIP is at audit_inode+0x81/0xb2
Jan 15 15:42:20 vkondra-mobl kernel: eax: 00000101   ebx: ee671778   ecx:
00000001   edx: ee671778
Jan 15 15:42:20 vkondra-mobl kernel: esi: 6b6b6b6b   edi: d97fb000   ebp:
ee279eb8   esp: ee279ea8
Jan 15 15:42:20 vkondra-mobl kernel: ds: 007b   es: 007b   ss: 0068
Jan 15 15:42:20 vkondra-mobl kernel: Process gkrellm (pid: 2971,
threadinfo=ee278000 task=eee82a90)
Jan 15 15:42:20 vkondra-mobl kernel: Stack: 00000101 ee278000 ee279f44 d97fb000
ee279ed4 c015cd8b 00000101 fffffffe
Jan 15 15:42:20 vkondra-mobl kernel:        ee279f44 00000101 ffffffe9 ee279eec
c015cdd4 d97fb000 d97fb000 ee279f44
Jan 15 15:42:20 vkondra-mobl kernel:        00000008 ee279efc c015ce10 00008001
00000000 ee279f38 c015d464 00008001
Jan 15 15:42:20 vkondra-mobl kernel: Call Trace:
Jan 15 15:42:20 vkondra-mobl kernel:  [<c0103bcc>] show_stack+0x7c/0x85
Jan 15 15:42:20 vkondra-mobl kernel:  [<c0103d11>] show_registers+0x123/0x197
Jan 15 15:42:20 vkondra-mobl kernel:  [<c0103ecb>] die+0xd2/0x145
Jan 15 15:42:20 vkondra-mobl kernel:  [<c0292f45>] do_page_fault+0x373/0x521
Jan 15 15:42:20 vkondra-mobl kernel:  [<c01038cb>] error_code+0x4f/0x54
Jan 15 15:42:20 vkondra-mobl kernel:  [<c015cd8b>] path_lookup+0x179/0x183
Jan 15 15:42:20 vkondra-mobl kernel:  [<c015cdd4>]
__path_lookup_intent_open+0x3f/0x6e
Jan 15 15:42:20 vkondra-mobl kernel:  [<c015ce10>] path_lookup_open+0xd/0xf
Jan 15 15:42:20 vkondra-mobl kernel:  [<c015d464>] open_namei+0x63/0x588
Jan 15 15:42:20 vkondra-mobl kernel:  [<c014f156>] filp_open+0x1c/0x31
Jan 15 15:42:20 vkondra-mobl kernel:  [<c014f3ee>] do_sys_open+0x37/0xa8
Jan 15 15:42:20 vkondra-mobl kernel:  [<c014f470>] sys_open+0x11/0x13
Jan 15 15:42:20 vkondra-mobl kernel:  [<c0102d99>] syscall_call+0x7/0xb
Jan 15 15:42:20 vkondra-mobl kernel: Code: 74 04 39 f8 74 1a 83 f9 0c 7f 4b 8d
41 01 89 ca 89 43 38 89 c8 c1 e0 05 c7 44 18 3c 00
00 00 00 8b 45 f0 c1 e2 05 01 da 89 42 58 <8b> 46 20 89 42 40 8b 86 c4 00 00 00
8b 40 08 89 42 44 8b 46 28

Steps to reproduce:
run 2.6.15; run X11 with 'gkrellm' and battery monitor enabled. Perform several
suspend/resume cycles.
Monitor /var/log/messages.
Comment 1 Vladimir Kondratiev 2006-01-27 13:26:55 UTC
Failure caused by garbage (6b6b6b6b in my case) in 'inode' parameter, and 1-st
access to 'inode' member leads to page fault. Following is audit_inode function,
 line where page fault occurs, marked.
void audit_inode(const char *name, const struct inode *inode, unsigned flags)
{
	int idx;
	struct audit_context *context = current->audit_context;

	if (!context->in_syscall)
		return;
	if (context->name_count
	    && context->names[context->name_count-1].name
	    && context->names[context->name_count-1].name == name)
		idx = context->name_count - 1;
	else if (context->name_count > 1
		 && context->names[context->name_count-2].name
		 && context->names[context->name_count-2].name == name)
		idx = context->name_count - 2;
	else {
		/* FIXME: how much do we care about inodes that have no
		 * associated name? */
		if (context->name_count >= AUDIT_NAMES - AUDIT_NAMES_RESERVED)
			return;
		idx = context->name_count++;
		context->names[idx].name = NULL;
#if AUDIT_DEBUG
		++context->ino_count;
#endif
	}
	context->names[idx].flags = flags;
===fault on the next line===>
	context->names[idx].ino   = inode->i_ino;
	context->names[idx].dev	  = inode->i_sb->s_dev;
	context->names[idx].mode  = inode->i_mode;
	context->names[idx].uid   = inode->i_uid;
	context->names[idx].gid   = inode->i_gid;
	context->names[idx].rdev  = inode->i_rdev;
}
Comment 2 Alexey Starikovskiy 2006-01-30 05:59:33 UTC
How does it relate to ACPI battery? What inode do you evaluate and where do you
get it from?
Comment 3 Vladimir Kondratiev 2006-01-30 23:57:57 UTC
Point is, I am not 100% sure it is battery who is guilty; but usually after
this, `cat /proc/acpi/battery/BAT0/state` gives some form of error: `cat`
process stall, or got reading error, or reads empty file. But it may be that
this is some general bug in procfs and mentioned above file is simply one that
happens to suffer most in my environment.

I will add printk's around audit_inode to see more info.
Comment 4 Vladimir Kondratiev 2006-01-31 10:50:21 UTC
Another variant of ACPI problem: after resume, the following appears in syslog:
Jan 31 20:34:38 vkondra-mobl kernel:     ACPI-0292: *** Error: Looking up [SERN]
 in namespace, AE_ALREADY_EXISTS
Jan 31 20:34:38 vkondra-mobl kernel:     ACPI-0508: *** Error: Method execution
failed [\_SB_.PCI0.LPC_.EC__.GBIF] (Node f7f02354), AE_ALREADY_EXISTS
Jan 31 20:34:41 vkondra-mobl kernel:     ACPI-0213: *** Error: Method reached ma
ximum reentrancy limit (255)
Jan 31 20:34:41 vkondra-mobl kernel:     ACPI-0508: *** Error: Method execution
failed [\_SB_.PCI0.LPC_.EC__.BAT0._BIF] (Node f7f0219c), AE_AML_METHOD_LIMIT
Jan 31 20:34:41 vkondra-mobl kernel:  done
Jan 31 20:34:44 vkondra-mobl kernel:     ACPI-0213: *** Error: Method reached ma
ximum reentrancy limit (255)

.........etc. etc.
now, /proc/acpi/battery/BAT0/state is OK but /proc/acpi/battery/BAT0/state is not:
[vkondra-l@vkondra-mobl ~]$ cat /proc/acpi/battery/BAT0/state
present:                 yes
capacity state:          ok
charging state:          charging
present rate:            16330 mW
remaining capacity:      61970 mWh
present voltage:         12325 mV
[vkondra-l@vkondra-mobl ~]$ cat /proc/acpi/battery/BAT0/info
present:                 yes
ERROR: Unable to read battery information
Comment 5 Suzuki 2006-02-02 22:36:31 UTC
I am also looking into a similar issue. I encountered it while running fsracer
tests on SLES10 B1, which is 2.6.15-git12-6
Comment 6 Suzuki 2006-02-02 22:45:55 UTC
I think the bad Inode gets in from do_lookup(). I suspect  __d_lookup and
real_lookup.
Comment 7 Suzuki 2006-02-05 23:44:18 UTC
Hi Vladimir,

Could you check out the following patch ? Its against 2.6.15-rc7. The
path_lookup() doesn't check whether the link_path_walk() successful or not. 

--- linux-2.6.15-rc7-I/fs/namei.c       2006-01-03 10:25:57.000000000 -0800
+++ linux-2.6.15-rc7-I/fs/namei.c.mod   2006-02-06 05:08:36.000000000 -0800
@@ -1090,7 +1090,7 @@ int fastcall path_lookup(const char *nam
        current->total_link_count = 0;
        retval = link_path_walk(name, nd);
 out:
-       if (unlikely(current->audit_context
+       if (unlikely(!retval && current->audit_context
                     && nd && nd->dentry && nd->dentry->d_inode))
                audit_inode(name, nd->dentry->d_inode, flags);
        return retval;


Thanks.
Comment 8 Vladimir Kondratiev 2006-02-06 14:11:36 UTC
Applied. I use 2.6.16-rc2-git2 (latest for today). Patch as is fails, but idea
is clear so it is easy to adjust.
I run 
watch -n 0 cat /proc/acpi/battery/BAT0/[is]*
and all applets that I know to monitor battery state; I did couple of
suspend/resume cycles.

So far, so good. But, to be sure I need to run it for some more time. Let me
test it for, say, one week from now. I will report immediately in case it oops
again.
Comment 9 Suzuki 2006-02-06 22:41:11 UTC
FYI..
 The following patch is now in mm tree.

diff -puN
fs/namei.c~fix-do_path_lookup-to-add-the-check-for-error-in-link_path_walk
fs/namei.c
---
devel/fs/namei.c~fix-do_path_lookup-to-add-the-check-for-error-in-link_path_walk	2006-02-06
22:25:12.000000000 -0800
+++ devel-akpm/fs/namei.c	2006-02-06 22:27:45.000000000 -0800
@@ -1119,9 +1119,11 @@ static int fastcall do_path_lookup(int d
 	current->total_link_count = 0;
 	retval = link_path_walk(name, nd);
 out:
-	if (unlikely(current->audit_context
-		     && nd && nd->dentry && nd->dentry->d_inode))
+	if (likely(retval == 0)) {
+		if (unlikely(current->audit_context && nd && nd->dentry &&
+				nd->dentry->d_inode))
 		audit_inode(name, nd->dentry->d_inode, flags);
+	}
 	return retval;
 
 fput_unlock_fail:
Comment 10 Vladimir Kondratiev 2006-02-12 14:11:11 UTC
I would consider it fixed.
During week of intensive testing I have no single failure.
Comment 11 Suzuki 2006-02-15 21:51:49 UTC
Closing as per Comment #10.

Note You need to log in before you can comment on or make changes to this bug.