Bug 12353 - Stable reproducable exception in PPP (dial-up-modem) module causes SMP systems to crash.
Summary: Stable reproducable exception in PPP (dial-up-modem) module causes SMP system...
Status: CLOSED DOCUMENTED
Alias: None
Product: Other
Classification: Unclassified
Component: Modules (show other bugs)
Hardware: All Linux
: P1 high
Assignee: other_modules
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-03 18:45 UTC by Markus Wolfgang Daniel
Modified: 2012-05-24 14:09 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.28 (mainline, vanilla)
Subsystem:
Regression: No
Bisected commit-id:


Attachments
.config file (50.11 KB, text/plain)
2009-01-04 10:37 UTC, Markus Wolfgang Daniel
Details
boot log file from /var/log/ (32.23 KB, text/plain)
2009-01-04 10:39 UTC, Markus Wolfgang Daniel
Details
proposed fix (510 bytes, patch)
2009-01-04 12:40 UTC, Marcin Slusarz
Details | Diff

Description Markus Wolfgang Daniel 2009-01-03 18:45:10 UTC
Latest working kernel version: 
unknown (I have just a dial-up connection and cannot test this out easily.)

Earliest failing kernel version:
at least since 2.6.27 (vanilla)

Distribution: 
SuSE 10.2 (32bit)

Hardware Environment:
AMD X2 5000+
NVIDIA nForce 630a  (no X11/GL modules loaded)
External serial 56k dial-up modem
4GB Ram

Software Environment:
- gcc 4.1.2 (shipped)
- vanilla kernel just with
  - IP4 / PPP / VESA / nForce-PATA / USB / ACPI
- IDE boot (not SATA) via kernel included nForce drivers
- boot into runlevel 3 (no X11)

Problem Description:
A fatal exception in the interrupt handler of the PPP module brings reliable down SMP systems.

Steps to reproduce:
boot in runlevel 3
login as root
> ifup modem0
> wget www.anysite.com
press _quickly_ Control+Alt+F10 and you can see the following:

kernel BUG at net/core/skbuff.c:147!
invaild opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/resource
Modules linked in:

Pid: 7, comm: ksoftirqd/1 Not tainted (2.6.28 #2) To Be Filled By O.E.M.
EIP: 0060:[<c03a026c>] EFLAGS: 00010287 CPU: 1
EIP is at skb_under_panic+0x5c/0x60
EAX: 00000077 EBX: f66f1000 ECX: c012775e EDX: 0227b000
ESI: 00000000 EDI: f6be9600 EBP: f7075ef0 ESP: f7075ec4
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process ksoftirqd/1 (pid: 7, ti=f7074000 task=f7073130 task.ti=f7074000)
Stack:
 c050e4c4 c02dd812 000005e0 00000002 f66f1000 f66f0fff f66f1fdf f66f1600
 c04dd210 f6bb7140 f6bb7140 f7075efc c03a179a 00000000 f7075f28 c02dd812
 f6bb7140 f6be9600 f7075f28 c040d34f 00000000 00000001 f6be9000 f6bb7140
all Trace:
 [<c02dd812>] ? ppp_receive_frame+0x212/0x630
 [<c03a179a>] ? skb_push+0x2a/0x40
 [<c02dd812>] ? ppp_receive_frame+0x212/0x630
 [<c040d34f>] ? _spin_lock_bh+0x3f/0x50
 [<c02dee39>] ? ppp_input+0x99/0x110
 [<c02ddcc0>] ? ppp_input_error+0x90/0xb0
 [<c02dfbdf>] ? ppp_async_process+0x1f/0x70
 [<c012bff6>] ? tasklet_action+0x56/0xc0
 [<c012b957>] ? __do_softirq+0x87/0x150
 [<c012bbe0>] ? ksoftirq+0x0/0x0d
 [<c012ba5b>] ? do_softirq+0x3b/0x50
 [<c012bc37>] ? ksoftirqd+0x57/0xd0
 [<c0139cf2>] ? kthread+0x42/0x70
 [<c0139cb0>] ? kthread+0x0/0x70
 [<c0103eab>] ? kernel_thread_helper+0x7/0x1c
Code: 00 00 89 5c 24 14 8b 98 90 00 00 00 89 54 24 0c 89 5c 24 10 b8 40 40 50 89 4c 24 c7 04 24 c4 e4 50 c0 89 44 24 08 e8 04 7b
 d8 ff <0f> 0b eb fe 55 89 e5 56 53 bb 10 d2 4d c0 83 ec 24 8b 70 14 85
EIP: [<c03a026c>] skb_under_panic+0x5c/0x60 SS:ESP 0068:f7075ec4
Kernal panic - not syncing: Fatal exception in interrupt

As I had to copy this dialog manually, it maybe contain transcription error(s). 
If you have further questions, just write an email.
And yes, I can also apply your patches to test things out.
Comment 1 Markus Wolfgang Daniel 2009-01-03 19:05:59 UTC
This should be also copied into linux-ppp@vger.kernel.org .
Comment 2 Marcin Slusarz 2009-01-04 05:06:50 UTC
There should be "skb_under_panic: ..." line before "kernel BUG at net/core/skbuff.c:147!". Can you paste it here?

Please attach your .config.
Comment 3 Markus Wolfgang Daniel 2009-01-04 10:37:13 UTC
Created attachment 19646 [details]
.config file
Comment 4 Markus Wolfgang Daniel 2009-01-04 10:39:57 UTC
Created attachment 19647 [details]
boot log file from /var/log/
Comment 5 Markus Wolfgang Daniel 2009-01-04 10:40:51 UTC
The following two lines appear:

Jan  4 14:25:53 linux-md kernel: CE: hpet increasing_delta_ns to 15000 nsec
skb_under_panic: text c02dd812 len:1504 put:2 head:f66fa000 data:f66f9fff tail:0xf66fa037 end:0xf66fa600 dev:<NULL>

The numbers for head, data, tail and end vary each time. However, their relative offsets are stable.

As the crash happens in an IRQ handler it not just brings down the entire system, it prevents me also from syncing the disks in order to get the "blue screen" into /var/log/messages. Thus, please keep this in mind when asking for bulky IRQ messages. :)

===

Please find the .config and boot.msg attached. I've switched on various debug options (in "kernel hacking") to gain a good dump.

What I have also noticed is 
 - if I use ping instead of wget (or ftp etc.) the system works fine.
 - if I disable ACPI the system starts to crash sporadic rather then reliable.
Comment 6 Marcin Slusarz 2009-01-04 11:38:52 UTC
Can you disable CONFIG_PPP_FILTER and check whether panic still occurs?
Comment 7 Marcin Slusarz 2009-01-04 12:40:47 UTC
Created attachment 19650 [details]
proposed fix

Ok, I think I understand what is wrong with this code.
Please check whether attached patch fixes the problem (with CONFIG_PPP_FILTER enabled).
Comment 8 Markus Wolfgang Daniel 2009-01-04 12:56:55 UTC
Will test your fix ASAP...

By the way: I could not get the system crashing anymore with option CONFIG_PPP_FILTER being disabled.
Comment 9 Markus Wolfgang Daniel 2009-01-04 13:57:13 UTC
Now, wget fetches already 5% before the system crashes. What is clearly better the the 0% without the patch.

The crash happens on the same position with head=xxxx7000 data=xxxx6fff tail:xxxx75df end=xxxx7600

The callstack looks now this way:

ppp_receive_frame+0x1fc/0x5f0
skb_push+0x27/0x30
ppp_receive_frame+0x1fc/0x5f0
__alloc_skb+0x53/0x110
ppp_input+0x9b/0x120
ppp_async_process+x19/0x70
tasklet_action+0x4e/0xb0
<further frames as above>

(This kernel, in contrast to my last one, got no debugging support.)
Comment 10 Herbert Xu 2009-01-04 14:59:48 UTC
We should never run out of headroom at that skb_push because our caller should
guarantee that the headroom exists.  I had a quick look around and can't find
anything in the backtrace path that can give us a packet with insufficient
headroom.

Indeed, the under panic message shows that we're exactly one byte past the
head, which is very unusual since we usually do not put data at an odd offset.

Also, the under panic message indicates that the skb may be non-linear, as tail
- data is much less than skb->len.  This should also be impossible for
ppp_async.  So I think something deeper is going on here that requires further
investigation.

Thanks!
Comment 11 Markus Wolfgang Daniel 2009-01-04 15:18:09 UTC
I can test patches daily from 12:00 to 14:00 (bugzilla time).

The strange thing is that with a single CPU I never saw any problems. However, some days ago I've updated my system and therefore enabled SMP in the kernel. (Ok, I also switched the from the SiS to nVIDIA's PATA drivers too). But this was it actually.
Comment 12 Markus Wolfgang Daniel 2009-01-06 15:55:52 UTC
I've tried to disable various default BIOS options. And was successfull:

My ASRock motherboard "ALiveNF7G-FullHD R3.0" comes with an AMIBIOS.

If I disable the option "CPU Configuration" -> "Enhanced Halt State" no further crashes happens. The manual provides unfortunately no details about this option. 

Eventually, I'd like to say Thank You to all that tried to help me!

Note You need to log in before you can comment on or make changes to this bug.