Bug 219061

Summary: Memory leaks on vmalloc crash every 32 bit kernel after a commit in 6.6.24 branch
Product: Memory Management Reporter: makemehappy
Component: OtherAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: high CC: harshit.m.mogalapalli, makemehappy, regressions
Priority: P3    
Hardware: i386   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: the log of a filed machine
the config of the failed kernel and machine
This is a various boot of some working and some not working kernels
Log of a Real working 6.6.3 session
attachment-15542-0.html
attachment-17344-0.html
attachment-24185-0.html

Description makemehappy 2024-07-19 10:51:43 UTC
Created attachment 306584 [details]
the log of a filed machine

Tested only on virtual guests under both VMware and VirtualBox platforms.

It just affect 32 bit kernels starting from 6.6.24 to every 6.6.x/6.7.x/6.8.x/6.9.x and 6.10.x.

I think it is generated by this commit in 6.6.24 branch:

commit 9a98ab01e3acba830cb0917296a13192fd23f305
Author: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Date:   Mon Nov 13 12:07:39 2023 -0800

    platform/x86: hp-bioscfg: Fix error handling in hp_add_other_attributes()
    
    commit f40f939917b2b4cbf18450096c0ce1c58ed59fae upstream.
    
    'attr_name_kobj' is allocated using kzalloc, but on all the error paths
    it is not freed, hence we have a memory leak.
    
    Fix the error path before kobject_init_and_add() by adding kfree().
    
    kobject_put() must be always called after passing the object to
    kobject_init_and_add(). Only the error path which is immediately next
    to kobject_init_and_add() calls kobject_put() and not any other error
    path after it.
    
    Fix the error handling after kobject_init_and_add() by moving the
    kobject_put() into the goto label err_other_attr_init that is already
    used by all the error paths after kobject_init_and_add().
    
    Fixes: a34fc329b189 ("platform/x86: hp-bioscfg: bioscfg")
    Cc: stable@vger.kernel.org # 6.6.x: c5dbf0416000: platform/x86: hp-bioscfg: Simplify return check in hp_add_other_attributes()
    Cc: stable@vger.kernel.org # 6.6.x: 5736aa9537c9: platform/x86: hp-bioscfg: move mutex_lock() down in hp_add_other_attributes()
    Reported-by: kernel test robot <lkp@intel.com>
    Reported-by: Dan Carpenter <error27@gmail.com>
    Closes: https://lore.kernel.org/r/202309201412.on0VXJGo-lkp@intel.com/
    Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
    [ij: Added the stable dep tags]
    Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
    Link: https://lore.kernel.org/r/20231113200742.3593548-3-harshit.m.mogalapalli@oracle.com
    Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

It happen just on > 1024 MB RAM.

I can reproduce it EVERY TIME, very easy on every 32 bit machines (I just have virtual 32 bi machines and I run them on both VMware and Virtualbox platform, then I think the problem should be present on real machines too. But HP ones?)

How to reproduce it: start the 32 bit linux machine with a configuration of 4 GB ram, your kernel standard configuration from kernel.org, then I tested also my own config and Gentoo and Void distros default configuration (I include my own configuration as an attachment), the system boot fine, open few terminals, few app and when the system cross the first 1024 MB Ram it start complain in the log about the necessity to increase vmalloc. You can do that and reboot the machine then the problem will be still present. From that moment your machine will start to die, producing errors on always opened app, insufficient memory to run a command in a terminal, web browsers tab start crash, audio reproduction stops and so on. From that moment your session is totally compromised: sometimes you can't even reboot (insufficient resources error messages), sometimes you get kernel oops and sometimes the machine freeze.

All this in few minutes from starting the machine, no need to wait for long to get the error and the machine slowly die.

Reproducible 100 per 100 of the time on virtual platform. The fix for now is reverting to working 6.6.23 kernel or before.

I also include the log with one of the affected kernels (all the branches starting from 6.6.24 to 6.6.10, everything after the reported commit).
Comment 1 makemehappy 2024-07-19 10:52:30 UTC
Created attachment 306585 [details]
the config of the failed kernel and machine
Comment 2 makemehappy 2024-07-19 11:05:26 UTC
Created attachment 306587 [details]
This is a various boot of some working and some not working kernels

This is the log of a working session on the same machine with a working kernel (6.6.23); everything is fine, no vmalloc errors and the memory manager work perfect.
Comment 3 makemehappy 2024-07-19 11:06:21 UTC
Comment on attachment 306585 [details]
the config of the failed kernel and machine

You can see the repeated vmalloc errors and this in the end slowly kill the session.
Comment 4 makemehappy 2024-07-19 11:07:36 UTC
Comment on attachment 306584 [details]
the log of a filed machine

This is the .config I normally apply to my 32 bit machines. It is the same used and perfectly fine working in <6.6.24 kernels. Last working kernel 6.6.23.
Comment 5 Artem S. Tashkinov 2024-07-19 11:13:20 UTC
Harshit Mogalapalli,

Please take a look.
Comment 6 Artem S. Tashkinov 2024-07-19 11:15:25 UTC
Harshit Mogalapalli is not on this bug tracker, please send your finding to LKML and CC harshit.m.mogalapalli@oracle.com
Comment 7 makemehappy 2024-07-19 11:22:00 UTC
sorry how can I send mail to LKML? I always sent an email to harshit and all the people involved to the buggy commit with no luck.
Comment 8 makemehappy 2024-07-19 11:48:57 UTC
(In reply to Artem S. Tashkinov from comment #6)
> Harshit Mogalapalli is not on this bug tracker, please send your finding to
> LKML and CC harshit.m.mogalapalli@oracle.com

Ok sent.
Comment 9 makemehappy 2024-07-19 20:08:44 UTC
Created attachment 306589 [details]
Log of a Real working 6.6.3 session

The other working log is infact just a a multiple boots log with working and not working kernes. Before found out the bubby kernel was 6.6.4 and everything I built and tested various kernels and that working log before contain a lot of sessions in it, some working and some not. This is just a working 6.6.3 session.
Comment 10 makemehappy 2024-07-19 20:12:47 UTC
Comment on attachment 306587 [details]
This is a various boot of some working and some not working kernels

To discover in which version of the kernel the bug was introduced I tested everything over 6.6.21 (my starting point) to find the last working kernel and found out it was 6.6.23. In this log some boots and logs of sessions with kernel with the bug (everyghing over 6.6.23) and kernel with no bug (6.6.24 and everything over it).
Comment 11 Harshit 2024-07-21 04:53:57 UTC
Hi,

Note that commit: 9a98ab01e3ac ("platform/x86: hp-bioscfg: Fix error handling in hp_add_other_attributes()") is not in the range of 6.6.23..6.6.24 , it is present since 6.6.4, so that couldn't be the cause of regression
Comment 12 makemehappy 2024-07-21 14:40:13 UTC
Someone in LKML asked me (and I was notified via e-mail if I tried 6.6.40 and 6.10.0 from kernel.org.

Yes, I can confirm this bug is present in EVERY kernel after 6.6.24.

Related with Harshit commit I can exclude this is the problem becouse, as he said, it is there from version 6.6.4.

I'm not a programmer then for me this bug is related - like someone pointed me in - to a vmalloc() call (without a vfree()).

Obviously I confirm I'm interested in testing every possible future patch against this bug.
Comment 13 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-07-24 09:11:41 UTC
(In reply to makemehappy from comment #12)
>
> Related with Harshit commit I can exclude this is the problem becouse, as he
> said, it is there from version 6.6.4.

So 6.6.3 works fine? If it does: could you try if reverting the culprit on 6.10 or the latest 6.6.y release and see if this fixes things?
Comment 14 makemehappy 2024-07-24 15:18:24 UTC
Hi Thorsten. I can confirm kernel 6.6.4 is fine too and everything after it is fine. The bug come from 6.6.23 (fine) and 6.6.24. Any kernel after it, it has the problem.

A patch -p1 -R to remove the Harshit patch didn't solved the problem.

Reading many commits in various kernels pointed me against that Harshit patch, becouse I think this bug problem is related, like I wrote in previous comments, this is a problem with vmalloc() call (without a vfree()) and that pretty much kill the machine very fast in my many environments. JUST A GUESS. I'm not a kernel developer! So it can be other, just a guess.

Hope to have clarified everything.

Now just the time to push "Save Changes and I probably even can't reboot the machine becouse on this debian 32 bit VMware Virtual Guest I start to have browser tab crashes and terminal hoops after I start to get tons of this:

Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 24576 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:42 debian1232vm kernel: alloc_vmap_area: 104 callbacks suppressed
Jul 24 17:04:42 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:42 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:42 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:42 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Comment 15 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-07-24 15:26:38 UTC
(In reply to makemehappy from comment #14)
> The bug come from 6.6.23 (fine) and 6.6.24. Any kernel after it, it has the
> problem.

So 6.6.24 is also fine, but 6.6.25 is not? If that's the case: could you bisect?
Comment 16 makemehappy 2024-07-24 15:33:13 UTC
No sorry!!! 6.6.23 IS FINE, 6.6.24 IS NOT. So if a bisect has to be done, it has to be done between 6.6.23 FINE and 6.6.24 NOT FINE.

Thankx.


to recap:

kernel 6.6.23 the LAST working kernel (no bug)
kernel 6.6.24 the FIRST with the big (not working, YES BUG)

Everything I tried after 6.6.24 HAS the bug, INCLUDED 6.6.41 and 6.10,1.

6.10.1 is the last tested, just a moment ago, while writing this comments, and after a whiled (very soon, 10 firefox tabs opened and two terminals, one running the log) started to produce vmaalloc errors and crashed the machine in minutes (browser tabs crashed and when try to open a new terminal it will produce an output similar to this: Failed to Open PTY, impossible to allocate memory) and so on.

I had to reset the machine (and BTW rebooted to kernel 6.6.23 to have it working).
Comment 17 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-07-24 15:45:22 UTC
(In reply to makemehappy from comment #16)
> if a bisect has to be done, it has to be done between 6.6.23 FINE and 6.6.24 

Would be great if you could take care of that, as I doubt anyone will look into this otherwise, as it could be caused by changes in various subsystems.
Comment 18 makemehappy 2024-07-24 16:12:17 UTC
Created attachment 306615 [details]
attachment-15542-0.html

 Hi Thorsten. I can confirm kernel 6.6.4 is fine too and everything after it is fine. The bug come from 6.6.23 (fine) and 6.6.24. Any kernel after it, it has the problem.

A patch -p1 -R to remove the Harshit patch didn't solved the problem.

Reading many commits in various kernels pointed me against that Harshit patch, becouse I think this bug problem is related, like I wrote in previous comments, this is a problem with vmalloc() call (without a vfree()) and that pretty much kill the machine very fast in my many environments. JUST A GUESS. I'm not a kernel developer! So it can be other, just a guess.

Hope to have clarified everything.

Now just the time to push "Save Changes and I probably even can't reboot the machine becouse on this debian 32 bit VMware Virtual Guest I start to have browser tab crashes and terminal hoops after I start to get tons of this:

Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 24576 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:42 debian1232vm kernel: alloc_vmap_area: 104 callbacks suppressed
Jul 24 17:04:42 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:42 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:42 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
Jul 24 17:04:42 debian1232vm kernel: vmap allocation for size 20480 failed: use vmalloc=<size> to increase size
The running kernel s a brand new 32 bit 6.10.1 downloaded from kernel.org and compiled.

    On Wednesday, July 24, 2024 at 11:11:42 AM GMT+2, bugzilla-daemon@kernel.org <bugzilla-daemon@kernel.org> wrote:  
 
 https://bugzilla.kernel.org/show_bug.cgi?id=219061

The Linux kernel's regression tracker (Thorsten Leemhuis) (regressions@leemhuis.info) changed:

          What    |Removed                    |Added
----------------------------------------------------------------------------
                CC|                            |regressions@leemhuis.info

--- Comment #13 from The Linux kernel's regression tracker (Thorsten Leemhuis) (regressions@leemhuis.info) ---
(In reply to makemehappy from comment #12)
>
> Related with Harshit commit I can exclude this is the problem becouse, as he
> said, it is there from version 6.6.4.

So 6.6.3 works fine? If it does: could you try if reverting the culprit on 6.10
or the latest 6.6.y release and see if this fixes things?
Comment 19 makemehappy 2024-07-24 16:17:54 UTC
Created attachment 306616 [details]
attachment-17344-0.html

 Hello,
the point is simple: I never had a bisect, I read documentation about it and it sems not that clear to me, I'm not sure I can handle it, I don't have spare time and I'm not a developer.
I reported this and loose time to diocument it and 'm also conscious there are no more many 32 bit systems around, myself I don't have real machines running 32 bit OS;
Then this is still a nasty bug, it is not something cosmetic, it crash machines and without a fix the kernel is broken. It is better, in this case, to say 32 bit not interest us anymore, and so we drop support.


Regards
MS


    On Wednesday, July 24, 2024 at 05:45:27 PM GMT+2, <bugzilla-daemon@kernel.org> wrote:  
 
 https://bugzilla.kernel.org/show_bug.cgi?id=219061

--- Comment #17 from The Linux kernel's regression tracker (Thorsten Leemhuis) (regressions@leemhuis.info) ---
(In reply to makemehappy from comment #16)
> if a bisect has to be done, it has to be done between 6.6.23 FINE and 6.6.24 

Would be great if you could take care of that, as I doubt anyone will look into
this otherwise, as it could be caused by changes in various subsystems.
Comment 20 makemehappy 2024-07-24 16:35:29 UTC
Created attachment 306617 [details]
attachment-24185-0.html

 No sorry!!! 6.6.23 IS FINE, 6.6.24 IS NOT. So if a bisect has to be done, it has to be done between 6.6.23 FINE and 6.6.24 NOT FINE.

Thankx.


to recap:

kernel 6.6.23 the LAST working kernel (no bug)
kernel 6.6.24 the FIRST with the big (not working, YES BUG)

Everything I tried after 6.6.24 HAS the bug, INCLUDED 6.6.41 and 6.10,1.

6.10.1 is the last tested, just a moment ago, while writing this comments, and after a whiled (very soon, 10 firefox tabs opened and two terminals, one running the log) started to produce vmaalloc errors and crashed the machine in minutes (browser tabs crashed and when try to open a new terminal it will produce an output similar to this: Failed to Open PTY, impossible to allocate memory) and so on.

I had to reset the machine (and BTW rebooted to kernel 6.6.23 to have it working).
PS; I don't know if my messages ON BUGZILLA are also forwarted to this list or I do have to reply there and here.


    On Wednesday, July 24, 2024 at 05:26:42 PM GMT+2, <bugzilla-daemon@kernel.org> wrote:  
 
 https://bugzilla.kernel.org/show_bug.cgi?id=219061

--- Comment #15 from The Linux kernel's regression tracker (Thorsten Leemhuis) (regressions@leemhuis.info) ---
(In reply to makemehappy from comment #14)
> The bug come from 6.6.23 (fine) and 6.6.24. Any kernel after it, it has the
> problem.

So 6.6.24 is also fine, but 6.6.25 is not? If that's the case: could you
bisect?