Latest working kernel version: 2.6.18 Earliest failing kernel version: 2.6.26 Distribution: Debian Hardware Environment: P4 (with HT), IDE disk (PATA), 512MB ram Software Environment: http://vanheusden.com/pyk/ Problem Description: filesystem corrupted: [85202.195563] rtc0: alarms up to one month, y3k, hpet irqs [85204.035802] journal_bmap: journal block not found at offset 2060 on dm-0 [85204.035818] Aborting journal on device dm-0. [85311.242093] ext3_abort called. [85311.242120] EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal [85311.242154] Remounting filesystem read-only [86331.285847] EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory #24568: rec_len % 4 != 0 - offset=0, inode=4098364138, rec_len=59542, name_len=76 [89931.353863] EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory #24568: rec_len % 4 != 0 - offset=0, inode=4098364138, rec_len=59542, name_len=76 [92499.263276] attempt to access beyond end of device [92499.263296] dm-2: rw=17, want=4177066240, limit=6004736 [92499.263311] Buffer I/O error on device dm-2, logical block 522133279 [92499.263333] lost page write due to I/O error on dm-2 [92499.263341] Aborting journal on device dm-2. [92499.263504] ext3_abort called. [92499.263515] EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal [92499.263543] Remounting filesystem read-only [93531.419902] EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory #24568: rec_len % 4 != 0 - offset=0, inode=4098364138, rec_len=59542, name_len=76 Steps to reproduce: run pyk script and run something like a git clone of the mainline kernel git tree, rm -rf the tree, touch /forcefsk, reboot
please provide the output of the script and/or post a list of your modules. do you see any warnings/oopses in dmesg while running it ?
What does e2fsck report when you try running e2fsck on the filesystem? Most of the errors indicate filesystem corruption which e2fsck should have complained vociferously about, and which it should have been able to fix if run manually. I can't tell how big the filesystem is (I'd need the output of dumpe2fs /dev/hdXXX) to detect that, but this: [92499.263311] Buffer I/O error on device dm-2, logical block 522133279 Indicates either a hardware error, or a corrected journal inode. In the latter case, e2fsck would have detected the problem, and offered to fix it. In the former case, this isn't a kernel bug, but rather a hardware problem....
(In reply to comment #1) > please provide the output of the script and/or post a list of your modules. > do you see any warnings/oopses in dmesg while running it ? No oopses, no warnings. Only odd messages I see are: [ 487.377387] bio too big device hda5 (8 > 0) [ 487.377860] bio too big device hda5 (8 > 0) will reboot tomorrow to see what filesystem errors there are
(In reply to comment #1) > please provide the output of the script and/or post a list of your modules. The system starts with the following modules: ac 3264 0 battery 6272 0 ipv6 234724 12 loop 12812 0 snd_intel8x0 26332 0 snd_ac97_codec 89220 1 snd_intel8x0 i2c_i801 8336 0 ac97_bus 1728 1 snd_ac97_codec snd_pcm 63108 2 snd_intel8x0,snd_ac97_codec i2c_core 20692 1 i2c_i801 snd_timer 18056 1 snd_pcm snd 45828 4 snd_intel8x0,snd_ac97_codec,snd_pcm,snd_timer soundcore 6528 1 snd snd_page_alloc 7400 2 snd_intel8x0,snd_pcm floppy 47812 0 pcspkr 2432 0 iTCO_wdt 9668 0 rng_core 4004 0 parport_pc 22660 0 parport 31180 1 parport_pc shpchp 25204 0 pci_hotplug 23680 1 shpchp container 3488 0 button 6096 0 intel_agp 22844 1 agpgart 29800 1 intel_agp evdev 8416 0 joydev 8608 0 ext3 106024 6 jbd 40820 1 ext3 mbcache 7268 1 ext3 dm_mirror 15264 0 dm_log 8516 1 dm_mirror dm_snapshot 15140 0 dm_mod 46696 16 dm_mirror,dm_log,dm_snapshot ide_cd_mod 27172 0 ide_disk 10592 3 cdrom 30016 1 ide_cd_mod piix 5864 2 ide_core 84468 3 ide_cd_mod,ide_disk,piix usbhid 36000 0 hid 33792 1 usbhid ff_memless 4456 1 usbhid ata_generic 4676 0 libata 144480 1 ata_generic scsi_mod 130412 1 libata dock 8368 1 libata e1000 104708 0 ehci_hcd 29132 0 uhci_hcd 18864 0 usbcore 120176 4 usbhid,ehci_hcd,uhci_hcd thermal 15388 0 processor 33516 1 thermal fan 4356 0 thermal_sys 10760 3 thermal,processor,fan > do you see any warnings/oopses in dmesg while running it ? No. After a short while I see the following output: [ 209.038312] attempt to access beyond end of device [ 209.038377] dm-0: rw=0, want=1279882228, limit=565248 [ 209.038458] Buffer I/O error on device dm-0, logical block 639941113 [ 209.038525] attempt to access beyond end of device [ 209.038585] dm-0: rw=0, want=5069976612, limit=565248 [ 209.038643] Buffer I/O error on device dm-0, logical block 2534988305 [ 209.038710] attempt to access beyond end of device [ 209.038763] dm-0: rw=0, want=2559708832, limit=565248 [ 209.038816] Buffer I/O error on device dm-0, logical block 1279854415 [ 209.038873] attempt to access beyond end of device [ 209.038936] dm-0: rw=0, want=877454918, limit=565248 [ 209.038989] Buffer I/O error on device dm-0, logical block 438727458 [ 209.039049] attempt to access beyond end of device [ 209.039102] dm-0: rw=0, want=616859760, limit=565248 [ 209.039188] attempt to access beyond end of device [ 209.039242] dm-0: rw=0, want=1279882228, limit=565248 [ 209.039294] Buffer I/O error on device dm-0, logical block 639941113 [ 209.039350] attempt to access beyond end of device [ 209.039403] dm-0: rw=0, want=5069976612, limit=565248 [ 209.039465] Buffer I/O error on device dm-0, logical block 2534988305 [ 209.039522] attempt to access beyond end of device [ 209.039575] dm-0: rw=0, want=2559708832, limit=565248 [ 209.039629] Buffer I/O error on device dm-0, logical block 1279854415 [ 209.039694] attempt to access beyond end of device [ 209.039747] dm-0: rw=0, want=877454918, limit=565248 [ 209.039799] Buffer I/O error on device dm-0, logical block 438727458 [ 209.041515] processor: Unknown symbol thermal_cooling_device_register [ 209.043820] processor: Unknown symbol thermal_cooling_device_unregister [ 209.110502] attempt to access beyond end of device [ 209.110571] dm-0: rw=0, want=1279882228, limit=565248 [ 209.110656] Buffer I/O error on device dm-0, logical block 639941113 [ 209.110723] attempt to access beyond end of device [ 209.110785] dm-0: rw=0, want=5069976612, limit=565248 [ 209.110840] Buffer I/O error on device dm-0, logical block 2534988305 [ 209.110908] attempt to access beyond end of device [ 209.110963] dm-0: rw=0, want=2559708832, limit=565248 [ 209.111017] attempt to access beyond end of device [ 209.111073] dm-0: rw=0, want=877454918, limit=565248 the modules then loaded are: floppy 47812 0 i2c_i801 8336 0 snd_intel8x0 26332 0 button 6096 0 dm_snapshot 15140 0 usbhid 36000 0 shpchp 25204 0 pci_hotplug 23680 1 shpchp netconsole 7360 0 configfs 21944 2 netconsole ipv6 234724 12 snd_ac97_codec 89220 1 snd_intel8x0 ac97_bus 1728 1 snd_ac97_codec snd_pcm 63108 2 snd_intel8x0,snd_ac97_codec i2c_core 20692 1 i2c_i801 snd_timer 18056 1 snd_pcm snd 45828 4 snd_intel8x0,snd_ac97_codec,snd_pcm,snd_timer soundcore 6528 1 snd snd_page_alloc 7400 2 snd_intel8x0,snd_pcm iTCO_wdt 9668 0 parport_pc 22660 0 parport 31180 1 parport_pc intel_agp 22844 1 agpgart 29800 1 intel_agp ext3 106024 6 jbd 40820 1 ext3 mbcache 7268 1 ext3 dm_mirror 15264 0 dm_log 8516 1 dm_mirror dm_mod 46696 16 dm_snapshot,dm_mirror,dm_log ide_disk 10592 3 piix 5864 4294967295 ide_core 84468 2 ide_disk,piix hid 33792 1 usbhid ff_memless 4456 1 usbhid libata 144480 0 scsi_mod 130412 1 libata dock 8368 1 libata e1000 104708 0 ehci_hcd 29132 0 uhci_hcd 18864 0 usbcore 120176 4 usbhid,ehci_hcd,uhci_hcd
After that I tried creating a file in each filesystem. When I hit /var I got the following dmesg error: [ 335.006864] journal_bmap: journal block not found at offset 524 on dm-0 [ 335.006927] Aborting journal on device dm-0. [ 353.624981] ext3_abort called. [ 353.625058] EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal [ 353.625211] Remounting filesystem read-only After that no file was accessible. E.g.: debian:/home/folkert# umount /var bash: /bin/umount: cannot execute binary file
the perlscript does modprobe, rmmod in a loop. typically it`s insmod/rmmod or modprobe/modprobe -r don`t know, but i would try modprobe -r instead of rmmod - just to see if it makes a difference. furthermore, any chance to dig out if there is one or more "offending" module, i.e. can you try to find out if this still happens with the right modules excluded? if the perlscript + shellscript was done by yourself, i think you have some programming skills and can work out some strategy to find out which module causing this issue. (i`d bisect the lsmod output appropriately and let that run against pyk-perl.mod)
Tried my script with only these modules: ide_cd_mod ide_disk cdrom piix ide_core ata_generic libata scsi_mod dock tried it with all modules but the ones listed above tried it with only usb and without usb modules Only when doing the script with all modules the problem arises.
so it does NOT happen with some modules excluded and it does also NOT happen when just trying the excluded modules ? that`s weird.
This problem is also reproducable with 2.6.28.
Theodore, > I can't tell how big the filesystem is (I'd need the output of dumpe2fs > /dev/hdXXX) to detect that, but this: > [92499.263311] Buffer I/O error on device dm-2, logical block 522133279 Device Boot Start End Blocks Id System /dev/hda1 * 1 31 248976 83 Linux /dev/hda2 32 4865 38829105 5 Extended /dev/hda5 32 4865 38829073+ 8e Linux LVM 38829073 < 522133279 so it tries to reach beyond the physical boundaries of the disk > Indicates either a hardware error, or a corrected journal inode. Possible as one of the errors that pop up is aborted journals, and journals take off-line and what not. > In the latter case, e2fsck would have detected the problem, and offered to > fix it. e2fsck finds millions of issues, I always give fixing a try but after minutes of pressing enter to fsck questions I give up and re-install debian (takes 15 minutes)
Cannot reproduce the problem anymore with 2.6.26 but very easy for 2.6.27 and 2.6.28.
It's still not clear to me what you do to trigger the corruption. What modules, specifically, are you removing and inserting? Can you narrow it down to a single module? The messages [ 487.377387] bio too big device hda5 (8 > 0) [ 487.377860] bio too big device hda5 (8 > 0) ... indicates that the block queue data structure has gotten corrupted (since queue->max_hw_sectors should never be zero). Bottom line is it sounds like *some* module is causing random memory corruption, leading to the kernel malfunctioning. The bottom line is figuring out which kernel module or modules are involved.
i also think it`s memory corruption which leads to filesystem issues. please concentrate on developing a strateg to find the offending module(s).
besides the filesystem corruption, here is another sign of the memory corruption: the modules then loaded are: --snipp-- ide_disk 10592 3 piix 5864 4294967295 <-- !!! ide_core 84468 2 ide_disk,piix --snipp--
Yes well do you guys have a suggestion? As each test-cycle takes at least half an hour as I need to reinstall debian each time.
Well, the first thing I would do is optimize the test-cycle. I would create partition the disk so you can install a stable debian system (I think you said you were stable with 2.6.26?), a fixed image of your test system (using 2.6.28 or 2.6.29-rc1), using the smallest posible system you can that still reproduces the problem. Then copy, using dd, the fixed image of the test system to the scratch partition, and then rig up grub (where the menu.conf file is on your stable system) so you can boot the scratch partition. Hopefully that way you can cut down your test cycle down to 5-10 minutes.
Tried writing the modules that got rmmodded and insmodded to a file. Now since ext3 fails to /root or any other filesystem fails massively. Inserted a memory stick with after each write a sync. This killed 2 memory sticks. So I created a vfat filesystem since fat is really trustworthy for this kind of tricks. And now I got a list of modules! unfortunally i forgot to write to the file if it got insmodded or rmmodded. Luckilly it'll be another 1,5 hour before $girlfriend will be here so I'll try again. ... here's the list of commands performed: modprobe Module rmmod shpchp rmmod piix rmmod loop rmmod dm_log rmmod ide_generic modprobe snd_ac97_codec modprobe scsi_mod rmmod dm_snapshot rmmod ide_gd_mod rmmod ide_core rmmod dm_snapshot modprobe agpgart rmmod i2c_i801 rmmod pci_hotplug rmmod fan rmmod i2c_core modprobe Module modprobe parport modprobe dm_mirror modprobe thermal_sys modprobe snd_timer modprobe snd_ac97_codec modprobe nls_base modprobe scsi_mod rmmod snd_intel8x0 rmmod snd_intel8x0 modprobe i2c_core rmmod snd_timer modprobe sg modprobe jbd modprobe ide_core modprobe uhci_hcd rmmod usbcore modprobe pci_hotplug rmmod loop modprobe snd_ac97_codec rmmod processor modprobe vfat modprobe parport_pc modprobe snd_intel8x0 rmmod dm_region_hash rmmod thermal_sys modprobe loop modprobe sd_mod modprobe ext3 rmmod ext3 rmmod hid rmmod loop rmmod ehci_hcd modprobe evdev rmmod iTCO_wdt rmmod nls_cp437 modprobe dm_region_hash modprobe shpchp modprobe ac97_bus modprobe snd_pcsp modprobe ehci_hcd modprobe rng_core modprobe evdev rmmod dm_log modprobe ext3 rmmod Module rmmod rng_core modprobe thermal_sys modprobe shpchp modprobe snd_pcsp rmmod fan modprobe ac97_bus modprobe ac97_bus modprobe thermal_sys modprobe fat modprobe dm_mod rmmod snd_pcsp rmmod fan modprobe piix modprobe snd_timer modprobe ide_generic modprobe usbcore modprobe thermal modprobe ac97_bus modprobe mbcache rmmod fan rmmod usb_storage rmmod dm_mirror modprobe button rmmod sd_mod rmmod dm_log rmmod i2c_core rmmod dm_mod rmmod snd_timer modprobe vfat modprobe ac97_bus modprobe vfat modprobe vfat rmmod sd_mod rmmod nls_cp437 modprobe nls_cp437 rmmod soundcore modprobe jbd rmmod ide_gd_mod modprobe ac97_bus rmmod parport_pc modprobe i2c_i801 modprobe dm_log rmmod sr_mod rmmod intel_agp rmmod sd_mod modprobe rng_core modprobe piix modprobe crc_t10dif rmmod ext3 modprobe soundcore rmmod ide_generic modprobe i2c_i801 rmmod ac modprobe dm_mod modprobe crc_t10dif rmmod snd_page_alloc rmmod sd_mod modprobe ide_cd_mod rmmod snd_page_alloc modprobe ac rmmod snd_page_alloc modprobe evdev modprobe ide_gd_mod modprobe ide_generic modprobe container modprobe i2c_core modprobe agpgart modprobe sr_mod rmmod ide_cd_mod rmmod loop modprobe jbd rmmod i2c_i801 rmmod shpchp rmmod shpchp modprobe ide_gd_mod rmmod usb_storage rmmod libata rmmod evdev modprobe snd_timer rmmod nls_base modprobe snd_pcm rmmod ac modprobe thermal rmmod snd modprobe snd_ac97_codec rmmod uhci_hcd rmmod dm_mod modprobe ac rmmod thermal_sys modprobe agpgart rmmod ata_generic modprobe jbd modprobe hid rmmod i2c_i801 modprobe ide_core modprobe evdev modprobe ext3 modprobe battery rmmod agpgart modprobe snd_pcsp rmmod i2c_core rmmod ata_generic modprobe usbhid modprobe evdev rmmod ac rmmod hid modprobe container rmmod vfat modprobe ide_generic rmmod sd_mod rmmod piix rmmod ipv6 modprobe snd_timer modprobe iTCO_wdt rmmod processor modprobe ide_cd_mod modprobe ehci_hcd rmmod sr_mod rmmod shpchp rmmod snd_pcm modprobe container modprobe i2c_i801 modprobe ata_generic modprobe snd_pcm modprobe snd_timer modprobe ide_gd_mod modprobe ext3 modprobe sg modprobe nls_cp437 rmmod nls_cp437 rmmod jbd rmmod i2c_i801 rmmod piix rmmod thermal_sys rmmod mbcache modprobe nls_utf8 rmmod thermal modprobe hid rmmod snd_page_alloc modprobe nls_base modprobe dm_log modprobe joydev rmmod dm_mirror modprobe ide_gd_mod modprobe shpchp modprobe scsi_mod modprobe loop modprobe ac modprobe iTCO_wdt rmmod container rmmod crc_t10dif modprobe ata_generic rmmod hid rmmod nls_cp437 rmmod rng_core rmmod soundcore rmmod dm_log modprobe piix modprobe loop modprobe fan modprobe mbcache rmmod usbhid modprobe crc_t10dif modprobe soundcore modprobe sd_mod rmmod processor rmmod parport modprobe snd_intel8x0 rmmod jbd rmmod fat rmmod thermal_sys modprobe usbhid rmmod evdev modprobe scsi_mod rmmod Module modprobe dm_snapshot rmmod sr_mod modprobe battery rmmod parport_pc modprobe agpgart modprobe dm_mod modprobe i2c_i801 modprobe ac97_bus rmmod uhci_hcd modprobe snd_ac97_codec rmmod evdev modprobe parport modprobe snd_timer modprobe scsi_mod modprobe evdev modprobe nls_base modprobe hid modprobe nls_base modprobe sr_mod rmmod ehci_hcd modprobe snd_pcm modprobe parport_pc rmmod battery modprobe container rmmod i2c_i801 rmmod fan modprobe sd_mod modprobe parport_pc rmmod i2c_core rmmod parport_pc rmmod fan modprobe intel_agp modprobe vfat rmmod piix rmmod fan modprobe button rmmod nls_cp437 rmmod hid modprobe i2c_core modprobe usb_storage rmmod loop modprobe cdrom modprobe iTCO_wdt rmmod thermal rmmod dm_log modprobe libata modprobe i2c_i801 modprobe ide_gd_mod modprobe rng_core modprobe nls_base modprobe Module modprobe dm_mirror modprobe intel_agp modprobe fat rmmod iTCO_wdt rmmod rng_core rmmod soundcore rmmod ide_core modprobe usb_storage rmmod fat rmmod fat rmmod usb_storage modprobe iTCO_wdt rmmod nls_utf8 modprobe rng_core rmmod jbd rmmod usb_storage rmmod ipv6 rmmod nls_utf8 modprobe i2c_i801 rmmod nls_base modprobe pci_hotplug rmmod evdev rmmod piix rmmod usbcore modprobe nls_base rmmod snd rmmod loop modprobe ehci_hcd rmmod snd_pcm rmmod dm_mod modprobe snd_timer rmmod i2c_core modprobe thermal_sys rmmod cdrom rmmod snd_pcsp rmmod intel_agp rmmod Module rmmod ide_core modprobe shpchp rmmod libata rmmod dm_mirror modprobe ide_cd_mod rmmod snd_timer rmmod rng_core rmmod dm_log modprobe uhci_hcd modprobe agpgart modprobe Module modprobe parport modprobe ide_gd_mod modprobe snd modprobe i2c_i801 modprobe ipv6 rmmod mbcache modprobe snd_page_alloc modprobe ide_core rmmod snd modprobe loop rmmod i2c_core modprobe intel_agp rmmod nls_utf8 modprobe joydev modprobe dm_log rmmod nls_cp437 rmmod iTCO_wdt rmmod rng_core rmmod nls_cp437 rmmod usb_storage modprobe piix modprobe ata_generic rmmod usb_storage rmmod parport rmmod fat modprobe hid rmmod crc_t10dif modprobe hid modprobe joydev rmmod crc_t10dif rmmod battery rmmod nls_base rmmod intel_agp rmmod loop rmmod hid rmmod battery rmmod dm_snapshot rmmod dm_log modprobe shpchp rmmod snd_pcsp rmmod mbcache modprobe sg modprobe rng_core modprobe ide_gd_mod rmmod ide_generic modprobe dm_region_hash modprobe pci_hotplug modprobe crc_t10dif rmmod snd_page_alloc rmmod ac modprobe shpchp modprobe ata_generic rmmod parport_pc rmmod loop rmmod ac rmmod nls_cp437 rmmod button modprobe thermal modprobe usbcore rmmod container rmmod ext3 rmmod parport_pc modprobe ac modprobe snd_page_alloc modprobe loop rmmod Module modprobe snd_intel8x0 modprobe Module rmmod sr_mod rmmod ipv6 rmmod rng_core rmmod nls_utf8 rmmod i2c_core modprobe vfat rmmod evdev modprobe snd_page_alloc modprobe thermal modprobe evdev modprobe iTCO_wdt rmmod snd_pcsp rmmod sg modprobe snd_pcm rmmod ide_cd_mod rmmod rng_core modprobe sg rmmod sr_mod rmmod soundcore modprobe thermal modprobe fan modprobe dm_mod rmmod nls_utf8 rmmod libata modprobe loop rmmod nls_cp437 rmmod i2c_i801 modprobe soundcore rmmod libata modprobe snd rmmod i2c_core rmmod sr_mod modprobe thermal modprobe button rmmod ide_cd_mod rmmod hid modprobe agpgart modprobe snd_pcsp modprobe evdev modprobe dm_log modprobe snd_intel8x0 modprobe snd_timer rmmod Module rmmod piix modprobe ide_generic modprobe libata modprobe snd_timer Hopefully this is of any help.
can you always reproduce the problem, if you let this run a second time, i.e. you load/unload the modules in the same order as listed ?
Um, why are you loading and unloading so many modules? Note that it is not necessarily guaranteed to be safe to be unloading modules. In particular with network drivers, there are often race conditions that can crash your machine if you unload a module. Part of the problem is that some kernel maintainers don't believe that it is valid/good thing to rmmod a kernel, and in practice, it is often impossible to make module remove race-free. Some maintainers therefore don't take even basic precautions to avoid the most obvious race problems. So if you have something which is automatically unloading modules --- don't. It's not supported. If you can narrow it down to a single module which is racy on unload, and you can reproduce it, and polite request help from the module maintainer to fix it, they might feel magnanimous and fix it for you --- but be warned there are some maintainers (davem comes to mind) who believe so strenuously that module unloading is evil and shouldn't be supported that even if you give them a patch to fix some module unload race condition, they may not accept it.
P.S. There are a few modules that I manually unload, such as ehci_ucd and uhci_ucd for power management reasons, but that's basically because the USB folks haven't given us better ways of turning off USB or to better manage USB's power consumption when running on batteries. But it's better to consider that you have a safe list of modules that which can be unloaded safely, and to do so by hand, rather by some automatic program. P.P.S. In any case, it's pretty clear this isn't a filesystem bug.
(In reply to comment #18) > can you always reproduce the problem, if you let this run a second time, i.e. > you load/unload the modules in the same order as listed ? Yes, most definately.
(In reply to comment #20) [ .. not guaranteed that removing a module works in all cases ] > P.P.S. In any case, it's pretty clear this isn't a filesystem bug. Ah ok. I thought it was supposed to work in all cases and that I just found a bug that should not be there. Glad it is not a filesystem bug as I became a little afraid to upgrade to something more recent than 2.6.26. Thanks
>Note that it is not necessarily guaranteed to be safe to be unloading modules. if it`s not safe, those unsafe modules should be marked appropriately that they cannot be unloaded. at least they should spit out a warning that unloading should be avoided. there are already lot`s of modules which cannot be unloaded at all, so it`s just a matter of good will if the others being marked appropriately. if a module is unloadable and this is unsafe and if that module doesn`t tell that to the user, then i`d call that a bug. if it DOES tell that and the kernel crashes, then it`s user error. why i think this way? i already did lot`s of module testing like this user does, so i did lot`s of automated module load unload and found the one or other bug with this, but it`s the first time that someone is telling, that this should NOT be done because there are developers who do not support this. i think it`s good that we have people like folkert reporting such issues, because it enhances kernel quality.
(In reply to comment #21) > (In reply to comment #18) > > can you always reproduce the problem, if you let this run a second time, > i.e. > > you load/unload the modules in the same order as listed ? > > Yes, most definately. > Then I'd try to narrow down the simplest subset of that list that reproduces the problem...
I agree this would be a desirable thing to do, but it takes time for this to happen, and the people who are most interested in determining which modules are runing into problems when unloaded frequently are the ones who need to do this testing. I'll note there are also those modules which *can* be safely unloaded, as long as you ifconfig down the interface, make sure there are no active programs accessing them, wait a few seconds, (save your files just in case), and then unload it, while crossing your fingers. It's sufficiently useful to unload this driver that even though you have to be ***very*** careful to unload it, I'd prefer that it not be made completely impossible to remove. Note that there already is a config option to prevent rmmod from working at all, and rmmod requires root privs, and we do expect root to have at least some background skills....
ted, shouldn`t modules which cannot be safely unloaded either be unloadable or being at least marked appropriately (print warning on unload), so the users will know that they are doing that on their own risk ? if module load works, so should unload, and if unload isn`t safe to use, then it`s a bug and the world (i.e. the end user) should know about that. so, such problematic modules should print appropriate message on unload,imho.
bullshit - double post. i`m repeating myself with something i already told months ago. i´m getting old.... :D can someone tell me how to delete my own post in bugzilla?
Folkert, we`d be still interested in which module is killing your system. so, if you can provide more input here we should try finding the offending module. if you don`t want to do that due to lack of time (which is understandable, since not everybody likes bug-hunting) we sould just close this bugreport, as it would be just another unresolved bug and you would get asked about the status every couple of months or so. one more: does this ticket relate to this post ? http://marc.info/?l=linux-kernel&m=122841015111252&w=2 you are telling there, that the problem doesn`t exists with .28rc kernels, but in this bugtracker you say you could easy trigger it with .28 kernel. so, can you still trigger the problem with recent .30 or .31rc kernels ? one more recommendation: as you told, that you constantly were killing your system and need to reinstall, i´d relocate testing into a virtual machine. so, you could keep a snapshot of the working system and always revert back to a working state very quickly.
No response, closing old stale bug