Bug 13365 - 2.6.28+ cannot read a hard-drive properly through an "82801DB (ICH4) IDE Controller"
Summary: 2.6.28+ cannot read a hard-drive properly through an "82801DB (ICH4) IDE Cont...
Status: CLOSED OBSOLETE
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-05-22 23:16 UTC by Jacob
Modified: 2012-06-07 11:04 UTC (History)
3 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
lscpi -vvvnn (7.94 KB, application/octet-stream)
2009-05-22 23:16 UTC, Jacob
Details
2.6.29.4 .config (58.81 KB, text/plain)
2009-05-22 23:32 UTC, Jacob
Details
dmesg for 2.6.27 (45.99 KB, text/plain)
2009-05-24 12:22 UTC, Jacob
Details
good dmesg for 2.6.29 (42.91 KB, text/plain)
2009-05-24 12:38 UTC, Jacob
Details

Description Jacob 2009-05-22 23:16:14 UTC
Created attachment 21496 [details]
lscpi -vvvnn

This used to work in 2.6.27. 

It seems 2.6.28 or greater (tested with 2.6.28-gentoo-r5 and the latest stable on kernel.org, 2.6.29.4) cannot use the Intel 82801DB (ICH4) IDE Controller properly. The kernel will boot fine, and the init script will be launched, but as soon as fsck does a check on whether the partition is /really/ fine before the init services mount it, everything stops. fsck complains that there's a different superblock count (few thousand difference) than what the partition table claims.

If I bot back into 2.6.27 after booting into either of these kernels, fsck will complain in a different way that there are errors in the file-system, but it cleans it up and boot continues. If I reboot again into 2.6.27, then fsck doesn't complain at all. However, as soon as I attempt .28 or .29, things go awry again.

Attached is lspci -vvvnn output. A plain 'lspci' dump is provided below (for Google purposes):
=== begin ===
00:00.0 Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE DRAM Controller/Host-Hub Interface (rev 01)
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 81)
00:1f.0 ISA bridge: Intel Corporation 82801DB/DBL (ICH4/ICH4-L) LPC Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801DB (ICH4) IDE Controller (rev 01)
00:1f.3 SMBus: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus Controller (rev 01)
00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 01)
01:07.0 Multimedia video controller: Conexant CX23880/1/2/3 PCI Video and Audio Decoder (rev 05)
01:08.0 Network controller: Broadcom Corporation BCM4306 802.11b/g Wireless LAN Controller (rev 03)
=== end ===

I think the important part is the IDE interface, but in case there's a conflict in hardware I included it all.

Besides the Gentoo kernel and the vanilla kernel, I also have a unionfs patch. However, I have left it out of the mix for 2.6.28 and .29 to make sure that wasn't the cause.

This is my first _kernel_ bug report (*shiver* ;), so let me know if I'm missing any info you need.
Comment 1 Jacob 2009-05-22 23:32:30 UTC
Created attachment 21498 [details]
2.6.29.4 .config

It's called .config.bak because it's the .config that applies to this bug. I'm currently trying to turn off "Serial ATA" support and use plain "ATA" instead to see if that helps. If it does, I'll post a 'diff -u' at that point in time.
Comment 2 Jacob 2009-05-22 23:34:25 UTC
Comment on attachment 21498 [details]
2.6.29.4 .config

Adjusted mime-type for convenience.
Comment 3 Alan 2009-05-22 23:39:10 UTC
Can you attach a dmesg of the boot (good or bad)
Comment 4 Felix Miata 2009-05-23 00:07:27 UTC
I have several ICH4 systems. On one with an Intel motherboard running Mandriva Cooker, 2.6.29.1 works fine with libata drivers. A Dell GX260, running openSUSE Factory with a 2.6.29-6 kernel, works fine with legacy IDE drivers. I'm using only ext2 & ext3 partitions.
Comment 5 Felix Miata 2009-05-23 02:58:32 UTC
It just occurred to me that boards with 845 chipsets were mostly made during the plague of defective capacitors. My 845 Intel board developed bad caps, which I replaced when bad things started happening, and the errors stopped.
Comment 6 Christopher Hogan 2009-05-23 05:00:28 UTC
I'm also having this problem with an Acer desktop system. I ran into the problem when I updated from kernel-2.6.18-gentoo-r5 to kernel-2.6.29-gentoo-r3. At the same time, I switched from using the BLK_DEV_PIIX driver to the ATA_PIIX driver.

Booting the new kernel produced similar errors concerning the file system being to large for the partition. The system could not boot the root file system. Booting back into 2.6.18, fsck corrected some corruption and the system booted normally. Every time I booted into the 2.6.29 kernel, I received the same errors.

I recompiled the 2.6.29 kernel to use the BLK_DEV_PIIX driver and the errors went away.

I'd attach the dmesg of the boot with the error. However, the root file system wouldn't mount. I not sure how I'd capture it. If I get a chance tomorrow, I'll try booting the system from a Knoppix image and see if it has problems mounting the hard drive.

lspci output:
00:00.0 Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE DRAM Controller/Host-Hub Interface (rev 03)
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 82)
00:1f.0 ISA bridge: Intel Corporation 82801DB/DBL (ICH4/ICH4-L) LPC Interface Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801DB (ICH4) IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 02)
01:0c.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)

I also have a Dell GX260 that uses this chipset, but doesn't have this problem. I'm using ReiserFS from the kernel.
Comment 7 Alan 2009-05-23 08:11:06 UTC
Christopher - please open a separate bug for what is almost certainly an unrelated issue.

This actually looks like something is causing memory corruption. It's probably not the ATA or IDE drivers but something stomping on the cache which is why I asked for the dmesg, to see what else is going on. A dud board is certainly possible but if going back to the old kernel fixes it then, while running memtest86 overnight on it wouldn't be a bad idea it does suggest its probably software triggered
Comment 8 Bartlomiej Zolnierkiewicz 2009-05-23 11:54:13 UTC
On Saturday 23 May 2009 10:11:07 bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13365
> 
> 
> 
> 
> 
> --- Comment #7 from Alan <alan@lxorguk.ukuu.org.uk>  2009-05-23 08:11:06 ---
> Christopher - please open a separate bug for what is almost certainly an
> unrelated issue.

Well, the kernel config posted by Jacob has:

	# CONFIG_IDE is not set

so it would be useful to verify his 2.6.27 (last good) kernel config first.

> This actually looks like something is causing memory corruption. It's
> probably
> not the ATA or IDE drivers but something stomping on the cache which is why I

Certainly not IDE drivers.  Please move this bug over to libata.
Comment 9 Jacob 2009-05-24 12:22:38 UTC
Created attachment 21513 [details]
dmesg for 2.6.27

Here's a good boot's dmesg.

Well, guys, things have gotten very interesting since I last wrote this bug. I'm not so sure this is a kernel bug anymore. My hard-drive made sounds almost like it had turned off and was spinning down, but then turned back on again. Several hours later, there was a hard-drive error, and Linux did an emergency read-only remount.

I ran fsck, which presumably fixed the errors, and everything is working again.

Looking at my dmesg file in depth for the first time, I see some "Buffer" errors that make me quiver.

Nevertheless, I'm going to continue fighting for 2.6.29, because there's still a possibility that there's a bug keeping 2.6.29 from booting altogether. Unless 2.6.29 is being smart. ;)
Comment 10 Jacob 2009-05-24 12:38:38 UTC
Created attachment 21514 [details]
good dmesg for 2.6.29

Christopher (at least, I'm pretty sure he's the one who recommended this workaround on the Gentoo forums) was right: the Serial ATA drivers do not work, but the ATA ones do. I'm now using 2.6.29.4, with /dev/hda instead of /dev/sda.  I'm also without X at the moment, since the intel driver is failing to paint pixels correctly on my screen. Another bug for another day, perhaps. ;)

I'll let you guys decide whether it's a hard-drive issue or a kernel issue, but I think it has a good chance of being a kernel issue at this point.

If you need any more inforamtion, let me know.
Comment 11 Alan 2009-05-24 12:52:43 UTC
If you've got the video driver failing, random crashes and the like its almost undebuggable whatever the real problem is.
Comment 12 Felix Miata 2009-05-24 13:03:51 UTC
Bad caps are something you discover by just opening it up to look. If you have them, anything else you do is a waste of everyone's time. http://www.overclockers.com/index.php?option=com_content&view=article&id=3649&catid=53:editorials&Itemid=4259
Comment 13 Bartlomiej Zolnierkiewicz 2009-05-24 13:11:45 UTC
Seems to be legitimate kernel issue.  I'm re-assigning it to the right component...
Comment 14 Alan 2009-05-24 13:14:14 UTC
Seems not to be at this point.
Comment 15 Bartlomiej Zolnierkiewicz 2009-05-24 13:32:42 UTC
You have two independent reports for the same ATA hardware regarding similar  issues and you don't even care to ask for 'bad' 2.6.29 dmesg (to do diff between 'good' and 'bad' dmesgs) before dismissing it using the usual 'broken hardware' excuse...?

BTW Looking at errors I think that the issue might be related to HPA handling and not ata_piix host driver itself.
Comment 16 Alan 2009-05-24 13:38:43 UTC
Bart - please take your silly little personal vendettas somewhere else.
Comment 17 Bartlomiej Zolnierkiewicz 2009-05-24 14:41:09 UTC
Technical facts are not "silly little personal vendettas":

- issue *does*not* happen with CONFIG_IDE

- issue *does* happen with CONFIG_ATA

Thus reassigning the bug from ide to libata made a perfect sense.

Moreover since it is most likely subsystem related problem, and not host driver specific one I reassigned it to the libata Maintainer (Jeff Garzik), not _you_.

Jeff, can decide whether it is really a kernel problem and take further actions accordingly (like requesting more data, reassigning the bug further or even closing it if he choses so).

Please reassign the bug back to libata where it should belong for now.
Comment 18 Alan 2009-05-24 16:07:13 UTC
Bart, I've already asked you once, and I'll ask you again - please talk yoiur personal vendetta elsewhere

This is a single system, with a set of random behaviours that appear to be memory corruption and look like the could be hardware. In that situation what works is highly dependant upon memory layout. I've been doing this for over fifteen years, I've seen this pattern many times before and it is usually hardware but may be another unrelated driver scribbling on something. That setup is *known* to work for other users.

It is almost certainly nothing to do with libata, and trying to assign it there is not helpful, nor is your wittering about HPA problems - _Christopher's_ problem probably is a misconfiguration of HPA settings which is why I asked him to file a separate bug report. 

I am not reassigning the bug back to libata. It's assigned as other/other because at the moment we have no evidence at all as to what actually is going on. I wouldn't be suprised if Felix suspicious about hardware are not shown to be correct. It could be anything so this is the best place for it.
Comment 19 Bartlomiej Zolnierkiewicz 2009-05-24 16:31:48 UTC
On Sunday 24 May 2009 18:07:13 bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13365
> 
> 
> 
> 
> 
> --- Comment #18 from Alan <alan@lxorguk.ukuu.org.uk>  2009-05-24 16:07:13 ---
> Bart, I've already asked you once, and I'll ask you again - please talk yoiur
> personal vendetta elsewhere
> 
> This is a single system, with a set of random behaviours that appear to be
> memory corruption and look like the could be hardware. In that situation what
> works is highly dependant upon memory layout. I've been doing this for over
> fifteen years, I've seen this pattern many times before and it is usually
> hardware but may be another unrelated driver scribbling on something. That
> setup is *known* to work for other users.
> 
> It is almost certainly nothing to do with libata, and trying to assign it
> there
> is not helpful, nor is your wittering about HPA problems - _Christopher's_
> problem probably is a misconfiguration of HPA settings which is why I asked
> him
> to file a separate bug report. 

How's about finally taking a peek at _Jacob's_ (who is this bug's submitter)
dmesgs from 2.6.27 (libata) and 2.6.29 (ide).  Then you can talk to me about
"personal vendetta" and "wittering about HPA problems" all you want:

2.6.27 (libata, doesn't work):
[    0.771152] ata1.00: HPA detected: current 156250000, native 156301488
[    0.771252] ata1.00: ATA-5: WDC WD800BB-75CAA0, 16.06V16, max UDMA/100
[    0.771347] ata1.00: 156250000 sectors, multi 8: LBA 

2.6.29 (ide, does work):
[    2.690956] hda: Host Protected Area detected.
[    2.690958] 	current capacity is 156250000 sectors (80000 MB)
[    2.690961] 	native  capacity is 156301488 sectors (80026 MB)
[    2.693441] hda: Host Protected Area disabled.
[    2.693532] hda: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=65535/16/63

So it is very much libata related and the actual problem is that people
migrating from IDE to libata expect the same behavior w.r.t. HPA.

Since the default behavior in IDE has been to disable HPA (probably not
the best choice but it was so for historical reasons), the good practice
upon discovery of HPA in libata would be to print a warning about possible
compatibility issue and hint user about "libata.ignore_hpa" option.

Jacob, please try booting 2.6.27 (or even better 2.6.29 w/libata) using
"libata.ignore_hpa=1" parameter.

Thanks.
Comment 20 Robert Hancock 2009-05-24 19:41:48 UTC
Yes, Jacob, it looks like that the HD was formatted/partitioned under the IDE drivers which disabled the host protected area on that drive which exists for some reason, this resulting in the partitions being created to use the area protected by the HPA. libata by default respects the HPA and so now the sda4 partition now spans past the end of what we consider the device capacity to be:

[    0.966457] sda: p4 exceeds device capacity

and resulting in reads into that part of the partition failing.

So yes, the solution for now would be to set the libata.ignore_hpa=1 kernel parameter. The better long-term solution would be to remove the host protected area from the drive so this isn't an issue in the future. I believe the hdparm -N command can be used to do this. (Whatever the HPA was protecting, maybe a factory OS recovery image or something, has presumably been blown away by now.)

It's unfortunate that the IDE drivers defaulted to ignoring the HPA, as it resulted in situations like this where partitions were created inside it. Ignoring the HPA should only be done in special cases as they are usually present for a reason. We might be able to clarify the libata message to indicate what it means a bit better to the uninitiated, but we don't want to encourage people to use ignore_hpa without some thought.
Comment 21 Bartlomiej Zolnierkiewicz 2009-05-24 20:29:12 UTC
On Sunday 24 May 2009 21:41:49 bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13365
> 
> 
> Robert Hancock <hancockrwd@gmail.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |hancockrwd@gmail.com
> 
> 
> 
> 
> --- Comment #20 from Robert Hancock <hancockrwd@gmail.com>  2009-05-24
> 19:41:48 ---
> Yes, Jacob, it looks like that the HD was formatted/partitioned under the IDE
> drivers which disabled the host protected area on that drive which exists for
> some reason, this resulting in the partitions being created to use the area
> protected by the HPA. libata by default respects the HPA and so now the sda4
> partition now spans past the end of what we consider the device capacity to
> be:
> 
> [    0.966457] sda: p4 exceeds device capacity
> 
> and resulting in reads into that part of the partition failing.
> 
> So yes, the solution for now would be to set the libata.ignore_hpa=1 kernel
> parameter. The better long-term solution would be to remove the host
> protected
> area from the drive so this isn't an issue in the future. I believe the
> hdparm
> -N command can be used to do this. (Whatever the HPA was protecting, maybe a
> factory OS recovery image or something, has presumably been blown away by
> now.)
> 
> It's unfortunate that the IDE drivers defaulted to ignoring the HPA, as it
> resulted in situations like this where partitions were created inside it.

Sure it is unfortunate but it was the existing behavior since some 2.4.x days
(sorry I don't know who committed it back then, it predates my reign)...

> Ignoring the HPA should only be done in special cases as they are usually
> present for a reason. We might be able to clarify the libata message to
> indicate what it means a bit better to the uninitiated, but we don't want to
> encourage people to use ignore_hpa without some thought.

The real-world severity of the issue is the following:

- the existing setups using IDE (w/ disabled HPA) work just *fine*

- by encouraging libata (w/ enabled HPA) migration such setups are pushed at
  (quite high) risk of filesystem *corruption*

The *carefully* prepared warning message is the absolute minimum that should
be done.  No "ifs" or "buts", please somebody just do it *finally* (this is a
known problem for some long time now).

[ Oh, wait I could have used it for my "personal vendetta" and made Linux
  media full of "new PATA drivers causing corruption, developers dismissing
  the problem as configuration issue" headlines!  Sigh... ]
Comment 22 Robert Hancock 2009-05-24 20:45:21 UTC
I would question that the existing IDE setups (and future ones that may still be created) work "just fine", given that it allows blowing away potentially important/desired data like OS recovery partitions without user knowledge. (This data would be at the end of the disk and wouldn't show up in the partition table, so the user may not know there was anything there.) In fact this is quite likely what happened here.

And I tend to doubt that filesystem corruption is really that likely. If a file system corrupts itself due to part of the FS becoming inaccessible, that seems like a FS bug. The symptoms may appear similar to FS corruption, however, as it is likely to remount read-only, etc.

Ideally this problem needs to be dealt with at a higher level, i.e. in distribution installers, etc. which could detect this situation and give the user an option of what to do. Adding a kernel message is fine but to have any effect requires that the user a) see it and b) connect it to the symptoms they're experiencing.
Comment 23 Bartlomiej Zolnierkiewicz 2009-05-24 21:18:58 UTC
On Sunday 24 May 2009 22:45:21 bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13365
> 
> 
> 
> 
> 
> --- Comment #22 from Robert Hancock <hancockrwd@gmail.com>  2009-05-24
> 20:45:21 ---
> I would question that the existing IDE setups (and future ones that may still
> be created) work "just fine", given that it allows blowing away potentially
> important/desired data like OS recovery partitions without user knowledge.

Well, especially Linux-only users are certainly missing their proprietary
OS recovery partitions... :)

[ AFAIK no Linux based system makes such use of HPA area. ]

> (This data would be at the end of the disk and wouldn't show up in the
> partition table, so the user may not know there was anything there.) In fact
> this is quite likely what happened here.

No, nothing like that happened here.

Moreover this is missing the point completely.

[ Look, one of the setups discussed here was upgraded from 2.6.18... ]

> And I tend to doubt that filesystem corruption is really that likely. If a
> file
> system corrupts itself due to part of the FS becoming inaccessible, that
> seems
> like a FS bug. The symptoms may appear similar to FS corruption, however, as
> it
> is likely to remount read-only, etc.

Mix in few real-world usages by actual users (like attempts to fix errors
that they are seeing or filesystem resize using gparted) and the probability
of "FS bug" rises significantly.

Even if filesystem manages to remount to read-only mode it is still would
mean *unexpected* loss of some current data combined with inability to access
some existing data.

> Ideally this problem needs to be dealt with at a higher level, i.e. in
> distribution installers, etc. which could detect this situation and give the

Ideally, yes.  Probably in 2020 we will see it, maybe 2015.

> user an option of what to do. Adding a kernel message is fine but to have any
> effect requires that the user a) see it and b) connect it to the symptoms
> they're experiencing.

Whatever, please move this bug to libata finally.
Comment 24 Robert Hancock 2009-05-24 21:27:24 UTC
(In reply to comment #23)
> > I would question that the existing IDE setups (and future ones that may
> still
> > be created) work "just fine", given that it allows blowing away potentially
> > important/desired data like OS recovery partitions without user knowledge.
> 
> Well, especially Linux-only users are certainly missing their proprietary
> OS recovery partitions... :)
> 
> [ AFAIK no Linux based system makes such use of HPA area. ]

The user may be using a dual-boot system and want to have that recovery data remain available. In any case, the kernel has no business unmasking that area and allowing it to get trashed without consent.

> 
> > (This data would be at the end of the disk and wouldn't show up in the
> > partition table, so the user may not know there was anything there.) In
> fact
> > this is quite likely what happened here.
> 
> No, nothing like that happened here.

It probably did when the distribution was originally installed using the IDE drivers.

> > user an option of what to do. Adding a kernel message is fine but to have
> any
> > effect requires that the user a) see it and b) connect it to the symptoms
> > they're experiencing.
> 
> Whatever, please move this bug to libata finally.

I wouldn't say that this really constitutes a "bug" in libata. If anything is a bug it was the decision to unmask HPA by default in IDE, and libata is just tripping over it in this case. There is a potential usability enhancement here, but anyway, such discussions really belong on the linux-ide list and not in Bugzilla.
Comment 25 Bartlomiej Zolnierkiewicz 2009-05-24 21:54:37 UTC
On Sunday 24 May 2009 23:27:25 bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13365
> 
> 
> 
> 
> 
> --- Comment #24 from Robert Hancock <hancockrwd@gmail.com>  2009-05-24
> 21:27:24 ---
> (In reply to comment #23)
> > > I would question that the existing IDE setups (and future ones that may
> still
> > > be created) work "just fine", given that it allows blowing away
> potentially
> > > important/desired data like OS recovery partitions without user
> knowledge.
> > 
> > Well, especially Linux-only users are certainly missing their proprietary
> > OS recovery partitions... :)
> > 
> > [ AFAIK no Linux based system makes such use of HPA area. ]
> 
> The user may be using a dual-boot system and want to have that recovery data
> remain available. In any case, the kernel has no business unmasking that area
> and allowing it to get trashed without consent.
> 
> > 
> > > (This data would be at the end of the disk and wouldn't show up in the
> > > partition table, so the user may not know there was anything there.) In
> fact
> > > this is quite likely what happened here.
> > 
> > No, nothing like that happened here.
> 
> It probably did when the distribution was originally installed using the IDE
> drivers.

TBH it is really distribution's fault to blow up HPA.  Obtaining information
about HPA is not a black magic and it was a official decision to give kernel
access to the _whole_ drive by default, also this is missing the whole point... 

So what if distribution did it *N-years* ago?  The current setup works *fine*.

> > > user an option of what to do. Adding a kernel message is fine but to have
> any
> > > effect requires that the user a) see it and b) connect it to the symptoms
> > > they're experiencing.
> > 
> > Whatever, please move this bug to libata finally.
> 
> I wouldn't say that this really constitutes a "bug" in libata. If anything is
> a
> bug it was the decision to unmask HPA by default in IDE, and libata is just

We can be throwing the blame over IDE's and libata's fences all day long
and it won't move things forward a tiny bit.

> tripping over it in this case. There is a potential usability enhancement
> here,
> but anyway, such discussions really belong on the linux-ide list and not in
> Bugzilla.

From the user perspective this is a a kernel bug, *regression* on top of it.

The Big Picture thinking is clearly missing here...
Comment 26 Alan 2009-05-24 21:55:24 UTC
I'm not moving the bug anywhere yet

- If you have a wrong HPA you get file system errors logged not random crashes (or a panic if its swap)

So if he's getting random crashes (including some in X) then something else is up and it needs finding whether its a bug in the block layer failing to handle an overrun or an fs corrupting not logging an error or whatever.

The original IDE HPA behavior was from Andre (I think) and came about because at the time 95% of the use of the HPA was with the jumper or software to make the drive look small enough for the BIOS to not explode on boot with big drives. So it made sense at the time.  The other 5% (which are the cases still using it today) generally hide stuff like reinstall images, firmware support and the like there which in a few cases you really really don't want to install.

Unfortunately the PC partition table doesn't give the needed info directly to autodetect which form is in use.
Comment 27 Alan 2009-05-24 21:58:39 UTC
From the majority of modern users perspectives ignoring the HPA breaks vendor RAID formats, GPT partitions, risks corrupting firmware and the like. There are *good* reasons for the change and its been changed for a while for almost all users and distributions so you'd get a serious regression problem if you changed it back anyway.

See the patches to expose both sizes to the kernel and the needed patches queued for dmraid for the direction on this. Libata can do rescanning of devices happily so once the dmraid and sysfs patches are done a runtime userspace triggered policy decision on hpa rescanning comes out in the wash.
Comment 28 Bartlomiej Zolnierkiewicz 2009-05-24 22:20:13 UTC
Hmm, wait.  Could it be that you're taking this bug as some other problem?

Where there was info about random crashes?  There was only info about X driver failing repeatably.  OTOH there were (perfectly matching our theory) file system errors logged:

[   31.156343] Buffer I/O error on device sda4, logical block 16884288
[   31.156921] attempt to access beyond end of device
[   31.156930] sda: rw=0, want=156296377, limit=156250000
[   31.156936] Buffer I/O error on device sda4, logical block 16884313
[   31.157272] attempt to access beyond end of device
[   31.157279] sda: rw=0, want=156296385, limit=156250000
[   31.157284] Buffer I/O error on device sda4, logical block 16884314
[   31.157594] attempt to access beyond end of device
[   31.157602] sda: rw=0, want=156296385, limit=156250000
[   31.157607] Buffer I/O error on device sda4, logical block 16884314
[   31.157902] attempt to access beyond end of device

[ That is why my reaction to dismissing the issue was quite hard... ]

also:

Please stop trying to conving me that the change was good.  There is really no need for that.

What I'm complaining about is downplaying of compatibility issues -- i.e. the lack of any "IDE -> libata transition checklist" (preferrably at http://linux-ata.org/) and lack of kernel warnings about possible problems (so if somebody hits them looking at kernel messages may give some the person the hint about possible cause).

IOW thinking about kernel as a whole, instead of "mine" vs "yours"
Comment 29 Christopher Hogan 2009-05-30 20:03:18 UTC
The suggestion to use "libata.ignore_hpa=1" fixed it for me. Thanks for the suggestion.

I did see the warning "hda: Host Protected Area disabled" when using BLK_DEV_PIIX. I assumed the drive had this capability and it was disabled. The differences between current and native capacity should have clued me in that HPA was enabled and ignored, not disabled.

The system is a single drive system. The hard drive is not original to the system. I'd rather use the space for files than for what ever was there. I didn't get a chance to see all the messages from libata as the root file system didn't mount and the kernel panicked.

Alan, I didn't file a separate bug as I've been away from my computer. When I came back, the above suggestion worked.

It looks like discussion of this bug has moved to other places and there is talk of patches, possibly for 2.6.31. I just wanted to post that my problem is solved and to thank everyone for their help.

Note You need to log in before you can comment on or make changes to this bug.