Bug 12556 - pgoff_t type not wide enough (32-bit with LFS and/or LBD)
Summary: pgoff_t type not wide enough (32-bit with LFS and/or LBD)
Status: RESOLVED OBSOLETE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Block Layer (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jens Axboe
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-27 16:50 UTC by Marc Aurele La France
Modified: 2013-12-10 16:25 UTC (History)
3 users (show)

See Also:
Kernel Version: >= 2.6.20
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Marc Aurele La France 2009-01-27 16:50:19 UTC
Latest working kernel version:  None
Earliest failing kernel version:  2.6.20
Distribution:  Not relevant
Hardware Environment:  32-bit
Software Environment:  CONFIG_LBD and/or CONFIG_LFS
Problem Description:

On 32-bit archs, CONFIG_LBD allows for block devices larger than 2TB (with a 512-byte sector size), while CONFIG_LFS allows for files larger than 16TB (with a 4K block size).  However, reads (through the cache) of such things beyond the first 16TB result in bogus data.  Writes, oddly enough, are OK.  This is because the pgoff_t type is not wide enough.  See read_dev_sector()'s call to read_mapping_page() in fs/partitions/check.c for an example.

A Q&D fix for this is to have that snippet in include/linux/types.h read ...

/*
 * The type of an index into the pagecache.  Use a #define so asm/types.h
 * can override it.
 */
#ifndef pgoff_t
#if defined(CONFIG_LBD) || defined(CONFIG_LSF)
#define pgoff_t u64
#else
#define pgoff_t unsigned long
#endif
#endif

This problem occurs in all kernels since 2.6.20 (I'm currently running 2.6.25).

Steps to reproduce:

(Simple case)

Use a 32-bit system to define GPT partitions with GNU parted on a block device larger than 16TB and watch the kernel log for complaints from fs/partitions/efi.c.  Specifically "Alternate GPT is invalid, using primary GPT."
Comment 1 Marc Aurele La France 2009-02-04 13:18:04 UTC
Is there anybody in there?

I ask because I will soon be losing my ability to test an alternate solution.
Comment 2 Anonymous Emailer 2009-03-13 13:00:58 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 13 Mar 2009 07:15:38 -0700 (PDT)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=12556

> On 32-bit archs, CONFIG_LBD allows for block devices larger than 2TB (with a
> 512-byte sector size), while CONFIG_LFS allows for files larger than 16TB
> (with
> a 4K block size).  However, reads (through the cache) of such things beyond
> the
> first 16TB result in bogus data.  Writes, oddly enough, are OK.  This is
> because the pgoff_t type is not wide enough.  See read_dev_sector()'s call to
> read_mapping_page() in fs/partitions/check.c for an example.
> 
> A Q&D fix for this is to have that snippet in include/linux/types.h read ...
> 
> /*
>  * The type of an index into the pagecache.  Use a #define so asm/types.h
>  * can override it.
>  */
> #ifndef pgoff_t
> #if defined(CONFIG_LBD) || defined(CONFIG_LSF)
> #define pgoff_t u64
> #else
> #define pgoff_t unsigned long
> #endif
> #endif
> 

ouch.

We never had any serious intention of implementing 64-bit pagecache
indexes on 32-bit architectures.  I added pgoff_t mainly for code
clarity reasons (it was getting nutty in there), with a vague
expectation that we would need to use a 64-bit type one day.

And, yes, the need to be able to manipulate block devices via the
pagecache does mean that this day is upon us.

A full implementation is quite problematic.  Such a change affects each
filesystem, many of which are old and crufty and which nobody
maintains.  The cost of bugs in there (and there will be bugs) is
corrupted data in rare cases for few people, which is bad.

Perhaps what we should do is to add a per-filesystem flag which says
"this fs is OK with 64-bit page indexes", and turn that on within each
filesystem as we convert and test them.  Add checks to the VFS to
prevent people from extending files to more than 16TB on unverified
filesystems.  Hopefully all of this infrastructure is already in place
via super_block.s_maxbytes, and we can just forget about >16TB _files_.

And fix up the core VFS if there are any problems, and get pagecache IO
reviewed, tested and working for the blockdev address_spaces.

I expect it's all pretty simple, actually.  Mainly a matter of doing a
few hours code review to clean up those places where we accidentally
copy a pgoff_t to or from a long type.

The fact that the kernel apparently already works correctly when one simply
makes pgoff_t a u64 is surprising and encouraging and unexpected.  I
bet it doesn't work 100% properly!
Comment 3 Marc Aurele La France 2009-03-14 10:09:09 UTC
On Fri, 13 Mar 2009, Andrew Morton wrote:

> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).

OK.  Thanks for responding.

> We never had any serious intention of implementing 64-bit pagecache
> indexes on 32-bit architectures.  I added pgoff_t mainly for code
> clarity reasons (it was getting nutty in there), with a vague
> expectation that we would need to use a 64-bit type one day.

> And, yes, the need to be able to manipulate block devices via the
> pagecache does mean that this day is upon us.

I'm somewhat surprised this didn't come up back in 2.6.20, when LBD and LSF 
were first introduced.

> A full implementation is quite problematic.  Such a change affects each
> filesystem, many of which are old and crufty and which nobody
> maintains.  The cost of bugs in there (and there will be bugs) is
> corrupted data in rare cases for few people, which is bad.

> Perhaps what we should do is to add a per-filesystem flag which says
> "this fs is OK with 64-bit page indexes", and turn that on within each
> filesystem as we convert and test them.  Add checks to the VFS to
> prevent people from extending files to more than 16TB on unverified
> filesystems.  Hopefully all of this infrastructure is already in place
> via super_block.s_maxbytes, and we can just forget about >16TB _files_.

> And fix up the core VFS if there are any problems, and get pagecache IO
> reviewed, tested and working for the blockdev address_spaces.

> I expect it's all pretty simple, actually.  Mainly a matter of doing a
> few hours code review to clean up those places where we accidentally
> copy a pgoff_t to or from a long type.

> The fact that the kernel apparently already works correctly when one simply
> makes pgoff_t a u64 is surprising and encouraging and unexpected.  I
> bet it doesn't work 100% properly!

True enough.  The kernel is rife with nooks and cranies.  So I can't 
vouch for all of them.  But, this has already undergone some stress.  For a 
period of three weeks, I had this in production on a cluster's NFS server 
that also does backups and GFS2.  Even before then, periods of load averages 
of 50 or more and heavy paging were not unusual.  It was, in part, to address 
that load that this system has since been replaced with more capable and, 
unfortunately for this bug report, 64-bit hardware.

I also don't share your "doom and gloom" assessment.  First, unless a 
filesystem is stupid enough to store a pgoff_t on disc (a definite bug 
candidate), it doesn't really matter what the kernel's internal 
representation of one is, as long as it is wide enough.  Secondly, this is an 
unsigned quantity, so, barring compiler bugs, sign-extension issues cannot 
occur.

Third, the vast majority of filesystems in the wild are less than 16TB in 
size.  This whether or not the filesystem type used can handle more.  Here, 
all pgoff_t values fit in 32 bits and are therefore immune to any, even 
random, zero-extensions to 64 bits and truncations back to 32 bits that might 
internally occur.  A similar argument can be made for the bulk of block 
devices out there that are also no larger than 16TB.

This leaves us with the rare >16TB situations.  But, wait.  Isn't that the 
bug we're talking about?  Of these, I can tell you that a 23TB GFS2 
filesystem is much more stable with this change than it is without.  And, on 
a 32-bit system, a swap device that large can't be fully used anyway.

There's also the fact that, as things stand now, a pgoff_t's size doesn't 
affect interoperability among 32-bit and 64-bit systems.

All in all, I think you're selling yourself short WRT the correctness of your 
introduction of pgoff_t.

Thanks again.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi@ualberta.ca         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+
XFree86 developer and VP.  ATI driver and X server internals.
Comment 4 Nick Piggin 2009-03-14 17:58:40 UTC
On Sunday 15 March 2009 04:08:37 Marc Aurele La France wrote:
> On Fri, 13 Mar 2009, Andrew Morton wrote:
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
>
> OK.  Thanks for responding.
>
> > We never had any serious intention of implementing 64-bit pagecache
> > indexes on 32-bit architectures.  I added pgoff_t mainly for code
> > clarity reasons (it was getting nutty in there), with a vague
> > expectation that we would need to use a 64-bit type one day.
> >
> > And, yes, the need to be able to manipulate block devices via the
> > pagecache does mean that this day is upon us.
>
> I'm somewhat surprised this didn't come up back in 2.6.20, when LBD and LSF
> were first introduced.

They increased the width of the type that stores sectors, so needed
for > 2TB. pagecache index was enough for several more doublings
after that.


> > A full implementation is quite problematic.  Such a change affects each
> > filesystem, many of which are old and crufty and which nobody
> > maintains.  The cost of bugs in there (and there will be bugs) is
> > corrupted data in rare cases for few people, which is bad.
> >
> > Perhaps what we should do is to add a per-filesystem flag which says
> > "this fs is OK with 64-bit page indexes", and turn that on within each
> > filesystem as we convert and test them.  Add checks to the VFS to
> > prevent people from extending files to more than 16TB on unverified
> > filesystems.  Hopefully all of this infrastructure is already in place
> > via super_block.s_maxbytes, and we can just forget about >16TB _files_.
> >
> > And fix up the core VFS if there are any problems, and get pagecache IO
> > reviewed, tested and working for the blockdev address_spaces.
> >
> > I expect it's all pretty simple, actually.  Mainly a matter of doing a
> > few hours code review to clean up those places where we accidentally
> > copy a pgoff_t to or from a long type.
> >
> > The fact that the kernel apparently already works correctly when one
> > simply makes pgoff_t a u64 is surprising and encouraging and unexpected. 
> > I bet it doesn't work 100% properly!
>
> True enough.  The kernel is rife with nooks and cranies.  So I can't
> vouch for all of them.  But, this has already undergone some stress.  For a
> period of three weeks, I had this in production on a cluster's NFS server
> that also does backups and GFS2.  Even before then, periods of load
> averages of 50 or more and heavy paging were not unusual.  It was, in part,
> to address that load that this system has since been replaced with more
> capable and, unfortunately for this bug report, 64-bit hardware.
>
> I also don't share your "doom and gloom" assessment.  First, unless a
> filesystem is stupid enough to store a pgoff_t on disc (a definite bug
> candidate), it doesn't really matter what the kernel's internal
> representation of one is, as long as it is wide enough.  Secondly, this is
> an unsigned quantity, so, barring compiler bugs, sign-extension issues
> cannot occur.
>
> Third, the vast majority of filesystems in the wild are less than 16TB in
> size.  This whether or not the filesystem type used can handle more.  Here,
> all pgoff_t values fit in 32 bits and are therefore immune to any, even
> random, zero-extensions to 64 bits and truncations back to 32 bits that
> might internally occur.  A similar argument can be made for the bulk of
> block devices out there that are also no larger than 16TB.
>
> This leaves us with the rare >16TB situations.  But, wait.  Isn't that the
> bug we're talking about?  Of these, I can tell you that a 23TB GFS2
> filesystem is much more stable with this change than it is without.  And,
> on a 32-bit system, a swap device that large can't be fully used anyway.
>
> There's also the fact that, as things stand now, a pgoff_t's size doesn't
> affect interoperability among 32-bit and 64-bit systems.

I don't know how this could possibly be working correctly considering
that the radix tree that is the basic store of pagecache pages uses
keys typed unsigned long. And that was just the first place I looked
at.


> All in all, I think you're selling yourself short WRT the correctness of
> your introduction of pgoff_t.

The assessment is more like realism than doom and gloom. pgoff_t is
the right way to fix this, but right now the 32-bit kernel is obviously
not going to work with > 32-bit index, and yes I agree with Andrew it
does need some kind of audit of each filesystem before we can tell
users it is supported. Maybe sparse could help with that...?

Note You need to log in before you can comment on or make changes to this bug.