Bug 47151 - provide a file system block size of 8KB for certain SSDs.
Summary: provide a file system block size of 8KB for certain SSDs.
Status: RESOLVED WILL_NOT_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 enhancement
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-06 20:34 UTC by Elmar Stellnberger
Modified: 2012-10-11 20:58 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.4.6-2.10-default
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Elmar Stellnberger 2012-09-06 20:34:16 UTC
Sandisk support says that its 480GB SSDs have a block size of 8KB instead of the usual 4KB. Thus in order to use that SSD we would have to format with a block size of 8KB. As SSDs get bigger and bigger it is likely that also other vendors will sell SSDs with a block size of 8KB. Please look forward to supporting an ext4 blocksize of 8KB!

# mkfs.ext4 -b 8192 test.disk 
Warning: blocksize 8192 not usable on most systems.
mke2fs 1.42.4 (12-June-2012)
test.disk is not a block special device.
Proceed anyway? (y,n) y
mkfs.ext4: 8192-byte blocks too big for system (max 4096)
Proceed anyway? (y,n) y
Warning: 8192-byte blocks too big for system (max 4096), forced to continue
Discarding device blocks: done                            
Filesystem label=
OS type: Linux
Block size=8192 (log=3)
Fragment size=8192 (log=3)
Stride=0 blocks, Stripe width=0 blocks
1280 inodes, 1280 blocks
64 blocks (5.00%) reserved for the super user
First data block=0
1 block group
65528 blocks per group, 65528 fragments per group
1280 inodes per group

Allocating group tables: done                            
Writing inode tables: done                            

Filesystem too small for a journal
Writing superblocks and filesystem accounting information: done

# mount -o loop test.disk /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail or so
# dmesg | tail
[ 2020.657698] SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=33:33:00:00:00:fb:44:d8:84:61:8c:33:86:dd SRC=fe80:0000:0000:0000:46d8:84ff:fe61:8c33 DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=227 TC=0 HOPLIMIT=255 FLOWLBL=0 PROTO=UDP SPT=5353 DPT=5353 LEN=187 
[ 2020.657874] SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=33:33:00:00:00:fb:44:d8:84:61:8c:33:86:dd SRC=fe80:0000:0000:0000:46d8:84ff:fe61:8c33 DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=227 TC=0 HOPLIMIT=255 FLOWLBL=0 PROTO=UDP SPT=5353 DPT=5353 LEN=187 
[ 2020.696054] SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=33:33:00:00:00:fb:44:d8:84:61:8c:33:86:dd SRC=fe80:0000:0000:0000:46d8:84ff:fe61:8c33 DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=227 TC=0 HOPLIMIT=255 FLOWLBL=0 PROTO=UDP SPT=5353 DPT=5353 LEN=187 
[ 2020.992554] SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=33:33:00:00:00:fb:44:d8:84:61:8c:33:86:dd SRC=fe80:0000:0000:0000:46d8:84ff:fe61:8c33 DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=431 TC=0 HOPLIMIT=255 FLOWLBL=0 PROTO=UDP SPT=5353 DPT=5353 LEN=391 
[ 2045.224133] SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=33:33:00:00:00:fb:44:d8:84:61:8c:33:86:dd SRC=fe80:0000:0000:0000:46d8:84ff:fe61:8c33 DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=142 TC=0 HOPLIMIT=255 FLOWLBL=0 PROTO=UDP SPT=5353 DPT=5353 LEN=102 
[ 2188.427869] SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=33:33:00:00:00:fb:00:1b:63:2f:5e:93:86:dd SRC=fe80:0000:0000:0000:021b:63ff:fe2f:5e93 DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=111 TC=0 HOPLIMIT=255 FLOWLBL=0 PROTO=UDP SPT=5353 DPT=5353 LEN=71 
[ 2188.429149] SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=33:33:00:00:00:fb:00:80:77:d8:ee:7d:86:dd SRC=fe80:0000:0000:0000:0280:77ff:fed8:ee7d DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=485 TC=0 HOPLIMIT=1 FLOWLBL=0 PROTO=UDP SPT=5353 DPT=5353 LEN=445 
[ 2192.171020] EXT4-fs (loop0): bad block size 8192
[ 2193.769124] SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC= SRC=fe80:0000:0000:0000:02a0:ccff:fed9:b3da DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=84 TC=0 HOPLIMIT=255 FLOWLBL=0 PROTO=UDP SPT=5353 DPT=5353 LEN=44 
[ 2246.621356] EXT4-fs (loop0): bad block size 8192
Comment 1 Alan 2012-09-07 15:32:45 UTC
This isn't a bug..... it may be a feature in which case send patches. However the core Linux kernel code doesn't support a block size larger than the page size of the hardware (and you'd get memory fragmentation problems if you changed that).
Comment 2 Elmar Stellnberger 2012-09-07 18:03:21 UTC
  Perhaps a file system block size of 8KB and higher should only be supported if the user boots in an adequate paging mode. PAE paging and IA-32e paging f.i. support a physical page size of 2MB. This may even be more efficient for certain usage scenarios (especially with a rich memory accouterment) as the page table will hereby get smaller.
  Another idea would simply be to keep sufficient chunks of 8KB and guarantee allocation in 8KB chunks if the allocation area or segment size is being kept a multiple of 8KB starting at an 8KB aligned address (I don`t think it would be that hard to manage since most segments are bigger than 4KB and 8KB chunks could also be used for these 4KB chunked and aligned segments.)

btw.: it was filed as enhancement
Comment 3 Elmar Stellnberger 2012-09-07 18:07:02 UTC
i.e. only allow a small part of the memory to hold 4KB fragments. In worst case we would fall down to an effective emulated page size of 8KB.
Comment 4 Eric Sandeen 2012-09-07 18:12:05 UTC
(In reply to comment #0)
> Sandisk support says that its 480GB SSDs have a block size of 8KB instead of
> the usual 4KB. Thus in order to use that SSD we would have to format with a
> block size of 8KB.

Do you have one of these devices?  I'd be a little surprised if it's really not formattable with 4k blocks.  If you have the device, this might be interesting:

# blockdev --getss --getpbsz --getiomin --getioopt --getbsz /dev/sdX
Comment 5 Elmar Stellnberger 2012-09-29 12:22:20 UTC
Sorry, I don`t have such a device handy and could not get anyone to directly attend the report who has.

Veronika B. from support@sandisk.com writes:
>
> The only drive currently produced by SanDisk with 8K alignment is the
> Extreme SSD of 480GB. Only the drives with this capacity will be aligned
> to 8K.
>
Comment 6 Eric Sandeen 2012-10-01 14:50:12 UTC
Ok, at this point we have no evidence that larger fs block sizes will be required for this SSD.  That email from sandisk doesn't really tell us a whole lot, simply saying "8k alignment" isn't terribly informative.  Is that the minimum alignment, the physical alignment, etc.  I think someone will need to get their hands on a device to see what it really advertises.  Scouting around the web I've not seen any indication that this device has an 8k logical blocksize; if it did there would surely be some noisy reviews out there about how it doesn't work on x86 linux today, I think.
Comment 7 Eric Sandeen 2012-10-01 14:56:40 UTC
As Alan alluded to, blocks size > page size is not a trivial problem, so this isn't likely to happen any time soon.  I just wanted to clarify whether a device was really shipping that would require it for 4k page systems.

Thanks,
-Eric
Comment 8 Elmar Stellnberger 2012-10-05 17:26:56 UTC
8K is to be understood instead of the usual 4K as page size (smallest read- and writeable entity). Eric: I doubt that anyone would file a bug report on it as the only effect of using a wrong fs block size will be degraded performance and a shortened lifetime of the SSD (nobody would in deed find out about it who hasn`t tried it with 8KB block size or heard about disks with page size != 4K). Feel free to care about it whenever you want as we do not have any users requiring it yet (to me for myself this isn`t any issue since I don`t plan to use the besaid disk at the time).
Comment 9 Elmar Stellnberger 2012-10-05 17:30:55 UTC
No, there seem to be plenty of users requiring it: my first search for 'SSD 8K file system block size' has produced an immediate hit at place #1: http://forums.gentoo.org/viewtopic-t-935946.html?sid=6b0893c88505b69270c47558ff641326
Comment 10 Jeff Moyer 2012-10-05 18:07:33 UTC
(In reply to comment #8)
> 8K is to be understood instead of the usual 4K as page size (smallest read-
> and
> writeable entity).

Internal to the device, it's the smallest I/O size.  But the device continues to export probably a 512 byte logical block size (or, more rarely, a 4k one).

> Eric: I doubt that anyone would file a bug report on it as
> the only effect of using a wrong fs block size will be degraded performance
> and
> a shortened lifetime of the SSD (nobody would in deed find out about it who
> hasn`t tried it with 8KB block size or heard about disks with page size !=
> 4K).

Given that pretty much all file systems running on Linux and Windows will use a block size less than or equal to the page size (4k), you can bet that the SSD vendor took this into account when advertising both the performance of the drive and the lifetime.  I'll also note that this SSD isn't posing any new problems.  Consider RAID devices.  They would also like I/O in larger than page size quantities.  We don't have to provide that in order to get good performance, though, for a large number of reasons (which vary depending on the exact hardware and software configuration you're talking about).  Also keep in mind that the workload you intend to run will play a large role when determining what file system block size you choose (for example, to avoid unnecessary wasting of space).

> Feel free to care about it whenever you want as we do not have any users
> requiring it yet (to me for myself this isn`t any issue since I don`t plan to
> use the besaid disk at the time).

If you want to play around with things, try creating a 4k file system with a stripe width of 2.  Of course, you'd have to have the hardware to try this, and you don't, nor does anyone looking at this, so the discussion is moot.

The bottom line is that if someone wants to take on the work to get file system block sizes pushed beyond page size, we're all for it.  Lobbying for this to get done isn't going to help as much as doing the work or paying someone to do the work, though.
Comment 11 Eric Sandeen 2012-10-05 18:20:45 UTC
Agreed.  Random people on gentoo forums notwithstanding, I see no evidence that this device needs help from the kernel to function properly.  All SSDs have larger structures in use internally, and firmware to manage the fact that they expect to get smaller IOs from the OS.
Comment 12 Elmar Stellnberger 2012-10-10 09:28:13 UTC
Sandisk support does actually not respond any more. I think we can close this issue as they do not seem to be interested in Linux support.
Comment 13 Elmar Stellnberger 2012-10-10 09:36:28 UTC
Oops; my fault; have overlooked the response:

> The 480GB Extreme SSD drive has to be aligned to 8K. It cannot be aligned to 
> smaller blocks, i.e. 4K. This issue will be addressed in the new firmware 
> release, however, we cannot announce any specific time frame regarding when 
> this release is going to happen.
>
Comment 14 Elmar Stellnberger 2012-10-10 09:41:25 UTC
i.e. the current drives don`t have such a 4K-emulation. By the time however I can not say whether other drives of other manufacturers may be affected by the same issue.
Comment 15 Alan 2012-10-10 10:14:16 UTC
I have talked to  folks in the SSD world and the view seems to be

- bigger than 4K block size is not a concern because the real device is just making up the disk block sizes anyway
- alignment according to the identify data is important (and is a matter for fdisk etc)
- this isn't particularly likely to change

So I think we should just close this as WONTFIX. Anyone who cares sufficiently and has a use case should instead submit patches.
Comment 16 Elmar Stellnberger 2012-10-11 20:53:59 UTC
> the real device is just making up the disk block sizes anyway
Yes, in our case it would make up a block size of 8KB

> alignment according to the identify data is important 
No, only for HDDs not for SSDs; here the alignment needs to fit
the internal or exposed alignment of the SSD in order not to
ruin it by a 2x or 4x-ing of all read-write cycles

> this isn't particularly likely to change
Yes, folks of any chat channel are not aware of these issues.
Comment 17 Elmar Stellnberger 2012-10-11 20:58:41 UTC
The question is whether we need to support these few 8K drives that are out that do not have a 4K emulation. I believe the Linux support of SSDs is already very well; f.i. on Win only Windows 7 has a trim mount support (afaik).

Note You need to log in before you can comment on or make changes to this bug.