Thread: Re: 8Kb or 4Kb ext4 filesystem page size

Re: 8Kb or 4Kb ext4 filesystem page size

From
George Neuner
Date:
On Wed, 29 May 2019 15:10:06 -0300, Alexandre hadjinlian guerra 
<alexhguerra@gmail.com> wrote:
> Hi
>
> Given that postgres uses 8Kb pages, im wondering why i couldnt see any
> tests at all which would format ext4 partition to 8Kb pages. Im about 
> to do
> some tests, but any knowledge about such lack of tests on the internet
> makes me wonder if im looking poorly or just lack of testing. besides, 
> i do
> ask if the following link remain true given XFS and EXT4 evolution since
> 2015
> https://blog.pgaddict.com/posts/postgresql-performance-on-ext4-and-xfs
>
> Thanks

One thing no one has yet mentioned is that I/O performance could suffer 
greatly if the disk page size > memory page size because a single disk 
page split over multiple VMM page frames may be discontiguous in memory.

That is a problem for bus-mastering disk controllers because their DMA 
can operate only on *physical* addresses - not the logical addresses 
used by the programs.  Pages touched by external DMA need to be pinned 
(locked in place) for the duration.

There is also DMA built-in on the system board.  Typically built-in DMA 
can work through the MMU with logical addresses and so (usually) does 
not need to pin memory pages to access them.

But it's up to the device driver which DMA (if any) is used.  Most 
bus-mastering devices prefer to use their own DMA hardware, and their 
drivers either have to pin memory pages or work through a small(ish) 
buffer in a fixed location (which entails extraneous copying of data).


Postgresql's 8KB logical disk pages take up two 4KB memory pages - which 
may not be adjacent - but since the filesystem and memory pages are the 
same size, DMA  (built-in or external) can access the pages in any 
order, and without employing (or even needing) scatter/gather ability to 
coalesce or distribute partial pages to/from non-contiguous locations.

Another consideration for disk page size is that program code typically 
is paged in directly from the executable file.  If the disk and memory 
pages aren't the same size, the OS page fault handler needs to be aware 
and able to deal with the difference. Obviously this could be addressed 
simply by segregating "large" pages to separate data-only volumes.

AFAIK, only the Itanium has an option for 8KB memory pages.  The "large" 
/ "huge" memory pages available in most CPUs today are too big to be 
used effectively by a filesystem.
https://en.wikipedia.org/wiki/Page_(computer_memory)#Multiple_page_sizes

Rewriting filesystem drivers and the memory manager so that 8KKB or 
larger disk pages could be treated as a sort of "huge" memory page - 
overlaid on adjacent 4KB physical memory pages - would be a massive 
job.  Since few programs other than DBMS really would benefit from it, 
it isn't likely to happen.

YMMV,
George