Home > mailing lists

Re: Initdb-time block size specification - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: Initdb-time block size specification
Date	June 30, 2023 22:51:18
Msg-id	20230630225118.nfe2sx6pso6vrtdn@awork3.anarazel.de Whole thread Raw
In response to	Re: Initdb-time block size specification (Bruce Momjian <bruce@momjian.us>)
Responses	Re: Initdb-time block size specification
List	pgsql-hackers

Tree view

Hi,

On 2023-06-30 17:53:34 -0400, Bruce Momjian wrote:
> On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote:
> > On 6/30/23 23:11, Andres Freund wrote:
> > > I suspect you're going to see more benefits from going to a *lower* setting
> > > than a higher one. Some practical issues aside, plenty of storage hardware
> > > these days would allow to get rid of FPIs if you go to 4k blocks (although it
> > > often requires explicit sysadmin action to reformat the drive into that mode
> > > etc).  But obviously that's problematic from the "postgres limits" POV.
> > >
> >
> > I wonder what are the conditions/options for disabling FPI. I kinda
> > assume it'd apply to new drives with 4k sectors, with properly aligned
> > partitions etc. But I haven't seen any particularly clear confirmation
> > that's correct.
>
> I don't think we have ever had to study this --- we just request the
> write to the operating system, and we either get a successful reply or
> we go into WAL recovery to reread the pre-image.  We never really care
> if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4
> 2kB writes --- we don't care --- we only care if they are all done or
> not.

Well, that works because we have FPI. This sub-discussion is motivated by
getting rid of FPIs.

> For a 4kB write, to say it is not partially written would be to require
> the operating system to guarantee that the 4kB write is not split into
> smaller writes which might each be atomic because smaller atomic writes
> would not help us.

That's why were talking about drives with 4k sector size - you *can't* split
the writes below that.

The problem is that, as far as I know,it's not always obvious what block size
is being used on the actual storage level.  It's not even trivial when
operating on a filesystem directly stored on a single block device ([1]). Once
there's things like LVM or disk encryption involved, it gets pretty hairy
([2]).  Once you know all the block devices, it's not too bad, but ...

Greetings,

Andres Freund

[1] On linux I think you need to use stat() to figure out the st_dev for a
file, then look in /proc/self/mountinfo for the block device, use the name
of the file to look in /sys/block/$d/queue/physical_block_size.

[2] The above doesn't work because e.g. a device mapper target might only
support 4k sectors, even though the sectors on the underlying storage device
are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the
above recipe (with the added step of resolving the symlink to know the actual
device name), you would see a 4k sector size.  Even though the underlying NVMe
disk only supports 512b sectors.

pgsql-hackers by date:

From: Bruce Momjian
Date: 30 June 2023, 22:37:39
Subject: Re: Initdb-time block size specification

From: Tomas Vondra
Date: 30 June 2023, 22:56:13
Subject: Re: Initdb-time block size specification

Re: Initdb-time block size specification - Mailing list pgsql-hackers

Previous

Next