On Mon, Dec 1, 2025 at 8:56 PM Peter Eisentraut <peter@eisentraut.org> wrote:
> Commit faeedbcefd4 changed the alignment of WAL buffers from XLOG_BLCKSZ
> to PG_IO_ALIGN_SIZE.
>
> While looking around for places to apply alignas, I think I found at
> least two places that were forgotten, namely in BootStrapXLOG() and in
> pg_test_fsync.c. Patches attached for those.
/* page buffer must be aligned suitably for O_DIRECT */
- buffer = (char *) palloc(XLOG_BLCKSZ + XLOG_BLCKSZ);
- page = (XLogPageHeader) TYPEALIGN(XLOG_BLCKSZ, buffer);
+ buffer = (char *) palloc(XLOG_BLCKSZ + PG_IO_ALIGN_SIZE);
+ page = (XLogPageHeader) TYPEALIGN(PG_IO_ALIGN_SIZE, buffer);
Could save a line with palloc_aligned()?
. o O { If we did that palloc_object() trick I mentioned the other day
I suppose we could also palloc_object(PGAlignedXLogBlock)... }
> I also suspect that the TYPEALIGN call in XLOGShmemInit() should take
> PG_IO_ALIGN_SIZE into account, but it's not immediately obvious how,
> since the comment also mentions that it wants alignment on "a full xlog
> block size boundary". Maybe Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE)?
I guess it was worrying about either direct I/O (to work) or VM pages
(cf ALIGNOF_BUFFER)? PG_IO_ALIGN_SIZE should be enough I think.
> I also wonder whether the check in check_debug_io_direct() for #if
> XLOG_BLCKSZ < PG_IO_ALIGN_SIZE would be required if we fixed all those
> places?
I don't think so, because the undersized blocks in the array can't all
satisfy PG_IO_ALIGN_SIZE. Also, not only the base address but also
the length of I/Os or iovecs have to be aligned on 4KB pages to
benefit from physical I/O direct to/from user space memory, at least
in our handwavy portable model of this stuff*. It might work on some
systems though, it just didn't seem useful enough to research which
ones...
That's probably why I didn't change those places above: either
XLOG_BLCKSZ is itself already big enough, or it isn't and there's not
much we can do about it except block direct I/O from being turned on.
PG_IO_ALIGN_SIZE is probably still a better choice though, even if it
doesn't really work: if you set the block size very large, it would be
wasteful to over-align the memory.
*The real alignment requirements are way more complicated and hiding
in many layers of kernels and drivers and protocols... Linux says it
requires only 512 bytes these days (used to be 4KB IIRC), but first
gen NVMe required 4KB alignment except for the first and last entries
of its PRP for vectored I/O (huh, what does Linux do about that?),
FreeBSD works in terms of 4KB pages at various levels, and Windows
even uses 4KB page lists as arguments to its direct I/O-only
ReadFileScatter()/WriteFileGather() system calls (that we aren't using
yet, but one day). 4KB seems to be enough everywhere and also matches
common VM page size, and also helps buffered I/O too by touching fewer
pages.