Thread: Relation bulk write facility

Relation bulk write facility

From
Heikki Linnakangas
Date:
Several places bypass the buffer manager and use direct smgrextend() 
calls to populate a new relation: Index AM build methods, rewriteheap.c 
and RelationCopyStorage(). There's fair amount of duplicated code to 
WAL-log the pages, calculate checksums, call smgrextend(), and finally 
call smgrimmedsync() if needed. The duplication is tedious and 
error-prone. For example, if we want to optimize by WAL-logging multiple 
pages in one record, that needs to be implemented in each AM separately. 
Currently only sorted GiST index build does that but it would be equally 
beneficial in all of those places.

And I believe we got the smgrimmedsync() logic slightly wrong in a 
number of places [1]. And it's not great for latency, we could let the 
checkpointer do the fsyncing lazily, like Robert mentioned in the same 
thread.

The attached patch centralizes that pattern to a new bulk writing 
facility, and changes all those AMs to use it. The facility buffers 32 
pages and WAL-logs them in record, calculates checksums. You could 
imagine a lot of further optimizations, like writing those 32 pages in 
one vectored pvwrite() call [2], and not skipping the buffer manager 
when the relation is small. But the scope of this initial version is 
mostly to refactor the existing code.

One new optimization included here is to let the checkpointer do the 
fsyncing if possible. That gives a big speedup when e.g. restoring a 
schema-only dump with lots of relations.

[1] 
https://www.postgresql.org/message-id/58effc10-c160-b4a6-4eb7-384e95e6f9e3%40iki.fi

[2] 
https://www.postgresql.org/message-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Attachment

Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 19/09/2023 17:13, Heikki Linnakangas wrote:
> The attached patch centralizes that pattern to a new bulk writing
> facility, and changes all those AMs to use it.

Here's a new rebased version of the patch.

This includes fixes to the pageinspect regression test. They were 
explained in the commit message, but I forgot to include the actual test 
changes. That should fix the cfbot failures.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Relation bulk write facility

From
Andres Freund
Date:
Hi,

On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
> The new facility makes it easier to optimize bulk loading, as the
> logic for buffering, WAL-logging, and syncing the relation only needs
> to be implemented once. It's also less error-prone: We have had a
> number of bugs in how a relation is fsync'd - or not - at the end of a
> bulk loading operation. By centralizing that logic to one place, we
> only need to write it correctly once.

One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.


> The new facility is faster for small relations: Instead of of calling
> smgrimmedsync(), we register the fsync to happen at next checkpoint,
> which avoids the fsync latency. That can make a big difference if you
> are e.g. restoring a schema-only dump with lots of relations.

I think this would be a huge win for people running their application tests
against postgres.


> +    bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
> +
>      nblocks = smgrnblocks(src, forkNum);
>  
>      for (blkno = 0; blkno < nblocks; blkno++)
>      {
> +        Page        page;
> +
>          /* If we got a cancel signal during the copy of the data, quit */
>          CHECK_FOR_INTERRUPTS();
>  
> -        smgrread(src, forkNum, blkno, buf.data);
> +        page = bulkw_alloc_buf(bulkw);
> +        smgrread(src, forkNum, blkno, page);
>  
>          if (!PageIsVerifiedExtended(page, blkno,
>                                      PIV_LOG_WARNING | PIV_REPORT_STAT))
> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
>           * page this is, so we have to log the full page including any unused
>           * space.
>           */
> -        if (use_wal)
> -            log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
> -
> -        PageSetChecksumInplace(page, blkno);
> -
> -        /*
> -         * Now write the page.  We say skipFsync = true because there's no
> -         * need for smgr to schedule an fsync for this write; we'll do it
> -         * ourselves below.
> -         */
> -        smgrextend(dst, forkNum, blkno, buf.data, true);
> +        bulkw_write(bulkw, blkno, page, false);

I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?


> +/*-------------------------------------------------------------------------
> + *
> + * bulk_write.c
> + *      Efficiently and reliably populate a new relation
> + *
> + * The assumption is that no other backends access the relation while we are
> + * loading it, so we can take some shortcuts.  Alternatively, you can use the
> + * buffer manager as usual, if performance is not critical, but you must not
> + * mix operations through the buffer manager and the bulk loading interface at
> + * the same time.

From "Alternatively" onward this is is somewhat confusing.


> + * We bypass the buffer manager to avoid the locking overhead, and call
> + * smgrextend() directly.  A downside is that the pages will need to be
> + * re-read into shared buffers on first use after the build finishes.  That's
> + * usually a good tradeoff for large relations, and for small relations, the
> + * overhead isn't very significant compared to creating the relation in the
> + * first place.

FWIW, I doubt the "isn't very significant" bit is really true.


> + * One tricky point is that because we bypass the buffer manager, we need to
> + * register the relation for fsyncing at the next checkpoint ourselves, and
> + * make sure that the relation is correctly fsync by us or the checkpointer
> + * even if a checkpoint happens concurrently.

"fsync'ed" or such? Otherwise this reads awkwardly for me.



> + *
> + *
> + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + *      src/backend/storage/smgr/bulk_write.c
> + *
> + *-------------------------------------------------------------------------
> + */
> +#include "postgres.h"
> +
> +#include "access/xloginsert.h"
> +#include "access/xlogrecord.h"
> +#include "storage/bufmgr.h"
> +#include "storage/bufpage.h"
> +#include "storage/bulk_write.h"
> +#include "storage/proc.h"
> +#include "storage/smgr.h"
> +#include "utils/rel.h"
> +
> +#define MAX_BUFFERED_PAGES XLR_MAX_BLOCK_ID
> +
> +typedef struct BulkWriteBuffer
> +{
> +    Page        page;
> +    BlockNumber blkno;
> +    bool        page_std;
> +    int16        order;
> +} BulkWriteBuffer;
> +

The name makes it sound like this struct itself contains a buffer - but it's
just pointing to one. *BufferRef or such maybe?

I was wondering how you dealt with the alignment of buffers given the struct
definition, which is what lead me to look at this...


> +/*
> + * Bulk writer state for one relation fork.
> + */
> +typedef struct BulkWriteState
> +{
> +    /* Information about the target relation we're writing */
> +    SMgrRelation smgr;

Isn't there a danger of this becoming a dangling pointer? At least until
https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
is merged?


> +    ForkNumber    forknum;
> +    bool        use_wal;
> +
> +    /* We keep several pages buffered, and WAL-log them in batches */
> +    int            nbuffered;
> +    BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
> +
> +    /* Current size of the relation */
> +    BlockNumber pages_written;
> +
> +    /* The RedoRecPtr at the time that the bulk operation started */
> +    XLogRecPtr    start_RedoRecPtr;
> +
> +    Page        zeropage;        /* workspace for filling zeroes */

We really should just have one such page in shared memory somewhere... For WAL
writes as well.

But until then - why do you allocate the page? Seems like we could just use a
static global variable?


> +/*
> + * Write all buffered pages to disk.
> + */
> +static void
> +bulkw_flush(BulkWriteState *bulkw)
> +{
> +    int            nbuffered = bulkw->nbuffered;
> +    BulkWriteBuffer *buffers = bulkw->buffers;
> +
> +    if (nbuffered == 0)
> +        return;
> +
> +    if (nbuffered > 1)
> +    {
> +        int            o;
> +
> +        qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
> +
> +        /*
> +         * Eliminate duplicates, keeping the last write of each block.
> +         * (buffer_cmp uses 'order' as the last sort key)
> +         */

Huh - which use cases would actually cause duplicate writes?


Greetings,

Andres Freund



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 19/11/2023 02:04, Andres Freund wrote:
> On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
>> The new facility makes it easier to optimize bulk loading, as the
>> logic for buffering, WAL-logging, and syncing the relation only needs
>> to be implemented once. It's also less error-prone: We have had a
>> number of bugs in how a relation is fsync'd - or not - at the end of a
>> bulk loading operation. By centralizing that logic to one place, we
>> only need to write it correctly once.
> 
> One thing I'd like to use the centralized handling for is to track such
> writes in pg_stat_io. I don't mean as part of the initial patch, just that
> that's another reason I like the facility.

Oh I didn't realize they're not counted at the moment.

>> +    bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
>> +
>>       nblocks = smgrnblocks(src, forkNum);
>>   
>>       for (blkno = 0; blkno < nblocks; blkno++)
>>       {
>> +        Page        page;
>> +
>>           /* If we got a cancel signal during the copy of the data, quit */
>>           CHECK_FOR_INTERRUPTS();
>>   
>> -        smgrread(src, forkNum, blkno, buf.data);
>> +        page = bulkw_alloc_buf(bulkw);
>> +        smgrread(src, forkNum, blkno, page);
>>   
>>           if (!PageIsVerifiedExtended(page, blkno,
>>                                       PIV_LOG_WARNING | PIV_REPORT_STAT))
>> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
>>            * page this is, so we have to log the full page including any unused
>>            * space.
>>            */
>> -        if (use_wal)
>> -            log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
>> -
>> -        PageSetChecksumInplace(page, blkno);
>> -
>> -        /*
>> -         * Now write the page.  We say skipFsync = true because there's no
>> -         * need for smgr to schedule an fsync for this write; we'll do it
>> -         * ourselves below.
>> -         */
>> -        smgrextend(dst, forkNum, blkno, buf.data, true);
>> +        bulkw_write(bulkw, blkno, page, false);
> 
> I wonder if bulkw_alloc_buf() is a good name - if you naively read this
> change, it looks like it'll just leak memory. It also might be taken to be
> valid until freed, which I don't think is the case?

Yeah, I'm not very happy with this interface. The model is that you get 
a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it 
over to bulkw_write(), which takes ownership of it and frees it later. 
There is no other function to free it, although currently the buffer is 
just palloc'd so you could call pfree on it.

However, I'd like to not expose that detail to the callers. I'm 
imagining that in the future we might optimize further, by having a 
larger e.g. 1 MB buffer, and carve the 8kB blocks from that. Then 
opportunistically, if you fill the buffers sequentially, bulk_write.c 
could do one smgrextend() to write the whole 1 MB chunk.

I renamed it to bulkw_get_buf() now, and made it return a new 
BulkWriteBuffer typedef instead of a plain Page. The point of the new 
typedef is to distinguish a buffer returned by bulkw_get_buf() from a 
Page or char[BLCKSZ] that you might palloc on your own. That indeed 
revealed some latent bugs in gistbuild.c where I had mixed up buffers 
returned by bulkw_alloc_buf() and palloc'd buffers.

(The previous version of this patch called a different struct 
BulkWriteBuffer, but I renamed that to PendingWrite; see below. Don't be 
confused!)

I think this helps a little, but I'm still not very happy with it. I'll 
give it some more thought after sleeping over it, but in the meanwhile, 
I'm all ears for suggestions.

>> +/*-------------------------------------------------------------------------
>> + *
>> + * bulk_write.c
>> + *      Efficiently and reliably populate a new relation
>> + *
>> + * The assumption is that no other backends access the relation while we are
>> + * loading it, so we can take some shortcuts.  Alternatively, you can use the
>> + * buffer manager as usual, if performance is not critical, but you must not
>> + * mix operations through the buffer manager and the bulk loading interface at
>> + * the same time.
> 
>  From "Alternatively" onward this is is somewhat confusing.

Rewrote that to just "Do not mix operations through the regular buffer 
manager and the bulk loading interface!"

>> + * One tricky point is that because we bypass the buffer manager, we need to
>> + * register the relation for fsyncing at the next checkpoint ourselves, and
>> + * make sure that the relation is correctly fsync by us or the checkpointer
>> + * even if a checkpoint happens concurrently.
> 
> "fsync'ed" or such? Otherwise this reads awkwardly for me.

Yep, fixed.

>> +typedef struct BulkWriteBuffer
>> +{
>> +    Page        page;
>> +    BlockNumber blkno;
>> +    bool        page_std;
>> +    int16        order;
>> +} BulkWriteBuffer;
>> +
> 
> The name makes it sound like this struct itself contains a buffer - but it's
> just pointing to one. *BufferRef or such maybe?
> 
> I was wondering how you dealt with the alignment of buffers given the struct
> definition, which is what lead me to look at this...

I renamed this to PendingWrite, and the field that holds these 
"pending_writes". Think of it as a queue of writes that haven't been 
performed yet.

>> +/*
>> + * Bulk writer state for one relation fork.
>> + */
>> +typedef struct BulkWriteState
>> +{
>> +    /* Information about the target relation we're writing */
>> +    SMgrRelation smgr;
> 
> Isn't there a danger of this becoming a dangling pointer? At least until
> https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
> is merged?

Good point. I just added a FIXME comment to remind about that, hoping 
that that patch gets merged soon. If not, I'll come up with a different fix.

>> +    ForkNumber    forknum;
>> +    bool        use_wal;
>> +
>> +    /* We keep several pages buffered, and WAL-log them in batches */
>> +    int            nbuffered;
>> +    BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
>> +
>> +    /* Current size of the relation */
>> +    BlockNumber pages_written;
>> +
>> +    /* The RedoRecPtr at the time that the bulk operation started */
>> +    XLogRecPtr    start_RedoRecPtr;
>> +
>> +    Page        zeropage;        /* workspace for filling zeroes */
> 
> We really should just have one such page in shared memory somewhere... For WAL
> writes as well.
> 
> But until then - why do you allocate the page? Seems like we could just use a
> static global variable?

I made it a static global variable for now. (The palloc way was copied 
over from nbtsort.c)

>> +/*
>> + * Write all buffered pages to disk.
>> + */
>> +static void
>> +bulkw_flush(BulkWriteState *bulkw)
>> +{
>> +    int            nbuffered = bulkw->nbuffered;
>> +    BulkWriteBuffer *buffers = bulkw->buffers;
>> +
>> +    if (nbuffered == 0)
>> +        return;
>> +
>> +    if (nbuffered > 1)
>> +    {
>> +        int            o;
>> +
>> +        qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
>> +
>> +        /*
>> +         * Eliminate duplicates, keeping the last write of each block.
>> +         * (buffer_cmp uses 'order' as the last sort key)
>> +         */
> 
> Huh - which use cases would actually cause duplicate writes?

Hmm, nothing anymore I guess. Many AMs used to write zero pages as a 
placeholder and come back to fill them in later, but now that 
bulk_write.c handles that,

Removed that, and replaced it with with an assertion in buffer_cmp() 
that there are no duplicates.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
Melanie just reminded about an older thread about this same thing:
https://www.postgresql.org/message-id/CAAKRu_ZQEpk6Q1WtNLgfXBdCmdU5xN_w0boVO6faO_Ax%2Bckjig%40mail.gmail.com. 
I had completely forgotten about that.

Melanie's patches in that thread implemented the same optimization of 
avoiding the fsync() if no checkpoint has happened during the index 
build. My patch here also implements batching the WAL records of 
multiple blocks, which was not part of those older patches. OTOH, those 
patches included an additional optimization of not bypassing the shared 
buffer cache if the index is small. That seems sensible too.

In this new patch, I subconsciously implemented an API close to what I 
suggested at the end of that old thread.

So I'd like to continue this effort based on this new patch. We can add 
the bypass-buffer-cache optimization later on top of this. With the new 
API that this introduces, it should be an isolated change to the 
implementation, with no changes required to the callers.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Relation bulk write facility

From
vignesh C
Date:
On Sat, 25 Nov 2023 at 06:49, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 19/11/2023 02:04, Andres Freund wrote:
> > On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
> >> The new facility makes it easier to optimize bulk loading, as the
> >> logic for buffering, WAL-logging, and syncing the relation only needs
> >> to be implemented once. It's also less error-prone: We have had a
> >> number of bugs in how a relation is fsync'd - or not - at the end of a
> >> bulk loading operation. By centralizing that logic to one place, we
> >> only need to write it correctly once.
> >
> > One thing I'd like to use the centralized handling for is to track such
> > writes in pg_stat_io. I don't mean as part of the initial patch, just that
> > that's another reason I like the facility.
>
> Oh I didn't realize they're not counted at the moment.
>
> >> +    bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
> >> +
> >>      nblocks = smgrnblocks(src, forkNum);
> >>
> >>      for (blkno = 0; blkno < nblocks; blkno++)
> >>      {
> >> +            Page            page;
> >> +
> >>              /* If we got a cancel signal during the copy of the data, quit */
> >>              CHECK_FOR_INTERRUPTS();
> >>
> >> -            smgrread(src, forkNum, blkno, buf.data);
> >> +            page = bulkw_alloc_buf(bulkw);
> >> +            smgrread(src, forkNum, blkno, page);
> >>
> >>              if (!PageIsVerifiedExtended(page, blkno,
> >>                                                                      PIV_LOG_WARNING | PIV_REPORT_STAT))
> >> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
> >>               * page this is, so we have to log the full page including any unused
> >>               * space.
> >>               */
> >> -            if (use_wal)
> >> -                    log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
> >> -
> >> -            PageSetChecksumInplace(page, blkno);
> >> -
> >> -            /*
> >> -             * Now write the page.  We say skipFsync = true because there's no
> >> -             * need for smgr to schedule an fsync for this write; we'll do it
> >> -             * ourselves below.
> >> -             */
> >> -            smgrextend(dst, forkNum, blkno, buf.data, true);
> >> +            bulkw_write(bulkw, blkno, page, false);
> >
> > I wonder if bulkw_alloc_buf() is a good name - if you naively read this
> > change, it looks like it'll just leak memory. It also might be taken to be
> > valid until freed, which I don't think is the case?
>
> Yeah, I'm not very happy with this interface. The model is that you get
> a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
> over to bulkw_write(), which takes ownership of it and frees it later.
> There is no other function to free it, although currently the buffer is
> just palloc'd so you could call pfree on it.
>
> However, I'd like to not expose that detail to the callers. I'm
> imagining that in the future we might optimize further, by having a
> larger e.g. 1 MB buffer, and carve the 8kB blocks from that. Then
> opportunistically, if you fill the buffers sequentially, bulk_write.c
> could do one smgrextend() to write the whole 1 MB chunk.
>
> I renamed it to bulkw_get_buf() now, and made it return a new
> BulkWriteBuffer typedef instead of a plain Page. The point of the new
> typedef is to distinguish a buffer returned by bulkw_get_buf() from a
> Page or char[BLCKSZ] that you might palloc on your own. That indeed
> revealed some latent bugs in gistbuild.c where I had mixed up buffers
> returned by bulkw_alloc_buf() and palloc'd buffers.
>
> (The previous version of this patch called a different struct
> BulkWriteBuffer, but I renamed that to PendingWrite; see below. Don't be
> confused!)
>
> I think this helps a little, but I'm still not very happy with it. I'll
> give it some more thought after sleeping over it, but in the meanwhile,
> I'm all ears for suggestions.
>
> >> +/*-------------------------------------------------------------------------
> >> + *
> >> + * bulk_write.c
> >> + *    Efficiently and reliably populate a new relation
> >> + *
> >> + * The assumption is that no other backends access the relation while we are
> >> + * loading it, so we can take some shortcuts.  Alternatively, you can use the
> >> + * buffer manager as usual, if performance is not critical, but you must not
> >> + * mix operations through the buffer manager and the bulk loading interface at
> >> + * the same time.
> >
> >  From "Alternatively" onward this is is somewhat confusing.
>
> Rewrote that to just "Do not mix operations through the regular buffer
> manager and the bulk loading interface!"
>
> >> + * One tricky point is that because we bypass the buffer manager, we need to
> >> + * register the relation for fsyncing at the next checkpoint ourselves, and
> >> + * make sure that the relation is correctly fsync by us or the checkpointer
> >> + * even if a checkpoint happens concurrently.
> >
> > "fsync'ed" or such? Otherwise this reads awkwardly for me.
>
> Yep, fixed.
>
> >> +typedef struct BulkWriteBuffer
> >> +{
> >> +    Page            page;
> >> +    BlockNumber blkno;
> >> +    bool            page_std;
> >> +    int16           order;
> >> +} BulkWriteBuffer;
> >> +
> >
> > The name makes it sound like this struct itself contains a buffer - but it's
> > just pointing to one. *BufferRef or such maybe?
> >
> > I was wondering how you dealt with the alignment of buffers given the struct
> > definition, which is what lead me to look at this...
>
> I renamed this to PendingWrite, and the field that holds these
> "pending_writes". Think of it as a queue of writes that haven't been
> performed yet.
>
> >> +/*
> >> + * Bulk writer state for one relation fork.
> >> + */
> >> +typedef struct BulkWriteState
> >> +{
> >> +    /* Information about the target relation we're writing */
> >> +    SMgrRelation smgr;
> >
> > Isn't there a danger of this becoming a dangling pointer? At least until
> > https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
> > is merged?
>
> Good point. I just added a FIXME comment to remind about that, hoping
> that that patch gets merged soon. If not, I'll come up with a different fix.
>
> >> +    ForkNumber      forknum;
> >> +    bool            use_wal;
> >> +
> >> +    /* We keep several pages buffered, and WAL-log them in batches */
> >> +    int                     nbuffered;
> >> +    BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
> >> +
> >> +    /* Current size of the relation */
> >> +    BlockNumber pages_written;
> >> +
> >> +    /* The RedoRecPtr at the time that the bulk operation started */
> >> +    XLogRecPtr      start_RedoRecPtr;
> >> +
> >> +    Page            zeropage;               /* workspace for filling zeroes */
> >
> > We really should just have one such page in shared memory somewhere... For WAL
> > writes as well.
> >
> > But until then - why do you allocate the page? Seems like we could just use a
> > static global variable?
>
> I made it a static global variable for now. (The palloc way was copied
> over from nbtsort.c)
>
> >> +/*
> >> + * Write all buffered pages to disk.
> >> + */
> >> +static void
> >> +bulkw_flush(BulkWriteState *bulkw)
> >> +{
> >> +    int                     nbuffered = bulkw->nbuffered;
> >> +    BulkWriteBuffer *buffers = bulkw->buffers;
> >> +
> >> +    if (nbuffered == 0)
> >> +            return;
> >> +
> >> +    if (nbuffered > 1)
> >> +    {
> >> +            int                     o;
> >> +
> >> +            qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
> >> +
> >> +            /*
> >> +             * Eliminate duplicates, keeping the last write of each block.
> >> +             * (buffer_cmp uses 'order' as the last sort key)
> >> +             */
> >
> > Huh - which use cases would actually cause duplicate writes?
>
> Hmm, nothing anymore I guess. Many AMs used to write zero pages as a
> placeholder and come back to fill them in later, but now that
> bulk_write.c handles that,
>
> Removed that, and replaced it with with an assertion in buffer_cmp()
> that there are no duplicates.

There are few compilation errors reported by CFBot at [1], patch needs
to be rebased:
[02:38:12.675] In file included from ../../../../src/include/postgres.h:45,
[02:38:12.675] from nbtsort.c:41:
[02:38:12.675] nbtsort.c: In function ‘_bt_load’:
[02:38:12.675] nbtsort.c:1309:57: error: ‘BTPageState’ has no member
named ‘btps_page’
[02:38:12.675] 1309 | Assert(dstate->maxpostingsize <=
BTMaxItemSize(state->btps_page) &&
[02:38:12.675] | ^~
[02:38:12.675] ../../../../src/include/c.h:864:9: note: in definition
of macro ‘Assert’
[02:38:12.675] 864 | if (!(condition)) \
[02:38:12.675] | ^~~~~~~~~
[02:38:12.675] ../../../../src/include/c.h:812:29: note: in expansion
of macro ‘TYPEALIGN_DOWN’
[02:38:12.675] 812 | #define MAXALIGN_DOWN(LEN)
TYPEALIGN_DOWN(MAXIMUM_ALIGNOF, (LEN))
[02:38:12.675] | ^~~~~~~~~~~~~~
[02:38:12.675] ../../../../src/include/access/nbtree.h:165:3: note: in
expansion of macro ‘MAXALIGN_DOWN’
[02:38:12.675] 165 | (MAXALIGN_DOWN((PageGetPageSize(page) - \

[1] - https://cirrus-ci.com/task/5299954164432896

Regards,
Vignesh



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 09/01/2024 08:50, vignesh C wrote:
> There are few compilation errors reported by CFBot at [1], patch needs
> to be rebased:

Here you go.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Relation bulk write facility

From
Robert Haas
Date:
On Fri, Nov 24, 2023 at 10:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Yeah, I'm not very happy with this interface. The model is that you get
> a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
> over to bulkw_write(), which takes ownership of it and frees it later.
> There is no other function to free it, although currently the buffer is
> just palloc'd so you could call pfree on it.

I think we should try to pick prefixes that are one or more words
rather than using word fragments. bulkw is an awkward prefix even for
people whose first language is English, and probably more awkward for
others.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Relation bulk write facility

From
Peter Smith
Date:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there was a CFbot test failure last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4575/
[2] https://cirrus-ci.com/task/4990764426461184

Kind Regards,
Peter Smith.



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 10/01/2024 18:17, Robert Haas wrote:
> I think we should try to pick prefixes that are one or more words
> rather than using word fragments. bulkw is an awkward prefix even for
> people whose first language is English, and probably more awkward for
> others.

Renamed the 'bulkw' variables to 'bulkstate, and the functions to have 
smgr_bulk_* prefix.

I was tempted to use just bulk_* as the prefix, but I'm afraid e.g. 
bulk_write() is too generic.

On 22/01/2024 07:50, Peter Smith wrote:
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there was a CFbot test failure last time it was run [2]. Please have a
> look and post an updated version if necessary.

Fixed the headerscheck failure by adding appropriate #includes.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
Committed this. Thanks everyone!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 23/02/2024 16:27, Heikki Linnakangas wrote:
> Committed this. Thanks everyone!

Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Relation bulk write facility

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..

Those are mine, let me know if you need local investigation.

            regards, tom lane



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 23/02/2024 17:15, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..
> 
> Those are mine, let me know if you need local investigation.

Thanks, the error message was clear enough:

> bulk_write.c:78:3: error: redefinition of typedef 'BulkWriteState' is a C11 feature [-Werror,-Wtypedef-redefinition]
> } BulkWriteState;
>   ^
> ../../../../src/include/storage/bulk_write.h:20:31: note: previous definition is here
> typedef struct BulkWriteState BulkWriteState;
>                               ^
> 1 error generated.

Fixed now, but I'm a bit surprised other buildfarm members nor cirrus CI 
caught that. I also tried to reproduce it locally by adding 
-Wtypedef-redefinition, but my version of clang didn't produce any 
warnings. Are there any extra compiler options on those animals or 
something?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Relation bulk write facility

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Thanks, the error message was clear enough:
>> bulk_write.c:78:3: error: redefinition of typedef 'BulkWriteState' is a C11 feature [-Werror,-Wtypedef-redefinition]
>> } BulkWriteState;

> Fixed now, but I'm a bit surprised other buildfarm members nor cirrus CI
> caught that. I also tried to reproduce it locally by adding
> -Wtypedef-redefinition, but my version of clang didn't produce any
> warnings. Are there any extra compiler options on those animals or
> something?

They use Apple's standard compiler (clang 15 or so), but

     'CC' => 'ccache clang -std=gnu99',

so maybe the -std has something to do with it.  I installed that
(or -std=gnu90 as appropriate to branch) on most of my build
setups back when we started the C99 push.

            regards, tom lane



Re: Relation bulk write facility

From
Noah Misch
Date:
On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
> Committed this. Thanks everyone!

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2024-02-24%2015%3A13%3A14 got:
TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID:
43188608

with this stack trace:
#5  0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 <XLogBeginInsert+80> "`", fileName=0x0,
lineNumber=16780064)at assert.c:66
 
#6  0x102daba8 in mdextend (reln=0x1042628c <PageSetChecksumInplace+44>, forknum=812540744, blocknum=33,
buffer=0x306e6000,skipFsync=812539904) at md.c:472
 
#7  0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, blocknum=33, buffer=0x306e6000, skipFsync=812539904)
atsmgr.c:541
 
#8  0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245
#9  0x107baf24 in _bt_blwritepage (wstate=0x100d0a14 <datum_image_eq@AF65_7+404>, buf=0x304f13b0, blkno=807631240) at
nbtsort.c:638
#10 0x107bccd8 in _bt_buildadd (wstate=0x104c9384 <smgr_bulk_start_rel+132>, state=0x304eb190, itup=0xe10,
truncextra=805686672)at nbtsort.c:984
 
#11 0x107bc86c in _bt_sort_dedup_finish_pending (wstate=0x3b6, state=0x19, dstate=0x3) at nbtsort.c:1036
#12 0x107bc188 in _bt_load (wstate=0x10, btspool=0x0, btspool2=0x0) at nbtsort.c:1331
#13 0x107bd4ec in _bt_leafbuild (btspool=0x101589fc <ProcessInvalidationMessages+188>, btspool2=0x0) at nbtsort.c:571
#14 0x107be028 in btbuild (heap=0x304d2a00, index=0x4e1f, indexInfo=0x3) at nbtsort.c:329
#15 0x1013538c in index_build (heapRelation=0x2, indexRelation=0x10bdc518 <getopt_long+2464664>, indexInfo=0x2,
isreindex=10,parallel=false) at index.c:3047
 
#16 0x101389e0 in index_create (heapRelation=0x1001aac0 <palloc+192>, indexRelationName=0x20 <error: Cannot access
memoryat address 0x20>, indexRelationId=804393376, parentIndexRelid=805686672,
 
    parentConstraintId=268544704, relFileNumber=805309688, indexInfo=0x3009a328, indexColNames=0x30237a20,
accessMethodId=403,tableSpaceId=0, collationIds=0x304d29d8, opclassIds=0x304d29f8,
 
    opclassOptions=0x304d2a18, coloptions=0x304d2a38, reloptions=0, flags=0, constr_flags=0,
allow_system_table_mods=false,is_internal=false, constraintId=0x2ff211b4) at index.c:1260
 
#17 0x1050342c in DefineIndex (tableId=19994, stmt=0x2ff21370, indexRelationId=0, parentIndexId=0,
parentConstraintId=0,total_parts=0, is_alter_table=false, check_rights=true, check_not_in_use=true,
 
    skip_build=false, quiet=false) at indexcmds.c:1204
#18 0x104b4474 in ProcessUtilitySlow (pstate=<error reading variable>, pstmt=0x3009a408, queryString=0x30099730 "CREATE
INDEXdupindexcols_i ON dupindexcols (f1, id, f1 text_pattern_ops);",
 

If there are other ways I should poke at it, let me know.




Re: Relation bulk write facility

From
Thomas Munro
Date:
On Sun, Feb 25, 2024 at 6:24 AM Noah Misch <noah@leadboat.com> wrote:
> On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
> > Committed this. Thanks everyone!
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2024-02-24%2015%3A13%3A14 got:
> TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID:
43188608
>
> with this stack trace:
> #5  0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 <XLogBeginInsert+80> "`", fileName=0x0,
lineNumber=16780064)at assert.c:66 
> #6  0x102daba8 in mdextend (reln=0x1042628c <PageSetChecksumInplace+44>, forknum=812540744, blocknum=33,
buffer=0x306e6000,skipFsync=812539904) at md.c:472 
> #7  0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, blocknum=33, buffer=0x306e6000,
skipFsync=812539904)at smgr.c:541 
> #8  0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245

So that's:

static const PGIOAlignedBlock zero_buffer = {{0}};  /* worth BLCKSZ */

...
                smgrextend(bulkstate->smgr, bulkstate->forknum,
                           bulkstate->pages_written++,
                           &zero_buffer,
                           true);

... where PGIOAlignedBlock is:

typedef union PGIOAlignedBlock
{
#ifdef pg_attribute_aligned
    pg_attribute_aligned(PG_IO_ALIGN_SIZE)
#endif
    char        data[BLCKSZ];
...

We see this happen with both xlc and gcc (new enough to know how to do
this).  One idea would be that the AIX *linker* is unable to align it,
as that is the common tool-chain component here (and unlike stack and
heap objects, this scope is the linker's job).  There is a
pre-existing example of a zero-buffer that is at file scope like that:
pg_prewarm.c.  Perhaps it doesn't get tested?

Hmm.



Re: Relation bulk write facility

From
Noah Misch
Date:
On Sun, Feb 25, 2024 at 07:52:16AM +1300, Thomas Munro wrote:
> On Sun, Feb 25, 2024 at 6:24 AM Noah Misch <noah@leadboat.com> wrote:
> > On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
> > > Committed this. Thanks everyone!
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2024-02-24%2015%3A13%3A14 got:
> > TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID:
43188608
> >
> > with this stack trace:
> > #5  0x10005cf0 in ExceptionalCondition (conditionName=0x1015d790 <XLogBeginInsert+80> "`", fileName=0x0,
lineNumber=16780064)at assert.c:66
 
> > #6  0x102daba8 in mdextend (reln=0x1042628c <PageSetChecksumInplace+44>, forknum=812540744, blocknum=33,
buffer=0x306e6000,skipFsync=812539904) at md.c:472
 
> > #7  0x102d6760 in smgrextend (reln=0x306e6670, forknum=812540744, blocknum=33, buffer=0x306e6000,
skipFsync=812539904)at smgr.c:541
 
> > #8  0x104c8dac in smgr_bulk_flush (bulkstate=0x306e6000) at bulk_write.c:245
> 
> So that's:
> 
> static const PGIOAlignedBlock zero_buffer = {{0}};  /* worth BLCKSZ */
> 
> ...
>                 smgrextend(bulkstate->smgr, bulkstate->forknum,
>                            bulkstate->pages_written++,
>                            &zero_buffer,
>                            true);
> 
> ... where PGIOAlignedBlock is:
> 
> typedef union PGIOAlignedBlock
> {
> #ifdef pg_attribute_aligned
>     pg_attribute_aligned(PG_IO_ALIGN_SIZE)
> #endif
>     char        data[BLCKSZ];
> ...
> 
> We see this happen with both xlc and gcc (new enough to know how to do
> this).  One idea would be that the AIX *linker* is unable to align it,
> as that is the common tool-chain component here (and unlike stack and
> heap objects, this scope is the linker's job).  There is a
> pre-existing example of a zero-buffer that is at file scope like that:
> pg_prewarm.c.  Perhaps it doesn't get tested?
> 
> Hmm.

GCC docs do say "For some linkers, the maximum supported alignment may be very
very small.", but AIX "man LD" says "data sections are aligned on a boundary
so as to satisfy the alignment of all CSECTs in the sections".  It also has -H
and -K flags to force some particular higher alignment.

On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
blockbuffer).  If I'm reading this right, neither AIX gcc nor xlc is marking
the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:

$ /opt/cfarm/binutils-latest/bin/objdump --section-headers
~/farm/*/HEAD/pgsqlkeep.*/src/backend/storage/smgr/bulk_write.o

/home/nm/farm/gcc64/HEAD/pgsqlkeep.2024-02-24_00-03-22/src/backend/storage/smgr/bulk_write.o:     file format
aix5coff64-rs6000

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000277c  0000000000000000  0000000000000000  000000f0  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, CODE
  1 .data         000000e4  000000000000277c  000000000000277c  0000286c  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .debug        0001f7ea  0000000000000000  0000000000000000  00002950  2**3
                  CONTENTS

/home/nm/farm/xlc32/HEAD/pgsqlkeep.2024-02-24_15-12-23/src/backend/storage/smgr/bulk_write.o:     file format
aixcoff-rs6000

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000880  00000000  00000000  00000180  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, CODE
  1 .data         0000410c  00000880  00000880  00000a00  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss          00000000  0000498c  0000498c  00000000  2**3
                  ALLOC
  3 .debug        00008448  00000000  00000000  00004b24  2**3
                  CONTENTS
  4 .except       00000018  00000000  00000000  00004b0c  2**3
                  CONTENTS, LOAD

$ /opt/cfarm/binutils-latest/bin/objdump --section-headers ~/farm/*/HEAD/pgsqlkeep.*/contrib/pg_prewarm/pg_prewarm.o

/home/nm/farm/gcc32/HEAD/pgsqlkeep.2024-01-21_03-16-12/contrib/pg_prewarm/pg_prewarm.o:     file format aixcoff-rs6000

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000a6c  00000000  00000000  000000b4  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, CODE
  1 .data         00000044  00000a6c  00000a6c  00000b20  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss          00002550  00000ab0  00000ab0  00000000  2**3
                  ALLOC
  3 .debug        0001c50e  00000000  00000000  00000b64  2**3
                  CONTENTS

/home/nm/farm/gcc64/HEAD/pgsqlkeep.2024-02-15_17-13-04/contrib/pg_prewarm/pg_prewarm.o:     file format
aix5coff64-rs6000

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000948  0000000000000000  0000000000000000  00000138  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, CODE
  1 .data         00000078  0000000000000948  0000000000000948  00000a80  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss          00002640  00000000000009c0  00000000000009c0  00000000  2**3
                  ALLOC
  3 .debug        0001d887  0000000000000000  0000000000000000  00000af8  2**3
                  CONTENTS



Re: Relation bulk write facility

From
Thomas Munro
Date:
On Sun, Feb 25, 2024 at 8:50 AM Noah Misch <noah@leadboat.com> wrote:
> On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
> section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
> blockbuffer).  If I'm reading this right, neither AIX gcc nor xlc is marking
> the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:

Ah, that is a bit of a hazard that we should probably document.

I guess the ideas to fix this would be: use smgrzeroextend() instead
of this coding, and/or perhaps look at the coding of pg_pwrite_zeros()
(function-local static) for any other place that needs such a thing,
if it would be satisfied by function-local scope?



Re: Relation bulk write facility

From
Thomas Munro
Date:
On Sun, Feb 25, 2024 at 9:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sun, Feb 25, 2024 at 8:50 AM Noah Misch <noah@leadboat.com> wrote:
> > On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
> > section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
> > blockbuffer).  If I'm reading this right, neither AIX gcc nor xlc is marking
> > the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:
>
> Ah, that is a bit of a hazard that we should probably document.
>
> I guess the ideas to fix this would be: use smgrzeroextend() instead
> of this coding, and/or perhaps look at the coding of pg_pwrite_zeros()
> (function-local static) for any other place that needs such a thing,
> if it would be satisfied by function-local scope?

Erm, wait, how does that function-local static object work differently?



Re: Relation bulk write facility

From
Noah Misch
Date:
On Sun, Feb 25, 2024 at 09:13:47AM +1300, Thomas Munro wrote:
> On Sun, Feb 25, 2024 at 9:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Sun, Feb 25, 2024 at 8:50 AM Noah Misch <noah@leadboat.com> wrote:
> > > On GNU/Linux x64, gcc correctly records alignment=2**12 for the associated
> > > section (.rodata for bulk_write.o zero_buffer, .bss for pg_prewarm.o
> > > blockbuffer).  If I'm reading this right, neither AIX gcc nor xlc is marking
> > > the section with sufficient alignment, in bulk_write.o or pg_prewarm.o:
> >
> > Ah, that is a bit of a hazard that we should probably document.
> >
> > I guess the ideas to fix this would be: use smgrzeroextend() instead
> > of this coding, and/or perhaps look at the coding of pg_pwrite_zeros()
> > (function-local static) for any other place that needs such a thing,
> > if it would be satisfied by function-local scope?

True.  Alternatively, could arrange for "#define PG_O_DIRECT 0" on AIX, which
disables the alignment assertions (and debug_io_direct).

> Erm, wait, how does that function-local static object work differently?

I don't know specifically, but I expect they're different parts of the gcc
implementation.  Aligning an xcoff section may entail some xcoff-specific gcc
component.  Aligning a function-local object just changes the early
instructions of the function; it's independent of the object format.



Re: Relation bulk write facility

From
Andres Freund
Date:
Hi,

On 2024-02-24 11:50:24 -0800, Noah Misch wrote:
> > We see this happen with both xlc and gcc (new enough to know how to do
> > this).  One idea would be that the AIX *linker* is unable to align it,
> > as that is the common tool-chain component here (and unlike stack and
> > heap objects, this scope is the linker's job).  There is a
> > pre-existing example of a zero-buffer that is at file scope like that:
> > pg_prewarm.c.  Perhaps it doesn't get tested?
> >
> > Hmm.
>
> GCC docs do say "For some linkers, the maximum supported alignment may be very
> very small.", but AIX "man LD" says "data sections are aligned on a boundary
> so as to satisfy the alignment of all CSECTs in the sections".  It also has -H
> and -K flags to force some particular higher alignment.

Some xlc manual [1] states that

  n must be a positive power of 2, or NIL. NIL can be specified as either
  __attribute__((aligned())) or __attribute__((aligned)); this is the same as
  specifying the maximum system alignment (16 bytes on all UNIX platforms).

Which does seems to suggest that this is a platform restriction.


Let's just drop AIX. This isn't the only alignment issue we've found and the
solution for those isn't so much a fix as forcing everyone to carefully only
look into one direction and not notice the cliffs to either side.

Greetings,

Andres Freund

[1] https://www.ibm.com/docs/en/SSGH2K_13.1.2/com.ibm.compilers.aix.doc/proguide.pdf



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 24 February 2024 23:29:36 EET, Andres Freund <andres@anarazel.de> wrote:
>Hi,
>
>On 2024-02-24 11:50:24 -0800, Noah Misch wrote:
>> > We see this happen with both xlc and gcc (new enough to know how to do
>> > this).  One idea would be that the AIX *linker* is unable to align it,
>> > as that is the common tool-chain component here (and unlike stack and
>> > heap objects, this scope is the linker's job).  There is a
>> > pre-existing example of a zero-buffer that is at file scope like that:
>> > pg_prewarm.c.  Perhaps it doesn't get tested?
>> >
>> > Hmm.
>>
>> GCC docs do say "For some linkers, the maximum supported alignment may be very
>> very small.", but AIX "man LD" says "data sections are aligned on a boundary
>> so as to satisfy the alignment of all CSECTs in the sections".  It also has -H
>> and -K flags to force some particular higher alignment.
>
>Some xlc manual [1] states that
>
>  n must be a positive power of 2, or NIL. NIL can be specified as either
>  __attribute__((aligned())) or __attribute__((aligned)); this is the same as
>  specifying the maximum system alignment (16 bytes on all UNIX platforms).
>
>Which does seems to suggest that this is a platform restriction.

My reading of that paragraph is that you can set it to any powet of two, and it should work. 16 bytes is just what you
getif you set it to NIL. 

>Let's just drop AIX. This isn't the only alignment issue we've found and the
>solution for those isn't so much a fix as forcing everyone to carefully only
>look into one direction and not notice the cliffs to either side.

I think the way that decision should go is that as long as someone is willing to step up and do the work keep AIX
supportgoing, we support it. To be clear, that someone is not me. Anyone willing to do it? 

Regarding the issue at hand, perhaps we should define PG_IO_ALIGN_SIZE as 16 on AIX, if that's the best the linker can
doon that platform. 

We could also make the allocation 2*PG_IO_ALIGN_SIZE and round up the starting address ourselves to PG_IO_ALIGN_SIZE.
Orallocate it in the heap. 

- Heikki



Re: Relation bulk write facility

From
Thomas Munro
Date:
On Sun, Feb 25, 2024 at 11:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Regarding the issue at hand, perhaps we should define PG_IO_ALIGN_SIZE as 16 on AIX, if that's the best the linker
cando on that platform. 

You'll probably get either an error or silently fall back to buffered
I/O, if direct I/O is enabled and you try to read/write a badly
aligned buffer.  That's documented (they offer finfo() to query it,
but it's always 4KB for the same sort of reasons as it is on every
other OS).



Re: Relation bulk write facility

From
Thomas Munro
Date:
On Sun, Feb 25, 2024 at 11:16 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sun, Feb 25, 2024 at 11:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Regarding the issue at hand, perhaps we should define PG_IO_ALIGN_SIZE as 16 on AIX, if that's the best the linker
cando on that platform. 
>
> You'll probably get either an error or silently fall back to buffered
> I/O, if direct I/O is enabled and you try to read/write a badly
> aligned buffer.  That's documented (they offer finfo() to query it,
> but it's always 4KB for the same sort of reasons as it is on every
> other OS).

I guess it's the latter ("to work efficiently" sounds like it isn't
going to reject the request):

https://www.ibm.com/docs/en/aix/7.3?topic=tuning-direct-io

If you make it < 4KB then all direct I/O would be affected, not just
this one place, so then you might as well just not allow direct I/O on
AIX at all, to avoid giving a false impression that it does something.
(Note that if we think the platform lacks O_DIRECT we don't make those
assertions about alignment).

FWIW I'm aware of one other thing that is wrong with our direct I/O
support on AIX: it should perhaps be using a different flag.  I
created a wiki page to defer thinking about any AIX issues
until/unless at least one real, live user shows up, which hasn't
happened yet:  https://wiki.postgresql.org/wiki/AIX



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 25/02/2024 00:37, Thomas Munro wrote:
> On Sun, Feb 25, 2024 at 11:16 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> On Sun, Feb 25, 2024 at 11:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Regarding the issue at hand, perhaps we should define PG_IO_ALIGN_SIZE as 16 on AIX, if that's the best the linker
cando on that platform.
 
>>
>> You'll probably get either an error or silently fall back to buffered
>> I/O, if direct I/O is enabled and you try to read/write a badly
>> aligned buffer.  That's documented (they offer finfo() to query it,
>> but it's always 4KB for the same sort of reasons as it is on every
>> other OS).
> 
> I guess it's the latter ("to work efficiently" sounds like it isn't
> going to reject the request):
> 
> https://www.ibm.com/docs/en/aix/7.3?topic=tuning-direct-io
> 
> If you make it < 4KB then all direct I/O would be affected, not just
> this one place, so then you might as well just not allow direct I/O on
> AIX at all, to avoid giving a false impression that it does something.
> (Note that if we think the platform lacks O_DIRECT we don't make those
> assertions about alignment).
> 
> FWIW I'm aware of one other thing that is wrong with our direct I/O
> support on AIX: it should perhaps be using a different flag.  I
> created a wiki page to defer thinking about any AIX issues
> until/unless at least one real, live user shows up, which hasn't
> happened yet:  https://wiki.postgresql.org/wiki/AIX

Here's a patch that effectively disables direct I/O on AIX. I'm inclined 
to commit this as a quick fix to make the buildfarm green again.

I agree with Andres though, that unless someone raises their hand and 
volunteers to properly maintain the AIX support, we should drop it. The 
current AIX buildfarm members are running AIX 7.1, which has been out of 
support since May 2023 
(https://www.ibm.com/support/pages/aix-support-lifecycle-information). 
See also older thread on this [0].

Noah, you're running the current AIX buildfarm animals. How much effort 
are you interested to put into AIX support?

[0] 
https://www.postgresql.org/message-id/20220702183354.a6uhja35wta7agew%40alap3.anarazel.de

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Relation bulk write facility

From
Noah Misch
Date:
On Sun, Feb 25, 2024 at 04:34:47PM +0200, Heikki Linnakangas wrote:
> I agree with Andres though, that unless someone raises their hand and
> volunteers to properly maintain the AIX support, we should drop it.

There's no way forward in which AIX support stops doing net harm.  Even if AIX
enthusiasts intercepted would-be buildfarm failures and fixed them before
buildfarm.postgresql.org could see them, the damage from the broader community
seeing the AIX-specific code would outweigh the benefits of AIX support.  I've
now disabled the animals for v17+, though each may do one more run before
picking up the disable.

My upthread observation about xcoff section alignment was a red herring.  gcc
populates symbol-level alignment, and section-level alignment is unnecessary
if symbol-level alignment is correct.  The simplest workaround for $SUBJECT
AIX failure would be to remove the "const", based on the results of the
attached test program.  The pg_prewarm.c var is like al4096_static in the
outputs below, hence the lack of trouble there.  The bulk_write.c var is like
al4096_static_const_initialized.

==== gcc 8.3.0
al4096                           4096 @ 0x11000c000 (mod 0)
al4096_initialized               4096 @ 0x110000fd0 (mod 4048 - BUG)
al4096_const                     4096 @ 0x11000f000 (mod 0)
al4096_const_initialized         4096 @ 0x10000cd00 (mod 3328 - BUG)
al4096_static                    4096 @ 0x110005000 (mod 0)
al4096_static_initialized        4096 @ 0x110008000 (mod 0)
al4096_static_const              4096 @ 0x100000c10 (mod 3088 - BUG)
al4096_static_const_initialized  4096 @ 0x100003c10 (mod 3088 - BUG)
==== xlc 12.01.0000.0000
al4096                           4096 @ 0x110008000 (mod 0)
al4096_initialized               4096 @ 0x110004000 (mod 0)
al4096_const                     4096 @ 0x11000b000 (mod 0)
al4096_const_initialized         4096 @ 0x100007000 (mod 0)
al4096_static                    4096 @ 0x11000e000 (mod 0)
al4096_static_initialized        4096 @ 0x110001000 (mod 0)
al4096_static_const              4096 @ 0x110011000 (mod 0)
al4096_static_const_initialized  4096 @ 0x1000007d0 (mod 2000 - BUG)
==== ibm-clang 17.1.1.2
al4096                           4096 @ 0x110001000 (mod 0)
al4096_initialized               4096 @ 0x110004000 (mod 0)
al4096_const                     4096 @ 0x100001000 (mod 0)
al4096_const_initialized         4096 @ 0x100005000 (mod 0)
al4096_static                    4096 @ 0x110008000 (mod 0)
al4096_static_initialized        4096 @ 0x11000b000 (mod 0)
al4096_static_const              4096 @ 0x100009000 (mod 0)
al4096_static_const_initialized  4096 @ 0x10000d000 (mod 0)

Attachment

Re: Relation bulk write facility

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Sun, Feb 25, 2024 at 04:34:47PM +0200, Heikki Linnakangas wrote:
>> I agree with Andres though, that unless someone raises their hand and
>> volunteers to properly maintain the AIX support, we should drop it.

> There's no way forward in which AIX support stops doing net harm.  Even if AIX
> enthusiasts intercepted would-be buildfarm failures and fixed them before
> buildfarm.postgresql.org could see them, the damage from the broader community
> seeing the AIX-specific code would outweigh the benefits of AIX support.  I've
> now disabled the animals for v17+, though each may do one more run before
> picking up the disable.

So, we now need to strip the remnants of AIX support from the code and
docs?  I don't see that much of it, but it's misleading to leave it
there.

(BTW, I still want to nuke the remaining snippets of HPPA support.
I don't think it does anybody any good to make it look like that's
still expected to work.)

            regards, tom lane



Re: Relation bulk write facility

From
Robert Haas
Date:
On Mon, Feb 26, 2024 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So, we now need to strip the remnants of AIX support from the code and
> docs?  I don't see that much of it, but it's misleading to leave it
> there.
>
> (BTW, I still want to nuke the remaining snippets of HPPA support.
> I don't think it does anybody any good to make it look like that's
> still expected to work.)

+1 for removing things that don't work (or that we think probably don't work).

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Relation bulk write facility

From
Michael Paquier
Date:
On Mon, Feb 26, 2024 at 09:42:03AM +0530, Robert Haas wrote:
> On Mon, Feb 26, 2024 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So, we now need to strip the remnants of AIX support from the code and
>> docs?  I don't see that much of it, but it's misleading to leave it
>> there.
>>
>> (BTW, I still want to nuke the remaining snippets of HPPA support.
>> I don't think it does anybody any good to make it look like that's
>> still expected to work.)
>
> +1 for removing things that don't work (or that we think probably don't work).

Seeing this stuff eat developer time because of the debugging of weird
issues while having a very limited impact for end-users is sad, so +1
for a cleanup of any remnants if this disappears.
--
Michael

Attachment

Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 26/02/2024 06:18, Michael Paquier wrote:
> On Mon, Feb 26, 2024 at 09:42:03AM +0530, Robert Haas wrote:
>> On Mon, Feb 26, 2024 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> So, we now need to strip the remnants of AIX support from the code and
>>> docs?  I don't see that much of it, but it's misleading to leave it
>>> there.
>>>
>>> (BTW, I still want to nuke the remaining snippets of HPPA support.
>>> I don't think it does anybody any good to make it look like that's
>>> still expected to work.)
>>
>> +1 for removing things that don't work (or that we think probably don't work).
> 
> Seeing this stuff eat developer time because of the debugging of weird
> issues while having a very limited impact for end-users is sad, so +1
> for a cleanup of any remnants if this disappears.

Here's a patch to fully remove AIX support.

One small issue that warrants some discussion (in sanity_check.sql):

> --- When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte alignment on
> +-- When MAXIMUM_ALIGNOF==8 but ALIGNOF_DOUBLE==4, the C ABI may impose 8-byte alignment
>  -- some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To ensure
>  -- catalog C struct layout matches catalog tuple layout, arrange for the tuple
>  -- offset of each fixed-width, attalign='d' catalog column to be divisible by 8
>  -- unconditionally.  Keep such columns before the first NameData column of the
>  -- catalog, since packagers can override NAMEDATALEN to an odd number.
> +-- (XXX: I'm not sure if any of the supported platforms have MAXIMUM_ALIGNOF==8 and
> +-- ALIGNOF_DOUBLE==4.  Perhaps we should just require that
> +-- ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF)

What do y'all think of adding a check for 
ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF to configure.ac and meson.build? It's 
not a requirement today, but I believe AIX was the only platform where 
that was not true. With AIX gone, that combination won't be tested, and 
we will probably break it sooner or later.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Relation bulk write facility

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> What do y'all think of adding a check for
> ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF to configure.ac and meson.build? It's
> not a requirement today, but I believe AIX was the only platform where
> that was not true. With AIX gone, that combination won't be tested, and
> we will probably break it sooner or later.

+1, and then probably revert the whole test addition of 79b716cfb7a.

I did a quick scrape of the buildfarm, and identified these as the
only animals reporting ALIGNOF_DOUBLE less than 8:

$ grep 'alignment of double' alignments  | grep -v ' 8$'
 hornet        | 2024-02-22 16:26:16 | checking alignment of double... 4
 lapwing       | 2024-02-27 12:40:15 | checking alignment of double... (cached) 4
 mandrill      | 2024-02-19 01:03:47 | checking alignment of double... 4
 sungazer      | 2024-02-21 00:22:48 | checking alignment of double... 4
 tern          | 2024-02-22 13:25:12 | checking alignment of double... 4

With AIX out of the picture, lapwing will be the only remaining
animal testing MAXALIGN less than 8.  That seems like a single
point of failure ... should we spin up another couple 32-bit
animals?  I had supposed that my faithful old PPC animal mamba
was helping to check this, but I see that under NetBSD it's
joined the ALIGNOF_DOUBLE==8 crowd.

            regards, tom lane



Re: Relation bulk write facility

From
Andres Freund
Date:
Hi,

On 2024-02-27 15:45:45 -0500, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> With AIX out of the picture, lapwing will be the only remaining
> animal testing MAXALIGN less than 8.  That seems like a single
> point of failure ... should we spin up another couple 32-bit
> animals?  I had supposed that my faithful old PPC animal mamba
> was helping to check this, but I see that under NetBSD it's
> joined the ALIGNOF_DOUBLE==8 crowd.

I can set up a i386 animal, albeit on an amd64 kernel. But I don't think the
latter matters.

Greetings,

Andres Freund



Re: Relation bulk write facility

From
Thomas Munro
Date:
On Wed, Feb 28, 2024 at 9:24 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Here's a patch to fully remove AIX support.

--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -3401,7 +3401,7 @@ export MANPATH
   <para>
    <productname>PostgreSQL</productname> can be expected to work on current
    versions of these operating systems: Linux, Windows,
-   FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, AIX, Solaris, and illumos.
+   FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, Solaris, and illumos.

There is also a little roll-of-honour of operating systems we used to
support, just a couple of paragraphs down, where AIX should appear.



Re: Relation bulk write facility

From
Noah Misch
Date:
On Wed, Feb 28, 2024 at 12:24:01AM +0400, Heikki Linnakangas wrote:
> Here's a patch to fully remove AIX support.

> Subject: [PATCH 1/1] Remove AIX support
> 
> There isn't a lot of user demand for AIX support, no one has stepped
> up to the plate to properly maintain it, so it's best to remove it

Regardless of how someone were to step up to maintain it, we'd be telling them
such contributions have negative value and must stop.  We're expelling AIX due
to low demand, compiler bugs, its ABI, and its shlib symbol export needs.

> altogether. AIX is still supported for stable versions.
> 
> The acute issue that triggered this decision was that after commit
> 8af2565248, the AIX buildfarm members have been hitting this
> assertion:
> 
>     TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID:
2949728
> 
> Apperently the "pg_attribute_aligned(a)" attribute doesn't work on AIX
> (linker?) for values larger than PG_IO_ALIGN_SIZE.

No; see https://postgr.es/m/20240225194322.a5%40rfd.leadboat.com



Re: Relation bulk write facility

From
Andres Freund
Date:
Hi,

On 2024-02-28 00:24:01 +0400, Heikki Linnakangas wrote:
> Here's a patch to fully remove AIX support.

Thomas mentioned to me that cfbot failed with this applied:
https://cirrus-ci.com/task/6348635416297472
https://api.cirrus-ci.com/v1/artifact/task/6348635416297472/log/tmp_install/log/initdb-template.log

initdb: error while loading shared libraries: libpq.so.5: cannot open shared object file: No such file or directory


While I couldn't reproduce the failure, I did notice that locally with the
patch applied, system libpq ended up getting used. Which isn't pre-installed
in the CI environment, explaining the failure.

The problem is due to this hunk:
> @@ -401,10 +376,6 @@ install-lib-static: $(stlib) installdirs-lib
>  
>  install-lib-shared: $(shlib) installdirs-lib
>  ifdef soname
> -# we don't install $(shlib) on AIX
> -# (see
http://archives.postgresql.org/message-id/52EF20B2E3209443BC37736D00C3C1380A6E79FE@EXADV1.host.magwien.gv.at)
> -ifneq ($(PORTNAME), aix)
> -    $(INSTALL_SHLIB) $< '$(DESTDIR)$(libdir)/$(shlib)'
>  ifneq ($(PORTNAME), cygwin)
>  ifneq ($(PORTNAME), win32)
>  ifneq ($(shlib), $(shlib_major))

So the versioned name didn't end up getting installed anymore, leading to
broken symlinks in the install directory.



> diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
> index 86cc01a640b..fc6b00224f6 100644
> --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
> +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
> @@ -400,9 +400,6 @@ is(scalar(@tblspc_tars), 1, 'one tablespace tar was created');
>  SKIP:
>  {
>      my $tar = $ENV{TAR};
> -    # don't check for a working tar here, to accommodate various odd
> -    # cases such as AIX. If tar doesn't work the init_from_backup below
> -    # will fail.
>      skip "no tar program available", 1
>        if (!defined $tar || $tar eq '');

Maybe better to not remove the whole comment, just the reference to AIX?


> diff --git a/src/test/regress/sql/sanity_check.sql b/src/test/regress/sql/sanity_check.sql
> index 7f338d191c6..2e9d5ebef3f 100644
> --- a/src/test/regress/sql/sanity_check.sql
> +++ b/src/test/regress/sql/sanity_check.sql
> @@ -21,12 +21,15 @@ SELECT relname, relkind
>         AND relfilenode <> 0;
>  
>  --
> --- When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte alignment on
> +-- When MAXIMUM_ALIGNOF==8 but ALIGNOF_DOUBLE==4, the C ABI may impose 8-byte alignment
>  -- some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To ensure
>  -- catalog C struct layout matches catalog tuple layout, arrange for the tuple
>  -- offset of each fixed-width, attalign='d' catalog column to be divisible by 8
>  -- unconditionally.  Keep such columns before the first NameData column of the
>  -- catalog, since packagers can override NAMEDATALEN to an odd number.
> +-- (XXX: I'm not sure if any of the supported platforms have MAXIMUM_ALIGNOF==8 and
> +-- ALIGNOF_DOUBLE==4.  Perhaps we should just require that
> +-- ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF)
>  --
>  WITH check_columns AS (
>   SELECT relname, attname,

I agree, this should be an error, and we should then remove the test.


Greetings,

Andres Freund



Re: Relation bulk write facility

From
Andres Freund
Date:
Hi,

On 2024-02-27 12:59:14 -0800, Andres Freund wrote:
> On 2024-02-27 15:45:45 -0500, Tom Lane wrote:
> > Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > With AIX out of the picture, lapwing will be the only remaining
> > animal testing MAXALIGN less than 8.  That seems like a single
> > point of failure ... should we spin up another couple 32-bit
> > animals?  I had supposed that my faithful old PPC animal mamba
> > was helping to check this, but I see that under NetBSD it's
> > joined the ALIGNOF_DOUBLE==8 crowd.
>
> I can set up a i386 animal, albeit on an amd64 kernel. But I don't think the
> latter matters.

That animal is now running, named "adder". Due to a typo there are still
spurious errors on the older branches, but I've triggered those to be re-run.

Currently adder builds with autconf on older branches and with meson on newer
ones. Is it worth setting up two animals so we cover both ac and meson with 32
bit on 16/HEAD?

There's something odd about how we fail when not specifying the correct PERL
at configure time:
/home/bf/bf-build/adder/REL_13_STABLE/pgsql.build/../pgsql/src/pl/plperl/Util.c: loadable library and perl binaries are
mismatched(got first handshake key 0x93c0080, needed 0x9580080)
 

Not sure what gets linked against what wrongly. But I'm also not sure it's
worth the energy to investigate.

Greetings,

Andres Freund



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
Committed, after fixing the various little things you pointed out:

On 28/02/2024 00:30, Thomas Munro wrote:
> --- a/doc/src/sgml/installation.sgml
> +++ b/doc/src/sgml/installation.sgml
> @@ -3401,7 +3401,7 @@ export MANPATH
>     <para>
>      <productname>PostgreSQL</productname> can be expected to work on current
>      versions of these operating systems: Linux, Windows,
> -   FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, AIX, Solaris, and illumos.
> +   FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, Solaris, and illumos.
> 
> There is also a little roll-of-honour of operating systems we used to
> support, just a couple of paragraphs down, where AIX should appear.

Added.

On 28/02/2024 05:52, Noah Misch wrote:
> Regardless of how someone were to step up to maintain it, we'd be telling them
> such contributions have negative value and must stop.  We're expelling AIX due
> to low demand, compiler bugs, its ABI, and its shlib symbol export needs.

Reworded.

>> Apperently the "pg_attribute_aligned(a)" attribute doesn't work on AIX
>> (linker?) for values larger than PG_IO_ALIGN_SIZE.
> 
> No; see https://postgr.es/m/20240225194322.a5%40rfd.leadboat.com

Ok, reworded.

On 28/02/2024 11:26, Andres Freund wrote:
> On 2024-02-28 00:24:01 +0400, Heikki Linnakangas wrote:
> The problem is due to this hunk:
>> @@ -401,10 +376,6 @@ install-lib-static: $(stlib) installdirs-lib
>>   
>>   install-lib-shared: $(shlib) installdirs-lib
>>   ifdef soname
>> -# we don't install $(shlib) on AIX
>> -# (see
http://archives.postgresql.org/message-id/52EF20B2E3209443BC37736D00C3C1380A6E79FE@EXADV1.host.magwien.gv.at)
>> -ifneq ($(PORTNAME), aix)
>> -    $(INSTALL_SHLIB) $< '$(DESTDIR)$(libdir)/$(shlib)'
>>   ifneq ($(PORTNAME), cygwin)
>>   ifneq ($(PORTNAME), win32)
>>   ifneq ($(shlib), $(shlib_major))
> 
> So the versioned name didn't end up getting installed anymore, leading to
> broken symlinks in the install directory.

Fixed, thanks!

>> diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
>> index 86cc01a640b..fc6b00224f6 100644
>> --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
>> +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
>> @@ -400,9 +400,6 @@ is(scalar(@tblspc_tars), 1, 'one tablespace tar was created');
>>   SKIP:
>>   {
>>       my $tar = $ENV{TAR};
>> -    # don't check for a working tar here, to accommodate various odd
>> -    # cases such as AIX. If tar doesn't work the init_from_backup below
>> -    # will fail.
>>       skip "no tar program available", 1
>>         if (!defined $tar || $tar eq '');
> 
> Maybe better to not remove the whole comment, just the reference to AIX?

Ok, done

>> diff --git a/src/test/regress/sql/sanity_check.sql b/src/test/regress/sql/sanity_check.sql
>> index 7f338d191c6..2e9d5ebef3f 100644
>> --- a/src/test/regress/sql/sanity_check.sql
>> +++ b/src/test/regress/sql/sanity_check.sql
>> @@ -21,12 +21,15 @@ SELECT relname, relkind
>>          AND relfilenode <> 0;
>>   
>>   --
>> --- When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte alignment on
>> +-- When MAXIMUM_ALIGNOF==8 but ALIGNOF_DOUBLE==4, the C ABI may impose 8-byte alignment
>>   -- some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To ensure
>>   -- catalog C struct layout matches catalog tuple layout, arrange for the tuple
>>   -- offset of each fixed-width, attalign='d' catalog column to be divisible by 8
>>   -- unconditionally.  Keep such columns before the first NameData column of the
>>   -- catalog, since packagers can override NAMEDATALEN to an odd number.
>> +-- (XXX: I'm not sure if any of the supported platforms have MAXIMUM_ALIGNOF==8 and
>> +-- ALIGNOF_DOUBLE==4.  Perhaps we should just require that
>> +-- ALIGNOF_DOUBLE==MAXIMUM_ALIGNOF)
>>   --
>>   WITH check_columns AS (
>>    SELECT relname, attname,
> 
> I agree, this should be an error, and we should then remove the test.

Done.

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Relation bulk write facility

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 28/02/2024 00:30, Thomas Munro wrote:
>> I agree, this should be an error, and we should then remove the test.

> Done.

The commit that added that test added a support function
"get_columns_length" which is now unused.  Should we get rid of that
as well?

            regards, tom lane



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 28/02/2024 18:04, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On 28/02/2024 00:30, Thomas Munro wrote:
>>> I agree, this should be an error, and we should then remove the test.
> 
>> Done.
> 
> The commit that added that test added a support function
> "get_columns_length" which is now unused.  Should we get rid of that
> as well?

I see you just removed it; thanks!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Relation bulk write facility

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 28/02/2024 18:04, Tom Lane wrote:
>> The commit that added that test added a support function
>> "get_columns_length" which is now unused.  Should we get rid of that
>> as well?

> I see you just removed it; thanks!

In the no-good-deed-goes-unpunished department: crake is reporting
that this broke our cross-version upgrade tests.  I'll go fix that.

            regards, tom lane



Remove AIX Support (was: Re: Relation bulk write facility)

From
Michael Banck
Date:
Hi,

On Sat, Feb 24, 2024 at 01:29:36PM -0800, Andres Freund wrote:
> Let's just drop AIX. This isn't the only alignment issue we've found and the
> solution for those isn't so much a fix as forcing everyone to carefully only
> look into one direction and not notice the cliffs to either side.

While I am not against dropping AIX (and certainly won't step up to
maintain it just for fun), I don't think burying this inside some
"Relation bulk write facility" thread is helpful; I have changed the
thread title as a first step.

The commit message says there is not a lot of user demand and that might
be right, but contrary to other fringe OSes that got removed like HPPA
or Irix, I believe Postgres on AIX is still used in production and if
so, probably in a mission-critical manner at some old-school
institutions (in fact, one of our customers does just that) and not as a
thought-experiment. It is probably well-known among Postgres hackers
that AIX support is problematic/a burden, but the current users might
not be aware of this.

Not sure what to do about this (especially now that this has been
committed), maybe there should have been be a public deprecation notice
first for v17... On the other hand, that might not work if important
features like direct-IO would have to be bumped from v17 just because of
AIX.

I posted about this on Twitter and Mastodon to see whether anybody
complains and did not get a lot of feedback.

In any case, users will have a couple of years to migrate as usual if
they upgrade to v16.


Michael



Re: Remove AIX Support (was: Re: Relation bulk write facility)

From
Daniel Gustafsson
Date:
> On 29 Feb 2024, at 09:13, Michael Banck <mbanck@gmx.net> wrote:

> In any case, users will have a couple of years to migrate as usual if
> they upgrade to v16.

As you say, there are many years left of AIX being supported so there is plenty
of runway for planning a migration.

--
Daniel Gustafsson




Re: Remove AIX Support (was: Re: Relation bulk write facility)

From
Andres Freund
Date:
Hi,

On 2024-02-29 09:13:04 +0100, Michael Banck wrote:
> The commit message says there is not a lot of user demand and that might
> be right, but contrary to other fringe OSes that got removed like HPPA
> or Irix, I believe Postgres on AIX is still used in production and if
> so, probably in a mission-critical manner at some old-school
> institutions (in fact, one of our customers does just that) and not as a
> thought-experiment. It is probably well-known among Postgres hackers
> that AIX support is problematic/a burden, but the current users might
> not be aware of this.

Then these users should have paid somebody to actually do maintenance work on
the AIX support,o it doesn't regularly stand in the way of implementing
various things.

Greetings,

Andres Freund



Re: Remove AIX Support (was: Re: Relation bulk write facility)

From
Michael Banck
Date:
Hi,

On Thu, Feb 29, 2024 at 12:57:31AM -0800, Andres Freund wrote:
> On 2024-02-29 09:13:04 +0100, Michael Banck wrote:
> > The commit message says there is not a lot of user demand and that might
> > be right, but contrary to other fringe OSes that got removed like HPPA
> > or Irix, I believe Postgres on AIX is still used in production and if
> > so, probably in a mission-critical manner at some old-school
> > institutions (in fact, one of our customers does just that) and not as a
> > thought-experiment. It is probably well-known among Postgres hackers
> > that AIX support is problematic/a burden, but the current users might
> > not be aware of this.
> 
> Then these users should have paid somebody to actually do maintenance work on
> the AIX support,o it doesn't regularly stand in the way of implementing
> various things.

Right, absolutely.

But: did we ever tell them to do that? I don't think it's reasonable for
them to expect to follow -hackers and jump in when somebody grumbles
about AIX being a burden somewhere deep down a thread...


Michael



Re: Remove AIX Support (was: Re: Relation bulk write facility)

From
Andres Freund
Date:
Hi,

On 2024-02-29 10:24:24 +0100, Michael Banck wrote:
> On Thu, Feb 29, 2024 at 12:57:31AM -0800, Andres Freund wrote:
> > On 2024-02-29 09:13:04 +0100, Michael Banck wrote:
> > > The commit message says there is not a lot of user demand and that might
> > > be right, but contrary to other fringe OSes that got removed like HPPA
> > > or Irix, I believe Postgres on AIX is still used in production and if
> > > so, probably in a mission-critical manner at some old-school
> > > institutions (in fact, one of our customers does just that) and not as a
> > > thought-experiment. It is probably well-known among Postgres hackers
> > > that AIX support is problematic/a burden, but the current users might
> > > not be aware of this.
> > 
> > Then these users should have paid somebody to actually do maintenance work on
> > the AIX support,o it doesn't regularly stand in the way of implementing
> > various things.
> 
> Right, absolutely.
> 
> But: did we ever tell them to do that? I don't think it's reasonable for
> them to expect to follow -hackers and jump in when somebody grumbles
> about AIX being a burden somewhere deep down a thread...

Well, the thing is that it's commonly going to be deep down some threads that
portability problems cause pain.  This is far from the only time. Just a few
threads:

https://postgr.es/m/CA+TgmoauCAv+p4Z57PqgVgNxsApxKs3Yh9mDLdUDB8fep-s=1w@mail.gmail.com
https://postgr.es/m/CA+hUKGK=DOC+hE-62FKfZy=Ybt5uLkrg3zCZD-jFykM-iPn8yw@mail.gmail.com
https://postgr.es/m/20230124165814.2njc7gnvubn2amh6@awork3.anarazel.de
https://postgr.es/m/2385119.1696354473@sss.pgh.pa.us
https://postgr.es/m/20221005200710.luvw5evhwf6clig6@awork3.anarazel.de
https://postgr.es/m/20220820204401.vrf5kejih6jofvqb%40awork3.anarazel.de
https://postgr.es/m/E1oWpzF-002EG4-AG%40gemulon.postgresql.org

This is far from all.

The only platform rivalling AIX on the pain-caused metric is windows. And at
least that can be tested via CI (or locally).  We've been relying on the gcc
buildfarm to be able to maintain AIX at all, and that's not a resource that
scales to many users.

Greetings,

Andres Freund



Re: Remove AIX Support (was: Re: Relation bulk write facility)

From
Daniel Gustafsson
Date:
> On 29 Feb 2024, at 10:24, Michael Banck <mbanck@gmx.net> wrote:
> On Thu, Feb 29, 2024 at 12:57:31AM -0800, Andres Freund wrote:

>> Then these users should have paid somebody to actually do maintenance work on
>> the AIX support,o it doesn't regularly stand in the way of implementing
>> various things.
>
> Right, absolutely.
>
> But: did we ever tell them to do that? I don't think it's reasonable for
> them to expect to follow -hackers and jump in when somebody grumbles
> about AIX being a burden somewhere deep down a thread...

Having spent a fair bit of time within open source projects that companies rely
on, my experience is that those companies who need to hear such news have zero
interaction with the project and most of time don't even know the project
community exist.  That conversely also means that the project don't know they
exist.  If their consultants and suppliers, who have a higher probability of
knowing this, hasn't told them then it's highly unlikely that anything we say
will get across.

--
Daniel Gustafsson




RE: Remove AIX Support (was: Re: Relation bulk write facility)

From
Phil Florent
Date:
Hi,
Historically many public hospitals I work for had IBM Power hardware.
The SMT8 (8 threads/cores) capabilities of Power CPU are useful to lower Oracle licence & support cost. We migrate to PostgreSQL and it runs very well on Power, especially since the (relatively) recent parallel executions features of the RDBMS match very well the CPU capabilities. 
We chose to run PostgreSQL on Debian/Power (Little Endian) since ppc64le is an official Debian port. No AIX then. Only problem is that we still need to access Oracle databases and it can be useful to read directly with oracle_fdw but this tool needs an instant client and it's not open source of course. Oracle provides a binary but they don't provide patches for Debian/Power Little Endian (strange situation...) Just to say that of course we chose Linux for PostgreSQL but sometimes things are not so easy... We could have chosen AIX and we still have a ???? about interoperability.
Best regards,
Phil

De : Andres Freund <andres@anarazel.de>
Envoyé : jeudi 29 février 2024 10:35
À : Michael Banck <mbanck@gmx.net>
Cc : Noah Misch <noah@leadboat.com>; Thomas Munro <thomas.munro@gmail.com>; Heikki Linnakangas <hlinnaka@iki.fi>; Peter Smith <smithpb2250@gmail.com>; Robert Haas <robertmhaas@gmail.com>; vignesh C <vignesh21@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>; Melanie Plageman <melanieplageman@gmail.com>
Objet : Re: Remove AIX Support (was: Re: Relation bulk write facility)
 
Hi,

On 2024-02-29 10:24:24 +0100, Michael Banck wrote:
> On Thu, Feb 29, 2024 at 12:57:31AM -0800, Andres Freund wrote:
> > On 2024-02-29 09:13:04 +0100, Michael Banck wrote:
> > > The commit message says there is not a lot of user demand and that might
> > > be right, but contrary to other fringe OSes that got removed like HPPA
> > > or Irix, I believe Postgres on AIX is still used in production and if
> > > so, probably in a mission-critical manner at some old-school
> > > institutions (in fact, one of our customers does just that) and not as a
> > > thought-experiment. It is probably well-known among Postgres hackers
> > > that AIX support is problematic/a burden, but the current users might
> > > not be aware of this.
> >
> > Then these users should have paid somebody to actually do maintenance work on
> > the AIX support,o it doesn't regularly stand in the way of implementing
> > various things.
>
> Right, absolutely.
>
> But: did we ever tell them to do that? I don't think it's reasonable for
> them to expect to follow -hackers and jump in when somebody grumbles
> about AIX being a burden somewhere deep down a thread...

Well, the thing is that it's commonly going to be deep down some threads that
portability problems cause pain.  This is far from the only time. Just a few
threads:

https://postgr.es/m/CA+TgmoauCAv+p4Z57PqgVgNxsApxKs3Yh9mDLdUDB8fep-s=1w@mail.gmail.com
https://postgr.es/m/CA+hUKGK=DOC+hE-62FKfZy=Ybt5uLkrg3zCZD-jFykM-iPn8yw@mail.gmail.com
https://postgr.es/m/20230124165814.2njc7gnvubn2amh6@awork3.anarazel.de
https://postgr.es/m/2385119.1696354473@sss.pgh.pa.us
https://postgr.es/m/20221005200710.luvw5evhwf6clig6@awork3.anarazel.de
https://postgr.es/m/20220820204401.vrf5kejih6jofvqb%40awork3.anarazel.de
https://postgr.es/m/E1oWpzF-002EG4-AG%40gemulon.postgresql.org

This is far from all.

The only platform rivalling AIX on the pain-caused metric is windows. And at
least that can be tested via CI (or locally).  We've been relying on the gcc
buildfarm to be able to maintain AIX at all, and that's not a resource that
scales to many users.

Greetings,

Andres Freund


Re: Relation bulk write facility

From
Noah Misch
Date:
On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
> Committed this. Thanks everyone!

Commit 8af2565 wrote:
> --- /dev/null
> +++ b/src/backend/storage/smgr/bulk_write.c

> +/*
> + * Finish bulk write operation.
> + *
> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
> + * the relation if needed.
> + */
> +void
> +smgr_bulk_finish(BulkWriteState *bulkstate)
> +{
> +    /* WAL-log and flush any remaining pages */
> +    smgr_bulk_flush(bulkstate);
> +
> +    /*
> +     * When we wrote out the pages, we passed skipFsync=true to avoid the
> +     * overhead of registering all the writes with the checkpointer.  Register
> +     * the whole relation now.
> +     *
> +     * There is one hole in that idea: If a checkpoint occurred while we were
> +     * writing the pages, it already missed fsyncing the pages we had written
> +     * before the checkpoint started.  A crash later on would replay the WAL
> +     * starting from the checkpoint, therefore it wouldn't replay our earlier
> +     * WAL records.  So if a checkpoint started after the bulk write, fsync
> +     * the files now.
> +     */
> +    if (!SmgrIsTemp(bulkstate->smgr))
> +    {

Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
decision is irrelevant to the !wal case.  Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).  I
don't see any functional problem, but this likely arranges for an unnecessary
sync when a checkpoint starts between mdcreate() and here.  (The mdcreate()
sync may also be unnecessary, but that's longstanding.)

> +        /*
> +         * Prevent a checkpoint from starting between the GetRedoRecPtr() and
> +         * smgrregistersync() calls.
> +         */
> +        Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> +        MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +
> +        if (bulkstate->start_RedoRecPtr != GetRedoRecPtr())
> +        {
> +            /*
> +             * A checkpoint occurred and it didn't know about our writes, so
> +             * fsync() the relation ourselves.
> +             */
> +            MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> +            smgrimmedsync(bulkstate->smgr, bulkstate->forknum);
> +            elog(DEBUG1, "flushed relation because a checkpoint occurred concurrently");
> +        }
> +        else
> +        {
> +            smgrregistersync(bulkstate->smgr, bulkstate->forknum);
> +            MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> +        }
> +    }
> +}

This is an elegant optimization.



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
Thanks for poking at this!

On 01/07/2024 23:52, Noah Misch wrote:
> Commit 8af2565 wrote:
>> --- /dev/null
>> +++ b/src/backend/storage/smgr/bulk_write.c
> 
>> +/*
>> + * Finish bulk write operation.
>> + *
>> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
>> + * the relation if needed.
>> + */
>> +void
>> +smgr_bulk_finish(BulkWriteState *bulkstate)
>> +{
>> +    /* WAL-log and flush any remaining pages */
>> +    smgr_bulk_flush(bulkstate);
>> +
>> +    /*
>> +     * When we wrote out the pages, we passed skipFsync=true to avoid the
>> +     * overhead of registering all the writes with the checkpointer.  Register
>> +     * the whole relation now.
>> +     *
>> +     * There is one hole in that idea: If a checkpoint occurred while we were
>> +     * writing the pages, it already missed fsyncing the pages we had written
>> +     * before the checkpoint started.  A crash later on would replay the WAL
>> +     * starting from the checkpoint, therefore it wouldn't replay our earlier
>> +     * WAL records.  So if a checkpoint started after the bulk write, fsync
>> +     * the files now.
>> +     */
>> +    if (!SmgrIsTemp(bulkstate->smgr))
>> +    {
> 
> Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
> decision is irrelevant to the !wal case.  Either we don't need fsync at all
> (TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).

The point of GetRedoRecPtr() is to detect if a checkpoint has started 
concurrently. It works for that purpose whether or not the bulk load is 
WAL-logged. It is not compared with the LSNs of WAL records written by 
the bulk load.

Unlogged tables do need to be fsync'd. The scenario is:

1. Bulk load an unlogged table.
2. Shut down Postgres cleanly
3. Pull power plug from server, and restart.

We talked about this earlier in the "Unlogged relation copy is not 
fsync'd" thread [1]. I had already forgotten about that; that bug 
actually still exists in back branches, and we should fix it..

[1] 
https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi

> I don't see any functional problem, but this likely arranges for an
> unnecessary sync when a checkpoint starts between mdcreate() and
> here.  (The mdcreate() sync may also be unnecessary, but that's
> longstanding.)
Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. 
It seems hard to eliminate the redundancy. smgr_bulk_finish() could skip 
the fsync, if it knew that smgrDoPendingSyncs() will do it later. 
However, smgrDoPendingSyncs() might also decide to WAL-log the relation 
instead of fsyncing it, and in that case we do still need the fsync.

Fortunately, fsync() on a file that's already flushed to disk is pretty 
cheap.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Relation bulk write facility

From
Noah Misch
Date:
On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:
> On 01/07/2024 23:52, Noah Misch wrote:
> > Commit 8af2565 wrote:
> > > --- /dev/null
> > > +++ b/src/backend/storage/smgr/bulk_write.c
> > 
> > > +/*
> > > + * Finish bulk write operation.
> > > + *
> > > + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
> > > + * the relation if needed.
> > > + */
> > > +void
> > > +smgr_bulk_finish(BulkWriteState *bulkstate)
> > > +{
> > > +    /* WAL-log and flush any remaining pages */
> > > +    smgr_bulk_flush(bulkstate);
> > > +
> > > +    /*
> > > +     * When we wrote out the pages, we passed skipFsync=true to avoid the
> > > +     * overhead of registering all the writes with the checkpointer.  Register
> > > +     * the whole relation now.
> > > +     *
> > > +     * There is one hole in that idea: If a checkpoint occurred while we were
> > > +     * writing the pages, it already missed fsyncing the pages we had written
> > > +     * before the checkpoint started.  A crash later on would replay the WAL
> > > +     * starting from the checkpoint, therefore it wouldn't replay our earlier
> > > +     * WAL records.  So if a checkpoint started after the bulk write, fsync
> > > +     * the files now.
> > > +     */
> > > +    if (!SmgrIsTemp(bulkstate->smgr))
> > > +    {
> > 
> > Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
> > decision is irrelevant to the !wal case.  Either we don't need fsync at all
> > (TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).
> 
> The point of GetRedoRecPtr() is to detect if a checkpoint has started
> concurrently. It works for that purpose whether or not the bulk load is
> WAL-logged. It is not compared with the LSNs of WAL records written by the
> bulk load.

I think the significance of start_RedoRecPtr is it preceding all records
needed to recreate the bulk write.  If start_RedoRecPtr==GetRedoRecPtr() and
we crash after commit, we're indifferent to whether the rel gets synced at a
checkpoint before that crash or rebuilt from WAL after that crash.  If
start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
deleted, so only smgrimmedsync() suffices.  Overall, while it is not compared
with LSNs in WAL records, it's significant only to the extent that such a WAL
record exists.  What am I missing?

> Unlogged tables do need to be fsync'd. The scenario is:
> 
> 1. Bulk load an unlogged table.
> 2. Shut down Postgres cleanly
> 3. Pull power plug from server, and restart.
> 
> We talked about this earlier in the "Unlogged relation copy is not fsync'd"
> thread [1]. I had already forgotten about that; that bug actually still
> exists in back branches, and we should fix it..
> 
> [1] https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi

Ah, that's right.  I agree this code suffices for unlogged.  As a further
optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
always call smgrregistersync().  (For any rel, smgrimmedsync() improves on
smgrregistersync() only if we fail to reach the shutdown checkpoint.  Without
a shutdown checkpoint, unlogged rels get reset anyway.)

> > I don't see any functional problem, but this likely arranges for an
> > unnecessary sync when a checkpoint starts between mdcreate() and
> > here.  (The mdcreate() sync may also be unnecessary, but that's
> > longstanding.)
> Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
> seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
> fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
> smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
> fsyncing it, and in that case we do still need the fsync.

We do not need the fsync in the "WAL-log the relation instead" case; see
https://postgr.es/m/20230921062210.GA110358@rfd.leadboat.com

So maybe like this:

  if (use_wal) /* includes init forks */
    current logic;
  else if (unlogged)
    smgrregistersync;
  /* else temp || (permanent && wal_level=minimal): nothing to do */

> Fortunately, fsync() on a file that's already flushed to disk is pretty
> cheap.

Yep.  I'm more concerned about future readers wondering why the function is
using LSNs to decide what to do about data that doesn't appear in WAL.  A
comment could be another way to fix that, though.



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 02/07/2024 02:24, Noah Misch wrote:
> On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:
>> On 01/07/2024 23:52, Noah Misch wrote:
>>> Commit 8af2565 wrote:
>>>> --- /dev/null
>>>> +++ b/src/backend/storage/smgr/bulk_write.c
>>>
>>>> +/*
>>>> + * Finish bulk write operation.
>>>> + *
>>>> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
>>>> + * the relation if needed.
>>>> + */
>>>> +void
>>>> +smgr_bulk_finish(BulkWriteState *bulkstate)
>>>> +{
>>>> +    /* WAL-log and flush any remaining pages */
>>>> +    smgr_bulk_flush(bulkstate);
>>>> +
>>>> +    /*
>>>> +     * When we wrote out the pages, we passed skipFsync=true to avoid the
>>>> +     * overhead of registering all the writes with the checkpointer.  Register
>>>> +     * the whole relation now.
>>>> +     *
>>>> +     * There is one hole in that idea: If a checkpoint occurred while we were
>>>> +     * writing the pages, it already missed fsyncing the pages we had written
>>>> +     * before the checkpoint started.  A crash later on would replay the WAL
>>>> +     * starting from the checkpoint, therefore it wouldn't replay our earlier
>>>> +     * WAL records.  So if a checkpoint started after the bulk write, fsync
>>>> +     * the files now.
>>>> +     */
>>>> +    if (!SmgrIsTemp(bulkstate->smgr))
>>>> +    {
>>>
>>> Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
>>> decision is irrelevant to the !wal case.  Either we don't need fsync at all
>>> (TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).
>>
>> The point of GetRedoRecPtr() is to detect if a checkpoint has started
>> concurrently. It works for that purpose whether or not the bulk load is
>> WAL-logged. It is not compared with the LSNs of WAL records written by the
>> bulk load.
> 
> I think the significance of start_RedoRecPtr is it preceding all records
> needed to recreate the bulk write.  If start_RedoRecPtr==GetRedoRecPtr() and
> we crash after commit, we're indifferent to whether the rel gets synced at a
> checkpoint before that crash or rebuilt from WAL after that crash.  If
> start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
> deleted, so only smgrimmedsync() suffices.  Overall, while it is not compared
> with LSNs in WAL records, it's significant only to the extent that such a WAL
> record exists.  What am I missing?

You're right. You pointed out below that we don't need to register or 
immediately fsync the relation if it was not WAL-logged, I missed that.

In the alternative universe that we did need to fsync() even in !use_wal 
case, the point of the start_RedoRecPtr==GetRedoRecPtr() was to detect 
whether the last checkpoint "missed" fsyncing the files that we wrote. 
But the point is moot now.

>> Unlogged tables do need to be fsync'd. The scenario is:
>>
>> 1. Bulk load an unlogged table.
>> 2. Shut down Postgres cleanly
>> 3. Pull power plug from server, and restart.
>>
>> We talked about this earlier in the "Unlogged relation copy is not fsync'd"
>> thread [1]. I had already forgotten about that; that bug actually still
>> exists in back branches, and we should fix it..
>>
>> [1] https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi
> 
> Ah, that's right.  I agree this code suffices for unlogged.  As a further
> optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
> always call smgrregistersync().  (For any rel, smgrimmedsync() improves on
> smgrregistersync() only if we fail to reach the shutdown checkpoint.  Without
> a shutdown checkpoint, unlogged rels get reset anyway.)
> 
>>> I don't see any functional problem, but this likely arranges for an
>>> unnecessary sync when a checkpoint starts between mdcreate() and
>>> here.  (The mdcreate() sync may also be unnecessary, but that's
>>> longstanding.)
>> Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
>> seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
>> fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
>> smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
>> fsyncing it, and in that case we do still need the fsync.
> 
> We do not need the fsync in the "WAL-log the relation instead" case; see
> https://postgr.es/m/20230921062210.GA110358@rfd.leadboat.com

Ah, true, I missed that log_newpage_range() loads the pages to the 
buffer cache and dirties them. That kinds of sucks actually, I wish it 
didn't need to dirty the buffers.

> So maybe like this:
> 
>    if (use_wal) /* includes init forks */
>      current logic;
>    else if (unlogged)
>      smgrregistersync;
>    /* else temp || (permanent && wal_level=minimal): nothing to do */

Makes sense, except that we cannot distinguish between unlogged 
relations and permanent relations with !use_wal here.

It would be nice to have relpersistence flag in SMgrRelation. I remember 
wanting to have that before, although I don't remember what the context 
was exactly.

>> Fortunately, fsync() on a file that's already flushed to disk is pretty
>> cheap.
> 
> Yep.  I'm more concerned about future readers wondering why the function is
> using LSNs to decide what to do about data that doesn't appear in WAL.  A
> comment could be another way to fix that, though.

Agreed, this is all very subtle, and deserves a good comment. What do 
you think of the attached?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Relation bulk write facility

From
Noah Misch
Date:
On Tue, Jul 02, 2024 at 02:42:50PM +0300, Heikki Linnakangas wrote:
> On 02/07/2024 02:24, Noah Misch wrote:
> > On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:

> log_newpage_range() loads the pages to the buffer
> cache and dirties them. That kinds of sucks actually, I wish it didn't need
> to dirty the buffers.

Agreed.

> > > Fortunately, fsync() on a file that's already flushed to disk is pretty
> > > cheap.
> > 
> > Yep.  I'm more concerned about future readers wondering why the function is
> > using LSNs to decide what to do about data that doesn't appear in WAL.  A
> > comment could be another way to fix that, though.
> 
> Agreed, this is all very subtle, and deserves a good comment. What do you
> think of the attached?

Looks good.  Thanks.  pgindent doesn't preserve all your indentation, but it
doesn't make things objectionable, either.



Re: Relation bulk write facility

From
Heikki Linnakangas
Date:
On 03/07/2024 06:41, Noah Misch wrote:
> On Tue, Jul 02, 2024 at 02:42:50PM +0300, Heikki Linnakangas wrote:
>> On 02/07/2024 02:24, Noah Misch wrote:
>>> On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:
>>>> Fortunately, fsync() on a file that's already flushed to disk is pretty
>>>> cheap.
>>>
>>> Yep.  I'm more concerned about future readers wondering why the function is
>>> using LSNs to decide what to do about data that doesn't appear in WAL.  A
>>> comment could be another way to fix that, though.
>>
>> Agreed, this is all very subtle, and deserves a good comment. What do you
>> think of the attached?
> 
> Looks good.  Thanks.  pgindent doesn't preserve all your indentation, but it
> doesn't make things objectionable, either.

Committed, thanks!

(Sorry for the delay, I had forgotten about this already and found it 
only now sedimented at the bottom of my inbox)

-- 
Heikki Linnakangas
Neon (https://neon.tech)