Re: Yet another fast GiST build - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Yet another fast GiST build
Date
Msg-id 932846.1600698019@sss.pgh.pa.us
Whole thread Raw
In response to Re: Yet another fast GiST build  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 21/09/2020 02:06, Tom Lane wrote:
>> Another interesting point is that all the other index AMs seem to WAL-log
>> the new page before the smgrextend call, whereas this code is doing it
>> in the other order.  I strongly doubt that both patterns are equally
>> correct.  Could be that the other AMs are in the wrong though.

> My thinking was that it's better to call smgrextend() first, so that if 
> you run out of disk space, you get the error before WAL-logging it. That 
> reduces the chance that WAL replay will run out of disk space. A lot of 
> things are different during WAL replay, so it's quite likely that WAL 
> replay runs out of disk space anyway if you're living on the edge, but 
> still.

Yeah.  access/transam/README points out that such failures need to be
planned for, and explains what we do for heap pages;

    1. Adding a disk page to an existing table.

    This action isn't WAL-logged at all.  We extend a table by writing a page
    of zeroes at its end.  We must actually do this write so that we are sure
    the filesystem has allocated the space.  If the write fails we can just
    error out normally.  Once the space is known allocated, we can initialize
    and fill the page via one or more normal WAL-logged actions.  Because it's
    possible that we crash between extending the file and writing out the WAL
    entries, we have to treat discovery of an all-zeroes page in a table or
    index as being a non-error condition.  In such cases we can just reclaim
    the space for re-use.

So GIST seems to be acting according to that design.  (Someday we need
to update this para to acknowledge that not all filesystems behave as
it's assuming.)

> I didn't notice that the other callers are doing it the other way round, 
> though. I think they need to, so that they can stamp the page with the 
> LSN of the WAL record. But GiST build is special in that regard, because 
> it stamps all pages with GistBuildLSN.

Kind of unpleasant; that means they risk what the README points out:

    In all of these cases, if WAL replay fails to redo the original action
    we must panic and abort recovery.  The DBA will have to manually clean up
    (for instance, free up some disk space or fix directory permissions) and
    then restart recovery.  This is part of the reason for not writing a WAL
    entry until we've successfully done the original action.

I'm not sufficiently motivated to go and change it right now, though.

            regards, tom lane



pgsql-hackers by date:

Previous
From: "Andrey M. Borodin"
Date:
Subject: Re: Yet another fast GiST build
Next
From: "Namrata Bhave"
Date:
Subject: RE: Binaries on s390x arch