Re: [HACKERS] WAL logging problem in 9.4.3? - Mailing list pgsql-hackers

From Noah Misch
Subject Re: [HACKERS] WAL logging problem in 9.4.3?
Date
Msg-id 20190818035230.GB3021338@rfd.leadboat.com
Whole thread Raw
In response to Re: [HACKERS] WAL logging problem in 9.4.3?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: [HACKERS] WAL logging problem in 9.4.3?
List pgsql-hackers
For two-phase commit, PrepareTransaction() needs to execute pending syncs.

On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
>      /* Remember if it's a system catalog */
>      is_system_catalog = IsSystemRelation(OldHeap);
>  
> -    /*
> -     * We need to log the copied data in WAL iff WAL archiving/streaming is
> -     * enabled AND it's a WAL-logged rel.
> -     */
> -    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> -
>      /* use_wal off requires smgr_targblock be initially invalid */
>      Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);

Since you're deleting the use_wal variable, update that last comment.

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
>              {
>                  SMgrRelation srel;
>  
> -                srel = smgropen(pending->relnode, pending->backend);
> -
> -                /* allocate the initial array, or extend it, if needed */
> -                if (maxrels == 0)
> +                if (pending->dosync)
>                  {
> -                    maxrels = 8;
> -                    srels = palloc(sizeof(SMgrRelation) * maxrels);
> +                    /* Perform pending sync of WAL-skipped relation */
> +                    FlushRelationBuffersWithoutRelcache(pending->relnode,
> +                                                        false);
> +                    srel = smgropen(pending->relnode, pending->backend);
> +                    smgrimmedsync(srel, MAIN_FORKNUM);

This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
false due to this code, use RelationNeedsWAL() for multiple forks, and then
not actually sync all forks.

The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point."  Please write the part to WAL-log the contents of small files
instead of syncing them.

> --- a/src/backend/commands/copy.c
> +++ b/src/backend/commands/copy.c
> @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
>       * If it does commit, we'll have done the table_finish_bulk_insert() at
>       * the bottom of this routine first.
>       *
> -     * As mentioned in comments in utils/rel.h, the in-same-transaction test
> -     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
> -     * can be cleared before the end of the transaction. The exact case is
> -     * when a relation sets a new relfilenode twice in same transaction, yet
> -     * the second one fails in an aborted subtransaction, e.g.
> -     *
> -     * BEGIN;
> -     * TRUNCATE t;
> -     * SAVEPOINT save;
> -     * TRUNCATE t;
> -     * ROLLBACK TO save;
> -     * COPY ...

The comment material being deleted is still correct, so don't delete it.
Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
attached patch adds an assertion that RelationNeedsWAL() and the
pendingDeletes array have the same opinion about the relfilenode, and it
expands a test case to fail that assertion.

> --- a/src/include/utils/rel.h
> +++ b/src/include/utils/rel.h
> @@ -74,11 +74,13 @@ typedef struct RelationData
>      SubTransactionId rd_createSubid;    /* rel was created in current xact */
>      SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
>                                                   * current xact */
> +    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
> +                                                 * first in current xact */

In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
all the lines printed.  Many bits of code need to look at all three,
e.g. RelationClose().  This field needs to be 100% reliable.  In other words,
it must equal InvalidSubTransactionId if and only if the relfilenode matches
the relfilenode that would be in place if the top transaction rolled back.

nm

Attachment

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)
Next
From: Konstantin Knizhnik
Date:
Subject: Re: Global temporary tables