Re: [HACKERS] snapbuild woes - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: [HACKERS] snapbuild woes
Date
Msg-id 0d06dfc1-0bc5-c3ab-3555-24e2f275d7fa@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] snapbuild woes  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] snapbuild woes  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
List pgsql-hackers
Thanks for looking at this!

On 13/04/17 02:29, Andres Freund wrote:
> Hi,
> On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote:
> 
>> From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001
>> From: Petr Jelinek <pjmodos@pjmodos.net>
>> Date: Fri, 24 Feb 2017 21:39:03 +0100
>> Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export
>>
>> Otherwise the VACUUM or pruning might remove tuples still needed by the
>> exported snapshot.
>> ---
>>  src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++-
>>  1 file changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
>> index 5529ac8..57c392c 100644
>> --- a/src/backend/replication/logical/logical.c
>> +++ b/src/backend/replication/logical/logical.c
>> @@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
>>       * the slot machinery about the new limit. Once that's done the
>>       * ProcArrayLock can be released as the slot machinery now is
>>       * protecting against vacuum.
>> +     *
>> +     * Note that we only store the global xmin temporarily in the in-memory
>> +     * state so that the initial snapshot can be exported. After initial
>> +     * snapshot is done global xmin should be reset and not tracked anymore
>> +     * so we are fine with losing the global xmin after crash.
>>       * ----
>>       */
>>      LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>>  
>>      slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
>>      slot->data.catalog_xmin = slot->effective_catalog_xmin;
>> +    slot->effective_xmin = slot->effective_catalog_xmin;
> 
> 
>>  void
>>  FreeDecodingContext(LogicalDecodingContext *ctx)
>>  {
>> +    ReplicationSlot *slot = MyReplicationSlot;
>> +
>>      if (ctx->callbacks.shutdown_cb != NULL)
>>          shutdown_cb_wrapper(ctx);
>>  
>> +    /*
>> +     * Cleanup global xmin for the slot that we may have set in
>> +     * CreateInitDecodingContext().
> 
> Hm.  Is that actually a meaningful point to do so? For one, it gets
> called by pg_logical_slot_get_changes_guts(), but more importantly, the
> snapshot is exported till SnapBuildClearExportedSnapshot(), which is the
> next command?  If we rely on the snapshot magic done by ExportSnapshot()
> it'd be worthwhile to mention that...
> 

(Didn't see the patch for couple of months so don't remember all the
detailed decisions anymore)

Yes we rely on backend's xmin to be set for exported snapshot. We only
care about global xmin for exported snapshots really, I assumed that's
clear enough from "so that the initial snapshot can be exported" but I
guess there should be more clear comment about this where we actually
clean this up.

> 
>> We do not take ProcArrayLock or similar
>> +     * since we only reset xmin here and there's not much harm done by a
>> +     * concurrent computation missing that.
>> +     */
> 
> Hum.  I was prepared to complain about this, but ISTM, that there's
> absolutely no risk because the following
> ReplicationSlotsComputeRequiredXmin(false); actually does all the
> necessary locking?  But still, I don't see much point in the
> optimization.
> 

Well, if we don't need it in LogicalConfirmReceivedLocation, I don't see
why we need it here. Please enlighten me.

> 
> 
>> This patch changes the code so that stored snapshots are only used for
>> logical decoding restart but not for initial slot snapshot.
> 
> Yea, that's a very good point...
> 
>> @@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>  
>>          return false;
>>      }
>> -    /* c) valid on disk state */
>> -    else if (SnapBuildRestore(builder, lsn))
>> +    /* c) valid on disk state and not exported snapshot */
>> +    else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) &&
>> +             SnapBuildRestore(builder, lsn))
> 
> Hm. Is this a good signaling mechanism? It'll also trigger for the SQL
> interface, where it'd strictly speaking not be required, right?
> 

Good point. Maybe we should really tell snapshot builder if the snapshot
is going to be exported or not explicitly (see the rant all the way down).

> 
>> From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
>> From: Petr Jelinek <pjmodos@pjmodos.net>
>> Date: Sun, 26 Feb 2017 01:07:33 +0100
>> Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards
> 
> A bit more commentary would be good. What does that protect us against?
> 

I think I explained that in the email. We might export snapshot with
xmin smaller than global xmin otherwise.

> 
> 
>> From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001
>> From: Petr Jelinek <pjmodos@pjmodos.net>
>> Date: Wed, 22 Feb 2017 00:57:33 +0100
>> Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder
>>
>> Due to race condition, the xl_running_xacts might contain no longer
>> running transactions. Previous coding tried to get around this by
>> additional locking but that did not work correctly for committs. Instead
>> try combining decoded commits and multiple xl_running_xacts to get the
>> consistent snapshot.
> 
> Needs more explanation about approach.
> 
> 
>> @@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>       *      simply track the number of in-progress toplevel transactions and
>>       *      lower it whenever one commits or aborts. When that number
>>       *      (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
>> -     *      to CONSISTENT.
>> +     *      to CONSISTENT. Sometimes we might get xl_running_xacts which has
>> +     *      all tracked transactions as finished. We'll need to restart tracking
>> +     *      in that case and use previously collected committed transactions to
>> +     *      purge transactions mistakenly marked as running in the
>> +     *      xl_running_xacts which exist as a result of race condition in
>> +     *      LogStandbySnapshot().
> 
> I'm not following this yet.
> 

Let me try to explain:
We get xl_running_xacts with txes 1,3,4. But the 1 already committed
before so the decoding will never see it and we never get snapshot. Now
at some point we might get xl_running_xact with txes 6,7,8 so we know
that all transactions from the initial xl_running_xacts must be closed.
We restart the tracking here from beginning as if this was first
xl_running_xacts we've seen, with the exception that we look into past
if we seen the 6,7,8 transactions already as finished, then we'll mark
them as finished immediately (hence avoiding the issue with transaction
6 being already committed before xl_running_xacts was written).

> 
>> @@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>       * b) first encounter of a useable xl_running_xacts record. If we had
>>       * found one earlier we would either track running transactions (i.e.
>>       * builder->running.xcnt != 0) or be consistent (this function wouldn't
>> -     * get called).
>> +     * get called). However it's possible that we could not see all
>> +     * transactions that were marked as running in xl_running_xacts, so if
>> +     * we get new one that says all were closed but we are not consistent
>> +     * yet, we need to restart the tracking while taking previously seen
>> +     * transactions into account.
> 
> This needs to revise the preceding comment more heavily.  "This is the
> first!!! Or maybe not!" isn't easy to understand.
> 

Yeah, I found it bit hard to make it sound correct and not confusing,
even wondered if I should split this code to two because of that but it
would lead into quite a bit of code duplication, dunno if that's better.
Maybe we could move the "reset" code into separate function to avoid
most of the duplication.

> 
>>       */
>> -    else if (!builder->running.xcnt)
>> +    else if (!builder->running.xcnt ||
>> +             running->oldestRunningXid > builder->running.xmax)
> 
> Isn't that wrong under wraparound?
> 

Right, should use TransactionIdFollows.

> 
>>      {
>>          int            off;
>> +        bool        first = builder->running.xcnt == 0;
>>  
>>          /*
>>           * We only care about toplevel xids as those are the ones we
>> @@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>          builder->running.xmin = builder->running.xip[0];
>>          builder->running.xmax = builder->running.xip[running->xcnt - 1];
>>  
>> +
>>          /* makes comparisons cheaper later */
>>          TransactionIdRetreat(builder->running.xmin);
>>          TransactionIdAdvance(builder->running.xmax);
>>  
>>          builder->state = SNAPBUILD_FULL_SNAPSHOT;
>>  
>> -        ereport(LOG,
>> -            (errmsg("logical decoding found initial starting point at %X/%X",
>> -                    (uint32) (lsn >> 32), (uint32) lsn),
>> -             errdetail_plural("%u transaction needs to finish.",
>> -                              "%u transactions need to finish.",
>> -                              builder->running.xcnt,
>> -                              (uint32) builder->running.xcnt)));
>> -
>> +        /*
>> +         * If this is the first time we've seen xl_running_xacts, we are done.
>> +         */
>> +        if (first)
>> +        {
>> +            ereport(LOG,
>> +                (errmsg("logical decoding found initial starting point at %X/%X",
>> +                        (uint32) (lsn >> 32), (uint32) lsn),
>> +                 errdetail_plural("%u transaction needs to finish.",
>> +                                  "%u transactions need to finish.",
>> +                                  builder->running.xcnt,
>> +                                  (uint32) builder->running.xcnt)));
>> +        }
>> +        else
>> +        {
>> +            /*
>> +             * Because of the race condition in LogStandbySnapshot() the
>> +             * transactions recorded in xl_running_xacts as running might have
>> +             * already committed by the time the xl_running_xacts was written
>> +             * to WAL. Use the information about decoded transactions that we
>> +             * gathered so far to update our idea about what's still running.
>> +             *
>> +             * We can use SnapBuildEndTxn directly as it only does the
>> +             * transaction running check and handling without any additional
>> +             * side effects.
>> +             */
>> +            for (off = 0; off < builder->committed.xcnt; off++)
>> +                SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
>> +            if (builder->state == SNAPBUILD_CONSISTENT)
>> +                return false;
>> +
>> +            ereport(LOG,
>> +                (errmsg("logical decoding moved initial starting point to %X/%X",
>> +                        (uint32) (lsn >> 32), (uint32) lsn),
>> +                 errdetail_plural("%u transaction needs to finish.",
>> +                                  "%u transactions need to finish.",
>> +                                  builder->running.xcnt,
>> +                                  (uint32) builder->running.xcnt)));
>> +        }
> 
> Hm, this is not pretty.
> 

Hmm? There are two possible scenarios that need to be handled
differently. Possibly another reason to split it out completely as
mentioned above.

> 
>> From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001
>> From: Petr Jelinek <pjmodos@pjmodos.net>
>> Date: Tue, 21 Feb 2017 19:58:18 +0100
>> Subject: [PATCH 5/5] Skip unnecessary snapshot builds
>>
>> When doing initial snapshot build during logical decoding
>> initialization, don't build snapshots for transactions where we know the
>> transaction didn't do any catalog changes. Otherwise we might end up
>> with thousands of useless snapshots on busy server which can be quite
>> expensive.
>> ---
>>  src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
>>  1 file changed, 53 insertions(+), 29 deletions(-)
>>
>> diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
>> index 1a1c9ba..c800aa5 100644
>> --- a/src/backend/replication/logical/snapbuild.c
>> +++ b/src/backend/replication/logical/snapbuild.c
>> @@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>>      bool        forced_timetravel = false;
>>      bool        sub_needs_timetravel = false;
>>      bool        top_needs_timetravel = false;
>> +    bool        skip_forced_snapshot = false;
>>  
>>      TransactionId xmax = xid;
>>  
>> @@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>>          /*
>>           * We could avoid treating !SnapBuildTxnIsRunning transactions as
>>           * timetravel ones, but we want to be able to export a snapshot when
>> -         * we reached consistency.
>> +         * we reached consistency so we need to keep track of them.
>>           */
>>          forced_timetravel = true;
>>          elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
>> +
>> +        /*
>> +         * It is however desirable to skip building new snapshot for
>> +         * !SnapBuildTxnIsRunning transactions as otherwise we might end up
>> +         * building thousands of unused snapshots on busy servers which can
>> +         * be very expensive.
>> +         */
>> +        if (!SnapBuildTxnIsRunning(builder, xid))
>> +            skip_forced_snapshot = true;
>>      }
> 
> That's pretty crudely bolted on the existing logic, isn't there a
> simpler way?
> 

Agreed, however, every time I tried to make this prettier I ended up
either producing subtle bugs (see the initial email in this thread for
example of that), so I eventually gave up on pretty.

As a side note, my opinion after all this is that it's probably mistake
to try to use various situational conditions to make sure we are
building exportable snapshot. ISTM we should tell the snapbuilder
explicitly that the snapshot will be exported and it should behave
accordingly based on that. Because for example, we also should track
aborted transactions in the snapshot which is to be exported because
otherwise enough of them happening during snapshot building will result
in export failure due to too big snapshot. But this seems like too
invasive change to be back-portable.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: [HACKERS] Undefined psql variables
Next
From: Petr Jelinek
Date:
Subject: Re: [HACKERS] snapbuild woes