Re: Skip partition tuple routing with constant partition key - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Skip partition tuple routing with constant partition key
Date
Msg-id CAD21AoAvTTTeG7+7UCENYhydHJQNETZ2PVyVogQn=PkZ-kEG=Q@mail.gmail.com
Whole thread Raw
In response to Re: Skip partition tuple routing with constant partition key  (Andres Freund <andres@anarazel.de>)
Responses Re: Skip partition tuple routing with constant partition key
List pgsql-hackers
Hi,

On Thu, Apr 7, 2022 at 4:37 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-04-06 00:07:07 -0400, Tom Lane wrote:
> > Amit Langote <amitlangote09@gmail.com> writes:
> > > On Sun, Apr 3, 2022 at 10:31 PM Greg Stark <stark@mit.edu> wrote:
> > >> Is this a problem with the patch or its tests?
> > >> [18:14:20.798] Test Summary Report
> > >> [18:14:20.798] -------------------
> > >> [18:14:20.798] t/013_partition.pl (Wstat: 15360 Tests: 31 Failed: 0)
> >
> > > Hmm, make check-world passes for me after rebasing the patch (v10) to
> > > the latest HEAD (clean), nor do I see a failure on cfbot:
> > > http://cfbot.cputube.org/amit-langote.html
> >
> > 013_partition.pl has been failing regularly in the buildfarm,
> > most recently here:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2022-03-31%2000%3A49%3A45
>
> Just failed locally on my machine as well.
>
>
> > I don't think there's room to blame any uncommitted patches
> > for that.  Somebody broke it a short time before here:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-03-17%2016%3A08%3A19
>
> The obvious thing to point a finger at is
>
> commit c91f71b9dc91ef95e1d50d6d782f477258374fc6
> Author: Tomas Vondra <tomas.vondra@postgresql.org>
> Date:   2022-03-16 16:42:47 +0100
>
>     Fix publish_as_relid with multiple publications
>

I've not managed to reproduce this issue on my machine but while
reviewing the code and the server logs[1] I may have found possible
bugs:

2022-04-08 12:59:30.701 EDT [91997:1] LOG:  logical replication apply
worker for subscription "sub2" has started
2022-04-08 12:59:30.702 EDT [91998:3] 013_partition.pl LOG:
statement: ALTER SUBSCRIPTION sub2 SET PUBLICATION pub_lower_level,
pub_all
2022-04-08 12:59:30.733 EDT [91998:4] 013_partition.pl LOG:
disconnection: session time: 0:00:00.036 user=buildfarm
database=postgres host=[local]
2022-04-08 12:59:30.740 EDT [92001:1] LOG:  logical replication table
synchronization worker for subscription "sub2", table "tab4_1" has
started
2022-04-08 12:59:30.744 EDT [91997:2] LOG:  logical replication apply
worker for subscription "sub2" will restart because of a parameter
change
2022-04-08 12:59:30.750 EDT [92003:1] LOG:  logical replication table
synchronization worker for subscription "sub2", table "tab3" has
started

The logs say that the apply worker for "sub2" finished whereas the
tablesync workers for "tab4_1" and "tab3" started. After these logs,
there are no logs that these tablesync workers finished and the apply
worker for "sub2" restarted, until the timeout. While reviewing the
code, I realized that the tablesync workers can advance its relstate
even without the apply worker intervention.

After a tablesync worker copies the table it sets
SUBREL_STATE_SYNCWAIT to its relstate, then it waits for the apply
worker to update the relstate to SUBREL_STATE_CATCHUP. If the apply
worker has already died, it breaks from the wait loop and returns
false:

wait_for_worker_state_change():

    for (;;)
    {
        LogicalRepWorker *worker;

        :

        /*
         * Bail out if the apply worker has died, else signal it we're
         * waiting.
         */
        LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
        worker = logicalrep_worker_find(MyLogicalRepWorker->subid,
                                        InvalidOid, false);
        if (worker && worker->proc)
            logicalrep_worker_wakeup_ptr(worker);
        LWLockRelease(LogicalRepWorkerLock);
        if (!worker)
            break;

        :
    }

    return false;

However, the caller doesn't check the return value at all:

    /*
     * We are done with the initial data synchronization, update the state.
     */
    SpinLockAcquire(&MyLogicalRepWorker->relmutex);
    MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCWAIT;
    MyLogicalRepWorker->relstate_lsn = *origin_startpos;
    SpinLockRelease(&MyLogicalRepWorker->relmutex);

    /*
     * Finally, wait until the main apply worker tells us to catch up and then
     * return to let LogicalRepApplyLoop do it.
     */
    wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
    return slotname;

Therefore, the tablesync worker started logical replication while
keeping its relstate as SUBREL_STATE_SYNCWAIT.

Given the server logs, it's likely that both tablesync workers for
"tab4_1" and "tab3" were in this situation. That is, there were two
tablesync workers who were applying changes for the target relation
but the relstate was SUBREL_STATE_SYNCWAIT.

When it comes to starting the apply worker, probably it didn't happen
since there are already running tablesync workers as much as
max_sync_workers_per_subscription (2 by default):

logicalrep_worker_launch():

    /*
     * If we reached the sync worker limit per subscription, just exit
     * silently as we might get here because of an otherwise harmless race
     * condition.
     */
    if (nsyncworkers >= max_sync_workers_per_subscription)
    {
        LWLockRelease(LogicalRepWorkerLock);
        return;
    }

This scenario seems possible in principle but I've not managed to
reproduce this issue so I might be wrong. Especially, according to the
server logs, it seems like the tablesync workers were launched before
the apply worker restarted due to parameter change and this is a
common pattern among other failure logs. But I'm not sure how it could
really happen. IIUC the apply worker always re-reads subscription (and
exits if there is parameter change) and then requests to launch
tablesync workers accordingly. Also, the fact that we don't check the
return value of wiat_for_worker_state_change() is not a new thing; we
have been living with this behavior since v10. So I'm not really sure
why this problem appeared recently if my hypothesis is correct.

Regards,

[1]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=grassquit&dt=2022-04-08%2014%3A13%3A27&stg=subscription-check

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: A qsort template
Next
From: David Rowley
Date:
Subject: Re: A qsort template