Re: wake up logical workers after ALTER SUBSCRIPTION - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: wake up logical workers after ALTER SUBSCRIPTION
Date
Msg-id 20230104181219.GB296349@nathanxps13
Whole thread Raw
In response to Re: wake up logical workers after ALTER SUBSCRIPTION  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: wake up logical workers after ALTER SUBSCRIPTION
List pgsql-hackers
On Wed, Jan 04, 2023 at 10:57:43AM +0530, Amit Kapila wrote:
> On Tue, Jan 3, 2023 at 11:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> My approach was to add a variable to LogicalRepWorker that indicated
>> whether a worker needed to be restarted immediately.  While this is a
>> little weird because the workers array is treated as slots, it worked
>> nicely for ALTER SUBSCRIPTION.
> 
> So, are you planning to keep its in_use and subid flag as it is in
> logicalrep_worker_cleanup()? Otherwise, without that it could be
> reused for some other subscription.

I believe I did something like this in my proof-of-concept.  I might have
used the new flag as another indicator that the slot was still "in use".
In any case, you are right that we need to prevent the slot from being
reused.

> What if we maintain a hash table similar to 'last_start_times'
> maintained in tablesync.c? It won't have entries for new
> subscriptions, so for those we may not need to wait till
> wal_retrieve_retry_interval.

I proposed this upthread [0].  I still think it is a worthwhile change.
Right now, if a worker needs to be restarted but another unrelated worker
was restarted less than wal_retrieve_retry_interval milliseconds ago, the
launcher waits to restart it.  I think it makes more sense for each worker
to have its own restart interval tracked.

>> IIUC you are suggesting just one variable that would bypass
>> wal_retrieve_retry_interval for all subscriptions, not just those newly
>> altered or created.  This definitely seems like it would prevent delays,
>> but it would also cause wal_retrieve_retry_interval to be incorrectly
>> bypassed for the other workers in some cases.
>
> Right, but I guess it would be rare in practical cases that someone
> Altered/Created a subscription, and also some workers are restarted
> due to errors/crashes as only in those cases launcher can restart the
> worker when it shouldn't. However, in that case, also, it won't
> restart the apply worker again and again unless there are concurrent
> Create/Alter Subscription operations going on. IIUC, currently also it
> can always first time restart the worker immediately after ERROR/CRASH
> because we don't maintain last_start_time for each worker. I think
> this is probably okay as we want to avoid repeated restarts after the
> ERROR.

This line of thinking is why I felt that lowering
wal_retrieve_retry_interval for the tests might be sufficient.  Besides the
fact that it revealed multiple bugs, I don't see the point in adding much
more complexity here.  In practice, workers will usually start right away,
unless of course there are other worker starts happening around the same
time.  This consistently causes testing delays because the tests stress
these code paths, but I don't think what the tests are doing is a typical
use-case.

From the discussion thus far, it sounds like the alternatives are to 1) add
a global flag that causes wal_retrieve_retry_interval to be bypassed for
all workers or to 2) add a hash map in the launcher and a
restart_immediately flag in each worker slot.  I'll go ahead and create a
patch for 2 since it seems like the most complete solution, and we can
evaluate whether the complexity seems appropriate.

[0] https://postgr.es/m/20221214171023.GA689106%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: pgbench - adding pl/pgsql versions of tests
Next
From: Nathan Bossart
Date:
Subject: Re: fix and document CLUSTER privileges