Re: Skipping logical replication transactions on subscriber side - Mailing list pgsql-hackers

From David G. Johnston
Subject Re: Skipping logical replication transactions on subscriber side
Date
Msg-id CAKFQuwaLfgeForu2XYgaCuzRg5=8=5vPe13kKwxGbwu0LyeJgQ@mail.gmail.com
Whole thread Raw
In response to Re: Skipping logical replication transactions on subscriber side  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Skipping logical replication transactions on subscriber side
Re: Skipping logical replication transactions on subscriber side
List pgsql-hackers
On Sat, Jan 22, 2022 at 2:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Jan 22, 2022 at 12:41 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Fri, Jan 21, 2022 at 10:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Jan 21, 2022 at 10:00 PM David G. Johnston
>> <david.g.johnston@gmail.com> wrote:
>> >
>> > On Fri, Jan 21, 2022 at 4:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >>

>
> I agree with incorporating "reset" into the paragraph somehow - does not have to mention NONE, just that ALTER SUBSCRIPTION SKIP (not a family friendly abbreviation...) is what does it.
>

It is not clear to me what you have in mind here but to me in this
context saying "Setting <literal>NONE</literal> resets the transaction
ID." seems quite reasonable.

OK

Yeah, we will error in that situation. The only invalid values are
system reserved values (1,2).

So long as the ALTER command errors when asked to skip those IDs there isn't any reason for an end-user, who likely doesn't know or care that 1 and 2 are special, to be concerned about them (the only two invalid values) while reading the docs.


> Or, why even force them to specify a number instead of just saying SKIP and if there is a current error we skip its transaction, otherwise we warn them that nothing happened because there is no last error.
>

The idea is that we might extend this feature to skip specific
operations on relations or maybe by having other identifiers.

Again, you've already got syntax reserved that lets you add more features to this command in the future; and removing warnings or errors because new features make them moot is easy.  Lets document and code what we are willing to implement today.  A single top-level transaction xid that is presently blocking the worker from applying any more WAL.

One idea
we discussed was to automatically fetch the last error xid but then
decided it can be done as a later patch.

This seems backwards.  The user-friendly approach is to not make them type in anything at all.  That said, this particular UX seems like it could use some safety.  Thus I would propose at this time that attempting to set the skip_option to anything but THE active_error_xid for the named subscription results in an error.  Once you add new features the user can set the skip_option to other things without provoking errors.  Again, I consider this a safety feature since the user now has to accurately match the xid to the name in the SQL in order to perform a successful skip - and the to-be affected transaction has to be one that is preventing replication from moving forward.  I'm not interested in providing a foot-gun where an arbitrary future transaction can be scheduled to be skipped.  Running the command twice with the same values should provoke an error since the first run should be allowed to finish (?).  Also, we handle the situation where the state of the worker changes between when the user saw the error and wrote down the xid to skip and the actual execution of the alter command.  Maybe not highly anticipated scenarios but this is an easy win to deal with them.


> Additionally, the description for pg_stat_subscription_workers should describe what happens once the transaction represented by last_error_xid has either been successfully processed or skipped.  Does this "last error" stick around until another error happens (which is hopefully very rare) or does it reset to blanks?
>

It will be reset only on subscription drop, otherwise, it will stick
around until another error happens.
 
I really dislike the user experience this provides, and given it is new in v15 (and right now this table seems to exist solely to support this feature) changing this seems within the realm of possibility. I have to imagine these workers have a sense of local state that would just be "no errors, no need to touch pg_stat_subscription_workers at the end of this transaction's commit".  It would save a local state of the error_xid and if a successfully committed transaction has that xid it would clear the error.  The skip code path would also check for and see the matching xid value and clear the error.  Even if the local state thing doesn't work, one catalog lookup per transaction seems like potentially reasonable overhead to incur here.

David J.

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: fairywren is generating bogus BASE_BACKUP commands
Next
From: "David G. Johnston"
Date:
Subject: Re: Skipping logical replication transactions on subscriber side