Re: Skipping logical replication transactions on subscriber side - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Skipping logical replication transactions on subscriber side |
Date | |
Msg-id | CAD21AoARD+4oB=Uwcr5-QG-Qf_gF_OdmY1R4cLh=vT6a6NH4TQ@mail.gmail.com Whole thread Raw |
In response to | Re: Skipping logical replication transactions on subscriber side (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Skipping logical replication transactions on subscriber side
|
List | pgsql-hackers |
On Sat, May 29, 2021 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, May 29, 2021 at 8:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, May 27, 2021 at 7:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, May 27, 2021 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > 1. the worker records the XID and commit LSN of the failed transaction > > > > to a catalog. > > > > > > > > > > When will you record this info? I am not sure if we can try to update > > > this when an error has occurred. We can think of using try..catch in > > > apply worker and then record it in catch on error but would that be > > > advisable? One random thought that occurred to me is to that apply > > > worker notifies such information to the launcher (or maybe another > > > process) which will log this information. > > > > Yeah, I was concerned about that too and had the same idea. The > > information still could not be written if the server crashes before > > the launcher writes it. But I think it's an acceptable. > > > > True, because even if the launcher restarts, the apply worker will > error out again and resend the information. I guess we can have an > error queue where apply workers can add their information and the > launcher will then process those. If we do that, then we need to > probably define what we want to do if the queue gets full, either > apply worker nudge launcher and wait or it can just throw an error and > continue. If you have any better ideas to share this information then > we can consider those as well. +1 for using error queue. Maybe we need to avoid queuing the same error more than once to avoid the catalog from being updated frequently? > > > > > > > > 2. the user specifies how to resolve that conflict transaction > > > > (currently only 'skip' is supported) and writes to the catalog. > > > > 3. the worker does the resolution method according to the catalog. If > > > > the worker didn't start to apply those changes, it can skip the entire > > > > transaction. If did, it rollbacks the transaction and ignores the > > > > remaining. > > > > > > > > The worker needs neither to reset information of the last failed > > > > transaction nor to mark the conflicted transaction as resolved. The > > > > worker will ignore that information when checking the catalog if the > > > > commit LSN is passed. > > > > > > > > > > So won't this require us to check the required info in the catalog > > > before applying each transaction? If so, that might be overhead, maybe > > > we can build some cache of the highest commitLSN that can be consulted > > > rather than the catalog table. > > > > I think workers can cache that information when starts and invalidates > > and reload the cache when the catalog gets updated. Specifying to > > skip XID will update the catalog, invalidating the cache. > > > > > I think we need to think about when to > > > remove rows for which conflict has been resolved as we can't let that > > > information grow infinitely. > > > > I guess we can update catalog tuples in place when another conflict > > happens next time. The catalog tuple should be fixed size. The > > already-resolved conflict will have the commit LSN older than its > > replication origin's LSN. > > > > Okay, but I have a slight concern that we will keep xid in the system > which might have been no longer valid. So, we will keep this info > about subscribers around till one performs drop subscription, > hopefully, that doesn't lead to too many rows. This will be okay as > per the current design but say tomorrow we decide to parallelize the > apply for a subscription then there could be multiple errors > corresponding to a subscription and in that case, such a design might > appear quite limiting. One possibility could be that when the launcher > is periodically checking for new error messages, it can clean up the > conflicts catalog as well, or maybe autovacuum does this periodically > as it does for stats (via pgstat_vacuum_stat). Yeah, it's better to have a way to cleanup no longer valid entries in the catalog in the case where the worker failed to remove it. I prefer the former idea so far, so I'll implement it in a PoC patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: