Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Single transaction in the tablesync worker?
Date
Msg-id CAHut+PuQhKEeFp6R1d4DKwP3sS80KryZnzBx3EngDAgdRVtOyQ@mail.gmail.com
Whole thread Raw
In response to Re: Single transaction in the tablesync worker?  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Single transaction in the tablesync worker?  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Wed, Jan 6, 2021 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 5, 2021 at 3:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Mon, Jan 4, 2021 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > 5. Is it possible to write a testcase where we fail (say due to pk
> > > violation or some other error) after the initial copy is done, then
> > > remove the conflicting row and allow a copy to be completed? If we
> > > already have any such test then it is fine.
> > >
> >
> > Causing a PK violation during the initial copy is not a problem to
> > test, but doing it after the initial copy is difficult. I have done
> > exactly this test scenario before but I thought it cannot be
> > automated. E.g. To cause an PK violation error somewhere between
> > COPYDONE and SYNDONE means that the offending insert (the one which
> > tablesync will fail to replicate) has to be sent while the tablesync
> > is in CATCHUP mode. But AFAIK that can only be achieved using the
> > debugger to get the timing right.
> >
>
> Yeah, I am also not able to think of any way to automate such a test.
> I was thinking about what could go wrong if we error out in that
> stage. The only thing that could be problematic is if we somehow make
> the slot and replication origin used during copy dangling. I think if
> tablesync is restarted after error then we will clean up those which
> will be normally the case but what if the tablesync worker is not
> started again? I think the only possibility of tablesync worker not
> started again is if during Alter Subscription ... Refresh Publication,
> we remove the corresponding subscription rel (see
> AlterSubscription_refresh, I guess it could happen if one has dropped
> the relation from publication). I haven't tested this with your patch
> but if such a possibility exists then we need to think of cleaning up
> slot and origin when we remove subscription rel. What do you think?
>

I think it makes sense. If there can be a race between the tablesync
re-launching (after error), and the AlterSubscription_refresh removing
some table’s relid from the subscription then there could be lurking
slot/origin tablesync resources (of the removed table) which a
subsequent DROP SUBSCRIPTION cannot discover. I will think more about
how/if it is possible to make this happen. Anyway, I suppose I ought
to refactor/isolate some of the tablesync cleanup code in case it
needs to be commonly called from DropSubscription and/or from
AlterSubscription_refresh.

----
Kind Regards,
Peter Smith.
Fujitsu Australia.



pgsql-hackers by date:

Previous
From: Antonin Houska
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)
Next
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: Enhance traceability of wal_level changes for backup management