On Sat, Apr 2, 2022 at 5:41 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Fri, Apr 01, 2022 at 09:25:52PM +0900, Masahiko Sawada wrote:
> > > On Fri, Apr 1, 2022 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> > > > src/test/subscription/t/029_on_error.pl has been failing reliably on the five
> > > > AIX buildfarm members:
> > > >
> > > > # poll_query_until timed out executing this query:
> > > > # SELECT subskiplsn = '0/0' FROM pg_subscription WHERE subname = 'sub'
> > > > # expecting this output:
> > > > # t
> > > > # last actual query output:
> > > > # f
> > > > # with stderr:
> > > > timed out waiting for match: (?^:LOG: done skipping logical replication transaction finished at 0/1D30788) at
t/029_on_error.plline 50.
> > > >
> > > > I've posted five sets of logs (2.7 MiB compressed) here:
> > > > https://drive.google.com/file/d/16NkyNIV07o0o8WM7GwcaAYFQDPTkULkR/view?usp=sharing
>
> > Given that "SELECT subskiplsn = '0/0'
> > FROM pg_subscription WHERE subname = 'sub’” didn't return true, some
> > value was set to subskiplsn even after the unique key error.
> >
> > So I'm guessing that the apply worker could not get the updated value
> > of the subskiplsn or its MySubscription->skiplsn could not match with
> > the transaction's finish LSN. Also, given that the test is failing on
> > all AIX buildfarm members, there might be something specific to AIX.
> >
> > Noah, to investigate this issue further, is it possible for you to
> > apply the attached patch and run the 029_on_error.pl test? The patch
> > adds some logs to get additional information.
>
> Logs attached.
>
Thank you.
By seeing the below Logs:
----
....
2022-04-01 18:19:34.710 CUT [58327402] LOG: not started skipping
changes: my_skiplsn 14EB7D8/B0706F72 finish_lsn 0/14EB7D8
...
----
It seems that the value of skiplsn read in GetSubscription is wrong
which makes the apply worker think it doesn't need to skip the
transaction. Now, in Alter/Create Subscription, we are using
LSNGetDatum() to store skiplsn value in pg_subscription but while
reading it in GetSubscription(), we are not converting back the datum
to LSN by using DatumGetLSN(). Is it possible that on this machine it
might be leading to not getting the right value for skiplsn? I think
it is worth trying to see if this fixes the problem.
Any other thoughts?
--
With Regards,
Amit Kapila.