Thread: subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE

subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE

From
Thomas Munro
Date:
Hi,

While scanning for assertion failures on the build farm that don't
appear to have been discussed, I found this[1] in
010_truncate_publisher.log on the 13 branch:

TRAP: FailedAssertion("tupdesc->tdrefcount <= 0", File:
"/home/bf/build/buildfarm-desmoxytes/REL_13_STABLE/pgsql.build/../pgsql/src/backend/access/common/tupdesc.c",
Line: 321)
2021-06-17 02:17:04.392 CEST [60ca947c.f7a43:4] LOG:  server process
(PID 1014658) was terminated by signal 11: Segmentation fault
2021-06-17 02:17:04.392 CEST [60ca947c.f7a43:5] DETAIL:  Failed
process was running: SELECT pg_catalog.set_config('search_path', '',
false);

The last thing the segfaulting process said was:

2021-06-17 02:17:03.847 CEST [60ca947f.f7b82:8] LOG:  logical decoding
found consistent point at 0/157D538
2021-06-17 02:17:03.847 CEST [60ca947f.f7b82:9] DETAIL:  There are no
running transactions.

Unfortunately 13 doesn't log PIDs for assertion failures (hmm, commit
18c170a08ee could be back-patched?) so it's not clear which process
that was, and there's no backtrace.

I don't know if "pg_catalog.set_config('search_path', '', false)" is a
clue that this is related to another recent crash[2] I mentioned, also
from the 13 branch.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-16%2023:58:47
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BqdF6QE6rcj_Zj5h2qVARM--%2B92sqdmr-0DUSM_0Qu_g%40mail.gmail.com



Re: subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE

From
Amit Kapila
Date:
On Tue, Jun 22, 2021 at 10:33 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> While scanning for assertion failures on the build farm that don't
> appear to have been discussed, I found this[1] in
> 010_truncate_publisher.log on the 13 branch:
>
> TRAP: FailedAssertion("tupdesc->tdrefcount <= 0", File:
> "/home/bf/build/buildfarm-desmoxytes/REL_13_STABLE/pgsql.build/../pgsql/src/backend/access/common/tupdesc.c",
> Line: 321)
>

I guess this could be similar to what we see at:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2021-06-15%2020%3A49%3A26

We have discussed this in another thread at:
https://www.postgresql.org/message-id/648020.1623854904%40sss.pgh.pa.us

The reason why I think it is the same is that assertion failure shown
in the report is from function FreeTupleDesc() which we can call from
pgoutput.c while processing the invalidation. Ideally, we shouldn't
call invalidation before initializing the tuple conversion map for
partitions but in some rare cases, that was happening which we have
fixed in commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=357cb8f07f95665ea533ff534821c22c35b01288).

I see this report is from 16th June 2021 and the commit is on 18th
June 2021. So, I am hoping this should have been fixed but if we see
it again then probably we need to investigate it further.

-- 
With Regards,
Amit Kapila.



Re: subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE

From
Thomas Munro
Date:
On Tue, Jun 22, 2021 at 10:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jun 22, 2021 at 10:33 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > While scanning for assertion failures on the build farm that don't
> > appear to have been discussed, I found this[1] in
> > 010_truncate_publisher.log on the 13 branch:
> >
> > TRAP: FailedAssertion("tupdesc->tdrefcount <= 0", File:
> > "/home/bf/build/buildfarm-desmoxytes/REL_13_STABLE/pgsql.build/../pgsql/src/backend/access/common/tupdesc.c",
> > Line: 321)
>
> I guess this could be similar to what we see at:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2021-06-15%2020%3A49%3A26
>
> We have discussed this in another thread at:
> https://www.postgresql.org/message-id/648020.1623854904%40sss.pgh.pa.us
>
> The reason why I think it is the same is that assertion failure shown
> in the report is from function FreeTupleDesc() which we can call from
> pgoutput.c while processing the invalidation. Ideally, we shouldn't
> call invalidation before initializing the tuple conversion map for
> partitions but in some rare cases, that was happening which we have
> fixed in commit
> (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=357cb8f07f95665ea533ff534821c22c35b01288).
>
> I see this report is from 16th June 2021 and the commit is on 18th
> June 2021. So, I am hoping this should have been fixed but if we see
> it again then probably we need to investigate it further.

Ahh, that makes sense.  Thanks for checking, and sorry for the noise.



Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, Jun 22, 2021 at 10:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I see this report is from 16th June 2021 and the commit is on 18th
>> June 2021. So, I am hoping this should have been fixed but if we see
>> it again then probably we need to investigate it further.

> Ahh, that makes sense.  Thanks for checking, and sorry for the noise.

BTW, the reason that the walsender is still showing its "query" as
"SELECT pg_config..." is that pre-v14 versions don't update the
reported query for replication commands, only plain-SQL commands.
I recall that we fixed that in HEAD awhile ago; should we consider
back-patching something for it?  It seems quite confusing.

            regards, tom lane



Re: subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE

From
Amit Kapila
Date:
On Tue, Jun 22, 2021 at 6:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Tue, Jun 22, 2021 at 10:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> I see this report is from 16th June 2021 and the commit is on 18th
> >> June 2021. So, I am hoping this should have been fixed but if we see
> >> it again then probably we need to investigate it further.
>
> > Ahh, that makes sense.  Thanks for checking, and sorry for the noise.
>
> BTW, the reason that the walsender is still showing its "query" as
> "SELECT pg_config..." is that pre-v14 versions don't update the
> reported query for replication commands, only plain-SQL commands.
> I recall that we fixed that in HEAD awhile ago; should we consider
> back-patching something for it?
>

I think it would be great if we can do that. Analyzing such failures
and in general for replication errors that will be a nice improvement
and make the jobs of many people a bit easier.

-- 
With Regards,
Amit Kapila.



Amit Kapila <amit.kapila16@gmail.com> writes:
> On Tue, Jun 22, 2021 at 6:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, the reason that the walsender is still showing its "query" as
>> "SELECT pg_config..." is that pre-v14 versions don't update the
>> reported query for replication commands, only plain-SQL commands.
>> I recall that we fixed that in HEAD awhile ago; should we consider
>> back-patching something for it?

> I think it would be great if we can do that. Analyzing such failures
> and in general for replication errors that will be a nice improvement
> and make the jobs of many people a bit easier.

Checking the git history, this was fixed in f560209c6, which also
included some other mostly-cosmetic cleanup.  I'm inclined to
propose back-patching that whole commit, rather than allowing the
code in exec_replication_command() to diverge in different branches.
It looks like it applies cleanly as far back as v10.  Maybe something
could be done for 9.6 as well, but since that branch is so close to
EOL, I doubt it's worth spending extra effort on it.

            regards, tom lane



Re: subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE

From
Amit Kapila
Date:
On Thu, Jun 24, 2021 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Tue, Jun 22, 2021 at 6:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> BTW, the reason that the walsender is still showing its "query" as
> >> "SELECT pg_config..." is that pre-v14 versions don't update the
> >> reported query for replication commands, only plain-SQL commands.
> >> I recall that we fixed that in HEAD awhile ago; should we consider
> >> back-patching something for it?
>
> > I think it would be great if we can do that. Analyzing such failures
> > and in general for replication errors that will be a nice improvement
> > and make the jobs of many people a bit easier.
>
> Checking the git history, this was fixed in f560209c6, which also
> included some other mostly-cosmetic cleanup.  I'm inclined to
> propose back-patching that whole commit, rather than allowing the
> code in exec_replication_command() to diverge in different branches.
> It looks like it applies cleanly as far back as v10.  Maybe something
> could be done for 9.6 as well, but since that branch is so close to
> EOL, I doubt it's worth spending extra effort on it.
>

+1.

-- 
With Regards,
Amit Kapila.



Amit Kapila <amit.kapila16@gmail.com> writes:
> On Thu, Jun 24, 2021 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Checking the git history, this was fixed in f560209c6, which also
>> included some other mostly-cosmetic cleanup.  I'm inclined to
>> propose back-patching that whole commit, rather than allowing the
>> code in exec_replication_command() to diverge in different branches.
>> It looks like it applies cleanly as far back as v10.  Maybe something
>> could be done for 9.6 as well, but since that branch is so close to
>> EOL, I doubt it's worth spending extra effort on it.

> +1.

Done that way.

            regards, tom lane



Re: subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE

From
Amit Kapila
Date:
On Fri, Jun 25, 2021 at 8:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Thu, Jun 24, 2021 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Checking the git history, this was fixed in f560209c6, which also
> >> included some other mostly-cosmetic cleanup.  I'm inclined to
> >> propose back-patching that whole commit, rather than allowing the
> >> code in exec_replication_command() to diverge in different branches.
> >> It looks like it applies cleanly as far back as v10.  Maybe something
> >> could be done for 9.6 as well, but since that branch is so close to
> >> EOL, I doubt it's worth spending extra effort on it.
>
> > +1.
>
> Done that way.
>

Thanks!

-- 
With Regards,
Amit Kapila.