Thread: subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE
Hi, While scanning for assertion failures on the build farm that don't appear to have been discussed, I found this[1] in 010_truncate_publisher.log on the 13 branch: TRAP: FailedAssertion("tupdesc->tdrefcount <= 0", File: "/home/bf/build/buildfarm-desmoxytes/REL_13_STABLE/pgsql.build/../pgsql/src/backend/access/common/tupdesc.c", Line: 321) 2021-06-17 02:17:04.392 CEST [60ca947c.f7a43:4] LOG: server process (PID 1014658) was terminated by signal 11: Segmentation fault 2021-06-17 02:17:04.392 CEST [60ca947c.f7a43:5] DETAIL: Failed process was running: SELECT pg_catalog.set_config('search_path', '', false); The last thing the segfaulting process said was: 2021-06-17 02:17:03.847 CEST [60ca947f.f7b82:8] LOG: logical decoding found consistent point at 0/157D538 2021-06-17 02:17:03.847 CEST [60ca947f.f7b82:9] DETAIL: There are no running transactions. Unfortunately 13 doesn't log PIDs for assertion failures (hmm, commit 18c170a08ee could be back-patched?) so it's not clear which process that was, and there's no backtrace. I don't know if "pg_catalog.set_config('search_path', '', false)" is a clue that this is related to another recent crash[2] I mentioned, also from the 13 branch. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-16%2023:58:47 [2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BqdF6QE6rcj_Zj5h2qVARM--%2B92sqdmr-0DUSM_0Qu_g%40mail.gmail.com
On Tue, Jun 22, 2021 at 10:33 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > While scanning for assertion failures on the build farm that don't > appear to have been discussed, I found this[1] in > 010_truncate_publisher.log on the 13 branch: > > TRAP: FailedAssertion("tupdesc->tdrefcount <= 0", File: > "/home/bf/build/buildfarm-desmoxytes/REL_13_STABLE/pgsql.build/../pgsql/src/backend/access/common/tupdesc.c", > Line: 321) > I guess this could be similar to what we see at: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2021-06-15%2020%3A49%3A26 We have discussed this in another thread at: https://www.postgresql.org/message-id/648020.1623854904%40sss.pgh.pa.us The reason why I think it is the same is that assertion failure shown in the report is from function FreeTupleDesc() which we can call from pgoutput.c while processing the invalidation. Ideally, we shouldn't call invalidation before initializing the tuple conversion map for partitions but in some rare cases, that was happening which we have fixed in commit (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=357cb8f07f95665ea533ff534821c22c35b01288). I see this report is from 16th June 2021 and the commit is on 18th June 2021. So, I am hoping this should have been fixed but if we see it again then probably we need to investigate it further. -- With Regards, Amit Kapila.
On Tue, Jun 22, 2021 at 10:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jun 22, 2021 at 10:33 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > While scanning for assertion failures on the build farm that don't > > appear to have been discussed, I found this[1] in > > 010_truncate_publisher.log on the 13 branch: > > > > TRAP: FailedAssertion("tupdesc->tdrefcount <= 0", File: > > "/home/bf/build/buildfarm-desmoxytes/REL_13_STABLE/pgsql.build/../pgsql/src/backend/access/common/tupdesc.c", > > Line: 321) > > I guess this could be similar to what we see at: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2021-06-15%2020%3A49%3A26 > > We have discussed this in another thread at: > https://www.postgresql.org/message-id/648020.1623854904%40sss.pgh.pa.us > > The reason why I think it is the same is that assertion failure shown > in the report is from function FreeTupleDesc() which we can call from > pgoutput.c while processing the invalidation. Ideally, we shouldn't > call invalidation before initializing the tuple conversion map for > partitions but in some rare cases, that was happening which we have > fixed in commit > (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=357cb8f07f95665ea533ff534821c22c35b01288). > > I see this report is from 16th June 2021 and the commit is on 18th > June 2021. So, I am hoping this should have been fixed but if we see > it again then probably we need to investigate it further. Ahh, that makes sense. Thanks for checking, and sorry for the noise.
Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Jun 22, 2021 at 10:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> I see this report is from 16th June 2021 and the commit is on 18th >> June 2021. So, I am hoping this should have been fixed but if we see >> it again then probably we need to investigate it further. > Ahh, that makes sense. Thanks for checking, and sorry for the noise. BTW, the reason that the walsender is still showing its "query" as "SELECT pg_config..." is that pre-v14 versions don't update the reported query for replication commands, only plain-SQL commands. I recall that we fixed that in HEAD awhile ago; should we consider back-patching something for it? It seems quite confusing. regards, tom lane
On Tue, Jun 22, 2021 at 6:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > On Tue, Jun 22, 2021 at 10:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> I see this report is from 16th June 2021 and the commit is on 18th > >> June 2021. So, I am hoping this should have been fixed but if we see > >> it again then probably we need to investigate it further. > > > Ahh, that makes sense. Thanks for checking, and sorry for the noise. > > BTW, the reason that the walsender is still showing its "query" as > "SELECT pg_config..." is that pre-v14 versions don't update the > reported query for replication commands, only plain-SQL commands. > I recall that we fixed that in HEAD awhile ago; should we consider > back-patching something for it? > I think it would be great if we can do that. Analyzing such failures and in general for replication errors that will be a nice improvement and make the jobs of many people a bit easier. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > On Tue, Jun 22, 2021 at 6:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, the reason that the walsender is still showing its "query" as >> "SELECT pg_config..." is that pre-v14 versions don't update the >> reported query for replication commands, only plain-SQL commands. >> I recall that we fixed that in HEAD awhile ago; should we consider >> back-patching something for it? > I think it would be great if we can do that. Analyzing such failures > and in general for replication errors that will be a nice improvement > and make the jobs of many people a bit easier. Checking the git history, this was fixed in f560209c6, which also included some other mostly-cosmetic cleanup. I'm inclined to propose back-patching that whole commit, rather than allowing the code in exec_replication_command() to diverge in different branches. It looks like it applies cleanly as far back as v10. Maybe something could be done for 9.6 as well, but since that branch is so close to EOL, I doubt it's worth spending extra effort on it. regards, tom lane
On Thu, Jun 24, 2021 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > On Tue, Jun 22, 2021 at 6:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> BTW, the reason that the walsender is still showing its "query" as > >> "SELECT pg_config..." is that pre-v14 versions don't update the > >> reported query for replication commands, only plain-SQL commands. > >> I recall that we fixed that in HEAD awhile ago; should we consider > >> back-patching something for it? > > > I think it would be great if we can do that. Analyzing such failures > > and in general for replication errors that will be a nice improvement > > and make the jobs of many people a bit easier. > > Checking the git history, this was fixed in f560209c6, which also > included some other mostly-cosmetic cleanup. I'm inclined to > propose back-patching that whole commit, rather than allowing the > code in exec_replication_command() to diverge in different branches. > It looks like it applies cleanly as far back as v10. Maybe something > could be done for 9.6 as well, but since that branch is so close to > EOL, I doubt it's worth spending extra effort on it. > +1. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > On Thu, Jun 24, 2021 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Checking the git history, this was fixed in f560209c6, which also >> included some other mostly-cosmetic cleanup. I'm inclined to >> propose back-patching that whole commit, rather than allowing the >> code in exec_replication_command() to diverge in different branches. >> It looks like it applies cleanly as far back as v10. Maybe something >> could be done for 9.6 as well, but since that branch is so close to >> EOL, I doubt it's worth spending extra effort on it. > +1. Done that way. regards, tom lane
On Fri, Jun 25, 2021 at 8:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > On Thu, Jun 24, 2021 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Checking the git history, this was fixed in f560209c6, which also > >> included some other mostly-cosmetic cleanup. I'm inclined to > >> propose back-patching that whole commit, rather than allowing the > >> code in exec_replication_command() to diverge in different branches. > >> It looks like it applies cleanly as far back as v10. Maybe something > >> could be done for 9.6 as well, but since that branch is so close to > >> EOL, I doubt it's worth spending extra effort on it. > > > +1. > > Done that way. > Thanks! -- With Regards, Amit Kapila.