Re: segmentation fault when cassert enabled - Mailing list pgsql-hackers
From | Jehan-Guillaume de Rorthais |
---|---|
Subject | Re: segmentation fault when cassert enabled |
Date | |
Msg-id | 20191216164000.02f06490@firost Whole thread Raw |
In response to | Re: segmentation fault when cassert enabled (vignesh C <vignesh21@gmail.com>) |
List | pgsql-hackers |
On Fri, 13 Dec 2019 12:10:07 +0530 vignesh C <vignesh21@gmail.com> wrote: > On Fri, Dec 6, 2019 at 5:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Nov 25, 2019 at 8:25 PM Jehan-Guillaume de Rorthais > > <jgdr@dalibo.com> wrote: > > > > > > On Wed, 6 Nov 2019 14:34:38 +0100 > > > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > > > > > > On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote: > > > > > My best bet so far is that logicalrep_relmap_invalidate_cb is not > > > > > called after the DDL on the subscriber so the relmap cache is not > > > > > invalidated. So we end up with slot->tts_tupleDescriptor->natts > > > > > superior than rel->remoterel->natts in slot_store_cstrings, leading > > > > > to the overflow on attrmap and the sigsev. > > > > > > > > It looks like something like that is happening. But it shouldn't. > > > > Different table schemas on publisher and subscriber are well supported, > > > > so this must be an edge case of some kind. Please continue > > > > investigating. > > > > > > I've been able to find the origin of the crash, but it was a long journey. > > > > > > <debugger hard life> > > > > > > I was unable to debug using gdb record because of this famous error: > > > > > > Process record does not support instruction 0xc5 at address > > > 0x1482758a4b30. > > > > > > Program stopped. > > > __memset_avx2_unaligned_erms () > > > at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:168 > > > 168 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such > > > file or directory. > > > > > > Trying with rr, I had constant "stack depth limit exceeded", even with > > > unlimited one. Does it worth opening a discussion or a wiki page about > > > these tools? Peter, it looks like you have some experience with rr, any > > > feedback? > > > > > > Finally, Julien Rouhaud spend some time with me after work hours, > > > a,swering my questions about some parts of the code and pointed me to the > > > excellent backtrace_functions GUC addition few days ago. This finally did > > > the trick to find out what was happening. Many thanks Julien! > > > > > > </debugger hard life> > > > > > > Back to the bug itself. Consider a working logical replication with > > > constant update/insert activity, eg. pgbench running against provider. > > > > > > Now, on the subscriber side, a session issue an "ALTER TABLE ADD > > > COLUMN" on a subscribed table, eg. pgbench_branches. A cache invalidation > > > message is then pending for this table. > > > > > > In the meantime, the logical replication worker receive an UPDATE to > > > apply. It opens the local relation using "logicalrep_rel_open". It finds > > > the related entry in LogicalRepRelMap is valid, so it does not update its > > > attrmap and directly opens and locks the local relation: > > > > > > > - /* Try to find and lock the relation by name. */ > > + /* Try to find the relation by name */ > > relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,\ > > remoterel->relname, -1), > > - lockmode, true); > > + NoLock, true); > > > > I think we can't do this because it could lead to locking the wrong > > reloid. See RangeVarGetRelidExtended. It ensures that after locking > > the relation (which includes accepting invalidation messages), that > > the reloid is correct. I think changing the code in the way you are > > suggesting can lead to locking incorrect reloid. Sorry for the delay, I couldn't answer earlier. To be honest, I was wondering about that. Since we keep in cache the relid and use it as cache invalidation, I thought it might be fragile. But then - as far as I could find - the only way to change the relid is to drop and create a new table. I wasn't sure it could really cause a race condition there because of the impact of such commands on logical replication. But now, I realize I should have go all the way through and close this potential bug as well. Thank you. > I have made changes to fix the comment provided. The patch for the > same is attached. Could not add a test case for this scenario is based > on timing issue. Thank you for this fix Vignesh!
pgsql-hackers by date: