Re: hyrax vs. RelationBuildPartitionDesc - Mailing list pgsql-hackers

From Amit Langote
Subject Re: hyrax vs. RelationBuildPartitionDesc
Date
Msg-id CA+HiwqGZSJS+gB_6EdffMUxPuyJ-H8j9-HFiCWNR9+=Khi0qvw@mail.gmail.com
Whole thread Raw
In response to Re: hyrax vs. RelationBuildPartitionDesc  (Amit Langote <amitlangote09@gmail.com>)
Responses Re: hyrax vs. RelationBuildPartitionDesc
Re: hyrax vs. RelationBuildPartitionDesc
List pgsql-hackers
On Wed, Jun 5, 2019 at 8:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
> I noticed a crash with one of the scenarios that I think the patch is
> meant to address.  Let me describe the steps:
>
> Attach gdb (or whatever) to session 1 in which we'll run a query that
> will scan at least two partitions and set a breakpoint in
> expand_partitioned_rtentry().  Run the query, so the breakpoint will
> be hit.  Step through up to the start of the following loop in this
> function:
>
>     i = -1;
>     while ((i = bms_next_member(live_parts, i)) >= 0)
>     {
>         Oid         childOID = partdesc->oids[i];
>         Relation    childrel;
>         RangeTblEntry *childrte;
>         Index       childRTindex;
>         RelOptInfo *childrelinfo;
>
>         /* Open rel, acquiring required locks */
>         childrel = table_open(childOID, lockmode);
>
> Note that 'partdesc' in the above code is from the partition
> directory.  Before stepping through into the loop, start another
> session and attach a new partition.  On into the loop.  When the 1st
> partition is opened, its locking will result in
> RelationClearRelation() being called on the parent relation due to
> partition being attached concurrently, which I observed, is actually
> invoked a couple of times due to recursion.  Parent relation's
> rd_oldpdcxt will be set in this process, which contains the
> aforementioned partition descriptor.
>
> Before moving the loop to process the 2nd partition, attach another
> partition in session 2.  When the 2nd partition is opened,
> RelationClearRelation() will run again and in one of its recursive
> invocations, it destroys the rd_oldpdcxt that was set previously,
> taking the partition directory's partition descriptor with it.
> Anything that tries to access it later crashes.
>
> I couldn't figure out what to blame here though -- the design of
> rd_pdoldcxt or the fact that RelationClearRelation() is invoked
> multiple times.  I will try to study it more closely tomorrow.

On further study, I concluded that it's indeed the multiple
invocations of RelationClearRelation() on the parent relation that
causes rd_pdoldcxt to be destroyed prematurely.  While it's
problematic that the session in which a new partition is attached to
the parent relation broadcasts 2 SI messages to invalidate the parent
relation [1], it's obviously better to fix how RelationClearRelation
manipulates rd_pdoldcxt so that duplicate SI messages are not harmful,
fixing the latter is hardly an option.

Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch,
which seems to fix this issue for me.  In the rare case where inval
messages due to multiple concurrent attachments get processed in a
session holding a reference on a partitioned table, there will be
multiple "old" partition descriptors and corresponding "old" contexts.
All of the old contexts are chained together via context-re-parenting,
with the newest "old" context being accessible from the table's
rd_pdoldcxt.

Thanks,
Amit

[1]: inval.c: AddRelcacheInvalidationMessage() does try to prevent
duplicate messages from being emitted, but the logic to detect
duplicates doesn't work across CommandCounterIncrement().  There are
at two relcache inval requests in ATTACH PARTITION code path, emitted
by SetRelationHasSubclass and StorePartitionBound, resp., which are
separated by at least one CCI, so the 2nd request isn't detected as
the duplicate of the 1st.

Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: Should we warn against using too many partitions?
Next
From: Michael Paquier
Date:
Subject: Re: pg_basebackup failure after setting default_table_access_methodoption