Re: Declarative partitioning - another take - Mailing list pgsql-hackers

From Amit Langote
Subject Re: Declarative partitioning - another take
Date
Msg-id a002d027-66cb-a221-4069-c1ebcc8dfd0e@lab.ntt.co.jp
Whole thread Raw
In response to Re: Declarative partitioning - another take  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Declarative partitioning - another take  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2016/08/18 5:23, Robert Haas wrote:
> On Wed, Aug 17, 2016 at 2:21 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I am slightly tempted to eliminate the pg_partition catalog and associated
>> syscache altogether and add a column to pg_class as Robert suggested.
>> That way, all relid_is_partition() calls will be replaced by
>> rel->rd_partbound != NULL check.  But one potential problem with that
>> approach is that now whenever a parent relation is opened, all the
>> partition relations must be opened to get the partbound value (to form the
>> PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
>> currently, we just look up the pg_partition catalog (or the associated
>> cache) for every partition and that gets us the partbound.
> 
> Well, you could just look up the pg_class row without opening the
> relation, too.  There is a system cache on pg_class.oid, after all.  I

Yes, I somehow didn't think of that.

> think the issue is whether it's safe to read either one of those
> things without a lock on the child relation.  If altering the
> partitioning information for a relation requires holding only
> AccessExclusiveLock on that relation, and no lock on the parent, then
> you really can't read the information for any child relation without
> taking at least AccessShareLock.  Otherwise, it might change under
> you, and that would be bad.

I'd imagine this won't be a problem because we take an AccessExclusiveLock
on the parent when adding/removing a partition.

> I'm inclined to think that changing the partitioning information for a
> child is going to require AccessExclusiveLock on both the child and
> the parent.  That seems unfortunate from a concurrency point of view,
> but we may be stuck with it: suppose you require only
> ShareUpdateExclusiveLock on the parent.  Well, then a concurrent read
> transaction might see the partition boundaries change when it does a
> relcache rebuild, which would cause it to suddenly start expecting the
> data to be in a different plan in mid-transaction, perhaps even in
> mid-scan.  Maybe that's survivable with really careful coding, but it
> seems like it's probably a bad thing.  For example, it would mean that
> the executor would be unable to rely on the partitioning information
> in the relcache remaining stable underneath it.  Moreover, the
> relcache is always going to be scanned with the most recent possible
> MVCC snapshot, but the transaction snapshot may be older, so such a
> system creates all sorts of nasty possibilities for there to be skew
> between the snapshot being used to via the data and the snapshot being
> used to read the metadata that says where the data is.

We do take a lock on the parent because we would be changing its partition
descriptor (relcache).  I changed MergeAttributes() such that an
AccessExclusiveLock instead of ShareUpdateExclusiveLock is taken if the
parent is a partitioned table.

> This may need some more thought, but if we go with that approach of
> requiring an AccessExclusiveLock on both parent and child, then it
> seems to me that maybe we should consider the partitioning information
> to be a property of the parent rather than the child.  Just take all
> the partitioning information for all children and put it in one big
> node tree and store it in the pg_class or pg_partition_root entry for
> the parent as one big ol' varlena.  Now you can open the parent and
> get all of the partitioning information for all of the children
> without needing any lock on any child, and that's *really* good,
> because it means that some day we might be able to do partition
> elimination before locking any of the children!  That would be
> excellent.

If we need an AccessExclusiveLock on parent to add/remove a partition
(IOW, changing that child table's partitioning information), then do we
need to lock the individual partitions when reading partition's
information?  I mean to ask why the simple syscache look-ups to get each
partition's bound wouldn't do.

Thanks,
Amit





pgsql-hackers by date:

Previous
From: Venkata B Nagothi
Date:
Subject: Re: patch proposal
Next
From: Amit Langote
Date:
Subject: Re: Declarative partitioning - another take