Re: Declarative partitioning - another take - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Declarative partitioning - another take |
Date | |
Msg-id | CA+Tgmoakz6BYWvhcb5AJ0TsLGMu3Nb5huKJX+ySw4bdfewB70Q@mail.gmail.com Whole thread Raw |
In response to | Re: Declarative partitioning - another take (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>) |
Responses |
Re: Declarative partitioning - another take
|
List | pgsql-hackers |
On Wed, Aug 17, 2016 at 2:21 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I am slightly tempted to eliminate the pg_partition catalog and associated > syscache altogether and add a column to pg_class as Robert suggested. > That way, all relid_is_partition() calls will be replaced by > rel->rd_partbound != NULL check. But one potential problem with that > approach is that now whenever a parent relation is opened, all the > partition relations must be opened to get the partbound value (to form the > PartitionDesc to be stored in parent relation's rd_partdesc). Whereas > currently, we just look up the pg_partition catalog (or the associated > cache) for every partition and that gets us the partbound. Well, you could just look up the pg_class row without opening the relation, too. There is a system cache on pg_class.oid, after all. I think the issue is whether it's safe to read either one of those things without a lock on the child relation. If altering the partitioning information for a relation requires holding only AccessExclusiveLock on that relation, and no lock on the parent, then you really can't read the information for any child relation without taking at least AccessShareLock. Otherwise, it might change under you, and that would be bad. I'm inclined to think that changing the partitioning information for a child is going to require AccessExclusiveLock on both the child and the parent. That seems unfortunate from a concurrency point of view, but we may be stuck with it: suppose you require only ShareUpdateExclusiveLock on the parent. Well, then a concurrent read transaction might see the partition boundaries change when it does a relcache rebuild, which would cause it to suddenly start expecting the data to be in a different plan in mid-transaction, perhaps even in mid-scan. Maybe that's survivable with really careful coding, but it seems like it's probably a bad thing. For example, it would mean that the executor would be unable to rely on the partitioning information in the relcache remaining stable underneath it. Moreover, the relcache is always going to be scanned with the most recent possible MVCC snapshot, but the transaction snapshot may be older, so such a system creates all sorts of nasty possibilities for there to be skew between the snapshot being used to via the data and the snapshot being used to read the metadata that says where the data is. This may need some more thought, but if we go with that approach of requiring an AccessExclusiveLock on both parent and child, then it seems to me that maybe we should consider the partitioning information to be a property of the parent rather than the child. Just take all the partitioning information for all children and put it in one big node tree and store it in the pg_class or pg_partition_root entry for the parent as one big ol' varlena. Now you can open the parent and get all of the partitioning information for all of the children without needing any lock on any child, and that's *really* good, because it means that some day we might be able to do partition elimination before locking any of the children! That would be excellent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: