Re: Declarative partitioning - another take - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Declarative partitioning - another take
Date
Msg-id CA+Tgmoakz6BYWvhcb5AJ0TsLGMu3Nb5huKJX+ySw4bdfewB70Q@mail.gmail.com
Whole thread Raw
In response to Re: Declarative partitioning - another take  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Responses Re: Declarative partitioning - another take
List pgsql-hackers
On Wed, Aug 17, 2016 at 2:21 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I am slightly tempted to eliminate the pg_partition catalog and associated
> syscache altogether and add a column to pg_class as Robert suggested.
> That way, all relid_is_partition() calls will be replaced by
> rel->rd_partbound != NULL check.  But one potential problem with that
> approach is that now whenever a parent relation is opened, all the
> partition relations must be opened to get the partbound value (to form the
> PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
> currently, we just look up the pg_partition catalog (or the associated
> cache) for every partition and that gets us the partbound.

Well, you could just look up the pg_class row without opening the
relation, too.  There is a system cache on pg_class.oid, after all.  I
think the issue is whether it's safe to read either one of those
things without a lock on the child relation.  If altering the
partitioning information for a relation requires holding only
AccessExclusiveLock on that relation, and no lock on the parent, then
you really can't read the information for any child relation without
taking at least AccessShareLock.  Otherwise, it might change under
you, and that would be bad.

I'm inclined to think that changing the partitioning information for a
child is going to require AccessExclusiveLock on both the child and
the parent.  That seems unfortunate from a concurrency point of view,
but we may be stuck with it: suppose you require only
ShareUpdateExclusiveLock on the parent.  Well, then a concurrent read
transaction might see the partition boundaries change when it does a
relcache rebuild, which would cause it to suddenly start expecting the
data to be in a different plan in mid-transaction, perhaps even in
mid-scan.  Maybe that's survivable with really careful coding, but it
seems like it's probably a bad thing.  For example, it would mean that
the executor would be unable to rely on the partitioning information
in the relcache remaining stable underneath it.  Moreover, the
relcache is always going to be scanned with the most recent possible
MVCC snapshot, but the transaction snapshot may be older, so such a
system creates all sorts of nasty possibilities for there to be skew
between the snapshot being used to via the data and the snapshot being
used to read the metadata that says where the data is.

This may need some more thought, but if we go with that approach of
requiring an AccessExclusiveLock on both parent and child, then it
seems to me that maybe we should consider the partitioning information
to be a property of the parent rather than the child.  Just take all
the partitioning information for all children and put it in one big
node tree and store it in the pg_class or pg_partition_root entry for
the parent as one big ol' varlena.  Now you can open the parent and
get all of the partitioning information for all of the children
without needing any lock on any child, and that's *really* good,
because it means that some day we might be able to do partition
elimination before locking any of the children!  That would be
excellent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: PATCH: Exclude additional directories in pg_basebackup
Next
From: Jim Nasby
Date:
Subject: Add -c to rsync commands on SR tutorial wiki page