Thread: ATTACH/DETACH PARTITION CONCURRENTLY

ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
Hi,

One of the downsides of declarative partitioning vs old school
inheritance partitioning is that a new partition cannot be added to
the partitioned table without taking an AccessExclusiveLock on the
partitioned table.  We've obviously got a bunch of features for
various other things where we work a bit harder to get around that
problem, e.g creating indexes concurrently.

I've started working on allowing partitions to be attached and
detached with just a ShareUpdateExclusiveLock on the table. If I'm
correct, then we can do this in a similar, but more simple way as to
how CREATE INDEX CONCURRENTLY works. We just need to pencil in that
the new partition exists, but not yet valid, then wait for snapshots
older than our own to finish before marking the partition is valid.

One problem I had with doing this is that there was not really a good
place to store that "isvalid" flag for partitions. We have pg_index
for indexes, but partition details are just spread over pg_inherits
and pg_class. So step 1 was to move all that into a new table called
pg_partition. I think this is quite nice as it also gets rid of
relpartbound out of pg_class. It probably just a matter of time before
someone complains that they can't create some partition with some
pretty large Datum due to it not being able to fit on a single heap
page (pg_class has no TOAST table).  I ended up getting rid of
pg_class.relispartition replacing it with relpartitionparernt which is
just InvalidOid when the table or index is not a partition.  This
allows various pieces of code to be more efficient since we can look
at the relcache instead of scanning pg_inherits all the time. It's now
also much faster to get a partitions ancestors.

So, patches 0001 is just one I've already submitted for the July
'fest. Nothing new. It was just required to start this work.

0002 migrates partitions out of pg_inherits into pg_partition. This
patch is at a stage where it appears to work, but is very unpolished
and requires me to stare at it much longer than I've done so far.
There's a bunch of code that gets repeated way too many times in
tablecmds.c, for example.

0003 does the same for partitioned indexes. The patch is in a similar,
maybe slightly worse state than 0002. Various comments will be out of
date.

0004
is the early workings of what I have in mind for the concurrent ATTACH
code. It's vastly incomplete. It does pass make check but really only
because there are no tests doing any concurrent attaches.  There's a
mountain of code missing that ignores invalid partitions. I just have
a very simple case working. Partition-wise joins will be very much
broken by what I have so far, and likely a whole bunch of other stuff.

About the extent of my tests so far are the following:

--setup
create table listp (a int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 (a int);
insert into listp1 values(1);
insert into listp2 values(2);

-- example 1.
start transaction isolation level repeatable read; -- Session 1
select * from listp; -- Session 1
 a
---
 1
(1 row)

alter table listp attach partition concurrently listp2 for values in
(2); -- Session 2 (waits for release of session 1's snapshot)
select * from listp; -- Session 1
 a
---
 1

commit; -- session 1 (session 2's alter table now finishes waiting)
select * from listp; -- Session 1 (new partition now valid)
 a
---
 1
 2
(2 rows)

-- example 2.
start transaction isolation level read committed; -- session 1;
select * from listp; -- session 1
 a
---
 1
(1 row)

alter table listp attach partition concurrently listp2 for values in
(2); -- Session 2 completes without waiting.

select * from listp; -- Session 1 (new partition visible while in transaction)
 a
---
 1
 2
(2 rows)

This basically works by:

1. Do all the normal partition attach partition validation.
2. Insert a record into pg_partition with partisvalid=false
3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
4. Obtain a session-level AccessExclusiveLock on the partition being attached.
5. Commit.
6. Start a new transaction.
7. Wait for snapshots older than our own to be released.
8. Mark the partition as valid
9. Invalidate relcache for the partitioned table.
10. release session-level locks.

I've disallowed the feature when the partitioned table has a default
partition. I don't see how this can be made to work.

At the moment ALTER TABLE ... ATTACH PARTITION commands cannot contain
any other sub-commands in the ALTER TABLE, so performing the
additional transaction commit and begin inside the single sub-command
might be okay.  It does mean that 'rel' which is passed down to
ATExecAttachPartition() must be closed and reopened again which
results in the calling function having a pointer into a closed
Relation. I worked around this by changing the code so it passes a
pointer to the Relation, and I've got ATExecAttachPartition() updating
that pointer before returning. It's not particularly pretty, but I
didn't really see how else this can be done.

I've not yet done anything about the DETACH CONCURRENTLY case. I think
it should just be the same steps in some roughly reverse order.  We
can skip the waiting part of the partition being detached is still
marked as invalid from some failed concurrent ATTACH.

I've not thought much about pg_dump beyond just have it ignore invalid
partitions. I don't think it's very useful to support some command
that attaches an invalid partition since there will be no command to
revalidate an invalid partition. It's probably best to resolve that
with a DETACH followed by a new ATTACH. So probably pg_dump can just
do nothing for invalid partitions.

So anyway, my intentions of posting this patch now rather than when
it's closer to being finished is for design review. I'm interested in
hearing objections, comments, constructive criticism for patches
0002-0004. Patch 0001 comments can go to [1]

Are there any blockers on this that I've overlooked?

[1] https://www.postgresql.org/message-id/CAKJS1f81TpxZ8twugrWCo%3DVDHEkmagxRx7a%2B1z4aaMeQy%3DnA7w%40mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On 3 August 2018 at 01:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
> 1. Do all the normal partition attach partition validation.
> 2. Insert a record into pg_partition with partisvalid=false
> 3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
> 4. Obtain a session-level AccessExclusiveLock on the partition being attached.
> 5. Commit.
> 6. Start a new transaction.
> 7. Wait for snapshots older than our own to be released.
> 8. Mark the partition as valid
> 9. Invalidate relcache for the partitioned table.
> 10. release session-level locks.

So I was thinking about this again and realised this logic is broken.
All it takes is a snapshot that starts after the ATTACH PARTITION
started and before it completed.  This snapshot will have the new
partition attached while it's possibly still open which could lead to
non-repeatable reads in a repeatable read transaction.  The window for
this to occur is possibly quite large given that the ATTACH
CONCURRENTLY can wait a long time for older snapshots to finish.

Here's my updated thinking for an implementation which seems to get
around the above problem:

ATTACH PARTITION CONCURRENTLY:

1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
than an AccessExclusiveLock.
2. Do all the normal partition attach partition validation.
3. Insert pg_partition record with partvalid = true.
4. Invalidate relcache entry for the partitioned table
5. Any loops over a partitioned table's PartitionDesc must check
PartitionIsValid(). This will return true if the current snapshot
should see the partition or not. The partition is valid if partisvalid
= true and the xmin precedes or is equal to the current snapshot.

#define PartitionIsValid(pd, i) (((pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals((pd)->xmin[(i)], GetCurrentTransactionId())) \
|| (!(pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals(GetCurrentTransactionId(), (pd)->xmin[(i)])))

DETACH PARTITION CONCURRENTLY:

1. Obtain ShareUpdateExclusiveLock on partition being detached
(instead of the AccessShareLock that non-concurrent detach uses)
2. Update the pg_partition record, set partvalid = false.
3. Commit
4. New transaction.
5. Wait for transactions which hold a snapshot older than the one held
when updating pg_partition to complete.
6. Delete the pg_partition record.
7. Perform other cleanup, relpartitionparent = 0, pg_depend etc.

DETACH PARTITION CONCURRENTLY failure when it fails after step 3 (above)

1. Make vacuum of a partition check for pg_partition.partvalid =
false, if xmin of tuple is old enough, perform a partition cleanup by
doing steps 6+7 above.

A VACUUM FREEZE must run before transaction wraparound, so this means
a partition can never reattach itself when the transaction counter
wraps.

I believe I've got the attach and detach working correctly now and
also isolation tests that appear to prove it works. I've also written
the failed detach cleanup code into vacuum. Unusually, since foreign
tables can also be partitions this required teaching auto-vacuum to
look at foreign tables, only in the sense of checking for failed
detached partitions. It also requires adding vacuum support for
foreign tables too.  It feels a little bit weird to modify auto-vacuum
to look at foreign tables, but I really couldn't see another way to do
this.

I'm now considering if this all holds together in the event the
pg_partition tuple of an invalid partition becomes frozen. The problem
would be that PartitionIsValid() could return the wrong value due to
TransactionIdPrecedesOrEquals(GetCurrentTransactionId(),
(pd)->xmin[(i)]). this code is trying to keep the detached partition
visible to older snapshots, but if pd->xmin[i] becomes frozen, then
the partition would become invisible.  However, I think this won't be
a problem since a VACUUM FREEZE would only freeze tuples that are also
old enough to have failed detaches cleaned up earlier in the vacuum
process.

Also, we must disallow a DEFAULT partition from being attached to a
partition with a failed DETACH CONCURRENTLY as it wouldn't be very
clear what the default partition's partition qual would be, as this is
built based on the quals of all attached partitions.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Andres Freund
Date:
Hi,

On 2018-08-08 00:40:12 +1200, David Rowley wrote:
> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
> than an AccessExclusiveLock.
> 2. Do all the normal partition attach partition validation.
> 3. Insert pg_partition record with partvalid = true.
> 4. Invalidate relcache entry for the partitioned table
> 5. Any loops over a partitioned table's PartitionDesc must check
> PartitionIsValid(). This will return true if the current snapshot
> should see the partition or not. The partition is valid if partisvalid
> = true and the xmin precedes or is equal to the current snapshot.

How does this protect against other sessions actively using the relcache
entry? Currently it is *NOT* safe to receive invalidations for
e.g. partitioning contents afaics.

- Andres


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Simon Riggs
Date:
On 7 August 2018 at 13:47, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2018-08-08 00:40:12 +1200, David Rowley wrote:
>> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
>> than an AccessExclusiveLock.
>> 2. Do all the normal partition attach partition validation.
>> 3. Insert pg_partition record with partvalid = true.
>> 4. Invalidate relcache entry for the partitioned table
>> 5. Any loops over a partitioned table's PartitionDesc must check
>> PartitionIsValid(). This will return true if the current snapshot
>> should see the partition or not. The partition is valid if partisvalid
>> = true and the xmin precedes or is equal to the current snapshot.
>
> How does this protect against other sessions actively using the relcache
> entry? Currently it is *NOT* safe to receive invalidations for
> e.g. partitioning contents afaics.

I think you may be right in the general case, but ISTM possible to
invalidate/refresh just the list of partitions.

If so, that idea would seem to require some new, as-yet not invented mechanism.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On 8 August 2018 at 00:47, Andres Freund <andres@anarazel.de> wrote:
> On 2018-08-08 00:40:12 +1200, David Rowley wrote:
>> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
>> than an AccessExclusiveLock.
>> 2. Do all the normal partition attach partition validation.
>> 3. Insert pg_partition record with partvalid = true.
>> 4. Invalidate relcache entry for the partitioned table
>> 5. Any loops over a partitioned table's PartitionDesc must check
>> PartitionIsValid(). This will return true if the current snapshot
>> should see the partition or not. The partition is valid if partisvalid
>> = true and the xmin precedes or is equal to the current snapshot.
>
> How does this protect against other sessions actively using the relcache
> entry? Currently it is *NOT* safe to receive invalidations for
> e.g. partitioning contents afaics.

I'm not proposing that sessions running older snapshots can't see that
there's a new partition. The code I have uses PartitionIsValid() to
test if the partition should be visible to the snapshot. The
PartitionDesc will always contain details for all partitions stored in
pg_partition whether they're valid to the current snapshot or not.  I
did it this way as there's no way to invalidate the relcache based on
a point in transaction, only a point in time.

I'm open to better ideas, of course.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Andres Freund
Date:
On 2018-08-08 01:23:51 +1200, David Rowley wrote:
> On 8 August 2018 at 00:47, Andres Freund <andres@anarazel.de> wrote:
> > On 2018-08-08 00:40:12 +1200, David Rowley wrote:
> >> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
> >> than an AccessExclusiveLock.
> >> 2. Do all the normal partition attach partition validation.
> >> 3. Insert pg_partition record with partvalid = true.
> >> 4. Invalidate relcache entry for the partitioned table
> >> 5. Any loops over a partitioned table's PartitionDesc must check
> >> PartitionIsValid(). This will return true if the current snapshot
> >> should see the partition or not. The partition is valid if partisvalid
> >> = true and the xmin precedes or is equal to the current snapshot.
> >
> > How does this protect against other sessions actively using the relcache
> > entry? Currently it is *NOT* safe to receive invalidations for
> > e.g. partitioning contents afaics.
> 
> I'm not proposing that sessions running older snapshots can't see that
> there's a new partition. The code I have uses PartitionIsValid() to
> test if the partition should be visible to the snapshot. The
> PartitionDesc will always contain details for all partitions stored in
> pg_partition whether they're valid to the current snapshot or not.  I
> did it this way as there's no way to invalidate the relcache based on
> a point in transaction, only a point in time.

I don't think that solves the problem that an arriving relcache
invalidation would trigger a rebuild of rd_partdesc, while it actually
is referenced by running code.

You'd need to build infrastructure to prevent that.

One approach would be to make sure that everything relying on
rt_partdesc staying the same stores its value in a local variable, and
then *not* free the old version of rt_partdesc (etc) when the refcount >
0, but delay that to the RelationClose() that makes refcount reach
0. That'd be the start of a framework for more such concurrenct
handling.

Regards,

Andres Freund


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On 8 August 2018 at 01:29, Andres Freund <andres@anarazel.de> wrote:
> On 2018-08-08 01:23:51 +1200, David Rowley wrote:
>> I'm not proposing that sessions running older snapshots can't see that
>> there's a new partition. The code I have uses PartitionIsValid() to
>> test if the partition should be visible to the snapshot. The
>> PartitionDesc will always contain details for all partitions stored in
>> pg_partition whether they're valid to the current snapshot or not.  I
>> did it this way as there's no way to invalidate the relcache based on
>> a point in transaction, only a point in time.
>
> I don't think that solves the problem that an arriving relcache
> invalidation would trigger a rebuild of rd_partdesc, while it actually
> is referenced by running code.
>
> You'd need to build infrastructure to prevent that.
>
> One approach would be to make sure that everything relying on
> rt_partdesc staying the same stores its value in a local variable, and
> then *not* free the old version of rt_partdesc (etc) when the refcount >
> 0, but delay that to the RelationClose() that makes refcount reach
> 0. That'd be the start of a framework for more such concurrenct
> handling.

I'm not so sure not freeing the partdesc until the refcount reaches 0
is safe. As you'd expect, we hold a lock on a partitioned table
between the planner and executor, but the relation has been closed and
the ref count returns to 0, which means when the relation is first
opened in the executor that the updated PartitionDesc is obtained.  A
non-concurrent attach would have been blocked in this case due to the
lock being held by the planner. Instead of using refcount == 0,
perhaps we can release the original partdesc only when there are no
locks held by us on the relation.

It's late here now, so I'll look at that tomorrow.

I've attached what I was playing around with. I think I'll also need
to change RelationGetPartitionDesc() to have it return the original
partdesc, if it's non-NULL.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Peter Eisentraut
Date:
On 07/08/2018 15:29, Andres Freund wrote:
> I don't think that solves the problem that an arriving relcache
> invalidation would trigger a rebuild of rd_partdesc, while it actually
> is referenced by running code.

The problem is more generally that a relcache invalidation changes all
pointers that might be in use.  So it's currently not safe to trigger a
relcache invalidation (on tables) without some kind of exclusive lock.
One possible solution to this is outlined here:
<https://www.postgresql.org/message-id/CA+TgmobtmFT5g-0dA=vEFFtogjRAuDHcYPw+qEdou5dZPnF=pg@mail.gmail.com>

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Andres Freund
Date:
Hi,

On 2018-08-09 20:57:35 +0200, Peter Eisentraut wrote:
> On 07/08/2018 15:29, Andres Freund wrote:
> > I don't think that solves the problem that an arriving relcache
> > invalidation would trigger a rebuild of rd_partdesc, while it actually
> > is referenced by running code.
> 
> The problem is more generally that a relcache invalidation changes all
> pointers that might be in use.

I don't think that's quite right. We already better be OK with
superfluous invals that do not change anything, because there's already
sources of those (just think of vacuum, analyze, relation extension,
whatnot).


> So it's currently not safe to trigger a relcache invalidation (on
> tables) without some kind of exclusive lock.

I don't think that's true in a as general sense as you're stating it.
It's not OK to send relcache invalidations for things that people rely
on, and that cannot be updated in-place. Because of the dangling pointer
issue etc.

The fact that currently it is not safe to *change* partition related
stuff without an AEL and how to make it safe is precisely what I was
talking about in the thread. It won't be a general solution, but the
infrastructure I'm talking about should get us closer.


> One possible solution to this is outlined here:
> <https://www.postgresql.org/message-id/CA+TgmobtmFT5g-0dA=vEFFtogjRAuDHcYPw+qEdou5dZPnF=pg@mail.gmail.com>

I don't see anything in here that addresses the issue structurally?

Greetings,

Andres Freund


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On 8 August 2018 at 01:29, Andres Freund <andres@anarazel.de> wrote:
> One approach would be to make sure that everything relying on
> rt_partdesc staying the same stores its value in a local variable, and
> then *not* free the old version of rt_partdesc (etc) when the refcount >
> 0, but delay that to the RelationClose() that makes refcount reach
> 0. That'd be the start of a framework for more such concurrenct
> handling.

This is not a fully baked idea, but I'm wondering if a better way to
do this, instead of having this PartitionIsValid macro to determine if
the partition should be visible to the current transaction ID, we
could, when we invalidate a relcache entry, send along the transaction
ID that it's invalid from.  Other backends when they process the
invalidation message they could wipe out the cache entry only if their
xid is >= the invalidation's "xmax" value. Otherwise, just tag the
xmax onto the cache somewhere and always check it before using the
cache (perhaps make it part of the RelationIsValid macro).  This would
also require that we move away from SnapshotAny type catalogue scans
in favour of MVCC scans so that backends populating their relcache
build it based on their current xid.  Unless I'm mistaken, it should
not make any difference for all DDL that takes an AEL on the relation,
since there can be no older transactions running when the catalogue is
modified, but for DDL that's not taking an AEL, we could effectively
have an MVCC relcache.

It would need careful thought about how it might affect CREATE INDEX
CONCURRENTLY and all the other DDL that can be performed without an
AEL.

I'm unsure how this would work for the catcache as I've studied that
code in even less detail, but throwing this out there in case there
some major flaw in this idea so that I don't go wasting time looking
into it further.

I think the PartitionIsValid idea was not that great as it really
complicates run-time partition pruning since it's quite critical about
partition indexes being the same between the planner and executor.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Sun, Aug 12, 2018 at 9:05 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> This is not a fully baked idea, but I'm wondering if a better way to
> do this, instead of having this PartitionIsValid macro to determine if
> the partition should be visible to the current transaction ID, we
> could, when we invalidate a relcache entry, send along the transaction
> ID that it's invalid from.  Other backends when they process the
> invalidation message they could wipe out the cache entry only if their
> xid is >= the invalidation's "xmax" value. Otherwise, just tag the
> xmax onto the cache somewhere and always check it before using the
> cache (perhaps make it part of the RelationIsValid macro).

Transactions don't necessarily commit in XID order, so this might be
an optimization to keep older transactions from having to do
unnecessary rebuilds -- which I actually doubt is a major problem, but
maybe I'm wrong -- but you can't rely solely on this as a way of
deciding which transactions will see the effects of some change.  If
transactions 100, 101, and 102 begin in that order, and transaction
101 commits, there's no principled justification for 102 seeing its
effects but 100 not seeing it.

> This would
> also require that we move away from SnapshotAny type catalogue scans
> in favour of MVCC scans so that backends populating their relcache
> build it based on their current xid.

I think this is a somewhat confused analysis.  We don't use
SnapshotAny for catalog scans, and we never have.  We used to use
SnapshotNow, and we now use a current MVCC snapshot.  What you're
talking about, I think, is possibly using the transaction snapshot
rather than a current MVCC snapshot for the catalog scans.

I've thought about similar things, but I think there's a pretty deep
can of worms.  For instance, if you built a relcache entry using the
transaction snapshot, you might end up building a seemingly-valid
relcache entry for a relation that has been dropped or rewritten.
When you try to access the relation data, you'll be attempt to access
a relfilenode that's not there any more.  Similarly, if you use an
older snapshot to build a partition descriptor, you might thing that
relation OID 12345 is still a partition of that table when in fact
it's been detached - and, maybe, altered in other ways, such as
changing column types.

It seems to me that overall you're not really focusing on the right
set of issues here.  I think the very first thing we need to worry
about how how we're going to keep the executor from following a bad
pointer and crashing.  Any time the partition descriptor changes, the
next relcache rebuild is going to replace rd_partdesc and free the old
one, but the executor may still have the old pointer cached in a
structure or local variable; the next attempt to dereference it will
be looking at freed memory, and kaboom.  Right now, we prevent this by
not allowing the partition descriptor to be modified while there are
any queries running against the partition, so while there may be a
rebuild, the old pointer will remain valid (cf. keep_partdesc).  I
think that whatever scheme you/we choose here should be tested with a
combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions
-- one of them doing DDL on the table while the other runs a long
query.

Once we keep it from blowing up, the second question is what the
semantics are supposed to be.  It seems inconceivable to me that the
set of partitions that an in-progress query will scan can really be
changed on the fly.  I think we're going to have to rule that if you
add or remove partitions while a query is running, we're going to scan
exactly the set we had planned to scan at the beginning of the query;
anything else would require on-the-fly plan surgery to a degree that
seems unrealistic.  That means that when a new partition is attached,
already-running queries aren't going to scan it.  If they did, we'd
have big problems, because the transaction snapshot might see rows in
those tables from an earlier time period during which that table
wasn't attached.  There's no guarantee that the data at that time
conformed to the partition constraint, so it would be pretty
problematic to let users see it.  Conversely, when a partition is
detached, there may still be scans from existing queries hitting it
for a fairly arbitrary length of time afterwards.  That may be
surprising from a locking point of view or something, but it's correct
as far as MVCC goes.  Any changes made after the DETACH operation
can't be visible to the snapshot being used for the scan.

Now, what we could try to change on the fly is the set of partitions
that are used for tuple routing.  For instance, suppose we're
inserting a long stream of COPY data.  At some point, we attach a new
partition from another session.  If we then encounter a row that
doesn't route to any of the partitions that existed at the time the
query started, we could - instead of immediately failing - go and
reload the set of partitions that are available for tuple routing and
see if the new partition which was concurrently added happens to be
appropriate to the tuple we've got.  If so, we could route the tuple
to it.  But all of this looks optional.  If new partitions aren't
available for insert/update tuple routing until the start of the next
query, that's not a catastrophe.  The reverse direction might be more
problematic: if a partition is detached, I'm not sure how sensible it
is to keep routing tuples into it.  On the flip side, what would
break, really?

Given the foregoing, I don't see why you need something like
PartitionIsValid() at all, or why you need an algorithm similar to
CREATE INDEX CONCURRENTLY.  The problem seems mostly different.  In
the case of CREATE INDEX CONCURRENTLY, the issue is that any new
tuples that get inserted while the index creation is in progress need
to end up in the index, so you'd better not start building the index
on the existing tuples until everybody who might insert new tuples
knows about the index.  I don't see that we have the same kind of
problem in this case.  Each partition is basically a separate table
with its own set of indexes; as long as queries don't end up with one
notion of which tables are relevant and a different notion of which
indexes are relevant, we shouldn't end up with any table/index
inconsistencies.  And it's not clear to me what other problems we
actually have here.  To put it another way, if we've got the notion of
"isvalid" for a partition, what's the difference between a partition
that exists but is not yet valid and one that exists and is valid?  I
can't think of anything, and I have a feeling that you may therefore
be inventing a lot of infrastructure that is not necessary.

I'm inclined to think that we could drop the name CONCURRENTLY from
this feature altogether and recast it as work to reduce the lock level
associated with partition attach/detach.  As long as we have a
reasonable definition of what the semantics are for already-running
queries, and clear documentation to go with those semantics, that
seems fine.  If a particular user finds the concurrent behavior too
strange, they can always perform the DDL in a transaction that uses
LOCK TABLE first, removing the concurrency.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On 14 August 2018 at 04:00, Robert Haas <robertmhaas@gmail.com> wrote:
> I've thought about similar things, but I think there's a pretty deep
> can of worms.  For instance, if you built a relcache entry using the
> transaction snapshot, you might end up building a seemingly-valid
> relcache entry for a relation that has been dropped or rewritten.
> When you try to access the relation data, you'll be attempt to access
> a relfilenode that's not there any more.  Similarly, if you use an
> older snapshot to build a partition descriptor, you might thing that
> relation OID 12345 is still a partition of that table when in fact
> it's been detached - and, maybe, altered in other ways, such as
> changing column types.

hmm, I guess for that to work correctly we'd need some way to allow
older snapshots to see the changes if they've not already taken a lock
on the table. If the lock had already been obtained then the ALTER
TABLE to change the type of the column would get blocked by the
existing lock. That kinda blows holes in only applying the change to
only snapshots newer than the ATTACH/DETACH's

> It seems to me that overall you're not really focusing on the right
> set of issues here.  I think the very first thing we need to worry
> about how how we're going to keep the executor from following a bad
> pointer and crashing.  Any time the partition descriptor changes, the
> next relcache rebuild is going to replace rd_partdesc and free the old
> one, but the executor may still have the old pointer cached in a
> structure or local variable; the next attempt to dereference it will
> be looking at freed memory, and kaboom.  Right now, we prevent this by
> not allowing the partition descriptor to be modified while there are
> any queries running against the partition, so while there may be a
> rebuild, the old pointer will remain valid (cf. keep_partdesc).  I
> think that whatever scheme you/we choose here should be tested with a
> combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions
> -- one of them doing DDL on the table while the other runs a long
> query.

I did focus on that and did write a patch to solve the issue. After
writing that I discovered another problem where if the PartitionDesc
differed between planning and execution then run-time pruning did the
wrong thing (See find_matching_subplans_recurse). The
PartitionPruneInfo is built assuming the PartitionDesc matches between
planning and execution. I moved on from the dangling pointer issue
onto trying to figure out a way to ensure these are the same between
planning and execution.

> Once we keep it from blowing up, the second question is what the
> semantics are supposed to be.  It seems inconceivable to me that the
> set of partitions that an in-progress query will scan can really be
> changed on the fly.  I think we're going to have to rule that if you
> add or remove partitions while a query is running, we're going to scan
> exactly the set we had planned to scan at the beginning of the query;
> anything else would require on-the-fly plan surgery to a degree that
> seems unrealistic.

Trying to do that for in-progress queries would be pretty insane. I'm
not planning on doing anything there.

> That means that when a new partition is attached,
> already-running queries aren't going to scan it.  If they did, we'd
> have big problems, because the transaction snapshot might see rows in
> those tables from an earlier time period during which that table
> wasn't attached.  There's no guarantee that the data at that time
> conformed to the partition constraint, so it would be pretty
> problematic to let users see it.  Conversely, when a partition is
> detached, there may still be scans from existing queries hitting it
> for a fairly arbitrary length of time afterwards.  That may be
> surprising from a locking point of view or something, but it's correct
> as far as MVCC goes.  Any changes made after the DETACH operation
> can't be visible to the snapshot being used for the scan.
>
> Now, what we could try to change on the fly is the set of partitions
> that are used for tuple routing.  For instance, suppose we're
> inserting a long stream of COPY data.  At some point, we attach a new
> partition from another session.  If we then encounter a row that
> doesn't route to any of the partitions that existed at the time the
> query started, we could - instead of immediately failing - go and
> reload the set of partitions that are available for tuple routing and
> see if the new partition which was concurrently added happens to be
> appropriate to the tuple we've got.  If so, we could route the tuple
> to it.  But all of this looks optional.  If new partitions aren't
> available for insert/update tuple routing until the start of the next
> query, that's not a catastrophe.  The reverse direction might be more
> problematic: if a partition is detached, I'm not sure how sensible it
> is to keep routing tuples into it.  On the flip side, what would
> break, really?

Unsure about that, I don't really see what it would buy us, so
presumably you're just considering that this might not be a
roadblocking side-effect. However, I think the PartitionDesc needs to
not change between planning and execution due to run-time pruning
requirements, so if that's the case then what you're saying here is
probably not an issue we need to think about.

> Given the foregoing, I don't see why you need something like
> PartitionIsValid() at all, or why you need an algorithm similar to
> CREATE INDEX CONCURRENTLY.  The problem seems mostly different.  In
> the case of CREATE INDEX CONCURRENTLY, the issue is that any new
> tuples that get inserted while the index creation is in progress need
> to end up in the index, so you'd better not start building the index
> on the existing tuples until everybody who might insert new tuples
> knows about the index.  I don't see that we have the same kind of
> problem in this case.  Each partition is basically a separate table
> with its own set of indexes; as long as queries don't end up with one
> notion of which tables are relevant and a different notion of which
> indexes are relevant, we shouldn't end up with any table/index
> inconsistencies.  And it's not clear to me what other problems we
> actually have here.  To put it another way, if we've got the notion of
> "isvalid" for a partition, what's the difference between a partition
> that exists but is not yet valid and one that exists and is valid?  I
> can't think of anything, and I have a feeling that you may therefore
> be inventing a lot of infrastructure that is not necessary.

Well, the problem is that you want REPEATABLE READ transactions to be
exactly that. A concurrent attach/detach should not change the output
of a query. I don't know for sure that some isvalid flag is required,
but we do need something to ensure we don't change the results of
queries run inside a repeatable read transaction. I did try to start
moving away from the isvalid flag in favour of having a PartitionDesc
just not change within the same snapshot but you've pointed out a few
problems with what I tried to come up with for that.

> I'm inclined to think that we could drop the name CONCURRENTLY from
> this feature altogether and recast it as work to reduce the lock level
> associated with partition attach/detach.  As long as we have a
> reasonable definition of what the semantics are for already-running
> queries, and clear documentation to go with those semantics, that
> seems fine.  If a particular user finds the concurrent behavior too
> strange, they can always perform the DDL in a transaction that uses
> LOCK TABLE first, removing the concurrency.

I did have similar thoughts but that seems like something to think
about once the semantics are determined, not before.

Thanks for your input on this. I clearly don't have all the answers on
this so your input and thoughts are very valuable.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2018-Aug-13, Robert Haas wrote:

> I think this is a somewhat confused analysis.  We don't use
> SnapshotAny for catalog scans, and we never have.  We used to use
> SnapshotNow, and we now use a current MVCC snapshot.  What you're
> talking about, I think, is possibly using the transaction snapshot
> rather than a current MVCC snapshot for the catalog scans.
> 
> I've thought about similar things, but I think there's a pretty deep
> can of worms.  For instance, if you built a relcache entry using the
> transaction snapshot, you might end up building a seemingly-valid
> relcache entry for a relation that has been dropped or rewritten.
> When you try to access the relation data, you'll be attempt to access
> a relfilenode that's not there any more.  Similarly, if you use an
> older snapshot to build a partition descriptor, you might thing that
> relation OID 12345 is still a partition of that table when in fact
> it's been detached - and, maybe, altered in other ways, such as
> changing column types.

I wonder if this all stems from a misunderstanding of what I suggested
to David offlist.  My suggestion was that the catalog scans would
continue to use the catalog MVCC snapshot, and that the relcache entries
would contain all the partitions that appear to the catalog; but each
partition's entry would carry the Xid of the creating transaction in a
field (say xpart), and that field is compared to the regular transaction
snapshot: if xpart is visible to the transaction snapshot, then the
partition is visible, otherwise not.  So you never try to access a
partition that doesn't exist, because those just don't appear at all in
the relcache entry.  But if you have an old transaction running with an
old snapshot, and the partitioned table just acquired a new partition,
then whether the partition will be returned as part of the partition
descriptor or not depends on the visibility of its entry.

I think that works fine for ATTACH without any further changes.  I'm not
so sure about DETACH, particularly when snapshots persist for a "long
time" (a repeatable-read transaction).  ISTM that in the above design,
the partition descriptor would lose the entry for the detached partition
ahead of time, which means queries would silently fail to see their data
(though they wouldn't crash).  I first thought this could be fixed by
waiting for those snapshots to finish, but then I realized that there's
no actual place where waiting achieves anything.  Certainly it's not
useful to wait before commit (because other snapshots are going to be
starting all the time), and it's not useful to start after the commit
(because by then the catalog tuple is already gone).  Maybe we need two
transactions: mark partition as removed with an xmax of sorts, commit,
wait for all snapshots, start transaction, remove partition catalog
tuple, commit.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I wonder if this all stems from a misunderstanding of what I suggested
> to David offlist.  My suggestion was that the catalog scans would
> continue to use the catalog MVCC snapshot, and that the relcache entries
> would contain all the partitions that appear to the catalog; but each
> partition's entry would carry the Xid of the creating transaction in a
> field (say xpart), and that field is compared to the regular transaction
> snapshot: if xpart is visible to the transaction snapshot, then the
> partition is visible, otherwise not.  So you never try to access a
> partition that doesn't exist, because those just don't appear at all in
> the relcache entry.  But if you have an old transaction running with an
> old snapshot, and the partitioned table just acquired a new partition,
> then whether the partition will be returned as part of the partition
> descriptor or not depends on the visibility of its entry.

Hmm.  One question is where you're going to get the XID of the
creating transaction.  If it's taken from the pg_class row or the
pg_inherits row or something of that sort, then you risk getting a
bogus value if something updates that row other than what you expect
-- and the consequences of that are pretty bad here; for this to work
as you intend, you need an exactly-correct value, not newer or older.
An alternative is to add an xid field that stores the value
explicitly, and that might work, but you'll have to arrange for that
value to be frozen at the appropriate time.

A further problem is that there could be multiple changes in quick
succession.  Suppose that a partition is attached, then detached
before the attach operation is all-visible, then reattached, perhaps
with different partition bounds.

> I think that works fine for ATTACH without any further changes.  I'm not
> so sure about DETACH, particularly when snapshots persist for a "long
> time" (a repeatable-read transaction).  ISTM that in the above design,
> the partition descriptor would lose the entry for the detached partition
> ahead of time, which means queries would silently fail to see their data
> (though they wouldn't crash).

I don't see why they wouldn't crash.  If the partition descriptor gets
rebuilt and some partitions disappear out from under you, the old
partition descriptor is going to get freed, and the executor has a
cached pointer to it, so it seems like you are in trouble.

> I first thought this could be fixed by
> waiting for those snapshots to finish, but then I realized that there's
> no actual place where waiting achieves anything.  Certainly it's not
> useful to wait before commit (because other snapshots are going to be
> starting all the time), and it's not useful to start after the commit
> (because by then the catalog tuple is already gone).  Maybe we need two
> transactions: mark partition as removed with an xmax of sorts, commit,
> wait for all snapshots, start transaction, remove partition catalog
> tuple, commit.

And what would that accomplish, exactly?  Waiting for all snapshots
would ensure that all still-running transactions see the fact the xmax
with which the partition has been marked as removed, but what good
does that do?  In order to have a plausible algorithm, you have to
describe both what the ATTACH/DETACH operation does and what the other
concurrent transactions do and how those things interact.  Otherwise,
it's like saying that we're going to solve a problem with X and Y
overlapping by having X take a lock.  If Y doesn't take a conflicting
lock, this does nothing.

Generally, I think I see what you're aiming at: make ATTACH and DETACH
have MVCC-like semantics with respect to concurrent transactions.  I
don't think that's a dumb idea from a theoretical perspective, but in
practice I think it's going to be very difficult to implement.  We
have no other DDL that has such semantics, and there's no reason we
couldn't; for example, TRUNCATE could work with SUEL and transactions
that can't see the TRUNCATE as committed continue to operate on the
old heap.  While we could do such things, we don't.  If you decide to
do them here, you've probably got a lot of work ahead of you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On 21 August 2018 at 13:59, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> I wonder if this all stems from a misunderstanding of what I suggested
>> to David offlist.  My suggestion was that the catalog scans would
>> continue to use the catalog MVCC snapshot, and that the relcache entries
>> would contain all the partitions that appear to the catalog; but each
>> partition's entry would carry the Xid of the creating transaction in a
>> field (say xpart), and that field is compared to the regular transaction
>> snapshot: if xpart is visible to the transaction snapshot, then the
>> partition is visible, otherwise not.  So you never try to access a
>> partition that doesn't exist, because those just don't appear at all in
>> the relcache entry.  But if you have an old transaction running with an
>> old snapshot, and the partitioned table just acquired a new partition,
>> then whether the partition will be returned as part of the partition
>> descriptor or not depends on the visibility of its entry.
>
> Hmm.  One question is where you're going to get the XID of the
> creating transaction.  If it's taken from the pg_class row or the
> pg_inherits row or something of that sort, then you risk getting a
> bogus value if something updates that row other than what you expect
> -- and the consequences of that are pretty bad here; for this to work
> as you intend, you need an exactly-correct value, not newer or older.
> An alternative is to add an xid field that stores the value
> explicitly, and that might work, but you'll have to arrange for that
> value to be frozen at the appropriate time.
>
> A further problem is that there could be multiple changes in quick
> succession.  Suppose that a partition is attached, then detached
> before the attach operation is all-visible, then reattached, perhaps
> with different partition bounds.

I should probably post the WIP I have here.  In those, I do have the
xmin array in the PartitionDesc. This gets taken from the new
pg_partition table, which I don't think suffers from the same issue as
taking it from pg_class, since nothing else will update the
pg_partition record.

However, I don't think the xmin array is going to work if we include
it in the PartitionDesc.  The problem is, as I discovered from writing
the code was that the PartitionDesc must remain exactly the same
between planning an execution. If there are any more or any fewer
partitions found during execution than what we saw in planning then
run-time pruning will access the wrong element in the
PartitionPruneInfo array, or perhaps access of the end of the array.
It might be possible to work around that by identifying partitions by
Oid rather than PartitionDesc array index, but the run-time pruning
code is already pretty complex. I think coding it to work when the
PartitionDesc does not match between planning and execution is just
going to too difficult to get right.  Tom is already unhappy with the
complexity of ExecFindInitialMatchingSubPlans().

I think the solution will require that the PartitionDesc does not:

a) Change between planning and execution.
b) Change during a snapshot after the partitioned table has been locked.

With b, it sounds like we'll need to take the most recent
PartitionDesc even if the transaction is older than the one that did
the ATTACH/DETACH operation as if we use an old version then, as
Robert mentions, there's nothing to stop another transaction making
changes to the table that make it an incompatible partition, e.g DROP
COLUMN.  This wouldn't be possible if we update the PartitionDesc
right after taking the first lock on the partitioned table since any
transactions doing DROP COLUMN would be blocked until the other
snapshot gets released.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
Hello

Here's my take on this feature, owing to David Rowley's version.

Firstly, I took Robert's advice and removed the CONCURRENTLY keyword
from the syntax.  We just do it that way always.  When there's a default
partition, only that partition is locked with an AEL; all the rest is
locked with ShareUpdateExclusive only.

I added some isolation tests for it -- they all pass for me.

There are two main ideas supporting this patch:

1. The Partition descriptor cache module (partcache.c) now contains a
long-lived hash table that lists all the current partition descriptors;
when an invalidation message is received for a relation, we unlink the
partdesc from the hash table *but do not free it*.  The hash
table-linked partdesc is rebuilt again in the future, when requested, so
many copies might exist in memory for one partitioned table.

2. Snapshots have their own cache (hash table) of partition descriptors.
If a partdesc is requested and the snapshot has already obtained that
partdesc, the original one is returned -- we don't request a new one
from partcache.

Then there are a few other implementation details worth mentioning:

3. parallel query: when a worker starts on a snapshot that has a
partition descriptor cache, we need to transmit those partdescs from
leader via shmem ... but we cannot send the full struct, so we just send
the OID list of partitions, then rebuild the descriptor in the worker.
Side effect: if a partition is detached right between the leader taking
the partdesc and the worker starting, the partition loses its
relpartbound column, so it's not possible to reconstruct the partdesc.
In this case, we raise an error.  Hopefully this should be rare.

4. If a partitioned table is dropped, but was listed in a snapshot's
partdesc cache, and then parallel query starts, the worker will try to
restore the partdesc for that table, but there are no catalog rows for
it.  The implementation choice here is to ignore the table and move on.
I would like to just remove the partdesc from the snapshot, but that
would require a relcache inval callback, and a) it'd kill us to scan all
snapshots for every relation drop; b) it doesn't work anyway because we
don't have any way to distinguish invals arriving because of DROP from
invals arriving because of anything else, say ANALYZE.

5. snapshots are copied a lot.  Copies share the same hash table as the
"original", because surely all copies should see the same partition
descriptor.  This leads to the pinning/unpinning business you see for
the structs in snapmgr.c.

Some known defects:

6. this still leaks memory.  Not as terribly as my earlier prototypes,
but clearly it's something that I need to address.

7. I've considered the idea of tracking snapshot-partdescs in resowner.c
to prevent future memory leak mistakes.  Not done yet.  Closely related
to item 6.

8. Header changes may need some cleanup yet -- eg. I'm not sure
snapmgr.h compiles standalone.

9. David Rowley recently pointed out that we can modify
CREATE TABLE ..  PARTITION OF to likewise not obtain AEL anymore.
Apparently it just requires removal of three lines in MergeAttributes.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Oct 25, 2018 at 4:26 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Firstly, I took Robert's advice and removed the CONCURRENTLY keyword
> from the syntax.  We just do it that way always.  When there's a default
> partition, only that partition is locked with an AEL; all the rest is
> locked with ShareUpdateExclusive only.

Check.

> Then there are a few other implementation details worth mentioning:
>
> 3. parallel query: when a worker starts on a snapshot that has a
> partition descriptor cache, we need to transmit those partdescs from
> leader via shmem ... but we cannot send the full struct, so we just send
> the OID list of partitions, then rebuild the descriptor in the worker.
> Side effect: if a partition is detached right between the leader taking
> the partdesc and the worker starting, the partition loses its
> relpartbound column, so it's not possible to reconstruct the partdesc.
> In this case, we raise an error.  Hopefully this should be rare.

I don't think it's a good idea to for parallel query to just randomly
fail in cases where a non-parallel query would have worked.  I tried
pretty hard to avoid that while working on the feature, and it would
be a shame to see that work undone.

It strikes me that it would be a good idea to break this work into two
phases.  In phase 1, let's support ATTACH and CREATE TABLE ..
PARTITION OF without requiring AccessExclusiveLock.  In phase 2, think
about concurrency for DETACH (and possibly DROP).

I suspect phase 1 actually isn't that hard.  It seems to me that the
only thing we REALLY need to ensure is that the executor doesn't blow
up if a relcache reload occurs.  There are probably a few different
approaches to that problem, but I think it basically boils down to (1)
making sure that the executor is holding onto pointers to the exact
objects it wants to use and not re-finding them through the relcache
and (2) making sure that the relcache doesn't free and rebuild those
objects but rather holds onto the existing copies.  With this
approach, already-running queries won't take into account the fact
that new partitions have been added, but that seems at least tolerable
and perhaps desirable.

For phase 2, we're not just talking about adding stuff that need not
be used immediately, but about removing stuff which may already be in
use.  Your email doesn't seem to describe what we want the *behavior*
to be in that case.  Leave aside for a moment the issue of not
crashing: what are the desired semantics?  I think it would be pretty
strange if you had a COPY running targeting a partitioned table,
detached a partition, and the COPY continued to route tuples to the
detached partition even though it was now an independent table.  It
also seems pretty strange if the tuples just get thrown away.  If the
COPY isn't trying to send any tuples to the now-detached partition,
then it's fine, but if it is, then I have trouble seeing any behavior
other than an error as sane, unless perhaps a new partition has been
attached or created for that part of the key space.

If you adopt that proposal, then the problem of parallel query
behaving differently from non-parallel query goes away.  You just get
an error in both cases, probably to the effect that there is no
(longer) a partition matching the tuple you are trying to insert (or
update).

If you're not hacking on this patch set too actively right at the
moment, I'd like to spend some time hacking on the CREATE/ATTACH side
of things and see if I can produce something committable for that
portion of the problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2018-Nov-06, Robert Haas wrote:

> If you're not hacking on this patch set too actively right at the
> moment, I'd like to spend some time hacking on the CREATE/ATTACH side
> of things and see if I can produce something committable for that
> portion of the problem.

I'm not -- feel free to hack away.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Simon Riggs
Date:
On Tue, 6 Nov 2018 at 10:10, Robert Haas <robertmhaas@gmail.com> wrote:
 
With this
approach, already-running queries won't take into account the fact
that new partitions have been added, but that seems at least tolerable
and perhaps desirable.

Desirable, imho. No data added after a query starts would be visible.
 
If the
COPY isn't trying to send any tuples to the now-detached partition,
then it's fine, but if it is, then I have trouble seeing any behavior
other than an error as sane, unless perhaps a new partition has been
attached or created for that part of the key space.

Error in the COPY or in the DDL? COPY preferred. Somebody with insert rights shouldn't be able to prevent a table-owner level action. People normally drop partitions to save space, so it could be annoying if that was interrupted.


Supporting parallel query shouldn't make other cases more difficult from a behavioral perspective just to avoid the ERROR. The ERROR sounds annoying, but not sure how annoying avoiding it would be.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Tue, Nov 6, 2018 at 1:54 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> Error in the COPY or in the DDL? COPY preferred. Somebody with insert rights shouldn't be able to prevent a
table-ownerlevel action. People normally drop partitions to save space, so it could be annoying if that was
interrupted.

Yeah, the COPY.

> Supporting parallel query shouldn't make other cases more difficult from a behavioral perspective just to avoid the
ERROR.The ERROR sounds annoying, but not sure how annoying avoiding it would be.
 

In my view, it's not just a question of it being annoying, but of
whether anything else is even sensible.  I mean, you can avoid an
error when a user types SELECT 1/0 by returning NULL or 42, but that's
not usually how we roll around here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Simon Riggs
Date:
On Tue, 6 Nov 2018 at 10:56, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 6, 2018 at 1:54 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> Error in the COPY or in the DDL? COPY preferred. Somebody with insert rights shouldn't be able to prevent a table-owner level action. People normally drop partitions to save space, so it could be annoying if that was interrupted.

Yeah, the COPY.

> Supporting parallel query shouldn't make other cases more difficult from a behavioral perspective just to avoid the ERROR. The ERROR sounds annoying, but not sure how annoying avoiding it would be.

In my view, it's not just a question of it being annoying, but of
whether anything else is even sensible.  I mean, you can avoid an
error when a user types SELECT 1/0 by returning NULL or 42, but that's
not usually how we roll around here.

If you can remove the ERROR without any other adverse effects, that sounds great.

Please let us know what, if any, adverse effects would be caused so we can discuss. Thanks

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Tue, Nov 6, 2018 at 2:01 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> If you can remove the ERROR without any other adverse effects, that sounds great.
>
> Please let us know what, if any, adverse effects would be caused so we can discuss. Thanks

Well, I've already written about this in two previous emails on this
thread, so I'm not sure exactly what you think is missing.  But to
state the problem again:

If you don't throw an error when a partition is concurrently detached
and then someone routes a tuple to that portion of the key space, what
DO you do?  Continue inserting tuples into the table even though it's
no longer a partition?  Throw tuples destined for that partition away?
 You can make an argument for both of those behaviors, but they're
both pretty strange.  The first one means that for an arbitrarily long
period of time after detaching a partition, the partition may continue
to receive inserts that were destined for its former parent.  The
second one means that your data can disappear into the ether.  I don't
like either of those things.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2018-Nov-06, Robert Haas wrote:

> If you don't throw an error when a partition is concurrently detached
> and then someone routes a tuple to that portion of the key space, what
> DO you do?  Continue inserting tuples into the table even though it's
> no longer a partition?

Yes -- the table was a partition when the query started, so it's still
a partition from the point of view of that query's snapshot.

> Throw tuples destined for that partition away?

Surely not.  (/me doesn't beat straw men anyway.)

> You can make an argument for both of those behaviors, but they're
> both pretty strange.  The first one means that for an arbitrarily long
> period of time after detaching a partition, the partition may continue
> to receive inserts that were destined for its former parent.

Not arbitrarily long -- only as long as those old snapshots live.  I
don't find this at all surprising.


(I think DETACH is not related to DROP in any way.  My proposal is that
DETACH can work concurrently, and if people want to drop the partition
later they can wait until snapshots/queries that could see that
partition are gone.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Tue, Nov 6, 2018 at 2:10 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2018-Nov-06, Robert Haas wrote:
> > If you don't throw an error when a partition is concurrently detached
> > and then someone routes a tuple to that portion of the key space, what
> > DO you do?  Continue inserting tuples into the table even though it's
> > no longer a partition?
>
> Yes -- the table was a partition when the query started, so it's still
> a partition from the point of view of that query's snapshot.

I think it's important to point out that DDL does not in general
respect the query snapshot.  For example, you can query a table that
was created by a transaction not visible to your query snapshot.  You
cannot query a table that was dropped by a transaction not visible to
your query snapshot.  If someone runs ALTER FUNCTION on a function
your query uses, you get the latest committed version, not the version
that was current at the time your query snapshot was created.  So, if
we go with the semantics you are proposing here, we will be making
this DDL behave differently from pretty much all other DDL.

Possibly that's OK in this case, but it's easy to think of other cases
where it could cause problems.  To take an example that I believe was
discussed on-list a number of years ago, suppose that ADD CONSTRAINT
worked according to the model that you are proposing for ATTACH
PARTITION.  If it did, then one transaction could be concurrently
inserting a tuple while another transaction was adding a constraint
which the tuple fails to satisfy.  Once both transactions commit, you
have a table with a supposedly-valid constraint and a tuple inside of
it that doesn't satisfy that constraint.  Obviously, that's no good.

I'm not entirely sure whether there are any similar dangers in the
case of DETACH PARTITION.  I think it depends a lot on what can be
done with that detached partition while the overlapping transaction is
still active.  For instance, suppose you attached it to the original
table with a different set of partition bounds, or attached it to some
other table with a different set of partition bounds.  If you can do
that, then I think it effectively creates the problem described in the
previous paragraph with respect to the partition constraint.

IOW, we've got to somehow prevent this:

setup: partition is attached with bounds 1 to a million
S1: COPY begins
S2: partition is detached
S2: partition is reattached with bounds 1 to a thousand
S1: still-running copy inserts a tuple with value ten thousand

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Simon Riggs
Date:
On Tue, 6 Nov 2018 at 11:06, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 6, 2018 at 2:01 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> If you can remove the ERROR without any other adverse effects, that sounds great.
>
> Please let us know what, if any, adverse effects would be caused so we can discuss. Thanks

Well, I've already written about this in two previous emails on this
thread, so I'm not sure exactly what you think is missing.  But to
state the problem again:

I was discussing the ERROR in relation to parallel query, not COPY.

I didn't understand how that would be achieved.

Thanks for working on this.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Tue, Aug 7, 2018 at 9:29 AM Andres Freund <andres@anarazel.de> wrote:
> One approach would be to make sure that everything relying on
> rt_partdesc staying the same stores its value in a local variable, and
> then *not* free the old version of rt_partdesc (etc) when the refcount >
> 0, but delay that to the RelationClose() that makes refcount reach
> 0. That'd be the start of a framework for more such concurrenct
> handling.

Some analysis of possible trouble spots:

- get_partition_dispatch_recurse and ExecCreatePartitionPruneState
both call RelationGetPartitionDesc.  Presumably, this means that if
the partition descriptor gets updated on the fly, the tuple routing
and partition dispatch code could end up with different ideas about
which partitions exist.  I think this should be fixed somehow, so that
we only call RelationGetPartitionDesc once per query and use the
result for everything.

- expand_inherited_rtentry checks
RelationGetPartitionDesc(oldrelation) != NULL.  If so, it calls
expand_partitioned_rtentry which fetches the same PartitionDesc again.
We can probably just do this once in the caller and pass the result
down.

- set_relation_partition_info also calls RelationGetPartitionDesc.
Off-hand, I think this code runs after expand_inherited_rtentry.  Not
sure what to do about this.  I'm not sure what the consequences would
be if this function and that one had different ideas about the
partition descriptor.

- tablecmds.c is pretty free about calling RelationGetPartitionDesc
repeatedly, but it probably doesn't matter.  If we're doing some kind
of DDL that depends on the contents of the partition descriptor, we
*had better* be holding a lock strong enough to prevent the partition
descriptor from being changed by somebody else at the same time.
Allowing a partition to be added concurrently with DML is one thing;
allowing a partition to be added concurrently with adding another
partition is a whole different level of insanity.  I think we'd be
best advised not to go down that rathole - among other concerns, how
would you even guarantee that the partitions being added didn't
overlap?

Generally:

Is it really OK to postpone freeing the old partition descriptor until
the relation reference count goes to 0?  I wonder if there are cases
where this could lead to tons of copies of the partition descriptor
floating around at the same time, gobbling up precious cache memory.
My first thought was that this would be pretty easy: just create a lot
of new partitions one by one while some long-running transaction is
open.  But the actual result in that case depends on the behavior of
the backend running the transaction.  If it just ignores the new
partitions and sticks with the partition descriptor it has got, then
probably nothing else will request the new partition descriptor either
and there will be no accumulation of memory.  However, if it tries to
absorb the updated partition descriptor, but without being certain
that the old one can be freed, then we'd have a query-lifespan memory
leak which is quadratic in the number of new partitions.

Maybe even that would be OK -- we could suppose that the number of new
partitions would probably be all THAT crazy large, and the constant
factor not too bad, so maybe you'd leak a could of MB for the length
of the query, but no more.  However, I wonder if it would better to
give each PartitionDescData its own refcnt, so that it can be freed
immediately when the refcnt goes to zero.  That would oblige every
caller of RelationGetPartitionDesc() to later call something like
ReleasePartitionDesc().  We could catch failures to do that by keeping
all the PartitionDesc objects so far created in a list.  When the main
entry's refcnt goes to 0, cross-check that this list is empty; if not,
then the remaining entries have non-zero refcnts that were leaked.  We
could emit a WARNING as we do in similar cases.

In general, I think something along the lines you are suggesting here
is the right place to start attacking this problem.  Effectively, we
immunize the system against the possibility of new entries showing up
in the partition descriptor while concurrent DML is running; the
semantics are that the new partitions are ignored for the duration of
currently-running queries.  This seems to allow for painless creation
or addition of new partitions in normal cases, but not when a default
partition exists.  In that case, using the old PartitionDesc is
outright wrong, because adding a new toplevel partition changes the
default partition's partition constraint. We can't insert into the
default partition a tuple that under the updated table definition
needs to go someplace else.  It seems like the best way to account for
that is to reduce the lock level on the partitioned table to
ShareUpdateExclusiveLock, but leave the lock level on any default
partition as AccessExclusiveLock (because we are modifying a
constraint on it).  We would also need to leave the lock level on the
new partition as AccessExclusiveLock (because we are adding a
constraint on it).  Not perfect, for sure, but not bad for a first
patch, either; it would improve things for users in a bunch of
practical cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2018-Nov-06, Alvaro Herrera wrote:

> On 2018-Nov-06, Robert Haas wrote:

> > Throw tuples destined for that partition away?
> 
> Surely not.  (/me doesn't beat straw men anyway.)

Hmm, apparently this can indeed happen with my patch :-(

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Tue, Nov 6, 2018 at 10:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > > Throw tuples destined for that partition away?
> > Surely not.  (/me doesn't beat straw men anyway.)
>
> Hmm, apparently this can indeed happen with my patch :-(

D'oh.  This is a hard problem, especially the part of it that involves
handling detach, so I wouldn't feel too bad about that.  However, to
beat this possibly-dead horse a little more, I think you made the
error of writing a patch that (1) tried to solve too many problems at
once and (2) didn't seem to really have a clear, well-considered idea
about what the semantics ought to be.

This is not intended as an attack; I want to work with you to solve
the problem, not have a fight about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Tue, Nov 6, 2018 at 5:09 PM Robert Haas <robertmhaas@gmail.com> wrote:
> - get_partition_dispatch_recurse and ExecCreatePartitionPruneState
> both call RelationGetPartitionDesc.  Presumably, this means that if
> the partition descriptor gets updated on the fly, the tuple routing
> and partition dispatch code could end up with different ideas about
> which partitions exist.  I think this should be fixed somehow, so that
> we only call RelationGetPartitionDesc once per query and use the
> result for everything.

I think there is deeper trouble here.
ExecSetupPartitionTupleRouting() calls find_all_inheritors() to
acquire RowExclusiveLock on the whole partitioning hierarchy.  It then
calls RelationGetPartitionDispatchInfo (as a non-relcache function,
this seems poorly named) which calls get_partition_dispatch_recurse,
which does this:

            /*
             * We assume all tables in the partition tree were already locked
             * by the caller.
             */
            Relation    partrel = heap_open(partrelid, NoLock);

That seems OK at present, because no new partitions can have appeared
since ExecSetupPartitionTupleRouting() acquired locks.  But if we
allow new partitions to be added with only ShareUpdateExclusiveLock,
then I think there would be a problem.  If a new partition OID creeps
into the partition descriptor after find_all_inheritors() and before
we fetch its partition descriptor, then we wouldn't have previously
taken a lock on it and would still be attempting to open it without a
lock, which is bad (cf. b04aeb0a053e7cf7faad89f7d47844d8ba0dc839).

Admittedly, it might be a bit hard to provoke a failure here because
I'm not exactly sure how you could trigger a relcache reload in the
critical window, but I don't think we should rely on that.

More generally, it seems a bit strange that we take the approach of
locking the entire partitioning hierarchy here regardless of which
relations the query actually knows about.  If some relations have been
pruned, presumably we don't need to lock them; if/when we permit
concurrent partition, we don't need to lock any new ones that have
materialized.  We're just going to end up ignoring them anyway because
there's nothing to do with the information that they are or are not
excluded from the query when they don't appear in the query plan in
the first place.

Furthermore, this whole thing looks suspiciously like more of the sort
of redundant locking that f2343653f5b2aecfc759f36dbb3fd2a61f36853e
attempted to eliminate.  In light of that commit message, I'm
wondering whether the best approach would be to [1] get rid of the
find_all_inheritors call altogether and [2] somehow ensure that
get_partition_dispatch_recurse() doesn't open any tables that aren't
part of the query's range table.

Thoughts?  Comments?  Ideas?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2018-Nov-06, Robert Haas wrote:

> - get_partition_dispatch_recurse and ExecCreatePartitionPruneState
> both call RelationGetPartitionDesc.

My patch deals with this by caching descriptors in the active snapshot.
So those two things would get the same partition descriptor.  There's no
RelationGetPartitionDesc anymore, and SnapshotGetPartitionDesc takes its
place.

(I tried to use different scoping than the active snapshot; I first
tried the Portal, then I tried the resource owner.  But nothing seems to
fit as precisely as the active snapshot.)

> - expand_inherited_rtentry checks
> RelationGetPartitionDesc(oldrelation) != NULL.  If so, it calls
> expand_partitioned_rtentry which fetches the same PartitionDesc again.

This can be solved by changing the test to a relkind one, as my patch
does.

> - set_relation_partition_info also calls RelationGetPartitionDesc.
> Off-hand, I think this code runs after expand_inherited_rtentry.  Not
> sure what to do about this.  I'm not sure what the consequences would
> be if this function and that one had different ideas about the
> partition descriptor.

Snapshot caching, like in my patch, again solves this problem.

> - tablecmds.c is pretty free about calling RelationGetPartitionDesc
> repeatedly, but it probably doesn't matter.  If we're doing some kind
> of DDL that depends on the contents of the partition descriptor, we
> *had better* be holding a lock strong enough to prevent the partition
> descriptor from being changed by somebody else at the same time.

My patch deals with this by unlinking the partcache entry from the hash
table on relation invalidation, so DDL code would obtain a fresh copy
each time (lookup_partcache_entry).

In other words, I already solved these problems you list.

Maybe you could give my patch a look.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Wed, Nov 7, 2018 at 12:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2018-Nov-06, Robert Haas wrote:
> > - get_partition_dispatch_recurse and ExecCreatePartitionPruneState
> > both call RelationGetPartitionDesc.
>
> My patch deals with this by caching descriptors in the active snapshot.
> So those two things would get the same partition descriptor.  There's no
> RelationGetPartitionDesc anymore, and SnapshotGetPartitionDesc takes its
> place.
>
> (I tried to use different scoping than the active snapshot; I first
> tried the Portal, then I tried the resource owner.  But nothing seems to
> fit as precisely as the active snapshot.)
...
> In other words, I already solved these problems you list.
>
> Maybe you could give my patch a look.

I have, a bit.  One problem I'm having is that while you explained the
design you chose in a fair amount of detail, you didn't give a lot of
explanation (that I have seen) of the reasons why you chose that
design.  If there's a README or a particularly good comment someplace
that I should be reading to understand that better, please point me in
the right direction.

And also, I just don't really understand what all the problems are
yet.  I'm only starting to study this.

I am a bit skeptical of your approach, though.  Tying it to the active
snapshot seems like an awfully big hammer.  Snapshot manipulation can
be a performance bottleneck both in terms of actual performance and
also in terms of code complexity, and I don't really like the idea of
adding more code there.  It's not a sustainable pattern for making DDL
work concurrently, either -- I'm pretty sure we don't want to add new
code to things like GetLatestSnapshot() every time we want to make a
new kind of DDL concurrent.  Also, while hash table lookups are pretty
cheap, they're not free.  In my opinion, to the extent that we can, it
would be better to refactor things to avoid duplicate lookups of the
PartitionDesc rather than to install a new subsystem that tries to
make sure they always return the same answer.

Such an approach seems to have other possible advantages.  For
example, if a COPY is running and a new partition shows up, we might
actually want to allow tuples to be routed to it.  Maybe that's too
pie in the sky, but if we want to preserve the option to do such
things in the future, a hard-and-fast rule that the apparent partition
descriptor doesn't change unless the snapshot changes seems like it
might get in the way.  It seems better to me to have a system where
code that accesses the relcache has a choice, so that at its option it
can either hang on to the PartitionDesc it has or get a new one that
may be different.  If we can do things that way, it gives us the most
flexibility.

After the poking around I've done over the last 24 hours, I do see
that there are some non-trivial problems with making it that way, but
I'm not really ready to give up yet.

Does that make sense to you, or am I all wet here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On 8 November 2018 at 05:05, Robert Haas <robertmhaas@gmail.com> wrote:
> That seems OK at present, because no new partitions can have appeared
> since ExecSetupPartitionTupleRouting() acquired locks.  But if we
> allow new partitions to be added with only ShareUpdateExclusiveLock,
> then I think there would be a problem.  If a new partition OID creeps
> into the partition descriptor after find_all_inheritors() and before
> we fetch its partition descriptor, then we wouldn't have previously
> taken a lock on it and would still be attempting to open it without a
> lock, which is bad (cf. b04aeb0a053e7cf7faad89f7d47844d8ba0dc839).
>
> Admittedly, it might be a bit hard to provoke a failure here because
> I'm not exactly sure how you could trigger a relcache reload in the
> critical window, but I don't think we should rely on that.
>
> More generally, it seems a bit strange that we take the approach of
> locking the entire partitioning hierarchy here regardless of which
> relations the query actually knows about.  If some relations have been
> pruned, presumably we don't need to lock them; if/when we permit
> concurrent partition, we don't need to lock any new ones that have
> materialized.  We're just going to end up ignoring them anyway because
> there's nothing to do with the information that they are or are not
> excluded from the query when they don't appear in the query plan in
> the first place.

While the find_all_inheritors() call is something I'd like to see
gone, I assume it was done that way since an UPDATE might route a
tuple to a partition that there is no subplan for and due to INSERT
with VALUES not having any RangeTblEntry for any of the partitions.
Simply, any partition which is a descendant of the target partition
table could receive the tuple regardless of what might have been
pruned.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Wed, Nov 7, 2018 at 7:06 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> While the find_all_inheritors() call is something I'd like to see
> gone, I assume it was done that way since an UPDATE might route a
> tuple to a partition that there is no subplan for and due to INSERT
> with VALUES not having any RangeTblEntry for any of the partitions.
> Simply, any partition which is a descendant of the target partition
> table could receive the tuple regardless of what might have been
> pruned.

Thanks.  I had figured out since my email of earlier today that it was
needed in the INSERT case, but I had not thought of/discovered the
case of an UPDATE that routes a tuple to a pruned partition.  I think
that latter case may not be tested in our regression tests, which is
perhaps something we ought to change.

Honestly, I *think* that the reason that find_all_inheritors() call is
there is because I had the idea that it was important to try to lock
partition hierarchies in the same order in all cases so as to avoid
spurious deadlocks.  However, I don't think we're really achieving
that goal despite this code.  If we arrive at this point having
already locked some relations, and then lock some more, based on
whatever got pruned, we're clearly not using a deterministic locking
order.  So I think we could probably rip out the find_all_inheritors()
call here and change the NoLock in get_partition_dispatch_recurse() to
just take a lock.  That's probably a worthwhile simplification and a
slight optimization regardless of anything else.

But I really think it would be better if we could also jigger this to
avoid reopening relations which the executor has already opened and
locked elsewhere.  Unfortunately, I don't see a really simple way to
accomplish that.  We get the OIDs of the descendents and want to know
whether there is range table entry for that OID; but there's no data
structure which answers that question at present, I believe, and
introducing one just for this purpose seems like an awful lot of new
machinery.  Perhaps that new machinery would still have less
far-reaching consequences than the machinery Alvaro is proposing, but,
still, it's not very appealing.

Perhaps one idea is only open and lock partitions on demand - i.e. if
a tuple actually gets routed to them.  There are good reasons to do
that independently of reducing lock levels, and we certainly couldn't
do it without having some efficient way to check whether it had
already been done.  So then the mechanism wouldn't feel like so much
like a special-purpose hack just for concurrent ATTACH/DETACH.  (Was
Amit Langote already working on this, or was that some other kind of
on-demand locking?)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On 8 November 2018 at 15:01, Robert Haas <robertmhaas@gmail.com> wrote:
> Honestly, I *think* that the reason that find_all_inheritors() call is
> there is because I had the idea that it was important to try to lock
> partition hierarchies in the same order in all cases so as to avoid
> spurious deadlocks.  However, I don't think we're really achieving
> that goal despite this code.  If we arrive at this point having
> already locked some relations, and then lock some more, based on
> whatever got pruned, we're clearly not using a deterministic locking
> order.  So I think we could probably rip out the find_all_inheritors()
> call here and change the NoLock in get_partition_dispatch_recurse() to
> just take a lock.  That's probably a worthwhile simplification and a
> slight optimization regardless of anything else.

I'd not thought of the locks taken elsewhere case. I guess it just
perhaps reduces the chances of a deadlock then.

A "slight optimization" is one way to categorise it. There are some
benchmarks you might find interesting in [1] and [2]. Patch 0002 does
just what you mention.

[1] https://www.postgresql.org/message-id/06524959-fda8-cff9-6151-728901897b79%40redhat.com
[2] https://www.postgresql.org/message-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz%3DGVBwvGh4a6xA%40mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On 2018/11/08 11:01, Robert Haas wrote:
> On Wed, Nov 7, 2018 at 7:06 PM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> While the find_all_inheritors() call is something I'd like to see
>> gone, I assume it was done that way since an UPDATE might route a
>> tuple to a partition that there is no subplan for and due to INSERT
>> with VALUES not having any RangeTblEntry for any of the partitions.
>> Simply, any partition which is a descendant of the target partition
>> table could receive the tuple regardless of what might have been
>> pruned.
> 
> Thanks.  I had figured out since my email of earlier today that it was
> needed in the INSERT case, but I had not thought of/discovered the
> case of an UPDATE that routes a tuple to a pruned partition.  I think
> that latter case may not be tested in our regression tests, which is
> perhaps something we ought to change.
> 
> Honestly, I *think* that the reason that find_all_inheritors() call is
> there is because I had the idea that it was important to try to lock
> partition hierarchies in the same order in all cases so as to avoid
> spurious deadlocks.  However, I don't think we're really achieving
> that goal despite this code.  If we arrive at this point having
> already locked some relations, and then lock some more, based on
> whatever got pruned, we're clearly not using a deterministic locking
> order.  So I think we could probably rip out the find_all_inheritors()
> call here and change the NoLock in get_partition_dispatch_recurse() to
> just take a lock.  That's probably a worthwhile simplification and a
> slight optimization regardless of anything else.

A patch that David and I have been working on over at:

https://commitfest.postgresql.org/20/1690/

does that.  With that patch, partitions (leaf or not) are locked and
opened only if a tuple is routed to them.  In edd44738bc (Be lazier about
partition tuple routing), we postponed the opening of leaf partitions, but
we still left the RelationGetPartitionDispatchInfo machine which
recursively creates PartitionDispatch structs for all partitioned tables
in a tree.  The patch mentioned above postpones even the partitioned
partition initialization to a point after a tuple is routed to it.

The patch doesn't yet eliminate the find_all_inheritors call from
ExecSetupPartitionTupleRouting.  But that's mostly because of the fear
that if we start becoming lazier about locking individual partitions too,
we'll get non-deterministic locking order on partitions that we might want
to avoid for deadlock fears.  Maybe, we don't need to be fearful though.

> But I really think it would be better if we could also jigger this to
> avoid reopening relations which the executor has already opened and
> locked elsewhere.  Unfortunately, I don't see a really simple way to
> accomplish that.  We get the OIDs of the descendents and want to know
> whether there is range table entry for that OID; but there's no data
> structure which answers that question at present, I believe, and
> introducing one just for this purpose seems like an awful lot of new
> machinery.  Perhaps that new machinery would still have less
> far-reaching consequences than the machinery Alvaro is proposing, but,
> still, it's not very appealing.

The newly added ExecGetRangeTableRelation opens (if not already done) and
returns a Relation pointer for tables that are present in the range table,
so requires to be passed a valid RT index.  That works for tables that the
planner touched.  UPDATE tuple routing benefits from that in cases where
the routing target is already in the range table.

For insert itself, planner adds only the target partitioned table to the
range table.  Partitions that the inserted tuples will route to may be
present in the range table via some other plan node, but the insert's
execution state won't know about them, so it cannot use
EcecGetRangeTableRelation.

> Perhaps one idea is only open and lock partitions on demand - i.e. if
> a tuple actually gets routed to them.  There are good reasons to do
> that independently of reducing lock levels, and we certainly couldn't
> do it without having some efficient way to check whether it had
> already been done.  So then the mechanism wouldn't feel like so much
> like a special-purpose hack just for concurrent ATTACH/DETACH.  (Was
> Amit Langote already working on this, or was that some other kind of
> on-demand locking?)

I think the patch mentioned above gets us closer to that goal.

Thanks,
Amit



Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Wed, Nov 7, 2018 at 1:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Maybe you could give my patch a look.
>
> I have, a bit.

While thinking about this problem a bit more, I realized that what is
called RelationBuildPartitionDesc in master and BuildPartitionDesc in
Alvaro's patch has a synchronization problem as soon as we start to
reduce lock levels.  At some point, find_inheritance_children() gets
called to get a list of the OIDs of the partitions. Then, later,
SysCacheGetAttr(RELOID, ...) gets called for each one to get its
relpartbound value.  But since catalog lookups use the most current
snapshot, they might not see a compatible view of the catalogs.

That could manifest in a few different ways:

- We might see a newer version of relpartbound, where it's now null
because it's been detached.
- We might see a newer version of relpartbound where it now has an
unrelated value because it has been detached and then reattached to
some other partitioned table.
- We might see newer versions of relpartbound for some tables than
others.  For instance, suppose we had partition A for 1..200 and B for
201..300.  Then we realize that this is not what we actually wanted to
do, so we detach A and reattach it with a bound of 1..100 and detached
B and reattach it with a bound of 101..300.  If we perform the
syscache lookup for A before this happens and the syscache lookup for
B after this happens, we might see the old bound for A and the new
bound for B, and that would be sad, 'cuz they overlap.
- Seeing an older relpartbound for some other table is also a problem
for other reasons -- we will have the wrong idea about the bounds of
that partition and may put the wrong tuples into it.  Without
AccessExclusiveLock, I don't think there is anything that keeps us
from reading stale syscache entries.

Alvaro's patch defends against the first of these cases by throwing an
error, which, as I already said, I don't think is acceptable, but I
don't see any defense at all against the other cases. The root of the
problem is that the way catalog lookups work today - each individual
lookup uses the latest available snapshot, but there is zero guarantee
that consecutive lookups use the same snapshot.  Therefore, as soon as
you start lowering lock levels, you are at risk for inconsistent data.

I suspect the only good way of fixing this problem is using a single
snapshot to perform both the scan of pg_inherits and the subsequent
pg_class lookups.  That way, you know that you are seeing the state of
the whole partitioning hierarchy as it existed at some particular
point in time -- every commit is either fully reflected in the
constructed PartitionDesc or not reflected at all.  Unfortunately,
that would mean that we can't use the syscache to perform the lookups,
which might have unhappy performance consequences.

Note that this problem still exists even if you allow concurrent
attach but not concurrent detach, but it's not as bad, because when
you encounter a concurrently-attached partition, you know it hasn't
also been concurrently-detached from someplace else.  Presumably you
either see the latest value of the partition bound or the NULL value
which preceded it, but not anything else.  If that's so, then maybe
you could get by without using a consistent snapshot for all of your
information gathering: if you see NULL, you know that the partition
was concurrently added and you just ignore it. There's still no
guarantee that all parallel workers would come to the same conclusion,
though, which doesn't feel too good.

Personally, I don't think it's right to blame that problem on parallel
query.  The problem is more general than that: we assume that holding
any kind of a lock on a relation is enough to keep the important
details of the relation static, and therefore it's fine to do
staggered lookups within one backend, and it's also fine to do
staggered lookups across different backends.  When you remove the
basic assumption that any lock is enough to prevent concurrent DDL,
then the whole idea that you can do different lookups at different
times with different snapshots (possibly in different backends) and
get sane answers also ceases to be correct.  But the idea that you can
look up different bits of catalog data at whatever time is convenient
undergirds large amounts of our current machinery -- it's built into
relcache, syscache, sinval, ...

I think that things get even crazier if we postpone locking on
individual partitions until we need to do something with that
partition, as has been proposed elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On 9 November 2018 at 05:34, Robert Haas <robertmhaas@gmail.com> wrote:
> I suspect the only good way of fixing this problem is using a single
> snapshot to perform both the scan of pg_inherits and the subsequent
> pg_class lookups.  That way, you know that you are seeing the state of
> the whole partitioning hierarchy as it existed at some particular
> point in time -- every commit is either fully reflected in the
> constructed PartitionDesc or not reflected at all.  Unfortunately,
> that would mean that we can't use the syscache to perform the lookups,
> which might have unhappy performance consequences.

I do have a patch sitting around that moves the relpartbound into a
new catalogue table named pg_partition. This gets rid of the usage of
pg_inherits for partitioned tables. I wonder if that problem is easier
to solve with that.  It also solves the issue with long partition keys
and lack of toast table on pg_class.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Nov 8, 2018 at 3:59 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 9 November 2018 at 05:34, Robert Haas <robertmhaas@gmail.com> wrote:
> > I suspect the only good way of fixing this problem is using a single
> > snapshot to perform both the scan of pg_inherits and the subsequent
> > pg_class lookups.  That way, you know that you are seeing the state of
> > the whole partitioning hierarchy as it existed at some particular
> > point in time -- every commit is either fully reflected in the
> > constructed PartitionDesc or not reflected at all.  Unfortunately,
> > that would mean that we can't use the syscache to perform the lookups,
> > which might have unhappy performance consequences.
>
> I do have a patch sitting around that moves the relpartbound into a
> new catalogue table named pg_partition. This gets rid of the usage of
> pg_inherits for partitioned tables. I wonder if that problem is easier
> to solve with that.  It also solves the issue with long partition keys
> and lack of toast table on pg_class.

Yeah, I thought about that, and it does make some sense.  Not sure if
it would hurt performance to have to access another table, but maybe
it comes out in the wash if pg_inherits is gone?  Seems like a fair
amount of code rearrangement just to get around the lack of a TOAST
table on pg_class, but maybe it's worth it.

I had another idea, too.  I think we might be able to reuse the
technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7.
That is:

- make a note of SharedInvalidMessageCounter before doing any of the
relevant catalog lookups
- do them
- AcceptInvalidationMessages()
- if SharedInvalidMessageCounter has changed, discard all the data we
collected and retry from the top

I believe that is sufficient to guarantee that whatever we construct
will have a consistent view of the catalogs which is the most recent
available view as of the time we do the work.  And with this approach
I believe we can continue to use syscache lookups to get the data
rather than having to use actual index scans, which is nice.

Then again, with your approach I'm guessing that one index scan would
get us the list of children and their partition bound information.
That would be even better -- the syscache lookup per child goes away
altogether; it's just a question of deforming the pg_partition tuples.

Way back at the beginning of the partitioning work, I mulled over the
idea of storing the partition bound information in a new column in
pg_inherits rather than in pg_class.  I wonder why exactly I rejected
that idea, and whether I was wrong to do so.  One possible advantage
of that approach over a pg_partition table is that is that client code
which queries pg_inherits will have to be adjusted if we stop using
it, and some of those queries are going to get more complicated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Fri, Nov 9, 2018 at 9:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I had another idea, too.  I think we might be able to reuse the
> technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7.
> That is:
>
> - make a note of SharedInvalidMessageCounter before doing any of the
> relevant catalog lookups
> - do them
> - AcceptInvalidationMessages()
> - if SharedInvalidMessageCounter has changed, discard all the data we
> collected and retry from the top
>
> I believe that is sufficient to guarantee that whatever we construct
> will have a consistent view of the catalogs which is the most recent
> available view as of the time we do the work.  And with this approach
> I believe we can continue to use syscache lookups to get the data
> rather than having to use actual index scans, which is nice.

Here are a couple of patches to illustrate this approach to this part
of the overall problem.  0001 is, I think, a good cleanup that may as
well be applied in isolation; it makes the code in
RelationBuildPartitionDesc both cleaner and more efficient.  0002
adjust things so that - I hope - the partition bounds we get for the
individual partitions has to be as of the same point in the commit
sequence as the list of children.  As I noted before, Alvaro's patch
doesn't seem to have tackled this part of the problem.

Another part of the problem is finding a way to make sure that if we
execute a query (or plan one), all parts of the executor (or planner)
see the same PartitionDesc for every relation.  In the case of
parallel query, I think it's important to try to get consistency not
only within a single backend but also across backends.  I'm thinking
about perhaps creating an object with a name like
PartitionDescDirectory which can optionally attach to dynamic shared
memory.  It would store an OID -> PartitionDesc mapping in local
memory, and optionally, an additional OID -> serialized-PartitionDesc
in DSA.  When given an OID, it would check the local hash table first,
and then if that doesn't find anything, check the shared hash table if
there is one.  If an entry is found there, deserialize and add to the
local hash table.  We'd then hang such a directory off of the EState
for the executor and the PlannerInfo for the planner.  As compared
with Alvaro's proposal, this approach has the advantage of not
treating parallel query as a second-class citizen, and also of keeping
partitioning considerations out of the snapshot handling, which as I
said before seems to me to be a good idea.

One thing which was vaguely on my mind in earlier emails but which I
think I can now articulate somewhat more clearly is this: In some
cases, a consistent but outdated view of the catalog state is just as
bad as an inconsistent view of the catalog state.  For example, it's
not OK to decide that a tuple can be placed in a certain partition
based on an outdated list of relation constraints, including the
partitioning constraint - nor is it OK to decide that a partition can
be pruned based on an old view of the partitioning constraint.  I
think this means that whenever we change a partition's partitioning
constraint, we MUST take AccessExclusiveLock on the partition.
Otherwise, a heap_insert() [or a partition pruning decision] can be in
progress on that relation in one relation at the same time that some
other partition is changing the partition constraint, which can't
possibly work.  The best we can reasonably do is to reduce the locking
level on the partitioned table itself.

A corollary is that holding the PartitionDescs that a particular query
sees for a particular relation fixed, whether by the method Alvaro
proposes or by what I am proposing here or by some other method is not
a panacea.  For example, the ExecPartitionCheck call in copy.c
sometimes gets skipped on the theory that if tuple routing has sent us
to partition X, then the tuple being routed must satisfy the partition
constraint for that partition.  But that's not true if we set up tuple
routing using one view of the catalogs, and then things changed
afterwards.  RelationBuildPartitionDesc doesn't lock the children
whose relpartbounds it is fetching (!), so unless we're guaranteed to
have already locked them children earlier for some other reason, we
could grab the partition bound at this point and then it could change
again before we get a lock on them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Michael Paquier
Date:
On Wed, Nov 14, 2018 at 02:27:31PM -0500, Robert Haas wrote:
> Here are a couple of patches to illustrate this approach to this part
> of the overall problem.  0001 is, I think, a good cleanup that may as
> well be applied in isolation; it makes the code in
> RelationBuildPartitionDesc both cleaner and more efficient.  0002
> adjust things so that - I hope - the partition bounds we get for the
> individual partitions has to be as of the same point in the commit
> sequence as the list of children.  As I noted before, Alvaro's patch
> doesn't seem to have tackled this part of the problem.

You may want to rebase these patches as of b52b7dc2, and change the
first argument of partition_bounds_create() so as a list is used in
input...
--
Michael

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On 2018/11/15 4:27, Robert Haas wrote:
> RelationBuildPartitionDesc doesn't lock the children
> whose relpartbounds it is fetching (!), so unless we're guaranteed to
> have already locked them children earlier for some other reason, we
> could grab the partition bound at this point and then it could change
> again before we get a lock on them.

Hmm, I think that RelationBuildPartitionDesc doesn't need to lock a
partition before fetching its relpartbound, because the latter can't
change if the caller is holding a lock on the parent, which it must be if
we're in RelationBuildPartitionDesc for parent at all.  Am I missing
something?


As Michael pointed out, the first cleanup patch needs to be rebased due to
a recent commit [1].  I did that to see if something we did in that commit
made things worse for your patch, but seems fine.  I had to go and change
things outside RelationBuildPartitionDesc as I rebased, due to the
aforementioned commit, but they're simple changes such as changing List *
arguments of some newly added functions to PartitionBoundSpec **.  Please
find the rebased patches attached with this email.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b52b7dc2

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On 2018/11/15 11:03, Amit Langote wrote:
> As Michael pointed out, the first cleanup patch needs to be rebased due to
> a recent commit [1].  I did that to see if something we did in that commit
> made things worse for your patch, but seems fine.  I had to go and change
> things outside RelationBuildPartitionDesc as I rebased, due to the
> aforementioned commit, but they're simple changes such as changing List *
> arguments of some newly added functions to PartitionBoundSpec **.  Please
> find the rebased patches attached with this email.
> 
> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b52b7dc2

I noticed that the regression tests containing partitioned tables fail
randomly with the rebased patches I posted, whereas they didn't if I apply
them to HEAD without [1].

It seems to be due to the slightly confused memory context handling in
RelationBuildPartitionDesc after [1], which Alvaro had expressed some
doubts about yesterday.

I've fixed 0001 again to re-order the code so that allocations happen the
correct context and now tests pass with the rebased patches.

By the way, I noticed that the oids array added by Robert's original 0001
patch wasn't initialized to NULL, which could lead to calling pfree on a
garbage value of oids after the 2nd patch.

Thanks,
Amit

[2]
https://www.postgresql.org/message-id/20181113135915.v4r77tdthlajdlqq%40alvherre.pgsql

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Michael Paquier
Date:
On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote:
> I've fixed 0001 again to re-order the code so that allocations happen the
> correct context and now tests pass with the rebased patches.

I have been looking at 0001, and it seems to me that you make even more
messy the current situation.  Coming to my point: do we have actually
any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation
has no partitions?  It seems to me that we had better set rd_pdcxt and
rd_partdesc to NULL in this case.
--
Michael

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On 2018/11/15 14:38, Michael Paquier wrote:
> On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote:
>> I've fixed 0001 again to re-order the code so that allocations happen the
>> correct context and now tests pass with the rebased patches.
> 
> I have been looking at 0001, and it seems to me that you make even more
> messy the current situation.  Coming to my point: do we have actually
> any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation
> has no partitions?  It seems to me that we had better set rd_pdcxt and
> rd_partdesc to NULL in this case.

As things stand today, rd_partdesc of a partitioned table must always be
non-NULL.  In fact, there are many places in the backend code that Assert it:

tablecmds.c: ATPrepDropNotNull()

    if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
    {
        PartitionDesc partdesc = RelationGetPartitionDesc(rel);

        Assert(partdesc != NULL);

prepunion.c: expand_partitioned_rtentry()

    PartitionDesc partdesc = RelationGetPartitionDesc(parentrel);

    check_stack_depth();

    /* A partitioned table should always have a partition descriptor. */
    Assert(partdesc);

plancat.c: set_relation_partition_info()

    partdesc = RelationGetPartitionDesc(relation);
    partkey = RelationGetPartitionKey(relation);
    rel->part_scheme = find_partition_scheme(root, relation);
    Assert(partdesc != NULL && rel->part_scheme != NULL);

Maybe there are others in a different form.

If there are no partitions, nparts is 0, and other fields are NULL, though
rd_partdesc itself is never NULL.

If we want to redesign that and allow it to be NULL until some code in the
backend wants to use it, then maybe we can consider doing what you say.
But, many non-trivial operations on partitioned tables require the
PartitionDesc, so there is perhaps not much point to designing it such
that rd_partdesc is set only when needed, because it will be referenced
sooner than later.  Maybe, we can consider doing that sort of thing for
boundinfo, because it's expensive to build, and not all operations want
the canonicalized bounds.

Thanks,
Amit



Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Michael Paquier
Date:
On Thu, Nov 15, 2018 at 02:53:47PM +0900, Amit Langote wrote:
> As things stand today, rd_partdesc of a partitioned table must always be
> non-NULL.  In fact, there are many places in the backend code that Assert it:
>
> [...]

I have noticed those, and they actually would not care much if
rd_partdesc was actually NULL.  I find interesting that the planner
portion actually does roughly the same thing with a partitioned table
with no partitions and a non-partitioned table.

> Maybe there are others in a different form.
>
> If there are no partitions, nparts is 0, and other fields are NULL, though
> rd_partdesc itself is never NULL.

I find a bit confusing that both concepts have the same meaning, aka
that a relation has no partition, and that it is actually relkind which
decides rd_partdesc should be NULL or set up.  This stuff also does
empty allocations.

> If we want to redesign that and allow it to be NULL until some code in the
> backend wants to use it, then maybe we can consider doing what you say.
> But, many non-trivial operations on partitioned tables require the
> PartitionDesc, so there is perhaps not much point to designing it such
> that rd_partdesc is set only when needed, because it will be referenced
> sooner than later.  Maybe, we can consider doing that sort of thing for
> boundinfo, because it's expensive to build, and not all operations want
> the canonicalized bounds.

I am fine if that's the consensus of this thread.  But as far as I can
see it is possible to remove a bit of the memory handling mess by doing
so.  My 2c.
--
Michael

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On 2018/11/15 15:22, Michael Paquier wrote:
>> If there are no partitions, nparts is 0, and other fields are NULL, though
>> rd_partdesc itself is never NULL.
> 
> I find a bit confusing that both concepts have the same meaning, aka
> that a relation has no partition, and that it is actually relkind which
> decides rd_partdesc should be NULL or set up.  This stuff also does
> empty allocations.
> 
>> If we want to redesign that and allow it to be NULL until some code in the
>> backend wants to use it, then maybe we can consider doing what you say.
>> But, many non-trivial operations on partitioned tables require the
>> PartitionDesc, so there is perhaps not much point to designing it such
>> that rd_partdesc is set only when needed, because it will be referenced
>> sooner than later.  Maybe, we can consider doing that sort of thing for
>> boundinfo, because it's expensive to build, and not all operations want
>> the canonicalized bounds.
> 
> I am fine if that's the consensus of this thread.  But as far as I can
> see it is possible to remove a bit of the memory handling mess by doing
> so.  My 2c.

Perhaps, we can discuss this another thread.  I know this thread contains
important points about partition descriptor creation and modification, but
memory context considerations seems like a separate topic.  The following
message could be a starting point, because there we were talking about
perhaps a similar design as you're saying:

https://www.postgresql.org/message-id/143ed9a4-6038-76d4-9a55-502035815e68%40lab.ntt.co.jp

Also, while I understood Alvaro's and your comment on the other thread
that memory handling is messy as is, but sorry, it's not clear to me why
you say this patch makes it messier.  It reduces context switch calls so
that RelationBuildPartitionDesc roughly looks like this after the patch:

Start with CurrentMemoryContext...

1. read catalogs and make bounddescs and oids arrays

2. partition_bounds_create(...)

3. create and switch to rd_pdcxt

4. create PartitionDesc, copy partdesc->oids and partdesc->boundinfo

5. switch back to the old context

Thanks,
Amit



Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Nov 15, 2018 at 12:38 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote:
> > I've fixed 0001 again to re-order the code so that allocations happen the
> > correct context and now tests pass with the rebased patches.
>
> I have been looking at 0001, and it seems to me that you make even more
> messy the current situation.  Coming to my point: do we have actually
> any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation
> has no partitions?  It seems to me that we had better set rd_pdcxt and
> rd_partdesc to NULL in this case.

I think that's unrelated to this patch, as Amit also says, but I have
to say that the last few hunks of the rebased version of this patch do
not make a lot of sense to me.  This patch is supposed to be reducing
list construction, and the original version did that, but the rebased
version adds a partition_bounds_copy() operation, whereas my version
did not add any expensive operations - it only removed some cost.  I
don't see why anything I changed should necessitate such a change, nor
does it seem like a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On 2018/11/15 22:57, Robert Haas wrote:
> On Thu, Nov 15, 2018 at 12:38 AM Michael Paquier <michael@paquier.xyz> wrote:
>> On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote:
>>> I've fixed 0001 again to re-order the code so that allocations happen the
>>> correct context and now tests pass with the rebased patches.
>>
>> I have been looking at 0001, and it seems to me that you make even more
>> messy the current situation.  Coming to my point: do we have actually
>> any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation
>> has no partitions?  It seems to me that we had better set rd_pdcxt and
>> rd_partdesc to NULL in this case.
> 
> I think that's unrelated to this patch, as Amit also says, but I have
> to say that the last few hunks of the rebased version of this patch do
> not make a lot of sense to me.  This patch is supposed to be reducing
> list construction, and the original version did that, but the rebased
> version adds a partition_bounds_copy() operation, whereas my version
> did not add any expensive operations - it only removed some cost.  I
> don't see why anything I changed should necessitate such a change, nor
> does it seem like a good idea.

The partition_bounds_copy() is not because of your changes, it's there in
HEAD.  The reason we do that is because partition_bounds_create allocates
the memory for the PartitionBoundInfo it returns along with other
temporary allocations in CurrentMemoryContext.  But we'd need to copy it
into rd_pdcxt before calling it a property of rd_partdesc, so the
partition_bounds_copy.

Maybe partition_bounds_create() should've had a MemoryContext argument to
pass it the context we want it to create the PartitionBoundInfo in.  That
way, we can simply pass rd_pdcxt to it and avoid making a copy.  As is,
we're now allocating  two copies of PartitionBoundInfo, one in the
CurrentMemoryContext and another in rd_pdcxt, whereas the previous code
would only allocate the latter.  Maybe we should fix it as being a regression.

Thanks,
Amit



Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Michael Paquier
Date:
On Fri, Nov 16, 2018 at 10:57:57AM +0900, Amit Langote wrote:
> Maybe partition_bounds_create() should've had a MemoryContext argument to
> pass it the context we want it to create the PartitionBoundInfo in.  That
> way, we can simply pass rd_pdcxt to it and avoid making a copy.  As is,
> we're now allocating  two copies of PartitionBoundInfo, one in the
> CurrentMemoryContext and another in rd_pdcxt, whereas the previous code
> would only allocate the latter.  Maybe we should fix it as being a regression.

Not sure about what you mean by regression here, but passing the memory
context as an argument has sense as you can remove the extra partition
bound copy, as it has sense to use an array instead of a list for
performance, which may matter if many partitions are handled when
building the cache.  So cleaning up both things at the same time would
be nice.
--
Michael

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On Fri, Nov 16, 2018 at 1:00 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Nov 16, 2018 at 10:57:57AM +0900, Amit Langote wrote:
> > Maybe partition_bounds_create() should've had a MemoryContext argument to
> > pass it the context we want it to create the PartitionBoundInfo in.  That
> > way, we can simply pass rd_pdcxt to it and avoid making a copy.  As is,
> > we're now allocating  two copies of PartitionBoundInfo, one in the
> > CurrentMemoryContext and another in rd_pdcxt, whereas the previous code
> > would only allocate the latter.  Maybe we should fix it as being a regression.
>
> Not sure about what you mean by regression here,

The regression is, as I mentioned, that the  new code allocates two
copies of PartitionBoundInfo whereas only one would be allocated
before.

> but passing the memory
> context as an argument has sense as you can remove the extra partition
> bound copy, as it has sense to use an array instead of a list for
> performance, which may matter if many partitions are handled when
> building the cache.  So cleaning up both things at the same time would
> be nice.

Maybe, the patch to add the memory context argument to
partition_bound_create and other related static functions in
partbound.c should be its own patch, as that seems to be a separate
issue.  OTOH, other changes needed to implement Robert's proposal of
using PartitionBoundSpec and Oid arrays instead of existing lists
should be in the same patch.

Thanks,
Amit


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Nov 15, 2018 at 8:58 PM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The partition_bounds_copy() is not because of your changes, it's there in
> HEAD.

OK, but it seems to me that your version of my patch rearranges the
code more than necessary.

How about the attached?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Michael Paquier
Date:
On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote:
> OK, but it seems to me that your version of my patch rearranges the
> code more than necessary.
>
> How about the attached?

What you are proposing here looks good to me.  Thanks!
--
Michael

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On 2018/11/17 9:06, Michael Paquier wrote:
> On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote:
>> OK, but it seems to me that your version of my patch rearranges the
>> code more than necessary.
>>
>> How about the attached?
> 
> What you are proposing here looks good to me.  Thanks!

Me too, now that I see the patch closely.  The errors I'd seen in the
regression tests were due to uninitialized oids variable which is fixed in
the later patches, not due to "confused memory context switching" as I'd
put it [1] and made that the reason for additional changes.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/1be8055c-137b-5639-9bcf-8a2d5fef6e5a%40lab.ntt.co.jp



Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Sun, Nov 18, 2018 at 9:43 PM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/11/17 9:06, Michael Paquier wrote:
> > On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote:
> >> OK, but it seems to me that your version of my patch rearranges the
> >> code more than necessary.
> >>
> >> How about the attached?
> >
> > What you are proposing here looks good to me.  Thanks!
>
> Me too, now that I see the patch closely.  The errors I'd seen in the
> regression tests were due to uninitialized oids variable which is fixed in
> the later patches, not due to "confused memory context switching" as I'd
> put it [1] and made that the reason for additional changes.

OK.  Rebased again, and committed (although I forgot to include a link
to this discussion - sorry about that).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Wed, Nov 14, 2018 at 9:03 PM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/11/15 4:27, Robert Haas wrote:
> > RelationBuildPartitionDesc doesn't lock the children
> > whose relpartbounds it is fetching (!), so unless we're guaranteed to
> > have already locked them children earlier for some other reason, we
> > could grab the partition bound at this point and then it could change
> > again before we get a lock on them.
>
> Hmm, I think that RelationBuildPartitionDesc doesn't need to lock a
> partition before fetching its relpartbound, because the latter can't
> change if the caller is holding a lock on the parent, which it must be if
> we're in RelationBuildPartitionDesc for parent at all.  Am I missing
> something?

After thinking about this for a bit, I think that right now it's fine,
because you can't create or drop or attach or detach a partition
without holding AccessExclusiveLock on both the parent and the child,
so if you hold even AccessShareLock on the parent, the child's
relpartbound can't be changing.  However, what we want to do is get
the lock level on the parent down to ShareUpdateExclusiveLock, at
which point the child's relpartbound could indeed change under us.  I
think, however, that what I previously posted as 0002 is sufficient to
fix that part of the problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Sergei Kornilov
Date:
Hello

> OK. Rebased again, and committed (although I forgot to include a link
> to this discussion - sorry about that).

Seems we erroneously moved this thread to next CF: https://commitfest.postgresql.org/21/1842/
Can you close this entry?

regards, Sergei


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Michael Paquier
Date:
On Sat, Dec 15, 2018 at 01:04:00PM +0300, Sergei Kornilov wrote:
> Seems we erroneously moved this thread to next CF:
> https://commitfest.postgresql.org/21/1842/
> Can you close this entry?

Robert has committed a patch to refactor a bit the list contruction of
RelationBuildPartitionDesc thanks to 7ee5f88e, but the main patch has
not been committed, so the current status looks right to me.
--
Michael

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Sun, Dec 16, 2018 at 6:43 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Sat, Dec 15, 2018 at 01:04:00PM +0300, Sergei Kornilov wrote:
> > Seems we erroneously moved this thread to next CF:
> > https://commitfest.postgresql.org/21/1842/
> > Can you close this entry?
>
> Robert has committed a patch to refactor a bit the list contruction of
> RelationBuildPartitionDesc thanks to 7ee5f88e, but the main patch has
> not been committed, so the current status looks right to me.

I have done a bit more work on this, but need to spend some more time
on it before I have something that is worth posting.  Not sure whether
I'll get to that before the New Year at this point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2018-Dec-17, Robert Haas wrote:

> I have done a bit more work on this, but need to spend some more time
> on it before I have something that is worth posting.  Not sure whether
> I'll get to that before the New Year at this point.

This patch missing the CF deadline would not be a happy way for me to
begin the new year.

I'm not sure what's the best way to move forward with this patch, but I
encourage you to post whatever version you have before the deadline,
even if you're not fully happy with it (and, heck, even if it doesn't
compile and/or is full of FIXME or TODO comments).

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Michael Paquier
Date:
On Mon, Dec 17, 2018 at 06:52:51PM -0300, Alvaro Herrera wrote:
> On 2018-Dec-17, Robert Haas wrote:
> This patch missing the CF deadline would not be a happy way for me to
> begin the new year.
>
> I'm not sure what's the best way to move forward with this patch, but I
> encourage you to post whatever version you have before the deadline,
> even if you're not fully happy with it (and, heck, even if it doesn't
> compile and/or is full of FIXME or TODO comments).

Agreed.  This patch has value, and somebody else could always take it
from the point where you were.
--
Michael

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Mon, Dec 17, 2018 at 6:44 PM Michael Paquier <michael@paquier.xyz> wrote:
> Agreed.  This patch has value, and somebody else could always take it
> from the point where you were.

OK.  I'll post what I have by the end of the week.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Michael Paquier
Date:
On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote:
> OK.  I'll post what I have by the end of the week.

Thanks, Robert.
--
Michael

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Tue, Dec 18, 2018 at 8:04 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote:
> > OK.  I'll post what I have by the end of the week.
>
> Thanks, Robert.

OK, so I got slightly delayed here by utterly destroying my laptop,
but I've mostly reconstructed what I did.  I think there are some
remaining problems, but this seems like a good time to share what I've
got so far.  I'm attaching three patches.

0001 is one which I posted before.  It attempts to fix up
RelationBuildPartitionDesc() so that this function will always return
a partition descriptor based on a consistent snapshot of the catalogs.
Without this, I think there's nothing to prevent a commit which
happens while the function is running from causing the function to
fail or produce nonsense answers.

0002 introduces the concept of a partition directory.  The idea is
that the planner will create a partition directory, and so will the
executor, and all calls which occur in those places to
RelationGetPartitionDesc() will instead call
PartitionDirectoryLookup().  Every lookup for the same relation in the
same partition directory is guaranteed to produce the same answer.  I
believe this patch still has a number of weaknesses.  More on that
below.

0003 actually lowers the lock level.  The comment here might need some
more work.

Here is a list of possible or definite problems that are known to me:

- I think we need a way to make partition directory lookups consistent
across backends in the case of parallel query.  I believe this can be
done with a dshash and some serialization and deserialization logic,
but I haven't attempted that yet.

- I refactored expand_inherited_rtentry() to drive partition expansion
entirely off of PartitionDescs. The reason why this is necessary is
that it clearly will not work to have find_all_inheritors() use a
current snapshot to decide what children we have and lock them, and
then consult a different source of truth to decide which relations to
open with NoLock.  There's nothing to keep the lists of partitions
from being different in the two cases, and that demonstrably causes
assertion failures if you SELECT with an ATTACH/DETACH loop running in
the background. However, it also changes the order in which tables get
locked.  Possibly that could be fixed by teaching
expand_partitioned_rtentry() to qsort() the OIDs the way
find_inheritance_children() does.  It also loses the infinite-loop
protection which find_all_inheritors() has.  Not sure what to do about
that.

- In order for the new PartitionDirectory machinery to avoid
use-after-free bugs, we have to either copy the PartitionDesc out of
the relcache into the partition directory or avoid freeing it while it
is still in use.  Copying it seems unappealing for performance
reasons, so I took the latter approach.  However, what I did here in
terms of reclaiming memory is just about the least aggressive strategy
short of leaking it altogether - it just keeps it around until the
next rebuild that occurs while the relcache entry is not in use.  We
might want to do better, e.g. freeing old copies immediately as soon
as the relcache reference count drops to 0. I just did it this way
because it was simple to code.

- I tried this with Alvaro's isolation tests and it fails some tests.
That's because Alvaro's tests expect that the list of accessible
partitions is based on the query snapshot.  For reasons I attempted to
explain in the comments in 0003, I think the idea that we can choose
the set of accessible partitions based on the query snapshot is very
wrong.  For example, suppose transaction 1 begins, reads an unrelated
table to establish a snapshot, and then goes idle.  Then transaction 2
comes along, detaches a partition from an important table, and then
does crazy stuff with that table -- changes the column list, drops it,
reattaches it with different bounds, whatever.  Then it commits.  If
transaction 1 now comes along and uses the query snapshot to decide
that the detached partition ought to still be seen as a partition of
that partitioned table, disaster will ensue.

- I don't have any tests here, but I think it would be good to add
some, perhaps modeled on Alvaro's, and also some that involve multiple
ATTACH and DETACH operations mixed with other interesting kinds of
DDL.  I also didn't make any attempt to update the documentation,
which is another thing that will probably have to be done at some
point. Not sure how much documentation we have of any of this now.

- I am uncertain whether the logic that builds up the partition
constraint is really safe in the face of concurrent DDL.  I kinda
suspect there are some problems there, but maybe not.  Once you hold
ANY lock on a partition, the partition constraint can't concurrently
become stricter, because no ATTACH can be done without
AccessExclusiveLock on the partition being attached; but it could
concurrently become less strict, because you only need a lesser lock
for a detach.  Not sure if/how that could foul with this logic.

- I have not done anything about the fact that index detach takes
AccessExclusiveLock.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
Thanks for this work!  I like the name "partition directory".

On 2018-Dec-20, Robert Haas wrote:

> 0002 introduces the concept of a partition directory.  The idea is
> that the planner will create a partition directory, and so will the
> executor, and all calls which occur in those places to
> RelationGetPartitionDesc() will instead call
> PartitionDirectoryLookup().  Every lookup for the same relation in the
> same partition directory is guaranteed to produce the same answer.  I
> believe this patch still has a number of weaknesses.  More on that
> below.

The commit message for this one also points out another potential
problem,

> Introduce the concept of a partition directory.
>
> Teach the optimizer and executor to use it, so that a single planning
> cycle or query execution gets the same PartitionDesc for the same table
> every time it looks it up.  This does not prevent changes between
> planning and execution, nor does it guarantee that all tables are
> expanded according to the same snapshot.

Namely: how does this handle the case of partition pruning structure
being passed from planner to executor, if an attach happens in the
middle of it and puts a partition in between existing partitions?  Array
indexes of any partitions that appear later in the partition descriptor
will change.

This is the reason I used the query snapshot rather than EState.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Introduce the concept of a partition directory.
> >
> > Teach the optimizer and executor to use it, so that a single planning
> > cycle or query execution gets the same PartitionDesc for the same table
> > every time it looks it up.  This does not prevent changes between
> > planning and execution, nor does it guarantee that all tables are
> > expanded according to the same snapshot.
>
> Namely: how does this handle the case of partition pruning structure
> being passed from planner to executor, if an attach happens in the
> middle of it and puts a partition in between existing partitions?  Array
> indexes of any partitions that appear later in the partition descriptor
> will change.
>
> This is the reason I used the query snapshot rather than EState.

I didn't handle that.  If partition pruning relies on nothing changing
between planning and execution, isn't that broken regardless of any of
this?  It's true that with the simple query protocol we'll hold locks
continuously from planning into execution, and therefore with the
current locking regime we couldn't really have a problem.  But unless
I'm confused, with the extended query protocol it's quite possible to
generate a plan, release locks, and then reacquire locks at execution
time.  Unless we have some guarantee that a new plan will always be
generated if any DDL has happened in the middle, I think we've got
trouble, and I don't think that is guaranteed in all cases.

Maybe I'm all wet, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2018-Dec-20, Robert Haas wrote:

> I didn't handle that.  If partition pruning relies on nothing changing
> between planning and execution, isn't that broken regardless of any of
> this?  It's true that with the simple query protocol we'll hold locks
> continuously from planning into execution, and therefore with the
> current locking regime we couldn't really have a problem.  But unless
> I'm confused, with the extended query protocol it's quite possible to
> generate a plan, release locks, and then reacquire locks at execution
> time.  Unless we have some guarantee that a new plan will always be
> generated if any DDL has happened in the middle, I think we've got
> trouble, and I don't think that is guaranteed in all cases.

Oh, so maybe this case is already handled by plan invalidation -- I
mean, if we run DDL, the stored plan is thrown away and a new one
recomputed.  IOW this was already a solved problem and I didn't need to
spend effort on it. /me slaps own forehead

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Dec 20, 2018 at 4:11 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Oh, so maybe this case is already handled by plan invalidation -- I
> mean, if we run DDL, the stored plan is thrown away and a new one
> recomputed.  IOW this was already a solved problem and I didn't need to
> spend effort on it. /me slaps own forehead

I'm kinda saying the opposite - I'm not sure that it's safe even with
the higher lock levels.  If the plan is relying on the same partition
descriptor being in effect at plan time as at execution time, that
sounds kinda dangerous to me.

Lowering the lock level might also make something that was previously
safe into something unsafe, because now there's no longer a guarantee
that invalidation messages are received soon enough. With
AccessExclusiveLock, we'll send invalidation messages before releasing
the lock, and other processes will acquire the lock and then
AcceptInvalidationMessages().  But with ShareUpdateExclusiveLock the
locks can coexist, so now there might be trouble.  I think this is an
area where we need to do some more investigation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Dec 20, 2018 at 4:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Lowering the lock level might also make something that was previously
> safe into something unsafe, because now there's no longer a guarantee
> that invalidation messages are received soon enough. With
> AccessExclusiveLock, we'll send invalidation messages before releasing
> the lock, and other processes will acquire the lock and then
> AcceptInvalidationMessages().  But with ShareUpdateExclusiveLock the
> locks can coexist, so now there might be trouble.  I think this is an
> area where we need to do some more investigation.

So there are definitely problems here.  With my patch:

create table tab (a int, b text) partition by range (a);
create table tab1 partition of tab for values from (0) to (10);
prepare t as select * from tab;
begin;
explain execute t; -- seq scan on tab1
execute t; -- no rows

Then, in another session:

alter table tab detach partition tab1;
insert into tab1 values (300, 'oops');

Back to the first session:

execute t; -- shows (300, 'oops')
explain execute t; -- still planning to scan tab1
commit;
explain execute t; -- now it got the memo, and plans to scan nothing
execute t; -- no rows

Well, that's not good.  We're showing a value that was never within
the partition bounds of any partition of tab.  The problem is that,
since we already have locks on all relevant objects, nothing triggers
the second 'explain execute' to process invalidation messages, so we
don't update the plan.  Generally, any DDL with less than
AccessExclusiveLock has this issue.  On another thread, I was
discussing with Tom and Peter the possibility of trying to rejigger
things so that we always AcceptInvalidationMessages() at least once
per command, but I think that just turns this into a race: if a
concurrent commit happens after 'explain execute t' decides not to
re-plan but before it begins executing, we have the same problem.

This example doesn't involve partition pruning, and in general I don't
think that the problem is confined to partition pruning.  It's rather
that if there's no conflict between the lock that is needed to change
the set of partitions and the lock that is needed to run a query, then
there's no way to guarantee that the query runs with the same set of
partitions for which it was planned. Unless I'm mistaken, which I
might be, this is also a problem with your approach -- if you repeat
the same prepared query in the same transaction, the transaction
snapshot will be updated, and thus the PartitionDesc will be expanded
differently at execution time, but the plan will not have changed,
because invalidation messages have not been processed.

Anyway, I think the only fix here is likely to be making the executor
resilient against concurrent changes in the PartitionDesc. I don't
think there's going to be any easy  way to compensate for added
partitions without re-planning, but maybe we could find a way to flag
detached partitions so that they return no rows without actually
touching the underlying relation.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On Fri, 21 Dec 2018 at 09:43, Robert Haas <robertmhaas@gmail.com> wrote:
> - I refactored expand_inherited_rtentry() to drive partition expansion
> entirely off of PartitionDescs. The reason why this is necessary is
> that it clearly will not work to have find_all_inheritors() use a
> current snapshot to decide what children we have and lock them, and
> then consult a different source of truth to decide which relations to
> open with NoLock.  There's nothing to keep the lists of partitions
> from being different in the two cases, and that demonstrably causes
> assertion failures if you SELECT with an ATTACH/DETACH loop running in
> the background. However, it also changes the order in which tables get
> locked.  Possibly that could be fixed by teaching
> expand_partitioned_rtentry() to qsort() the OIDs the way
> find_inheritance_children() does.  It also loses the infinite-loop
> protection which find_all_inheritors() has.  Not sure what to do about
> that.

I don't think you need to qsort() the Oids before locking. What the
qsort() does today is ensure we get a consistent locking order. Any
other order would surely do, providing we stick to it consistently. I
think PartitionDesc order is fine, as it's consistent.  Having it
locked in PartitionDesc order I think is what's needed for [1] anyway.
[2] proposes to relax the locking order taken during execution.

[1] https://commitfest.postgresql.org/21/1778/
[2] https://commitfest.postgresql.org/21/1887/

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On Fri, 21 Dec 2018 at 10:05, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Namely: how does this handle the case of partition pruning structure
> > being passed from planner to executor, if an attach happens in the
> > middle of it and puts a partition in between existing partitions?  Array
> > indexes of any partitions that appear later in the partition descriptor
> > will change.
> >
> > This is the reason I used the query snapshot rather than EState.
>
> I didn't handle that.  If partition pruning relies on nothing changing
> between planning and execution, isn't that broken regardless of any of
> this?  It's true that with the simple query protocol we'll hold locks
> continuously from planning into execution, and therefore with the
> current locking regime we couldn't really have a problem.  But unless
> I'm confused, with the extended query protocol it's quite possible to
> generate a plan, release locks, and then reacquire locks at execution
> time.  Unless we have some guarantee that a new plan will always be
> generated if any DDL has happened in the middle, I think we've got
> trouble, and I don't think that is guaranteed in all cases.

Today the plan would be invalidated if a partition was ATTACHED or
DETACHED. The newly built plan would get the updated list of
partitions.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Fri, Dec 21, 2018 at 6:04 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I don't think you need to qsort() the Oids before locking. What the
> qsort() does today is ensure we get a consistent locking order. Any
> other order would surely do, providing we stick to it consistently. I
> think PartitionDesc order is fine, as it's consistent.  Having it
> locked in PartitionDesc order I think is what's needed for [1] anyway.
> [2] proposes to relax the locking order taken during execution.

If queries take locks in one order and DDL takes them in some other
order, queries and DDL starting around the same time could deadlock.
Unless we convert the whole system to lock everything in PartitionDesc
order the issue doesn't go away completely. But maybe we just have to
live with that. Surely we're not going to pay the cost of locking
partitions that we don't otherwise need to avoid a deadlock-vs-DDL
risk, and once we've decided to assume that risk, I'm not sure a
qsort() here helps anything much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Fri, Dec 21, 2018 at 6:06 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> > I didn't handle that.  If partition pruning relies on nothing changing
> > between planning and execution, isn't that broken regardless of any of
> > this?  It's true that with the simple query protocol we'll hold locks
> > continuously from planning into execution, and therefore with the
> > current locking regime we couldn't really have a problem.  But unless
> > I'm confused, with the extended query protocol it's quite possible to
> > generate a plan, release locks, and then reacquire locks at execution
> > time.  Unless we have some guarantee that a new plan will always be
> > generated if any DDL has happened in the middle, I think we've got
> > trouble, and I don't think that is guaranteed in all cases.
>
> Today the plan would be invalidated if a partition was ATTACHED or
> DETACHED. The newly built plan would get the updated list of
> partitions.

I think you're right, and that won't be true any more once we lower
the lock level, so it has to be handled somehow. The entire plan
invalidation mechanism seems to depend fundamentally on
AccessExclusiveLock being used everywhere, so this is likely to be an
ongoing issue every time we want to reduce a lock level anywhere.  I
wonder if there is any kind of systematic fix or if we are just going
to have to keep inventing ad-hoc solutions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On Tue, 25 Dec 2018 at 08:15, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Dec 21, 2018 at 6:04 PM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > I don't think you need to qsort() the Oids before locking. What the
> > qsort() does today is ensure we get a consistent locking order. Any
> > other order would surely do, providing we stick to it consistently. I
> > think PartitionDesc order is fine, as it's consistent.  Having it
> > locked in PartitionDesc order I think is what's needed for [1] anyway.
> > [2] proposes to relax the locking order taken during execution.
>
> If queries take locks in one order and DDL takes them in some other
> order, queries and DDL starting around the same time could deadlock.
> Unless we convert the whole system to lock everything in PartitionDesc
> order the issue doesn't go away completely. But maybe we just have to
> live with that. Surely we're not going to pay the cost of locking
> partitions that we don't otherwise need to avoid a deadlock-vs-DDL
> risk, and once we've decided to assume that risk, I'm not sure a
> qsort() here helps anything much.

When I said "consistent" I meant consistent over all places where we
obtain locks on all partitions. My original v1-0002 patch attempted
something like this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Namely: how does this handle the case of partition pruning structure
> being passed from planner to executor, if an attach happens in the
> middle of it and puts a partition in between existing partitions?  Array
> indexes of any partitions that appear later in the partition descriptor
> will change.

I finally gotten a little more time to work on this.  It took me a
while to understand that a PartitionedRelPruneInfos assumes that the
indexes of partitions in the PartitionDesc don't change between
planning and execution, because subplan_map[] and subplan_map[] are
indexed by PartitionDesc offset.  I suppose the reason for this is so
that we don't have to go to the expense of copying the partition
bounds from the PartitionDesc into the final plan, but it somehow
seems a little scary to me.  Perhaps I am too easily frightened, but
it's certainly a problem from the point of view of this project, which
wants to let the PartitionDesc change concurrently.

I wrote a little patch that stores the relation OIDs of the partitions
into the PartitionedPruneRelInfo and then, at execution time, does an
Assert() that what it gets matches what existed at plan time.  I
figured that a good start would be to find a test case where this
fails with concurrent DDL allowed, but I haven't so far succeeded in
devising one.  To make the Assert() fail, I need to come up with a
case where concurrent DDL has caused the PartitionDesc to be rebuilt
but without causing an update to the plan.  If I use prepared queries
inside of a transaction block, I can continue to run old plans after
concurrent DDL has changed things, but I can't actually make the
Assert() fail, because the queries continue to use the old plans
precisely because they haven't processed invalidation messages, and
therefore they also have the old PartitionDesc and everything works.
Maybe if I try it with CLOBBER_CACHE_ALWAYS...

I also had the idea of trying to use a cursor, because if I could
start execution of a query, then force a relcache rebuild, then
continue executing the query, maybe something would blow up somehow.
But that's not so easy because I don't think we have any way using SQL
to declare a cursor for a prepared query, so how do I need to get a
query plan that involves run-time pruning without using parameters,
which I'm pretty sure is possible but I haven't figured it out yet.
And even there the PartitionDirectory concept might preserve us from
any damage if the change happens after the executor is initialized,
though I'm not sure if there are any cases where we don't do the first
PartitionDesc lookup for a particular table until mid-execution.

Anyway, I think this idea of passing a list of relation OIDs that we
saw at planning time through to the executor and cross-checking them
might have some legs.  If we only allowed concurrently *adding*
partitions and not concurrently *removing* them, then even if we find
the case(s) where the PartitionDesc can change under us, we can
probably just adjust subplan_map and subpart_map to compensate, since
we can iterate through the old and new arrays of relation OIDs and
just figure out which things have shifted to higher indexes in the
PartitionDesc.  This is all kind of hand-waving at the moment; tips
appreciated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2019-Jan-25, Robert Haas wrote:

> I finally gotten a little more time to work on this.  It took me a
> while to understand that a PartitionedRelPruneInfos assumes that the
> indexes of partitions in the PartitionDesc don't change between
> planning and execution, because subplan_map[] and subplan_map[] are
> indexed by PartitionDesc offset.

Right, the planner/executor "disconnect" is one of the challenges, and
why I was trying to keep the old copy of the PartitionDesc around
instead of building updated ones as needed.

> I suppose the reason for this is so
> that we don't have to go to the expense of copying the partition
> bounds from the PartitionDesc into the final plan, but it somehow
> seems a little scary to me.  Perhaps I am too easily frightened, but
> it's certainly a problem from the point of view of this project, which
> wants to let the PartitionDesc change concurrently.

Well, my definition of the problem started with the assumption that we
would keep the partition array indexes unchanged, so "change
concurrently" is what we needed to avoid.  Yes, I realize that you've
opted to change that definition.

I may have forgotten some of your earlier emails on this, but one aspect
(possibly a key one) is that I'm not sure we really need to cope, other
than with an ERROR, with queries that continue to run across an
attach/detach -- moreso in absurd scenarios such as the ones you
described where the detached table is later re-attached, possibly to a
different partitioned table.  I mean, if we can just detect the case and
raise an error, and this let us make it all work reasonably, that might
be better.

> I wrote a little patch that stores the relation OIDs of the partitions
> into the PartitionedPruneRelInfo and then, at execution time, does an
> Assert() that what it gets matches what existed at plan time.  I
> figured that a good start would be to find a test case where this
> fails with concurrent DDL allowed, but I haven't so far succeeded in
> devising one.  To make the Assert() fail, I need to come up with a
> case where concurrent DDL has caused the PartitionDesc to be rebuilt
> but without causing an update to the plan.  If I use prepared queries
> inside of a transaction block, [...]

> I also had the idea of trying to use a cursor, because if I could
> start execution of a query, [...]

Those are the ways I thought of, and the reason for the shape of some of
those .spec tests.  I wasn't able to hit the situation.

> Maybe if I try it with CLOBBER_CACHE_ALWAYS...

I didn't try this one.

> Anyway, I think this idea of passing a list of relation OIDs that we
> saw at planning time through to the executor and cross-checking them
> might have some legs.  If we only allowed concurrently *adding*
> partitions and not concurrently *removing* them, then even if we find
> the case(s) where the PartitionDesc can change under us, we can
> probably just adjust subplan_map and subpart_map to compensate, since
> we can iterate through the old and new arrays of relation OIDs and
> just figure out which things have shifted to higher indexes in the
> PartitionDesc.  This is all kind of hand-waving at the moment; tips
> appreciated.

I think detaching partitions concurrently is a necessary part of this
feature, so I would prefer not to go with a solution that works for
attaching partitions but not for detaching them.  That said, I don't see
why it's impossible to adjust the partition maps in both cases.  But I
don't have anything better than hand-waving ATM.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Right, the planner/executor "disconnect" is one of the challenges, and
> why I was trying to keep the old copy of the PartitionDesc around
> instead of building updated ones as needed.

I agree that would be simpler, but I just don't see how to make it
work.  For one thing, keeping the old copy around can't work in
parallel workers, which never had a copy in the first place.  For two
things, we don't have a really good mechanism to keep the
PartitionDesc that was used at plan time around until execution time.
Keeping the relation open would do it, but I'm pretty sure that causes
other problems; the system doesn't expect any residual references.

I know you had a solution to this problem, but I don't see how it can
work.  You said "Snapshots have their own cache (hash table) of
partition descriptors. If a partdesc is requested and the snapshot has
already obtained that partdesc, the original one is returned -- we
don't request a new one from partcache."  But presumably this means
when the last snapshot is unregistered, the cache is flushed
(otherwise, when?) and if that's so then this relies on the snapshot
that was used for planning still being around at execution time, which
I am pretty sure isn't guaranteed.

Also, and I think this point is worthy of some further discussion, the
thing that really seems broken to me about your design is the idea
that it's OK to use the query or transaction snapshot to decide which
partitions exist.  The problem with that is that some query or
transaction with an old snapshot might see as a partition some table
that has been dropped or radically altered - different column types,
attached to some other table now, attached to same table but with
different bounds, or just dropped.  And therefore it might try to
insert data into that table and fail in all kinds of crazy ways, about
the mildest of which is inserting data that doesn't match the current
partition constraint.  I'm willing to be told that I've misunderstood
the way it all works and this isn't really a problem for some reason,
but my current belief is that not only is it a problem with your
design, but that it's such a bad problem that there's really no way to
fix it and we have to abandon your entire approach and go a different
route.  If you don't think that's true, then perhaps we should discuss
it further.

> > I suppose the reason for this is so
> > that we don't have to go to the expense of copying the partition
> > bounds from the PartitionDesc into the final plan, but it somehow
> > seems a little scary to me.  Perhaps I am too easily frightened, but
> > it's certainly a problem from the point of view of this project, which
> > wants to let the PartitionDesc change concurrently.
>
> Well, my definition of the problem started with the assumption that we
> would keep the partition array indexes unchanged, so "change
> concurrently" is what we needed to avoid.  Yes, I realize that you've
> opted to change that definition.

I don't think I made a conscious decision to change this, and I'm kind
of wondering whether I have missed some better approach here.  I feel
like the direction I'm pursuing is an inevitable consequence of having
no good way to keep the PartitionDesc around from plan-time to
execution-time, which in turn feels like an inevitable consequence of
the two points I made above: there's no guarantee that the plan-time
snapshot is still registered anywhere by the time we get to execution
time, and even if there were, the associated PartitionDesc may point
to tables that have been drastically modified or don't exist any more.
But it's possible that my chain-of-inevitable-consequences has a weak
link, in which case I would surely like it if you (or someone else)
would point it out to me.

> I may have forgotten some of your earlier emails on this, but one aspect
> (possibly a key one) is that I'm not sure we really need to cope, other
> than with an ERROR, with queries that continue to run across an
> attach/detach -- moreso in absurd scenarios such as the ones you
> described where the detached table is later re-attached, possibly to a
> different partitioned table.  I mean, if we can just detect the case and
> raise an error, and this let us make it all work reasonably, that might
> be better.

Well, that's an interesting idea.  I assumed that users would hate
that kind of behavior with a fiery passion that could never be
quenched.  If not, then the problem changes from *coping* with
concurrent changes to *detecting* concurrent changes, which may be
easier, but see below.

> I think detaching partitions concurrently is a necessary part of this
> feature, so I would prefer not to go with a solution that works for
> attaching partitions but not for detaching them.  That said, I don't see
> why it's impossible to adjust the partition maps in both cases.  But I
> don't have anything better than hand-waving ATM.

The general problem here goes back to what I wrote in the third
paragraph of this email: a PartitionDesc that was built with a
particular snapshot can't be assumed to be usable after any subsequent
DDL has occurred that might affect the shape of the PartitionDesc.
For example, if somebody detaches a partition and reattaches it with
different partition bounds, and we use the old PartitionDesc for tuple
routing, we'll route tuples into that partition that do not satisfy
its partition constraint.  And we won't even get an ERROR, because the
system assumes that any tuple which arrives at a partition as a result
of tuple routing must necessarily satisfy the partition constraint.

If only concurrent CREATE/ATTACH operations are allowed and
DROP/DETACH is not, then that kind of thing isn't possible.  Any new
partitions which have shown up since the plan was created can just be
ignored, and the old ones must still have the same partition bounds
that they did before, and everything is fine.  Or, if we're OK with a
less-nice solution, we could just ERROR out when the number of
partitions have changed.  Some people will get errors they don't like,
but they won't end up with rows in their partitions that violate the
constraints.

But as soon as you allow concurrent DETACH, then things get really
crazy.  Even if, at execution time, there are the same number of
partitions as I had at plan time, and even if those partitions have
the same OIDs as what I had at plan time, and even if those OIDs are
in the same order in the PartitionDesc, it does not prove that things
are OK.  The partition could have been detached and reattached with a
narrower set of partition bounds.  And if so, then we might route a
tuple to it that doesn't fit that narrower set of bounds, and there
will be no error, just database corruption.

I suppose one idea for handling this is to stick a counter into
pg_class or something that gets incremented every time a partition is
detached.  At plan time, save the counter value; if it has changed at
execution time, ERROR.  If not, then you have only added partitions to
worry about, and that's an easier problem.

But I kind of wonder whether we're really gaining as much as you think
by trying to support concurrent DETACH in the first place.  If there
are queries running against the table, there's probably at least
AccessShareLock on the partition itself, not just the parent.  And
that means that reducing the necessary lock on the parent from
AccessExclusiveLock to something less doesn't really help that much,
because we're still going to block trying to acquire
AccessExclusiveLock on the child.  Now you might say: OK, well, just
reduce that lock level, too.

But that seems to me to be opening a giant can of worms which we are
unlikely to get resolved in time for this release.  The worst of those
problem is that if you also reduce the lock level on the partition
when attaching it, then you are adding a constraint while somebody
might at the exact same time be inserting a tuple that violates that
constraint.  Those two things have to be synchronized somehow.  You
could avoid that by reducing the lock level on the partition when
detaching and not when attaching.  But even then, detaching a
partition can involve performing a whole bunch of operations for which
we currently require AccessExclusiveLock. AlterTableGetLockLevel says:

                /*
                 * Removing constraints can affect SELECTs that have been
                 * optimised assuming the constraint holds true.
                 */
            case AT_DropConstraint: /* as DROP INDEX */
            case AT_DropNotNull:    /* may change some SQL plans */
                cmd_lockmode = AccessExclusiveLock;
                break;

Dropping a partition implicitly involves dropping a constraint.  We
could gamble that the above has no consequences that are really
serious enough to care about, and that none of the other subsidiary
objects that we adjust during detach (indexes, foreign keys, triggers)
really need AccessExclusiveLock now, and that none of the other kinds
of subsidiary objects that we might need to adjust in the future
during a detach will be changes that require AccessExclusiveLock
either, but that sounds awfully risky to me.  We have very little DDL
that runs with less than AccessExclusiveLock, and I've already found
lots of subtle problems that have to be patched up just for the
particular case of allowing attach/detach to take a lesser lock on the
parent table, and I bet that there are a whole bunch more similar
problems when you start talking about weakening the lock on the child
table, and I'm not convinced that there are any reasonable solutions
to some of those problems, let alone that we can come up with good
solutions to all of them in the very near future.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Simon Riggs
Date:
On Mon, 28 Jan 2019 at 20:15, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Right, the planner/executor "disconnect" is one of the challenges, and
> why I was trying to keep the old copy of the PartitionDesc around
> instead of building updated ones as needed.

I agree that would be simpler, but I just don't see how to make it
work.  For one thing, keeping the old copy around can't work in
parallel workers, which never had a copy in the first place.  For two
things, we don't have a really good mechanism to keep the
PartitionDesc that was used at plan time around until execution time.
Keeping the relation open would do it, but I'm pretty sure that causes
other problems; the system doesn't expect any residual references.

I know you had a solution to this problem, but I don't see how it can
work.  You said "Snapshots have their own cache (hash table) of
partition descriptors. If a partdesc is requested and the snapshot has
already obtained that partdesc, the original one is returned -- we
don't request a new one from partcache."  But presumably this means
when the last snapshot is unregistered, the cache is flushed
(otherwise, when?) and if that's so then this relies on the snapshot
that was used for planning still being around at execution time, which
I am pretty sure isn't guaranteed.

Also, and I think this point is worthy of some further discussion, the
thing that really seems broken to me about your design is the idea
that it's OK to use the query or transaction snapshot to decide which
partitions exist.  The problem with that is that some query or
transaction with an old snapshot might see as a partition some table
that has been dropped or radically altered - different column types,
attached to some other table now, attached to same table but with
different bounds, or just dropped.  And therefore it might try to
insert data into that table and fail in all kinds of crazy ways, about
the mildest of which is inserting data that doesn't match the current
partition constraint.  I'm willing to be told that I've misunderstood
the way it all works and this isn't really a problem for some reason,
but my current belief is that not only is it a problem with your
design, but that it's such a bad problem that there's really no way to
fix it and we have to abandon your entire approach and go a different
route.  If you don't think that's true, then perhaps we should discuss
it further.

> > I suppose the reason for this is so
> > that we don't have to go to the expense of copying the partition
> > bounds from the PartitionDesc into the final plan, but it somehow
> > seems a little scary to me.  Perhaps I am too easily frightened, but
> > it's certainly a problem from the point of view of this project, which
> > wants to let the PartitionDesc change concurrently.
>
> Well, my definition of the problem started with the assumption that we
> would keep the partition array indexes unchanged, so "change
> concurrently" is what we needed to avoid.  Yes, I realize that you've
> opted to change that definition.

I don't think I made a conscious decision to change this, and I'm kind
of wondering whether I have missed some better approach here.  I feel
like the direction I'm pursuing is an inevitable consequence of having
no good way to keep the PartitionDesc around from plan-time to
execution-time, which in turn feels like an inevitable consequence of
the two points I made above: there's no guarantee that the plan-time
snapshot is still registered anywhere by the time we get to execution
time, and even if there were, the associated PartitionDesc may point
to tables that have been drastically modified or don't exist any more.
But it's possible that my chain-of-inevitable-consequences has a weak
link, in which case I would surely like it if you (or someone else)
would point it out to me.

> I may have forgotten some of your earlier emails on this, but one aspect
> (possibly a key one) is that I'm not sure we really need to cope, other
> than with an ERROR, with queries that continue to run across an
> attach/detach -- moreso in absurd scenarios such as the ones you
> described where the detached table is later re-attached, possibly to a
> different partitioned table.  I mean, if we can just detect the case and
> raise an error, and this let us make it all work reasonably, that might
> be better.

Well, that's an interesting idea.  I assumed that users would hate
that kind of behavior with a fiery passion that could never be
quenched.  If not, then the problem changes from *coping* with
concurrent changes to *detecting* concurrent changes, which may be
easier, but see below.

> I think detaching partitions concurrently is a necessary part of this
> feature, so I would prefer not to go with a solution that works for
> attaching partitions but not for detaching them.  That said, I don't see
> why it's impossible to adjust the partition maps in both cases.  But I
> don't have anything better than hand-waving ATM.

The general problem here goes back to what I wrote in the third
paragraph of this email: a PartitionDesc that was built with a
particular snapshot can't be assumed to be usable after any subsequent
DDL has occurred that might affect the shape of the PartitionDesc.
For example, if somebody detaches a partition and reattaches it with
different partition bounds, and we use the old PartitionDesc for tuple
routing, we'll route tuples into that partition that do not satisfy
its partition constraint.  And we won't even get an ERROR, because the
system assumes that any tuple which arrives at a partition as a result
of tuple routing must necessarily satisfy the partition constraint.

If only concurrent CREATE/ATTACH operations are allowed and
DROP/DETACH is not, then that kind of thing isn't possible.  Any new
partitions which have shown up since the plan was created can just be
ignored, and the old ones must still have the same partition bounds
that they did before, and everything is fine.  Or, if we're OK with a
less-nice solution, we could just ERROR out when the number of
partitions have changed.  Some people will get errors they don't like,
but they won't end up with rows in their partitions that violate the
constraints.

But as soon as you allow concurrent DETACH, then things get really
crazy.  Even if, at execution time, there are the same number of
partitions as I had at plan time, and even if those partitions have
the same OIDs as what I had at plan time, and even if those OIDs are
in the same order in the PartitionDesc, it does not prove that things
are OK.  The partition could have been detached and reattached with a
narrower set of partition bounds.  And if so, then we might route a
tuple to it that doesn't fit that narrower set of bounds, and there
will be no error, just database corruption.

I suppose one idea for handling this is to stick a counter into
pg_class or something that gets incremented every time a partition is
detached.  At plan time, save the counter value; if it has changed at
execution time, ERROR.  If not, then you have only added partitions to
worry about, and that's an easier problem.

Yes, a version number would solve that issue.
 
But I kind of wonder whether we're really gaining as much as you think
by trying to support concurrent DETACH in the first place.  If there
are queries running against the table, there's probably at least
AccessShareLock on the partition itself, not just the parent.  And
that means that reducing the necessary lock on the parent from
AccessExclusiveLock to something less doesn't really help that much,
because we're still going to block trying to acquire
AccessExclusiveLock on the child.  Now you might say: OK, well, just
reduce that lock level, too.
 
The whole point of CONCURRENT detach is that you're not removing it whilst people are still using it, you're just marking it for later disuse.

But that seems to me to be opening a giant can of worms which we are
unlikely to get resolved in time for this release.  The worst of those
problem is that if you also reduce the lock level on the partition
when attaching it, then you are adding a constraint while somebody
might at the exact same time be inserting a tuple that violates that
constraint. 

Spurious.

This would only be true if we were adding a constraint that affected existing partitions.

The constraint being added affects the newly added partition, not existing ones.
 
Those two things have to be synchronized somehow.  You
could avoid that by reducing the lock level on the partition when
detaching and not when attaching.  But even then, detaching a
partition can involve performing a whole bunch of operations for which
we currently require AccessExclusiveLock. AlterTableGetLockLevel says:

                /*
                 * Removing constraints can affect SELECTs that have been
                 * optimised assuming the constraint holds true.
                 */
            case AT_DropConstraint: /* as DROP INDEX */
            case AT_DropNotNull:    /* may change some SQL plans */
                cmd_lockmode = AccessExclusiveLock;
                break;

Dropping a partition implicitly involves dropping a constraint.  We
could gamble that the above has no consequences that are really

It's not a gamble if you know that the constraints being dropped constrain only the object being dropped.
 
serious enough to care about, and that none of the other subsidiary
objects that we adjust during detach (indexes, foreign keys, triggers)
really need AccessExclusiveLock now, and that none of the other kinds
of subsidiary objects that we might need to adjust in the future
during a detach will be changes that require AccessExclusiveLock
either, but that sounds awfully risky to me.  We have very little DDL
that runs with less than AccessExclusiveLock, and I've already found
lots of subtle problems that have to be patched up just for the
particular case of allowing attach/detach to take a lesser lock on the
parent table, and I bet that there are a whole bunch more similar
problems when you start talking about weakening the lock on the child
table, and I'm not convinced that there are any reasonable solutions
to some of those problems, let alone that we can come up with good
solutions to all of them in the very near future.

 I've not read every argument on this thread, but many of the later points made here are spurious, by which I mean they sound like they could apply but in fact do not.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Tue, Jan 29, 2019 at 12:29 AM Simon Riggs <simon@2ndquadrant.com> wrote:
>> But I kind of wonder whether we're really gaining as much as you think
>> by trying to support concurrent DETACH in the first place.  If there
>> are queries running against the table, there's probably at least
>> AccessShareLock on the partition itself, not just the parent.  And
>> that means that reducing the necessary lock on the parent from
>> AccessExclusiveLock to something less doesn't really help that much,
>> because we're still going to block trying to acquire
>> AccessExclusiveLock on the child.  Now you might say: OK, well, just
>> reduce that lock level, too.
>
> The whole point of CONCURRENT detach is that you're not removing it whilst people are still using it, you're just
markingit for later disuse.
 

Well, I don't think that's the way any patch so far proposed actually works.

>> But that seems to me to be opening a giant can of worms which we are
>> unlikely to get resolved in time for this release.  The worst of those
>> problem is that if you also reduce the lock level on the partition
>> when attaching it, then you are adding a constraint while somebody
>> might at the exact same time be inserting a tuple that violates that
>> constraint.
>
> Spurious.
>
> This would only be true if we were adding a constraint that affected existing partitions.
>
> The constraint being added affects the newly added partition, not existing ones.

I agree that it affects the newly added partition, not existing ones.
But if you don't hold an AccessExclusiveLock on that partition while
you are adding that constraint to it, then somebody could be
concurrently inserting a tuple that violates that constraint.  This
would be an INSERT targeting the partition directly, not somebody
operating on the partitioning hierarchy to which it is being attached.

>> Those two things have to be synchronized somehow.  You
>> could avoid that by reducing the lock level on the partition when
>> detaching and not when attaching.  But even then, detaching a
>> partition can involve performing a whole bunch of operations for which
>> we currently require AccessExclusiveLock. AlterTableGetLockLevel says:
>>
>>                 /*
>>                  * Removing constraints can affect SELECTs that have been
>>                  * optimised assuming the constraint holds true.
>>                  */
>>             case AT_DropConstraint: /* as DROP INDEX */
>>             case AT_DropNotNull:    /* may change some SQL plans */
>>                 cmd_lockmode = AccessExclusiveLock;
>>                 break;
>>
>> Dropping a partition implicitly involves dropping a constraint.  We
>> could gamble that the above has no consequences that are really
>
> It's not a gamble if you know that the constraints being dropped constrain only the object being dropped.

That's not true, but I can't refute your argument any more than that
because you haven't made one.

>  I've not read every argument on this thread, but many of the later points made here are spurious, by which I mean
theysound like they could apply but in fact do not.
 

I think they do apply, and until somebody explains convincingly why
they don't, I'm going to keep thinking that they do.   Telling me that
my points are wrong without making any kind of argument about why they
are wrong is not constructive.  I've put a lot of energy into
analyzing this topic, both recently and in previous release cycles,
and I'm not inclined to just say "OK, well, Simon says I'm wrong, so
that's the end of it."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > I wrote a little patch that stores the relation OIDs of the partitions
> > into the PartitionedPruneRelInfo and then, at execution time, does an
> > Assert() that what it gets matches what existed at plan time.  I
> > figured that a good start would be to find a test case where this
> > fails with concurrent DDL allowed, but I haven't so far succeeded in
> > devising one.  To make the Assert() fail, I need to come up with a
> > case where concurrent DDL has caused the PartitionDesc to be rebuilt
> > but without causing an update to the plan.  If I use prepared queries
> > inside of a transaction block, [...]
>
> > I also had the idea of trying to use a cursor, because if I could
> > start execution of a query, [...]
>
> Those are the ways I thought of, and the reason for the shape of some of
> those .spec tests.  I wasn't able to hit the situation.

I've managed to come up with a test case that seems to hit this case.

Preparation:

create table foo (a int, b text, primary key (a)) partition by range (a);
create table foo1 partition of foo for values from (0) to (1000);
create table foo2 partition of foo for values from (1000) to (2000);
insert into foo1 values (1, 'one');
insert into foo2 values (1001, 'two');
alter system set plan_cache_mode = force_generic_plan;
select pg_reload_conf();

$ cat >x
alter table foo detach partition foo2;
alter table foo attach partition foo2 for values from (1000) to (2000);
^D

Window #1:

prepare foo as select * from foo where a = $1;
explain execute foo(1500);
\watch 0.01

Window #2:

$ pgbench -n -f x -T 60

Boom:

TRAP: FailedAssertion("!(partdesc->nparts == pinfo->nparts)", File:
"execPartition.c", Line: 1631)

I don't know how to reduce this to something reliable enough to
include it in the regression tests, and maybe we don't really need
that, but it's good to know that this is not a purely theoretical
problem.  I think next I'll try to write some code to make
execPartition.c able to cope with the situation when it arises.

(My draft/WIP patches attached, if you're interested.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Tue, Jan 29, 2019 at 1:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't know how to reduce this to something reliable enough to
> include it in the regression tests, and maybe we don't really need
> that, but it's good to know that this is not a purely theoretical
> problem.  I think next I'll try to write some code to make
> execPartition.c able to cope with the situation when it arises.

OK, that seems to be pretty easy.  New patch series attached.  The
patch with that new logic is 0004.  I've consolidated some of the
things I had as separate patches in my last post and rewritten the
commit messages to explain more clearly the purpose of each patch.

Open issues:

- For now, I haven't tried to handle the DETACH PARTITION case.  I
don't think there's anything preventing someone - possibly even me -
from implementing the counter-based approach that I described in the
previous message, but I think it would be good to have some more
discussion first on whether it's acceptable to make concurrent queries
error out.  I think any queries that were already up and running would
be fine, but any that were planned before the DETACH and tried to
execute afterwards would get an ERROR.  That's fairly low-probability,
because normally the shared invalidation machinery would cause
replanning, but there's a race condition, so we'd have to document
something like: if you use this feature, it'll probably just work, but
you might get some funny errors in other sessions if you're unlucky.
That kinda sucks but maybe we should just suck it up.  Possibly we
should consider making the concurrent behavior optional, so that if
you'd rather take blocking locks than risk errors, you have that
option.  Of course I guess you could also just let people do an
explicit LOCK TABLE if that's what they want.  Or we could try to
actually make it work in that case, I guess by ignoring the detached
partitions, but that seems a lot harder.

- 0003 doesn't have any handling for parallel query at this point, so
even though within a single backend a single query execution will
always get the same PartitionDesc for the same relation, the answers
might not be consistent across the parallel group.  I keep going back
and forth on whether this really matters.  It's too late to modify the
plan, so any relations attached since it was generated are not going
to get scanned.  As for detached relations, we're talking about making
them error out, so we don't have to worry about different backends
come to different conclusions about whether they should be scanned.
But maybe we should try to be smarter instead.  One concern is that
even relations that aren't scanned could still be affected because of
tuple routing, but right now parallel queries can't INSERT or UPDATE
rows anyway.  Then again, maybe we should try not to add more
obstacles in the way of lifting that restriction. Then again again,
AFAICT we wouldn't be able to test that the new code is actually
solving a problem right now today, and how much untested code do we
really want in the tree?  And then on the eleventh hand, maybe there
are other reasons why it's important to use the same PartitionDesc
across all parallel workers that I'm not thinking about at the moment.

- 0003 also changes the order in which locks are acquired.  I am not
sure whether we care about this, especially in view of other pending
changes.

If you know of other problems, have solutions to or opinions about
these, or think the whole approach is wrong, please speak up!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2019-Jan-31, Robert Haas wrote:

> OK, that seems to be pretty easy.  New patch series attached.  The
> patch with that new logic is 0004.  I've consolidated some of the
> things I had as separate patches in my last post and rewritten the
> commit messages to explain more clearly the purpose of each patch.

Looks awesome.

> - For now, I haven't tried to handle the DETACH PARTITION case.  I
> don't think there's anything preventing someone - possibly even me -
> from implementing the counter-based approach that I described in the
> previous message, but I think it would be good to have some more
> discussion first on whether it's acceptable to make concurrent queries
> error out.  I think any queries that were already up and running would
> be fine, but any that were planned before the DETACH and tried to
> execute afterwards would get an ERROR.  That's fairly low-probability,
> because normally the shared invalidation machinery would cause
> replanning, but there's a race condition, so we'd have to document
> something like: if you use this feature, it'll probably just work, but
> you might get some funny errors in other sessions if you're unlucky.
> That kinda sucks but maybe we should just suck it up.  Possibly we
> should consider making the concurrent behavior optional, so that if
> you'd rather take blocking locks than risk errors, you have that
> option.  Of course I guess you could also just let people do an
> explicit LOCK TABLE if that's what they want.  Or we could try to
> actually make it work in that case, I guess by ignoring the detached
> partitions, but that seems a lot harder.

I think telling people to do LOCK TABLE beforehand if they care about
errors is sufficient.  On the other hand, I do hope that we're only
going to cause queries to fail if they would affect the partition that's
being detached and not other partitions in the table.  Or maybe because
of the replanning on invalidation this doesn't matter as much as I think
it does.

> - 0003 doesn't have any handling for parallel query at this point, so
> even though within a single backend a single query execution will
> always get the same PartitionDesc for the same relation, the answers
> might not be consistent across the parallel group.

That doesn't sound good.  I think the easiest would be to just serialize
the PartitionDesc and send it to the workers instead of them recomputing
it, but then I worry that this might have bad performance when the
partition desc is large.  (Or maybe sending bytes over pqmq is faster
than reading all those catalog entries and so this isn't a concern
anyway.)

> - 0003 also changes the order in which locks are acquired.  I am not
> sure whether we care about this, especially in view of other pending
> changes.

Yeah, the drawbacks of the unpredictable locking order are worrisome,
but then the performance gain is hard to dismiss.  Not this patch only
but the others too.  If we're okay with the others going in, I guess we
don't have concerns about this one either.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Jan 31, 2019 at 6:00 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > - 0003 doesn't have any handling for parallel query at this point, so
> > even though within a single backend a single query execution will
> > always get the same PartitionDesc for the same relation, the answers
> > might not be consistent across the parallel group.
>
> That doesn't sound good.  I think the easiest would be to just serialize
> the PartitionDesc and send it to the workers instead of them recomputing
> it, but then I worry that this might have bad performance when the
> partition desc is large.  (Or maybe sending bytes over pqmq is faster
> than reading all those catalog entries and so this isn't a concern
> anyway.)

I don't think we'd be using pqmq here, or shm_mq either, but I think
the bigger issues is that starting a parallel query is already a
pretty heavy operation, and so the added overhead of this is probably
not very noticeable.  I agree that it seems a bit expensive, but since
we're already waiting for the postmaster to fork() a new process which
then has to initialize itself, this probably won't break the bank.
What bothers me more is that it's adding a substantial amount of code
that could very well contain bugs to fix something that isn't clearly
a problem in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Fri, Feb 1, 2019 at 9:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think we'd be using pqmq here, or shm_mq either, but I think
> the bigger issues is that starting a parallel query is already a
> pretty heavy operation, and so the added overhead of this is probably
> not very noticeable.  I agree that it seems a bit expensive, but since
> we're already waiting for the postmaster to fork() a new process which
> then has to initialize itself, this probably won't break the bank.
> What bothers me more is that it's adding a substantial amount of code
> that could very well contain bugs to fix something that isn't clearly
> a problem in the first place.

I spent most of the last 6 hours writing and debugging a substantial
chunk of the code that would be needed.  Here's an 0006 patch that
adds functions to serialize and restore PartitionDesc in a manner
similar to what parallel query does for other object types.  Since a
PartitionDesc includes a pointer to a PartitionBoundInfo, that meant
also writing functions to serialize and restore those.  If we want to
go this route, I think the next thing to do would be to integrate this
into the PartitionDirectory infrastructure.

Basically what I'm imagining we would do there is have a hash table
stored in shared memory to go with the one that is already stored in
backend-private memory.  The shared table stores serialized entries,
and the local table stores normal ones.  Any lookups try the local
table first, then the shared table.  If we get a hit in the shared
table, we deserialize whatever we find there and stash the result in
the local table.  If we find it neither place, we generate a new entry
in the local table and then serialize it into the shard table. It's
not quite clear to me at the moment how to solve the concurrency
problems associated with this design, but it's probably not too hard.
I don't have enough mental energy left to figure it out today, though.

After having written this code, I'm still torn about whether to go
further with this design.  On the one hand, this is such boilerplate
code that it's kinda hard to imagine it having too many more bugs; on
the other hand, as you can see, it's a non-trivial amount of code to
add without a real clear reason, and I'm not sure we have one, even
though in the abstract it seems like a better way to go.

Still interesting in hearing more opinions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On Sat, 2 Feb 2019 at 09:31, Robert Haas <robertmhaas@gmail.com> wrote:
> After having written this code, I'm still torn about whether to go
> further with this design.  On the one hand, this is such boilerplate
> code that it's kinda hard to imagine it having too many more bugs; on
> the other hand, as you can see, it's a non-trivial amount of code to
> add without a real clear reason, and I'm not sure we have one, even
> though in the abstract it seems like a better way to go.

I think we do need to ensure that the PartitionDesc matches between
worker and leader. Have a look at choose_next_subplan_for_worker() in
nodeAppend.c. Notice that a call is made to
ExecFindMatchingSubPlans().


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Sat, Feb 2, 2019 at 7:19 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I think we do need to ensure that the PartitionDesc matches between
> worker and leader. Have a look at choose_next_subplan_for_worker() in
> nodeAppend.c. Notice that a call is made to
> ExecFindMatchingSubPlans().

Thanks for the tip.  I see that code, but I'm not sure that I
understand why it matters here.  First, if I'm not mistaken, what's
being returned by ExecFindMatchingSubPlans is a BitmapSet of subplan
indexes, not anything that returns to a PartitionDesc directly.  And
second, even if it did, it looks like the computation is done
separately in every backend and not shared among backends, so even if
it were directly referring to PartitionDesc indexes, it still won't be
assuming that they're the same in every backend.  Can you further
explain your thinking?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On Mon, 4 Feb 2019 at 16:45, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Feb 2, 2019 at 7:19 PM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > I think we do need to ensure that the PartitionDesc matches between
> > worker and leader. Have a look at choose_next_subplan_for_worker() in
> > nodeAppend.c. Notice that a call is made to
> > ExecFindMatchingSubPlans().
>
> Thanks for the tip.  I see that code, but I'm not sure that I
> understand why it matters here.  First, if I'm not mistaken, what's
> being returned by ExecFindMatchingSubPlans is a BitmapSet of subplan
> indexes, not anything that returns to a PartitionDesc directly.  And
> second, even if it did, it looks like the computation is done
> separately in every backend and not shared among backends, so even if
> it were directly referring to PartitionDesc indexes, it still won't be
> assuming that they're the same in every backend.  Can you further
> explain your thinking?

In a Parallel Append, each parallel worker will call ExecInitAppend(),
which calls ExecCreatePartitionPruneState(). That function makes a
call to RelationGetPartitionDesc() and records the partdesc's
boundinfo in context->boundinfo.  This means that if we perform any
pruning in the parallel worker in choose_next_subplan_for_worker()
then find_matching_subplans_recurse() will use the PartitionDesc from
the parallel worker to translate the partition indexes into the
Append's subnodes.

If the PartitionDesc from the parallel worker has an extra partition
than what was there when the plan was built then the partition index
to subplan index translation will be incorrect as the
find_matching_subplans_recurse() will call get_matching_partitions()
using the context with the PartitionDesc containing the additional
partition. The return value from get_matching_partitions() is fine,
it's just that the code inside the while ((i =
bms_next_member(partset, i)) >= 0) loop that will do the wrong thing.
It could even crash if partset has an index out of bounds of the
subplan_map or subpart_map arrays.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Mon, Feb 4, 2019 at 12:02 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> If the PartitionDesc from the parallel worker has an extra partition
> than what was there when the plan was built then the partition index
> to subplan index translation will be incorrect as the
> find_matching_subplans_recurse() will call get_matching_partitions()
> using the context with the PartitionDesc containing the additional
> partition. The return value from get_matching_partitions() is fine,
> it's just that the code inside the while ((i =
> bms_next_member(partset, i)) >= 0) loop that will do the wrong thing.
> It could even crash if partset has an index out of bounds of the
> subplan_map or subpart_map arrays.

Is there any chance you've missed the fact that in one of the later
patches in the series I added code to adjust the subplan_map and
subpart_map arrays to compensate for any extra partitions?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Mon, Feb 4, 2019 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 4, 2019 at 12:02 AM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > If the PartitionDesc from the parallel worker has an extra partition
> > than what was there when the plan was built then the partition index
> > to subplan index translation will be incorrect as the
> > find_matching_subplans_recurse() will call get_matching_partitions()
> > using the context with the PartitionDesc containing the additional
> > partition. The return value from get_matching_partitions() is fine,
> > it's just that the code inside the while ((i =
> > bms_next_member(partset, i)) >= 0) loop that will do the wrong thing.
> > It could even crash if partset has an index out of bounds of the
> > subplan_map or subpart_map arrays.
>
> Is there any chance you've missed the fact that in one of the later
> patches in the series I added code to adjust the subplan_map and
> subpart_map arrays to compensate for any extra partitions?

In case that wasn't clear enough, my point here is that while the
leader and workers could end up with different ideas about the shape
of the PartitionDesc, each would end up with a subplan_map and
subpart_map array adapted to the view of the PartitionDesc with which
they ended up, and therefore, I think, everything should work.  So far
there is, to my knowledge, no situation in which a PartitionDesc index
gets passed between one backend and another, and as long as we don't
do that, it's not really necessary for them to agree; each backend
needs to individually ignore any concurrently added partitions not
contemplated by the plan, but it doesn't matter whether backend A and
backend B agree on which partitions were concurrently added, just that
each ignores the ones it knows about.

Since time is rolling along here, I went ahead and committed 0001
which seems harmless even if somebody finds a huge problem with some
other part of this.  If anybody wants to review the approach or the
code before I proceed further, that would be great, but please speak
up soon.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On Tue, 5 Feb 2019 at 01:54, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Feb 4, 2019 at 12:02 AM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > If the PartitionDesc from the parallel worker has an extra partition
> > than what was there when the plan was built then the partition index
> > to subplan index translation will be incorrect as the
> > find_matching_subplans_recurse() will call get_matching_partitions()
> > using the context with the PartitionDesc containing the additional
> > partition. The return value from get_matching_partitions() is fine,
> > it's just that the code inside the while ((i =
> > bms_next_member(partset, i)) >= 0) loop that will do the wrong thing.
> > It could even crash if partset has an index out of bounds of the
> > subplan_map or subpart_map arrays.
>
> Is there any chance you've missed the fact that in one of the later
> patches in the series I added code to adjust the subplan_map and
> subpart_map arrays to compensate for any extra partitions?

I admit that I hadn't looked at the patch, I was just going on what I
had read here.   I wasn't sure how the re-map would have been done as
some of the information is unavailable during execution, but I see now
that you're modified it so we send a list of Oids that we expect and
remap based on if an unexpected Oid is found.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Fri, Dec 21, 2018 at 6:04 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On Fri, 21 Dec 2018 at 09:43, Robert Haas <robertmhaas@gmail.com> wrote:
> > - I refactored expand_inherited_rtentry() to drive partition expansion
> > entirely off of PartitionDescs. The reason why this is necessary is
> > that it clearly will not work to have find_all_inheritors() use a
> > current snapshot to decide what children we have and lock them, and
> > then consult a different source of truth to decide which relations to
> > open with NoLock.  There's nothing to keep the lists of partitions
> > from being different in the two cases, and that demonstrably causes
> > assertion failures if you SELECT with an ATTACH/DETACH loop running in
> > the background. However, it also changes the order in which tables get
> > locked.  Possibly that could be fixed by teaching
> > expand_partitioned_rtentry() to qsort() the OIDs the way
> > find_inheritance_children() does.  It also loses the infinite-loop
> > protection which find_all_inheritors() has.  Not sure what to do about
> > that.
>
> I don't think you need to qsort() the Oids before locking. What the
> qsort() does today is ensure we get a consistent locking order. Any
> other order would surely do, providing we stick to it consistently. I
> think PartitionDesc order is fine, as it's consistent.  Having it
> locked in PartitionDesc order I think is what's needed for [1] anyway.
> [2] proposes to relax the locking order taken during execution.
>
> [1] https://commitfest.postgresql.org/21/1778/
> [2] https://commitfest.postgresql.org/21/1887/

Based on this feedback, I went ahead and committed the part of the
previously-posted patch set that makes this change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Jan 31, 2019 at 1:02 PM Robert Haas <robertmhaas@gmail.com> wrote:
> New patch series attached.

And here's yet another new patch series, rebased over today's commit
and with a couple of other fixes:

1. I realized that the PartitionDirectory for the planner ought to be
attached to the PlannerGlobal, not the PlannerInfo; we don't want to
create more than one partition directory per query planning cycle, and
we do want our notion of the PartitionDesc for a given relation to be
stable between the outer query and any subqueries.

2. I discovered - via CLOBBER_CACHE_ALWAYS testing - that the
PartitionDirectory has to hold a reference count on the relcache
entry.  In hindsight, this should have been obvious: the planner keeps
the locks when it closes a relation and later reopens it, but it
doesn't keep the relation open, which is what prevents recycling of
the old PartitionDesc.  Unfortunately these additional reference count
manipulations are probably not free.  I don't know expensive they are,
though; maybe it's not too bad.

Aside from these problems, I think I have spotted a subtle problem in
0001. I'll think about that some more and post another update.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Tue, Feb 26, 2019 at 5:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Aside from these problems, I think I have spotted a subtle problem in
> 0001. I'll think about that some more and post another update.

0001 turned out to be guarding against the wrong problem.  It supposed
that if we didn't get a coherent view of the system catalogs due to
concurrent DDL, we could just AcceptInvalidationMessages() and retry.
But that turns out to be wrong, because there's a (very) narrow window
after a process removes itself from the ProcArray and before it sends
invalidation messages.  It wasn't difficult to engineer an alternative
solution that works, but unfortunately it's only good enough to handle
the ATTACH case, so this is another thing that will need more thought
for concurrent DETACH.  Anyway, the updated 0001 contains that code
and some explanatory comments.  The rest of the series is
substantially unchanged.

I'm not currently aware of any remaining correctness issues with this
code, although certainly there may be some.  There has been a certain
dearth of volunteers to review any of this.  I do plan to poke at it a
bit to see whether it has any significant performance impact, but not
today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Feb 28, 2019 at 3:27 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not currently aware of any remaining correctness issues with this
> code, although certainly there may be some.  There has been a certain
> dearth of volunteers to review any of this.  I do plan to poke at it a
> bit to see whether it has any significant performance impact, but not
> today.

Today, did some performance testing.  I created a table with 100
partitions and randomly selected rows from it using pgbench, with and
without -M prepared.  The results show a small regression, but I
believe it's below the noise floor.  Five minute test runs.

with prepared queries

master:
tps = 10919.914458 (including connections establishing)
tps = 10876.271217 (including connections establishing)
tps = 10761.586160 (including connections establishing)

concurrent-attach:
tps = 10883.535582 (including connections establishing)
tps = 10868.471805 (including connections establishing)
tps = 10761.586160 (including connections establishing)

with simple queries

master:
tps = 1486.120530 (including connections establishing)
tps = 1486.797251 (including connections establishing)
tps = 1494.129256 (including connections establishing)

concurrent-attach:
tps = 1481.774212 (including connections establishing)
tps = 1472.159016 (including connections establishing)
tps = 1476.444097 (including connections establishing)

Looking at the total of the three results, that's about an 0.8%
regression with simple queries and an 0.2% regression with prepared
queries.  Looking at the median, it's about 0.7% and 0.07%.  Would
anybody like to argue that's a reason not to commit these patches?

Would anyone like to argue that there is any other reason not to
commit these patches?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On Wed, 6 Mar 2019 at 10:13, Robert Haas <robertmhaas@gmail.com> wrote:
> Would anyone like to argue that there is any other reason not to
> commit these patches?

Hi Robert,

Thanks for working on this. I'm sorry I didn't get a chance to
dedicate some time to look at it.

It looks like you've pushed all of this now.  Can the CF entry be
marked as committed?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ATTACH/DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Mar 14, 2019 at 6:12 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On Wed, 6 Mar 2019 at 10:13, Robert Haas <robertmhaas@gmail.com> wrote:
> > Would anyone like to argue that there is any other reason not to
> > commit these patches?
>
> Hi Robert,
>
> Thanks for working on this. I'm sorry I didn't get a chance to
> dedicate some time to look at it.
>
> It looks like you've pushed all of this now.  Can the CF entry be
> marked as committed?

Yeah, done now, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company