Thread: ATTACH/DETACH PARTITION CONCURRENTLY
Hi, One of the downsides of declarative partitioning vs old school inheritance partitioning is that a new partition cannot be added to the partitioned table without taking an AccessExclusiveLock on the partitioned table. We've obviously got a bunch of features for various other things where we work a bit harder to get around that problem, e.g creating indexes concurrently. I've started working on allowing partitions to be attached and detached with just a ShareUpdateExclusiveLock on the table. If I'm correct, then we can do this in a similar, but more simple way as to how CREATE INDEX CONCURRENTLY works. We just need to pencil in that the new partition exists, but not yet valid, then wait for snapshots older than our own to finish before marking the partition is valid. One problem I had with doing this is that there was not really a good place to store that "isvalid" flag for partitions. We have pg_index for indexes, but partition details are just spread over pg_inherits and pg_class. So step 1 was to move all that into a new table called pg_partition. I think this is quite nice as it also gets rid of relpartbound out of pg_class. It probably just a matter of time before someone complains that they can't create some partition with some pretty large Datum due to it not being able to fit on a single heap page (pg_class has no TOAST table). I ended up getting rid of pg_class.relispartition replacing it with relpartitionparernt which is just InvalidOid when the table or index is not a partition. This allows various pieces of code to be more efficient since we can look at the relcache instead of scanning pg_inherits all the time. It's now also much faster to get a partitions ancestors. So, patches 0001 is just one I've already submitted for the July 'fest. Nothing new. It was just required to start this work. 0002 migrates partitions out of pg_inherits into pg_partition. This patch is at a stage where it appears to work, but is very unpolished and requires me to stare at it much longer than I've done so far. There's a bunch of code that gets repeated way too many times in tablecmds.c, for example. 0003 does the same for partitioned indexes. The patch is in a similar, maybe slightly worse state than 0002. Various comments will be out of date. 0004 is the early workings of what I have in mind for the concurrent ATTACH code. It's vastly incomplete. It does pass make check but really only because there are no tests doing any concurrent attaches. There's a mountain of code missing that ignores invalid partitions. I just have a very simple case working. Partition-wise joins will be very much broken by what I have so far, and likely a whole bunch of other stuff. About the extent of my tests so far are the following: --setup create table listp (a int) partition by list(a); create table listp1 partition of listp for values in(1); create table listp2 (a int); insert into listp1 values(1); insert into listp2 values(2); -- example 1. start transaction isolation level repeatable read; -- Session 1 select * from listp; -- Session 1 a --- 1 (1 row) alter table listp attach partition concurrently listp2 for values in (2); -- Session 2 (waits for release of session 1's snapshot) select * from listp; -- Session 1 a --- 1 commit; -- session 1 (session 2's alter table now finishes waiting) select * from listp; -- Session 1 (new partition now valid) a --- 1 2 (2 rows) -- example 2. start transaction isolation level read committed; -- session 1; select * from listp; -- session 1 a --- 1 (1 row) alter table listp attach partition concurrently listp2 for values in (2); -- Session 2 completes without waiting. select * from listp; -- Session 1 (new partition visible while in transaction) a --- 1 2 (2 rows) This basically works by: 1. Do all the normal partition attach partition validation. 2. Insert a record into pg_partition with partisvalid=false 3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table. 4. Obtain a session-level AccessExclusiveLock on the partition being attached. 5. Commit. 6. Start a new transaction. 7. Wait for snapshots older than our own to be released. 8. Mark the partition as valid 9. Invalidate relcache for the partitioned table. 10. release session-level locks. I've disallowed the feature when the partitioned table has a default partition. I don't see how this can be made to work. At the moment ALTER TABLE ... ATTACH PARTITION commands cannot contain any other sub-commands in the ALTER TABLE, so performing the additional transaction commit and begin inside the single sub-command might be okay. It does mean that 'rel' which is passed down to ATExecAttachPartition() must be closed and reopened again which results in the calling function having a pointer into a closed Relation. I worked around this by changing the code so it passes a pointer to the Relation, and I've got ATExecAttachPartition() updating that pointer before returning. It's not particularly pretty, but I didn't really see how else this can be done. I've not yet done anything about the DETACH CONCURRENTLY case. I think it should just be the same steps in some roughly reverse order. We can skip the waiting part of the partition being detached is still marked as invalid from some failed concurrent ATTACH. I've not thought much about pg_dump beyond just have it ignore invalid partitions. I don't think it's very useful to support some command that attaches an invalid partition since there will be no command to revalidate an invalid partition. It's probably best to resolve that with a DETACH followed by a new ATTACH. So probably pg_dump can just do nothing for invalid partitions. So anyway, my intentions of posting this patch now rather than when it's closer to being finished is for design review. I'm interested in hearing objections, comments, constructive criticism for patches 0002-0004. Patch 0001 comments can go to [1] Are there any blockers on this that I've overlooked? [1] https://www.postgresql.org/message-id/CAKJS1f81TpxZ8twugrWCo%3DVDHEkmagxRx7a%2B1z4aaMeQy%3DnA7w%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 3 August 2018 at 01:25, David Rowley <david.rowley@2ndquadrant.com> wrote: > 1. Do all the normal partition attach partition validation. > 2. Insert a record into pg_partition with partisvalid=false > 3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table. > 4. Obtain a session-level AccessExclusiveLock on the partition being attached. > 5. Commit. > 6. Start a new transaction. > 7. Wait for snapshots older than our own to be released. > 8. Mark the partition as valid > 9. Invalidate relcache for the partitioned table. > 10. release session-level locks. So I was thinking about this again and realised this logic is broken. All it takes is a snapshot that starts after the ATTACH PARTITION started and before it completed. This snapshot will have the new partition attached while it's possibly still open which could lead to non-repeatable reads in a repeatable read transaction. The window for this to occur is possibly quite large given that the ATTACH CONCURRENTLY can wait a long time for older snapshots to finish. Here's my updated thinking for an implementation which seems to get around the above problem: ATTACH PARTITION CONCURRENTLY: 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather than an AccessExclusiveLock. 2. Do all the normal partition attach partition validation. 3. Insert pg_partition record with partvalid = true. 4. Invalidate relcache entry for the partitioned table 5. Any loops over a partitioned table's PartitionDesc must check PartitionIsValid(). This will return true if the current snapshot should see the partition or not. The partition is valid if partisvalid = true and the xmin precedes or is equal to the current snapshot. #define PartitionIsValid(pd, i) (((pd)->is_valid[(i)] \ && TransactionIdPrecedesOrEquals((pd)->xmin[(i)], GetCurrentTransactionId())) \ || (!(pd)->is_valid[(i)] \ && TransactionIdPrecedesOrEquals(GetCurrentTransactionId(), (pd)->xmin[(i)]))) DETACH PARTITION CONCURRENTLY: 1. Obtain ShareUpdateExclusiveLock on partition being detached (instead of the AccessShareLock that non-concurrent detach uses) 2. Update the pg_partition record, set partvalid = false. 3. Commit 4. New transaction. 5. Wait for transactions which hold a snapshot older than the one held when updating pg_partition to complete. 6. Delete the pg_partition record. 7. Perform other cleanup, relpartitionparent = 0, pg_depend etc. DETACH PARTITION CONCURRENTLY failure when it fails after step 3 (above) 1. Make vacuum of a partition check for pg_partition.partvalid = false, if xmin of tuple is old enough, perform a partition cleanup by doing steps 6+7 above. A VACUUM FREEZE must run before transaction wraparound, so this means a partition can never reattach itself when the transaction counter wraps. I believe I've got the attach and detach working correctly now and also isolation tests that appear to prove it works. I've also written the failed detach cleanup code into vacuum. Unusually, since foreign tables can also be partitions this required teaching auto-vacuum to look at foreign tables, only in the sense of checking for failed detached partitions. It also requires adding vacuum support for foreign tables too. It feels a little bit weird to modify auto-vacuum to look at foreign tables, but I really couldn't see another way to do this. I'm now considering if this all holds together in the event the pg_partition tuple of an invalid partition becomes frozen. The problem would be that PartitionIsValid() could return the wrong value due to TransactionIdPrecedesOrEquals(GetCurrentTransactionId(), (pd)->xmin[(i)]). this code is trying to keep the detached partition visible to older snapshots, but if pd->xmin[i] becomes frozen, then the partition would become invisible. However, I think this won't be a problem since a VACUUM FREEZE would only freeze tuples that are also old enough to have failed detaches cleaned up earlier in the vacuum process. Also, we must disallow a DEFAULT partition from being attached to a partition with a failed DETACH CONCURRENTLY as it wouldn't be very clear what the default partition's partition qual would be, as this is built based on the quals of all attached partitions. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2018-08-08 00:40:12 +1200, David Rowley wrote: > 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather > than an AccessExclusiveLock. > 2. Do all the normal partition attach partition validation. > 3. Insert pg_partition record with partvalid = true. > 4. Invalidate relcache entry for the partitioned table > 5. Any loops over a partitioned table's PartitionDesc must check > PartitionIsValid(). This will return true if the current snapshot > should see the partition or not. The partition is valid if partisvalid > = true and the xmin precedes or is equal to the current snapshot. How does this protect against other sessions actively using the relcache entry? Currently it is *NOT* safe to receive invalidations for e.g. partitioning contents afaics. - Andres
On 7 August 2018 at 13:47, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2018-08-08 00:40:12 +1200, David Rowley wrote: >> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather >> than an AccessExclusiveLock. >> 2. Do all the normal partition attach partition validation. >> 3. Insert pg_partition record with partvalid = true. >> 4. Invalidate relcache entry for the partitioned table >> 5. Any loops over a partitioned table's PartitionDesc must check >> PartitionIsValid(). This will return true if the current snapshot >> should see the partition or not. The partition is valid if partisvalid >> = true and the xmin precedes or is equal to the current snapshot. > > How does this protect against other sessions actively using the relcache > entry? Currently it is *NOT* safe to receive invalidations for > e.g. partitioning contents afaics. I think you may be right in the general case, but ISTM possible to invalidate/refresh just the list of partitions. If so, that idea would seem to require some new, as-yet not invented mechanism. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8 August 2018 at 00:47, Andres Freund <andres@anarazel.de> wrote: > On 2018-08-08 00:40:12 +1200, David Rowley wrote: >> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather >> than an AccessExclusiveLock. >> 2. Do all the normal partition attach partition validation. >> 3. Insert pg_partition record with partvalid = true. >> 4. Invalidate relcache entry for the partitioned table >> 5. Any loops over a partitioned table's PartitionDesc must check >> PartitionIsValid(). This will return true if the current snapshot >> should see the partition or not. The partition is valid if partisvalid >> = true and the xmin precedes or is equal to the current snapshot. > > How does this protect against other sessions actively using the relcache > entry? Currently it is *NOT* safe to receive invalidations for > e.g. partitioning contents afaics. I'm not proposing that sessions running older snapshots can't see that there's a new partition. The code I have uses PartitionIsValid() to test if the partition should be visible to the snapshot. The PartitionDesc will always contain details for all partitions stored in pg_partition whether they're valid to the current snapshot or not. I did it this way as there's no way to invalidate the relcache based on a point in transaction, only a point in time. I'm open to better ideas, of course. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018-08-08 01:23:51 +1200, David Rowley wrote: > On 8 August 2018 at 00:47, Andres Freund <andres@anarazel.de> wrote: > > On 2018-08-08 00:40:12 +1200, David Rowley wrote: > >> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather > >> than an AccessExclusiveLock. > >> 2. Do all the normal partition attach partition validation. > >> 3. Insert pg_partition record with partvalid = true. > >> 4. Invalidate relcache entry for the partitioned table > >> 5. Any loops over a partitioned table's PartitionDesc must check > >> PartitionIsValid(). This will return true if the current snapshot > >> should see the partition or not. The partition is valid if partisvalid > >> = true and the xmin precedes or is equal to the current snapshot. > > > > How does this protect against other sessions actively using the relcache > > entry? Currently it is *NOT* safe to receive invalidations for > > e.g. partitioning contents afaics. > > I'm not proposing that sessions running older snapshots can't see that > there's a new partition. The code I have uses PartitionIsValid() to > test if the partition should be visible to the snapshot. The > PartitionDesc will always contain details for all partitions stored in > pg_partition whether they're valid to the current snapshot or not. I > did it this way as there's no way to invalidate the relcache based on > a point in transaction, only a point in time. I don't think that solves the problem that an arriving relcache invalidation would trigger a rebuild of rd_partdesc, while it actually is referenced by running code. You'd need to build infrastructure to prevent that. One approach would be to make sure that everything relying on rt_partdesc staying the same stores its value in a local variable, and then *not* free the old version of rt_partdesc (etc) when the refcount > 0, but delay that to the RelationClose() that makes refcount reach 0. That'd be the start of a framework for more such concurrenct handling. Regards, Andres Freund
On 8 August 2018 at 01:29, Andres Freund <andres@anarazel.de> wrote: > On 2018-08-08 01:23:51 +1200, David Rowley wrote: >> I'm not proposing that sessions running older snapshots can't see that >> there's a new partition. The code I have uses PartitionIsValid() to >> test if the partition should be visible to the snapshot. The >> PartitionDesc will always contain details for all partitions stored in >> pg_partition whether they're valid to the current snapshot or not. I >> did it this way as there's no way to invalidate the relcache based on >> a point in transaction, only a point in time. > > I don't think that solves the problem that an arriving relcache > invalidation would trigger a rebuild of rd_partdesc, while it actually > is referenced by running code. > > You'd need to build infrastructure to prevent that. > > One approach would be to make sure that everything relying on > rt_partdesc staying the same stores its value in a local variable, and > then *not* free the old version of rt_partdesc (etc) when the refcount > > 0, but delay that to the RelationClose() that makes refcount reach > 0. That'd be the start of a framework for more such concurrenct > handling. I'm not so sure not freeing the partdesc until the refcount reaches 0 is safe. As you'd expect, we hold a lock on a partitioned table between the planner and executor, but the relation has been closed and the ref count returns to 0, which means when the relation is first opened in the executor that the updated PartitionDesc is obtained. A non-concurrent attach would have been blocked in this case due to the lock being held by the planner. Instead of using refcount == 0, perhaps we can release the original partdesc only when there are no locks held by us on the relation. It's late here now, so I'll look at that tomorrow. I've attached what I was playing around with. I think I'll also need to change RelationGetPartitionDesc() to have it return the original partdesc, if it's non-NULL. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 07/08/2018 15:29, Andres Freund wrote: > I don't think that solves the problem that an arriving relcache > invalidation would trigger a rebuild of rd_partdesc, while it actually > is referenced by running code. The problem is more generally that a relcache invalidation changes all pointers that might be in use. So it's currently not safe to trigger a relcache invalidation (on tables) without some kind of exclusive lock. One possible solution to this is outlined here: <https://www.postgresql.org/message-id/CA+TgmobtmFT5g-0dA=vEFFtogjRAuDHcYPw+qEdou5dZPnF=pg@mail.gmail.com> -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-08-09 20:57:35 +0200, Peter Eisentraut wrote: > On 07/08/2018 15:29, Andres Freund wrote: > > I don't think that solves the problem that an arriving relcache > > invalidation would trigger a rebuild of rd_partdesc, while it actually > > is referenced by running code. > > The problem is more generally that a relcache invalidation changes all > pointers that might be in use. I don't think that's quite right. We already better be OK with superfluous invals that do not change anything, because there's already sources of those (just think of vacuum, analyze, relation extension, whatnot). > So it's currently not safe to trigger a relcache invalidation (on > tables) without some kind of exclusive lock. I don't think that's true in a as general sense as you're stating it. It's not OK to send relcache invalidations for things that people rely on, and that cannot be updated in-place. Because of the dangling pointer issue etc. The fact that currently it is not safe to *change* partition related stuff without an AEL and how to make it safe is precisely what I was talking about in the thread. It won't be a general solution, but the infrastructure I'm talking about should get us closer. > One possible solution to this is outlined here: > <https://www.postgresql.org/message-id/CA+TgmobtmFT5g-0dA=vEFFtogjRAuDHcYPw+qEdou5dZPnF=pg@mail.gmail.com> I don't see anything in here that addresses the issue structurally? Greetings, Andres Freund
On 8 August 2018 at 01:29, Andres Freund <andres@anarazel.de> wrote: > One approach would be to make sure that everything relying on > rt_partdesc staying the same stores its value in a local variable, and > then *not* free the old version of rt_partdesc (etc) when the refcount > > 0, but delay that to the RelationClose() that makes refcount reach > 0. That'd be the start of a framework for more such concurrenct > handling. This is not a fully baked idea, but I'm wondering if a better way to do this, instead of having this PartitionIsValid macro to determine if the partition should be visible to the current transaction ID, we could, when we invalidate a relcache entry, send along the transaction ID that it's invalid from. Other backends when they process the invalidation message they could wipe out the cache entry only if their xid is >= the invalidation's "xmax" value. Otherwise, just tag the xmax onto the cache somewhere and always check it before using the cache (perhaps make it part of the RelationIsValid macro). This would also require that we move away from SnapshotAny type catalogue scans in favour of MVCC scans so that backends populating their relcache build it based on their current xid. Unless I'm mistaken, it should not make any difference for all DDL that takes an AEL on the relation, since there can be no older transactions running when the catalogue is modified, but for DDL that's not taking an AEL, we could effectively have an MVCC relcache. It would need careful thought about how it might affect CREATE INDEX CONCURRENTLY and all the other DDL that can be performed without an AEL. I'm unsure how this would work for the catcache as I've studied that code in even less detail, but throwing this out there in case there some major flaw in this idea so that I don't go wasting time looking into it further. I think the PartitionIsValid idea was not that great as it really complicates run-time partition pruning since it's quite critical about partition indexes being the same between the planner and executor. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Aug 12, 2018 at 9:05 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > This is not a fully baked idea, but I'm wondering if a better way to > do this, instead of having this PartitionIsValid macro to determine if > the partition should be visible to the current transaction ID, we > could, when we invalidate a relcache entry, send along the transaction > ID that it's invalid from. Other backends when they process the > invalidation message they could wipe out the cache entry only if their > xid is >= the invalidation's "xmax" value. Otherwise, just tag the > xmax onto the cache somewhere and always check it before using the > cache (perhaps make it part of the RelationIsValid macro). Transactions don't necessarily commit in XID order, so this might be an optimization to keep older transactions from having to do unnecessary rebuilds -- which I actually doubt is a major problem, but maybe I'm wrong -- but you can't rely solely on this as a way of deciding which transactions will see the effects of some change. If transactions 100, 101, and 102 begin in that order, and transaction 101 commits, there's no principled justification for 102 seeing its effects but 100 not seeing it. > This would > also require that we move away from SnapshotAny type catalogue scans > in favour of MVCC scans so that backends populating their relcache > build it based on their current xid. I think this is a somewhat confused analysis. We don't use SnapshotAny for catalog scans, and we never have. We used to use SnapshotNow, and we now use a current MVCC snapshot. What you're talking about, I think, is possibly using the transaction snapshot rather than a current MVCC snapshot for the catalog scans. I've thought about similar things, but I think there's a pretty deep can of worms. For instance, if you built a relcache entry using the transaction snapshot, you might end up building a seemingly-valid relcache entry for a relation that has been dropped or rewritten. When you try to access the relation data, you'll be attempt to access a relfilenode that's not there any more. Similarly, if you use an older snapshot to build a partition descriptor, you might thing that relation OID 12345 is still a partition of that table when in fact it's been detached - and, maybe, altered in other ways, such as changing column types. It seems to me that overall you're not really focusing on the right set of issues here. I think the very first thing we need to worry about how how we're going to keep the executor from following a bad pointer and crashing. Any time the partition descriptor changes, the next relcache rebuild is going to replace rd_partdesc and free the old one, but the executor may still have the old pointer cached in a structure or local variable; the next attempt to dereference it will be looking at freed memory, and kaboom. Right now, we prevent this by not allowing the partition descriptor to be modified while there are any queries running against the partition, so while there may be a rebuild, the old pointer will remain valid (cf. keep_partdesc). I think that whatever scheme you/we choose here should be tested with a combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions -- one of them doing DDL on the table while the other runs a long query. Once we keep it from blowing up, the second question is what the semantics are supposed to be. It seems inconceivable to me that the set of partitions that an in-progress query will scan can really be changed on the fly. I think we're going to have to rule that if you add or remove partitions while a query is running, we're going to scan exactly the set we had planned to scan at the beginning of the query; anything else would require on-the-fly plan surgery to a degree that seems unrealistic. That means that when a new partition is attached, already-running queries aren't going to scan it. If they did, we'd have big problems, because the transaction snapshot might see rows in those tables from an earlier time period during which that table wasn't attached. There's no guarantee that the data at that time conformed to the partition constraint, so it would be pretty problematic to let users see it. Conversely, when a partition is detached, there may still be scans from existing queries hitting it for a fairly arbitrary length of time afterwards. That may be surprising from a locking point of view or something, but it's correct as far as MVCC goes. Any changes made after the DETACH operation can't be visible to the snapshot being used for the scan. Now, what we could try to change on the fly is the set of partitions that are used for tuple routing. For instance, suppose we're inserting a long stream of COPY data. At some point, we attach a new partition from another session. If we then encounter a row that doesn't route to any of the partitions that existed at the time the query started, we could - instead of immediately failing - go and reload the set of partitions that are available for tuple routing and see if the new partition which was concurrently added happens to be appropriate to the tuple we've got. If so, we could route the tuple to it. But all of this looks optional. If new partitions aren't available for insert/update tuple routing until the start of the next query, that's not a catastrophe. The reverse direction might be more problematic: if a partition is detached, I'm not sure how sensible it is to keep routing tuples into it. On the flip side, what would break, really? Given the foregoing, I don't see why you need something like PartitionIsValid() at all, or why you need an algorithm similar to CREATE INDEX CONCURRENTLY. The problem seems mostly different. In the case of CREATE INDEX CONCURRENTLY, the issue is that any new tuples that get inserted while the index creation is in progress need to end up in the index, so you'd better not start building the index on the existing tuples until everybody who might insert new tuples knows about the index. I don't see that we have the same kind of problem in this case. Each partition is basically a separate table with its own set of indexes; as long as queries don't end up with one notion of which tables are relevant and a different notion of which indexes are relevant, we shouldn't end up with any table/index inconsistencies. And it's not clear to me what other problems we actually have here. To put it another way, if we've got the notion of "isvalid" for a partition, what's the difference between a partition that exists but is not yet valid and one that exists and is valid? I can't think of anything, and I have a feeling that you may therefore be inventing a lot of infrastructure that is not necessary. I'm inclined to think that we could drop the name CONCURRENTLY from this feature altogether and recast it as work to reduce the lock level associated with partition attach/detach. As long as we have a reasonable definition of what the semantics are for already-running queries, and clear documentation to go with those semantics, that seems fine. If a particular user finds the concurrent behavior too strange, they can always perform the DDL in a transaction that uses LOCK TABLE first, removing the concurrency. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 14 August 2018 at 04:00, Robert Haas <robertmhaas@gmail.com> wrote: > I've thought about similar things, but I think there's a pretty deep > can of worms. For instance, if you built a relcache entry using the > transaction snapshot, you might end up building a seemingly-valid > relcache entry for a relation that has been dropped or rewritten. > When you try to access the relation data, you'll be attempt to access > a relfilenode that's not there any more. Similarly, if you use an > older snapshot to build a partition descriptor, you might thing that > relation OID 12345 is still a partition of that table when in fact > it's been detached - and, maybe, altered in other ways, such as > changing column types. hmm, I guess for that to work correctly we'd need some way to allow older snapshots to see the changes if they've not already taken a lock on the table. If the lock had already been obtained then the ALTER TABLE to change the type of the column would get blocked by the existing lock. That kinda blows holes in only applying the change to only snapshots newer than the ATTACH/DETACH's > It seems to me that overall you're not really focusing on the right > set of issues here. I think the very first thing we need to worry > about how how we're going to keep the executor from following a bad > pointer and crashing. Any time the partition descriptor changes, the > next relcache rebuild is going to replace rd_partdesc and free the old > one, but the executor may still have the old pointer cached in a > structure or local variable; the next attempt to dereference it will > be looking at freed memory, and kaboom. Right now, we prevent this by > not allowing the partition descriptor to be modified while there are > any queries running against the partition, so while there may be a > rebuild, the old pointer will remain valid (cf. keep_partdesc). I > think that whatever scheme you/we choose here should be tested with a > combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions > -- one of them doing DDL on the table while the other runs a long > query. I did focus on that and did write a patch to solve the issue. After writing that I discovered another problem where if the PartitionDesc differed between planning and execution then run-time pruning did the wrong thing (See find_matching_subplans_recurse). The PartitionPruneInfo is built assuming the PartitionDesc matches between planning and execution. I moved on from the dangling pointer issue onto trying to figure out a way to ensure these are the same between planning and execution. > Once we keep it from blowing up, the second question is what the > semantics are supposed to be. It seems inconceivable to me that the > set of partitions that an in-progress query will scan can really be > changed on the fly. I think we're going to have to rule that if you > add or remove partitions while a query is running, we're going to scan > exactly the set we had planned to scan at the beginning of the query; > anything else would require on-the-fly plan surgery to a degree that > seems unrealistic. Trying to do that for in-progress queries would be pretty insane. I'm not planning on doing anything there. > That means that when a new partition is attached, > already-running queries aren't going to scan it. If they did, we'd > have big problems, because the transaction snapshot might see rows in > those tables from an earlier time period during which that table > wasn't attached. There's no guarantee that the data at that time > conformed to the partition constraint, so it would be pretty > problematic to let users see it. Conversely, when a partition is > detached, there may still be scans from existing queries hitting it > for a fairly arbitrary length of time afterwards. That may be > surprising from a locking point of view or something, but it's correct > as far as MVCC goes. Any changes made after the DETACH operation > can't be visible to the snapshot being used for the scan. > > Now, what we could try to change on the fly is the set of partitions > that are used for tuple routing. For instance, suppose we're > inserting a long stream of COPY data. At some point, we attach a new > partition from another session. If we then encounter a row that > doesn't route to any of the partitions that existed at the time the > query started, we could - instead of immediately failing - go and > reload the set of partitions that are available for tuple routing and > see if the new partition which was concurrently added happens to be > appropriate to the tuple we've got. If so, we could route the tuple > to it. But all of this looks optional. If new partitions aren't > available for insert/update tuple routing until the start of the next > query, that's not a catastrophe. The reverse direction might be more > problematic: if a partition is detached, I'm not sure how sensible it > is to keep routing tuples into it. On the flip side, what would > break, really? Unsure about that, I don't really see what it would buy us, so presumably you're just considering that this might not be a roadblocking side-effect. However, I think the PartitionDesc needs to not change between planning and execution due to run-time pruning requirements, so if that's the case then what you're saying here is probably not an issue we need to think about. > Given the foregoing, I don't see why you need something like > PartitionIsValid() at all, or why you need an algorithm similar to > CREATE INDEX CONCURRENTLY. The problem seems mostly different. In > the case of CREATE INDEX CONCURRENTLY, the issue is that any new > tuples that get inserted while the index creation is in progress need > to end up in the index, so you'd better not start building the index > on the existing tuples until everybody who might insert new tuples > knows about the index. I don't see that we have the same kind of > problem in this case. Each partition is basically a separate table > with its own set of indexes; as long as queries don't end up with one > notion of which tables are relevant and a different notion of which > indexes are relevant, we shouldn't end up with any table/index > inconsistencies. And it's not clear to me what other problems we > actually have here. To put it another way, if we've got the notion of > "isvalid" for a partition, what's the difference between a partition > that exists but is not yet valid and one that exists and is valid? I > can't think of anything, and I have a feeling that you may therefore > be inventing a lot of infrastructure that is not necessary. Well, the problem is that you want REPEATABLE READ transactions to be exactly that. A concurrent attach/detach should not change the output of a query. I don't know for sure that some isvalid flag is required, but we do need something to ensure we don't change the results of queries run inside a repeatable read transaction. I did try to start moving away from the isvalid flag in favour of having a PartitionDesc just not change within the same snapshot but you've pointed out a few problems with what I tried to come up with for that. > I'm inclined to think that we could drop the name CONCURRENTLY from > this feature altogether and recast it as work to reduce the lock level > associated with partition attach/detach. As long as we have a > reasonable definition of what the semantics are for already-running > queries, and clear documentation to go with those semantics, that > seems fine. If a particular user finds the concurrent behavior too > strange, they can always perform the DDL in a transaction that uses > LOCK TABLE first, removing the concurrency. I did have similar thoughts but that seems like something to think about once the semantics are determined, not before. Thanks for your input on this. I clearly don't have all the answers on this so your input and thoughts are very valuable. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018-Aug-13, Robert Haas wrote: > I think this is a somewhat confused analysis. We don't use > SnapshotAny for catalog scans, and we never have. We used to use > SnapshotNow, and we now use a current MVCC snapshot. What you're > talking about, I think, is possibly using the transaction snapshot > rather than a current MVCC snapshot for the catalog scans. > > I've thought about similar things, but I think there's a pretty deep > can of worms. For instance, if you built a relcache entry using the > transaction snapshot, you might end up building a seemingly-valid > relcache entry for a relation that has been dropped or rewritten. > When you try to access the relation data, you'll be attempt to access > a relfilenode that's not there any more. Similarly, if you use an > older snapshot to build a partition descriptor, you might thing that > relation OID 12345 is still a partition of that table when in fact > it's been detached - and, maybe, altered in other ways, such as > changing column types. I wonder if this all stems from a misunderstanding of what I suggested to David offlist. My suggestion was that the catalog scans would continue to use the catalog MVCC snapshot, and that the relcache entries would contain all the partitions that appear to the catalog; but each partition's entry would carry the Xid of the creating transaction in a field (say xpart), and that field is compared to the regular transaction snapshot: if xpart is visible to the transaction snapshot, then the partition is visible, otherwise not. So you never try to access a partition that doesn't exist, because those just don't appear at all in the relcache entry. But if you have an old transaction running with an old snapshot, and the partitioned table just acquired a new partition, then whether the partition will be returned as part of the partition descriptor or not depends on the visibility of its entry. I think that works fine for ATTACH without any further changes. I'm not so sure about DETACH, particularly when snapshots persist for a "long time" (a repeatable-read transaction). ISTM that in the above design, the partition descriptor would lose the entry for the detached partition ahead of time, which means queries would silently fail to see their data (though they wouldn't crash). I first thought this could be fixed by waiting for those snapshots to finish, but then I realized that there's no actual place where waiting achieves anything. Certainly it's not useful to wait before commit (because other snapshots are going to be starting all the time), and it's not useful to start after the commit (because by then the catalog tuple is already gone). Maybe we need two transactions: mark partition as removed with an xmax of sorts, commit, wait for all snapshots, start transaction, remove partition catalog tuple, commit. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I wonder if this all stems from a misunderstanding of what I suggested > to David offlist. My suggestion was that the catalog scans would > continue to use the catalog MVCC snapshot, and that the relcache entries > would contain all the partitions that appear to the catalog; but each > partition's entry would carry the Xid of the creating transaction in a > field (say xpart), and that field is compared to the regular transaction > snapshot: if xpart is visible to the transaction snapshot, then the > partition is visible, otherwise not. So you never try to access a > partition that doesn't exist, because those just don't appear at all in > the relcache entry. But if you have an old transaction running with an > old snapshot, and the partitioned table just acquired a new partition, > then whether the partition will be returned as part of the partition > descriptor or not depends on the visibility of its entry. Hmm. One question is where you're going to get the XID of the creating transaction. If it's taken from the pg_class row or the pg_inherits row or something of that sort, then you risk getting a bogus value if something updates that row other than what you expect -- and the consequences of that are pretty bad here; for this to work as you intend, you need an exactly-correct value, not newer or older. An alternative is to add an xid field that stores the value explicitly, and that might work, but you'll have to arrange for that value to be frozen at the appropriate time. A further problem is that there could be multiple changes in quick succession. Suppose that a partition is attached, then detached before the attach operation is all-visible, then reattached, perhaps with different partition bounds. > I think that works fine for ATTACH without any further changes. I'm not > so sure about DETACH, particularly when snapshots persist for a "long > time" (a repeatable-read transaction). ISTM that in the above design, > the partition descriptor would lose the entry for the detached partition > ahead of time, which means queries would silently fail to see their data > (though they wouldn't crash). I don't see why they wouldn't crash. If the partition descriptor gets rebuilt and some partitions disappear out from under you, the old partition descriptor is going to get freed, and the executor has a cached pointer to it, so it seems like you are in trouble. > I first thought this could be fixed by > waiting for those snapshots to finish, but then I realized that there's > no actual place where waiting achieves anything. Certainly it's not > useful to wait before commit (because other snapshots are going to be > starting all the time), and it's not useful to start after the commit > (because by then the catalog tuple is already gone). Maybe we need two > transactions: mark partition as removed with an xmax of sorts, commit, > wait for all snapshots, start transaction, remove partition catalog > tuple, commit. And what would that accomplish, exactly? Waiting for all snapshots would ensure that all still-running transactions see the fact the xmax with which the partition has been marked as removed, but what good does that do? In order to have a plausible algorithm, you have to describe both what the ATTACH/DETACH operation does and what the other concurrent transactions do and how those things interact. Otherwise, it's like saying that we're going to solve a problem with X and Y overlapping by having X take a lock. If Y doesn't take a conflicting lock, this does nothing. Generally, I think I see what you're aiming at: make ATTACH and DETACH have MVCC-like semantics with respect to concurrent transactions. I don't think that's a dumb idea from a theoretical perspective, but in practice I think it's going to be very difficult to implement. We have no other DDL that has such semantics, and there's no reason we couldn't; for example, TRUNCATE could work with SUEL and transactions that can't see the TRUNCATE as committed continue to operate on the old heap. While we could do such things, we don't. If you decide to do them here, you've probably got a lot of work ahead of you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21 August 2018 at 13:59, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> I wonder if this all stems from a misunderstanding of what I suggested >> to David offlist. My suggestion was that the catalog scans would >> continue to use the catalog MVCC snapshot, and that the relcache entries >> would contain all the partitions that appear to the catalog; but each >> partition's entry would carry the Xid of the creating transaction in a >> field (say xpart), and that field is compared to the regular transaction >> snapshot: if xpart is visible to the transaction snapshot, then the >> partition is visible, otherwise not. So you never try to access a >> partition that doesn't exist, because those just don't appear at all in >> the relcache entry. But if you have an old transaction running with an >> old snapshot, and the partitioned table just acquired a new partition, >> then whether the partition will be returned as part of the partition >> descriptor or not depends on the visibility of its entry. > > Hmm. One question is where you're going to get the XID of the > creating transaction. If it's taken from the pg_class row or the > pg_inherits row or something of that sort, then you risk getting a > bogus value if something updates that row other than what you expect > -- and the consequences of that are pretty bad here; for this to work > as you intend, you need an exactly-correct value, not newer or older. > An alternative is to add an xid field that stores the value > explicitly, and that might work, but you'll have to arrange for that > value to be frozen at the appropriate time. > > A further problem is that there could be multiple changes in quick > succession. Suppose that a partition is attached, then detached > before the attach operation is all-visible, then reattached, perhaps > with different partition bounds. I should probably post the WIP I have here. In those, I do have the xmin array in the PartitionDesc. This gets taken from the new pg_partition table, which I don't think suffers from the same issue as taking it from pg_class, since nothing else will update the pg_partition record. However, I don't think the xmin array is going to work if we include it in the PartitionDesc. The problem is, as I discovered from writing the code was that the PartitionDesc must remain exactly the same between planning an execution. If there are any more or any fewer partitions found during execution than what we saw in planning then run-time pruning will access the wrong element in the PartitionPruneInfo array, or perhaps access of the end of the array. It might be possible to work around that by identifying partitions by Oid rather than PartitionDesc array index, but the run-time pruning code is already pretty complex. I think coding it to work when the PartitionDesc does not match between planning and execution is just going to too difficult to get right. Tom is already unhappy with the complexity of ExecFindInitialMatchingSubPlans(). I think the solution will require that the PartitionDesc does not: a) Change between planning and execution. b) Change during a snapshot after the partitioned table has been locked. With b, it sounds like we'll need to take the most recent PartitionDesc even if the transaction is older than the one that did the ATTACH/DETACH operation as if we use an old version then, as Robert mentions, there's nothing to stop another transaction making changes to the table that make it an incompatible partition, e.g DROP COLUMN. This wouldn't be possible if we update the PartitionDesc right after taking the first lock on the partitioned table since any transactions doing DROP COLUMN would be blocked until the other snapshot gets released. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hello Here's my take on this feature, owing to David Rowley's version. Firstly, I took Robert's advice and removed the CONCURRENTLY keyword from the syntax. We just do it that way always. When there's a default partition, only that partition is locked with an AEL; all the rest is locked with ShareUpdateExclusive only. I added some isolation tests for it -- they all pass for me. There are two main ideas supporting this patch: 1. The Partition descriptor cache module (partcache.c) now contains a long-lived hash table that lists all the current partition descriptors; when an invalidation message is received for a relation, we unlink the partdesc from the hash table *but do not free it*. The hash table-linked partdesc is rebuilt again in the future, when requested, so many copies might exist in memory for one partitioned table. 2. Snapshots have their own cache (hash table) of partition descriptors. If a partdesc is requested and the snapshot has already obtained that partdesc, the original one is returned -- we don't request a new one from partcache. Then there are a few other implementation details worth mentioning: 3. parallel query: when a worker starts on a snapshot that has a partition descriptor cache, we need to transmit those partdescs from leader via shmem ... but we cannot send the full struct, so we just send the OID list of partitions, then rebuild the descriptor in the worker. Side effect: if a partition is detached right between the leader taking the partdesc and the worker starting, the partition loses its relpartbound column, so it's not possible to reconstruct the partdesc. In this case, we raise an error. Hopefully this should be rare. 4. If a partitioned table is dropped, but was listed in a snapshot's partdesc cache, and then parallel query starts, the worker will try to restore the partdesc for that table, but there are no catalog rows for it. The implementation choice here is to ignore the table and move on. I would like to just remove the partdesc from the snapshot, but that would require a relcache inval callback, and a) it'd kill us to scan all snapshots for every relation drop; b) it doesn't work anyway because we don't have any way to distinguish invals arriving because of DROP from invals arriving because of anything else, say ANALYZE. 5. snapshots are copied a lot. Copies share the same hash table as the "original", because surely all copies should see the same partition descriptor. This leads to the pinning/unpinning business you see for the structs in snapmgr.c. Some known defects: 6. this still leaks memory. Not as terribly as my earlier prototypes, but clearly it's something that I need to address. 7. I've considered the idea of tracking snapshot-partdescs in resowner.c to prevent future memory leak mistakes. Not done yet. Closely related to item 6. 8. Header changes may need some cleanup yet -- eg. I'm not sure snapmgr.h compiles standalone. 9. David Rowley recently pointed out that we can modify CREATE TABLE .. PARTITION OF to likewise not obtain AEL anymore. Apparently it just requires removal of three lines in MergeAttributes. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Oct 25, 2018 at 4:26 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Firstly, I took Robert's advice and removed the CONCURRENTLY keyword > from the syntax. We just do it that way always. When there's a default > partition, only that partition is locked with an AEL; all the rest is > locked with ShareUpdateExclusive only. Check. > Then there are a few other implementation details worth mentioning: > > 3. parallel query: when a worker starts on a snapshot that has a > partition descriptor cache, we need to transmit those partdescs from > leader via shmem ... but we cannot send the full struct, so we just send > the OID list of partitions, then rebuild the descriptor in the worker. > Side effect: if a partition is detached right between the leader taking > the partdesc and the worker starting, the partition loses its > relpartbound column, so it's not possible to reconstruct the partdesc. > In this case, we raise an error. Hopefully this should be rare. I don't think it's a good idea to for parallel query to just randomly fail in cases where a non-parallel query would have worked. I tried pretty hard to avoid that while working on the feature, and it would be a shame to see that work undone. It strikes me that it would be a good idea to break this work into two phases. In phase 1, let's support ATTACH and CREATE TABLE .. PARTITION OF without requiring AccessExclusiveLock. In phase 2, think about concurrency for DETACH (and possibly DROP). I suspect phase 1 actually isn't that hard. It seems to me that the only thing we REALLY need to ensure is that the executor doesn't blow up if a relcache reload occurs. There are probably a few different approaches to that problem, but I think it basically boils down to (1) making sure that the executor is holding onto pointers to the exact objects it wants to use and not re-finding them through the relcache and (2) making sure that the relcache doesn't free and rebuild those objects but rather holds onto the existing copies. With this approach, already-running queries won't take into account the fact that new partitions have been added, but that seems at least tolerable and perhaps desirable. For phase 2, we're not just talking about adding stuff that need not be used immediately, but about removing stuff which may already be in use. Your email doesn't seem to describe what we want the *behavior* to be in that case. Leave aside for a moment the issue of not crashing: what are the desired semantics? I think it would be pretty strange if you had a COPY running targeting a partitioned table, detached a partition, and the COPY continued to route tuples to the detached partition even though it was now an independent table. It also seems pretty strange if the tuples just get thrown away. If the COPY isn't trying to send any tuples to the now-detached partition, then it's fine, but if it is, then I have trouble seeing any behavior other than an error as sane, unless perhaps a new partition has been attached or created for that part of the key space. If you adopt that proposal, then the problem of parallel query behaving differently from non-parallel query goes away. You just get an error in both cases, probably to the effect that there is no (longer) a partition matching the tuple you are trying to insert (or update). If you're not hacking on this patch set too actively right at the moment, I'd like to spend some time hacking on the CREATE/ATTACH side of things and see if I can produce something committable for that portion of the problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-Nov-06, Robert Haas wrote: > If you're not hacking on this patch set too actively right at the > moment, I'd like to spend some time hacking on the CREATE/ATTACH side > of things and see if I can produce something committable for that > portion of the problem. I'm not -- feel free to hack away. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 6 Nov 2018 at 10:10, Robert Haas <robertmhaas@gmail.com> wrote:
--
With this
approach, already-running queries won't take into account the fact
that new partitions have been added, but that seems at least tolerable
and perhaps desirable.
Desirable, imho. No data added after a query starts would be visible.
If the
COPY isn't trying to send any tuples to the now-detached partition,
then it's fine, but if it is, then I have trouble seeing any behavior
other than an error as sane, unless perhaps a new partition has been
attached or created for that part of the key space.
Error in the COPY or in the DDL? COPY preferred. Somebody with insert rights shouldn't be able to prevent a table-owner level action. People normally drop partitions to save space, so it could be annoying if that was interrupted.
Supporting parallel query shouldn't make other cases more difficult from a behavioral perspective just to avoid the ERROR. The ERROR sounds annoying, but not sure how annoying avoiding it would be.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 6, 2018 at 1:54 PM Simon Riggs <simon@2ndquadrant.com> wrote: > Error in the COPY or in the DDL? COPY preferred. Somebody with insert rights shouldn't be able to prevent a table-ownerlevel action. People normally drop partitions to save space, so it could be annoying if that was interrupted. Yeah, the COPY. > Supporting parallel query shouldn't make other cases more difficult from a behavioral perspective just to avoid the ERROR.The ERROR sounds annoying, but not sure how annoying avoiding it would be. In my view, it's not just a question of it being annoying, but of whether anything else is even sensible. I mean, you can avoid an error when a user types SELECT 1/0 by returning NULL or 42, but that's not usually how we roll around here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 6 Nov 2018 at 10:56, Robert Haas <robertmhaas@gmail.com> wrote:
--
On Tue, Nov 6, 2018 at 1:54 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> Error in the COPY or in the DDL? COPY preferred. Somebody with insert rights shouldn't be able to prevent a table-owner level action. People normally drop partitions to save space, so it could be annoying if that was interrupted.
Yeah, the COPY.
> Supporting parallel query shouldn't make other cases more difficult from a behavioral perspective just to avoid the ERROR. The ERROR sounds annoying, but not sure how annoying avoiding it would be.
In my view, it's not just a question of it being annoying, but of
whether anything else is even sensible. I mean, you can avoid an
error when a user types SELECT 1/0 by returning NULL or 42, but that's
not usually how we roll around here.
If you can remove the ERROR without any other adverse effects, that sounds great.
Please let us know what, if any, adverse effects would be caused so we can discuss. Thanks
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 6, 2018 at 2:01 PM Simon Riggs <simon@2ndquadrant.com> wrote: > If you can remove the ERROR without any other adverse effects, that sounds great. > > Please let us know what, if any, adverse effects would be caused so we can discuss. Thanks Well, I've already written about this in two previous emails on this thread, so I'm not sure exactly what you think is missing. But to state the problem again: If you don't throw an error when a partition is concurrently detached and then someone routes a tuple to that portion of the key space, what DO you do? Continue inserting tuples into the table even though it's no longer a partition? Throw tuples destined for that partition away? You can make an argument for both of those behaviors, but they're both pretty strange. The first one means that for an arbitrarily long period of time after detaching a partition, the partition may continue to receive inserts that were destined for its former parent. The second one means that your data can disappear into the ether. I don't like either of those things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-Nov-06, Robert Haas wrote: > If you don't throw an error when a partition is concurrently detached > and then someone routes a tuple to that portion of the key space, what > DO you do? Continue inserting tuples into the table even though it's > no longer a partition? Yes -- the table was a partition when the query started, so it's still a partition from the point of view of that query's snapshot. > Throw tuples destined for that partition away? Surely not. (/me doesn't beat straw men anyway.) > You can make an argument for both of those behaviors, but they're > both pretty strange. The first one means that for an arbitrarily long > period of time after detaching a partition, the partition may continue > to receive inserts that were destined for its former parent. Not arbitrarily long -- only as long as those old snapshots live. I don't find this at all surprising. (I think DETACH is not related to DROP in any way. My proposal is that DETACH can work concurrently, and if people want to drop the partition later they can wait until snapshots/queries that could see that partition are gone.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 6, 2018 at 2:10 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2018-Nov-06, Robert Haas wrote: > > If you don't throw an error when a partition is concurrently detached > > and then someone routes a tuple to that portion of the key space, what > > DO you do? Continue inserting tuples into the table even though it's > > no longer a partition? > > Yes -- the table was a partition when the query started, so it's still > a partition from the point of view of that query's snapshot. I think it's important to point out that DDL does not in general respect the query snapshot. For example, you can query a table that was created by a transaction not visible to your query snapshot. You cannot query a table that was dropped by a transaction not visible to your query snapshot. If someone runs ALTER FUNCTION on a function your query uses, you get the latest committed version, not the version that was current at the time your query snapshot was created. So, if we go with the semantics you are proposing here, we will be making this DDL behave differently from pretty much all other DDL. Possibly that's OK in this case, but it's easy to think of other cases where it could cause problems. To take an example that I believe was discussed on-list a number of years ago, suppose that ADD CONSTRAINT worked according to the model that you are proposing for ATTACH PARTITION. If it did, then one transaction could be concurrently inserting a tuple while another transaction was adding a constraint which the tuple fails to satisfy. Once both transactions commit, you have a table with a supposedly-valid constraint and a tuple inside of it that doesn't satisfy that constraint. Obviously, that's no good. I'm not entirely sure whether there are any similar dangers in the case of DETACH PARTITION. I think it depends a lot on what can be done with that detached partition while the overlapping transaction is still active. For instance, suppose you attached it to the original table with a different set of partition bounds, or attached it to some other table with a different set of partition bounds. If you can do that, then I think it effectively creates the problem described in the previous paragraph with respect to the partition constraint. IOW, we've got to somehow prevent this: setup: partition is attached with bounds 1 to a million S1: COPY begins S2: partition is detached S2: partition is reattached with bounds 1 to a thousand S1: still-running copy inserts a tuple with value ten thousand -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 6 Nov 2018 at 11:06, Robert Haas <robertmhaas@gmail.com> wrote:
--
On Tue, Nov 6, 2018 at 2:01 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> If you can remove the ERROR without any other adverse effects, that sounds great.
>
> Please let us know what, if any, adverse effects would be caused so we can discuss. Thanks
Well, I've already written about this in two previous emails on this
thread, so I'm not sure exactly what you think is missing. But to
state the problem again:
I was discussing the ERROR in relation to parallel query, not COPY.
I didn't understand how that would be achieved.
Thanks for working on this.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Aug 7, 2018 at 9:29 AM Andres Freund <andres@anarazel.de> wrote: > One approach would be to make sure that everything relying on > rt_partdesc staying the same stores its value in a local variable, and > then *not* free the old version of rt_partdesc (etc) when the refcount > > 0, but delay that to the RelationClose() that makes refcount reach > 0. That'd be the start of a framework for more such concurrenct > handling. Some analysis of possible trouble spots: - get_partition_dispatch_recurse and ExecCreatePartitionPruneState both call RelationGetPartitionDesc. Presumably, this means that if the partition descriptor gets updated on the fly, the tuple routing and partition dispatch code could end up with different ideas about which partitions exist. I think this should be fixed somehow, so that we only call RelationGetPartitionDesc once per query and use the result for everything. - expand_inherited_rtentry checks RelationGetPartitionDesc(oldrelation) != NULL. If so, it calls expand_partitioned_rtentry which fetches the same PartitionDesc again. We can probably just do this once in the caller and pass the result down. - set_relation_partition_info also calls RelationGetPartitionDesc. Off-hand, I think this code runs after expand_inherited_rtentry. Not sure what to do about this. I'm not sure what the consequences would be if this function and that one had different ideas about the partition descriptor. - tablecmds.c is pretty free about calling RelationGetPartitionDesc repeatedly, but it probably doesn't matter. If we're doing some kind of DDL that depends on the contents of the partition descriptor, we *had better* be holding a lock strong enough to prevent the partition descriptor from being changed by somebody else at the same time. Allowing a partition to be added concurrently with DML is one thing; allowing a partition to be added concurrently with adding another partition is a whole different level of insanity. I think we'd be best advised not to go down that rathole - among other concerns, how would you even guarantee that the partitions being added didn't overlap? Generally: Is it really OK to postpone freeing the old partition descriptor until the relation reference count goes to 0? I wonder if there are cases where this could lead to tons of copies of the partition descriptor floating around at the same time, gobbling up precious cache memory. My first thought was that this would be pretty easy: just create a lot of new partitions one by one while some long-running transaction is open. But the actual result in that case depends on the behavior of the backend running the transaction. If it just ignores the new partitions and sticks with the partition descriptor it has got, then probably nothing else will request the new partition descriptor either and there will be no accumulation of memory. However, if it tries to absorb the updated partition descriptor, but without being certain that the old one can be freed, then we'd have a query-lifespan memory leak which is quadratic in the number of new partitions. Maybe even that would be OK -- we could suppose that the number of new partitions would probably be all THAT crazy large, and the constant factor not too bad, so maybe you'd leak a could of MB for the length of the query, but no more. However, I wonder if it would better to give each PartitionDescData its own refcnt, so that it can be freed immediately when the refcnt goes to zero. That would oblige every caller of RelationGetPartitionDesc() to later call something like ReleasePartitionDesc(). We could catch failures to do that by keeping all the PartitionDesc objects so far created in a list. When the main entry's refcnt goes to 0, cross-check that this list is empty; if not, then the remaining entries have non-zero refcnts that were leaked. We could emit a WARNING as we do in similar cases. In general, I think something along the lines you are suggesting here is the right place to start attacking this problem. Effectively, we immunize the system against the possibility of new entries showing up in the partition descriptor while concurrent DML is running; the semantics are that the new partitions are ignored for the duration of currently-running queries. This seems to allow for painless creation or addition of new partitions in normal cases, but not when a default partition exists. In that case, using the old PartitionDesc is outright wrong, because adding a new toplevel partition changes the default partition's partition constraint. We can't insert into the default partition a tuple that under the updated table definition needs to go someplace else. It seems like the best way to account for that is to reduce the lock level on the partitioned table to ShareUpdateExclusiveLock, but leave the lock level on any default partition as AccessExclusiveLock (because we are modifying a constraint on it). We would also need to leave the lock level on the new partition as AccessExclusiveLock (because we are adding a constraint on it). Not perfect, for sure, but not bad for a first patch, either; it would improve things for users in a bunch of practical cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-Nov-06, Alvaro Herrera wrote: > On 2018-Nov-06, Robert Haas wrote: > > Throw tuples destined for that partition away? > > Surely not. (/me doesn't beat straw men anyway.) Hmm, apparently this can indeed happen with my patch :-( -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 6, 2018 at 10:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > Throw tuples destined for that partition away? > > Surely not. (/me doesn't beat straw men anyway.) > > Hmm, apparently this can indeed happen with my patch :-( D'oh. This is a hard problem, especially the part of it that involves handling detach, so I wouldn't feel too bad about that. However, to beat this possibly-dead horse a little more, I think you made the error of writing a patch that (1) tried to solve too many problems at once and (2) didn't seem to really have a clear, well-considered idea about what the semantics ought to be. This is not intended as an attack; I want to work with you to solve the problem, not have a fight about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 6, 2018 at 5:09 PM Robert Haas <robertmhaas@gmail.com> wrote: > - get_partition_dispatch_recurse and ExecCreatePartitionPruneState > both call RelationGetPartitionDesc. Presumably, this means that if > the partition descriptor gets updated on the fly, the tuple routing > and partition dispatch code could end up with different ideas about > which partitions exist. I think this should be fixed somehow, so that > we only call RelationGetPartitionDesc once per query and use the > result for everything. I think there is deeper trouble here. ExecSetupPartitionTupleRouting() calls find_all_inheritors() to acquire RowExclusiveLock on the whole partitioning hierarchy. It then calls RelationGetPartitionDispatchInfo (as a non-relcache function, this seems poorly named) which calls get_partition_dispatch_recurse, which does this: /* * We assume all tables in the partition tree were already locked * by the caller. */ Relation partrel = heap_open(partrelid, NoLock); That seems OK at present, because no new partitions can have appeared since ExecSetupPartitionTupleRouting() acquired locks. But if we allow new partitions to be added with only ShareUpdateExclusiveLock, then I think there would be a problem. If a new partition OID creeps into the partition descriptor after find_all_inheritors() and before we fetch its partition descriptor, then we wouldn't have previously taken a lock on it and would still be attempting to open it without a lock, which is bad (cf. b04aeb0a053e7cf7faad89f7d47844d8ba0dc839). Admittedly, it might be a bit hard to provoke a failure here because I'm not exactly sure how you could trigger a relcache reload in the critical window, but I don't think we should rely on that. More generally, it seems a bit strange that we take the approach of locking the entire partitioning hierarchy here regardless of which relations the query actually knows about. If some relations have been pruned, presumably we don't need to lock them; if/when we permit concurrent partition, we don't need to lock any new ones that have materialized. We're just going to end up ignoring them anyway because there's nothing to do with the information that they are or are not excluded from the query when they don't appear in the query plan in the first place. Furthermore, this whole thing looks suspiciously like more of the sort of redundant locking that f2343653f5b2aecfc759f36dbb3fd2a61f36853e attempted to eliminate. In light of that commit message, I'm wondering whether the best approach would be to [1] get rid of the find_all_inheritors call altogether and [2] somehow ensure that get_partition_dispatch_recurse() doesn't open any tables that aren't part of the query's range table. Thoughts? Comments? Ideas? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-Nov-06, Robert Haas wrote: > - get_partition_dispatch_recurse and ExecCreatePartitionPruneState > both call RelationGetPartitionDesc. My patch deals with this by caching descriptors in the active snapshot. So those two things would get the same partition descriptor. There's no RelationGetPartitionDesc anymore, and SnapshotGetPartitionDesc takes its place. (I tried to use different scoping than the active snapshot; I first tried the Portal, then I tried the resource owner. But nothing seems to fit as precisely as the active snapshot.) > - expand_inherited_rtentry checks > RelationGetPartitionDesc(oldrelation) != NULL. If so, it calls > expand_partitioned_rtentry which fetches the same PartitionDesc again. This can be solved by changing the test to a relkind one, as my patch does. > - set_relation_partition_info also calls RelationGetPartitionDesc. > Off-hand, I think this code runs after expand_inherited_rtentry. Not > sure what to do about this. I'm not sure what the consequences would > be if this function and that one had different ideas about the > partition descriptor. Snapshot caching, like in my patch, again solves this problem. > - tablecmds.c is pretty free about calling RelationGetPartitionDesc > repeatedly, but it probably doesn't matter. If we're doing some kind > of DDL that depends on the contents of the partition descriptor, we > *had better* be holding a lock strong enough to prevent the partition > descriptor from being changed by somebody else at the same time. My patch deals with this by unlinking the partcache entry from the hash table on relation invalidation, so DDL code would obtain a fresh copy each time (lookup_partcache_entry). In other words, I already solved these problems you list. Maybe you could give my patch a look. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Nov 7, 2018 at 12:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2018-Nov-06, Robert Haas wrote: > > - get_partition_dispatch_recurse and ExecCreatePartitionPruneState > > both call RelationGetPartitionDesc. > > My patch deals with this by caching descriptors in the active snapshot. > So those two things would get the same partition descriptor. There's no > RelationGetPartitionDesc anymore, and SnapshotGetPartitionDesc takes its > place. > > (I tried to use different scoping than the active snapshot; I first > tried the Portal, then I tried the resource owner. But nothing seems to > fit as precisely as the active snapshot.) ... > In other words, I already solved these problems you list. > > Maybe you could give my patch a look. I have, a bit. One problem I'm having is that while you explained the design you chose in a fair amount of detail, you didn't give a lot of explanation (that I have seen) of the reasons why you chose that design. If there's a README or a particularly good comment someplace that I should be reading to understand that better, please point me in the right direction. And also, I just don't really understand what all the problems are yet. I'm only starting to study this. I am a bit skeptical of your approach, though. Tying it to the active snapshot seems like an awfully big hammer. Snapshot manipulation can be a performance bottleneck both in terms of actual performance and also in terms of code complexity, and I don't really like the idea of adding more code there. It's not a sustainable pattern for making DDL work concurrently, either -- I'm pretty sure we don't want to add new code to things like GetLatestSnapshot() every time we want to make a new kind of DDL concurrent. Also, while hash table lookups are pretty cheap, they're not free. In my opinion, to the extent that we can, it would be better to refactor things to avoid duplicate lookups of the PartitionDesc rather than to install a new subsystem that tries to make sure they always return the same answer. Such an approach seems to have other possible advantages. For example, if a COPY is running and a new partition shows up, we might actually want to allow tuples to be routed to it. Maybe that's too pie in the sky, but if we want to preserve the option to do such things in the future, a hard-and-fast rule that the apparent partition descriptor doesn't change unless the snapshot changes seems like it might get in the way. It seems better to me to have a system where code that accesses the relcache has a choice, so that at its option it can either hang on to the PartitionDesc it has or get a new one that may be different. If we can do things that way, it gives us the most flexibility. After the poking around I've done over the last 24 hours, I do see that there are some non-trivial problems with making it that way, but I'm not really ready to give up yet. Does that make sense to you, or am I all wet here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8 November 2018 at 05:05, Robert Haas <robertmhaas@gmail.com> wrote: > That seems OK at present, because no new partitions can have appeared > since ExecSetupPartitionTupleRouting() acquired locks. But if we > allow new partitions to be added with only ShareUpdateExclusiveLock, > then I think there would be a problem. If a new partition OID creeps > into the partition descriptor after find_all_inheritors() and before > we fetch its partition descriptor, then we wouldn't have previously > taken a lock on it and would still be attempting to open it without a > lock, which is bad (cf. b04aeb0a053e7cf7faad89f7d47844d8ba0dc839). > > Admittedly, it might be a bit hard to provoke a failure here because > I'm not exactly sure how you could trigger a relcache reload in the > critical window, but I don't think we should rely on that. > > More generally, it seems a bit strange that we take the approach of > locking the entire partitioning hierarchy here regardless of which > relations the query actually knows about. If some relations have been > pruned, presumably we don't need to lock them; if/when we permit > concurrent partition, we don't need to lock any new ones that have > materialized. We're just going to end up ignoring them anyway because > there's nothing to do with the information that they are or are not > excluded from the query when they don't appear in the query plan in > the first place. While the find_all_inheritors() call is something I'd like to see gone, I assume it was done that way since an UPDATE might route a tuple to a partition that there is no subplan for and due to INSERT with VALUES not having any RangeTblEntry for any of the partitions. Simply, any partition which is a descendant of the target partition table could receive the tuple regardless of what might have been pruned. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 7, 2018 at 7:06 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > While the find_all_inheritors() call is something I'd like to see > gone, I assume it was done that way since an UPDATE might route a > tuple to a partition that there is no subplan for and due to INSERT > with VALUES not having any RangeTblEntry for any of the partitions. > Simply, any partition which is a descendant of the target partition > table could receive the tuple regardless of what might have been > pruned. Thanks. I had figured out since my email of earlier today that it was needed in the INSERT case, but I had not thought of/discovered the case of an UPDATE that routes a tuple to a pruned partition. I think that latter case may not be tested in our regression tests, which is perhaps something we ought to change. Honestly, I *think* that the reason that find_all_inheritors() call is there is because I had the idea that it was important to try to lock partition hierarchies in the same order in all cases so as to avoid spurious deadlocks. However, I don't think we're really achieving that goal despite this code. If we arrive at this point having already locked some relations, and then lock some more, based on whatever got pruned, we're clearly not using a deterministic locking order. So I think we could probably rip out the find_all_inheritors() call here and change the NoLock in get_partition_dispatch_recurse() to just take a lock. That's probably a worthwhile simplification and a slight optimization regardless of anything else. But I really think it would be better if we could also jigger this to avoid reopening relations which the executor has already opened and locked elsewhere. Unfortunately, I don't see a really simple way to accomplish that. We get the OIDs of the descendents and want to know whether there is range table entry for that OID; but there's no data structure which answers that question at present, I believe, and introducing one just for this purpose seems like an awful lot of new machinery. Perhaps that new machinery would still have less far-reaching consequences than the machinery Alvaro is proposing, but, still, it's not very appealing. Perhaps one idea is only open and lock partitions on demand - i.e. if a tuple actually gets routed to them. There are good reasons to do that independently of reducing lock levels, and we certainly couldn't do it without having some efficient way to check whether it had already been done. So then the mechanism wouldn't feel like so much like a special-purpose hack just for concurrent ATTACH/DETACH. (Was Amit Langote already working on this, or was that some other kind of on-demand locking?) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8 November 2018 at 15:01, Robert Haas <robertmhaas@gmail.com> wrote: > Honestly, I *think* that the reason that find_all_inheritors() call is > there is because I had the idea that it was important to try to lock > partition hierarchies in the same order in all cases so as to avoid > spurious deadlocks. However, I don't think we're really achieving > that goal despite this code. If we arrive at this point having > already locked some relations, and then lock some more, based on > whatever got pruned, we're clearly not using a deterministic locking > order. So I think we could probably rip out the find_all_inheritors() > call here and change the NoLock in get_partition_dispatch_recurse() to > just take a lock. That's probably a worthwhile simplification and a > slight optimization regardless of anything else. I'd not thought of the locks taken elsewhere case. I guess it just perhaps reduces the chances of a deadlock then. A "slight optimization" is one way to categorise it. There are some benchmarks you might find interesting in [1] and [2]. Patch 0002 does just what you mention. [1] https://www.postgresql.org/message-id/06524959-fda8-cff9-6151-728901897b79%40redhat.com [2] https://www.postgresql.org/message-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz%3DGVBwvGh4a6xA%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/11/08 11:01, Robert Haas wrote: > On Wed, Nov 7, 2018 at 7:06 PM David Rowley > <david.rowley@2ndquadrant.com> wrote: >> While the find_all_inheritors() call is something I'd like to see >> gone, I assume it was done that way since an UPDATE might route a >> tuple to a partition that there is no subplan for and due to INSERT >> with VALUES not having any RangeTblEntry for any of the partitions. >> Simply, any partition which is a descendant of the target partition >> table could receive the tuple regardless of what might have been >> pruned. > > Thanks. I had figured out since my email of earlier today that it was > needed in the INSERT case, but I had not thought of/discovered the > case of an UPDATE that routes a tuple to a pruned partition. I think > that latter case may not be tested in our regression tests, which is > perhaps something we ought to change. > > Honestly, I *think* that the reason that find_all_inheritors() call is > there is because I had the idea that it was important to try to lock > partition hierarchies in the same order in all cases so as to avoid > spurious deadlocks. However, I don't think we're really achieving > that goal despite this code. If we arrive at this point having > already locked some relations, and then lock some more, based on > whatever got pruned, we're clearly not using a deterministic locking > order. So I think we could probably rip out the find_all_inheritors() > call here and change the NoLock in get_partition_dispatch_recurse() to > just take a lock. That's probably a worthwhile simplification and a > slight optimization regardless of anything else. A patch that David and I have been working on over at: https://commitfest.postgresql.org/20/1690/ does that. With that patch, partitions (leaf or not) are locked and opened only if a tuple is routed to them. In edd44738bc (Be lazier about partition tuple routing), we postponed the opening of leaf partitions, but we still left the RelationGetPartitionDispatchInfo machine which recursively creates PartitionDispatch structs for all partitioned tables in a tree. The patch mentioned above postpones even the partitioned partition initialization to a point after a tuple is routed to it. The patch doesn't yet eliminate the find_all_inheritors call from ExecSetupPartitionTupleRouting. But that's mostly because of the fear that if we start becoming lazier about locking individual partitions too, we'll get non-deterministic locking order on partitions that we might want to avoid for deadlock fears. Maybe, we don't need to be fearful though. > But I really think it would be better if we could also jigger this to > avoid reopening relations which the executor has already opened and > locked elsewhere. Unfortunately, I don't see a really simple way to > accomplish that. We get the OIDs of the descendents and want to know > whether there is range table entry for that OID; but there's no data > structure which answers that question at present, I believe, and > introducing one just for this purpose seems like an awful lot of new > machinery. Perhaps that new machinery would still have less > far-reaching consequences than the machinery Alvaro is proposing, but, > still, it's not very appealing. The newly added ExecGetRangeTableRelation opens (if not already done) and returns a Relation pointer for tables that are present in the range table, so requires to be passed a valid RT index. That works for tables that the planner touched. UPDATE tuple routing benefits from that in cases where the routing target is already in the range table. For insert itself, planner adds only the target partitioned table to the range table. Partitions that the inserted tuples will route to may be present in the range table via some other plan node, but the insert's execution state won't know about them, so it cannot use EcecGetRangeTableRelation. > Perhaps one idea is only open and lock partitions on demand - i.e. if > a tuple actually gets routed to them. There are good reasons to do > that independently of reducing lock levels, and we certainly couldn't > do it without having some efficient way to check whether it had > already been done. So then the mechanism wouldn't feel like so much > like a special-purpose hack just for concurrent ATTACH/DETACH. (Was > Amit Langote already working on this, or was that some other kind of > on-demand locking?) I think the patch mentioned above gets us closer to that goal. Thanks, Amit
On Wed, Nov 7, 2018 at 1:37 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Maybe you could give my patch a look. > > I have, a bit. While thinking about this problem a bit more, I realized that what is called RelationBuildPartitionDesc in master and BuildPartitionDesc in Alvaro's patch has a synchronization problem as soon as we start to reduce lock levels. At some point, find_inheritance_children() gets called to get a list of the OIDs of the partitions. Then, later, SysCacheGetAttr(RELOID, ...) gets called for each one to get its relpartbound value. But since catalog lookups use the most current snapshot, they might not see a compatible view of the catalogs. That could manifest in a few different ways: - We might see a newer version of relpartbound, where it's now null because it's been detached. - We might see a newer version of relpartbound where it now has an unrelated value because it has been detached and then reattached to some other partitioned table. - We might see newer versions of relpartbound for some tables than others. For instance, suppose we had partition A for 1..200 and B for 201..300. Then we realize that this is not what we actually wanted to do, so we detach A and reattach it with a bound of 1..100 and detached B and reattach it with a bound of 101..300. If we perform the syscache lookup for A before this happens and the syscache lookup for B after this happens, we might see the old bound for A and the new bound for B, and that would be sad, 'cuz they overlap. - Seeing an older relpartbound for some other table is also a problem for other reasons -- we will have the wrong idea about the bounds of that partition and may put the wrong tuples into it. Without AccessExclusiveLock, I don't think there is anything that keeps us from reading stale syscache entries. Alvaro's patch defends against the first of these cases by throwing an error, which, as I already said, I don't think is acceptable, but I don't see any defense at all against the other cases. The root of the problem is that the way catalog lookups work today - each individual lookup uses the latest available snapshot, but there is zero guarantee that consecutive lookups use the same snapshot. Therefore, as soon as you start lowering lock levels, you are at risk for inconsistent data. I suspect the only good way of fixing this problem is using a single snapshot to perform both the scan of pg_inherits and the subsequent pg_class lookups. That way, you know that you are seeing the state of the whole partitioning hierarchy as it existed at some particular point in time -- every commit is either fully reflected in the constructed PartitionDesc or not reflected at all. Unfortunately, that would mean that we can't use the syscache to perform the lookups, which might have unhappy performance consequences. Note that this problem still exists even if you allow concurrent attach but not concurrent detach, but it's not as bad, because when you encounter a concurrently-attached partition, you know it hasn't also been concurrently-detached from someplace else. Presumably you either see the latest value of the partition bound or the NULL value which preceded it, but not anything else. If that's so, then maybe you could get by without using a consistent snapshot for all of your information gathering: if you see NULL, you know that the partition was concurrently added and you just ignore it. There's still no guarantee that all parallel workers would come to the same conclusion, though, which doesn't feel too good. Personally, I don't think it's right to blame that problem on parallel query. The problem is more general than that: we assume that holding any kind of a lock on a relation is enough to keep the important details of the relation static, and therefore it's fine to do staggered lookups within one backend, and it's also fine to do staggered lookups across different backends. When you remove the basic assumption that any lock is enough to prevent concurrent DDL, then the whole idea that you can do different lookups at different times with different snapshots (possibly in different backends) and get sane answers also ceases to be correct. But the idea that you can look up different bits of catalog data at whatever time is convenient undergirds large amounts of our current machinery -- it's built into relcache, syscache, sinval, ... I think that things get even crazier if we postpone locking on individual partitions until we need to do something with that partition, as has been proposed elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9 November 2018 at 05:34, Robert Haas <robertmhaas@gmail.com> wrote: > I suspect the only good way of fixing this problem is using a single > snapshot to perform both the scan of pg_inherits and the subsequent > pg_class lookups. That way, you know that you are seeing the state of > the whole partitioning hierarchy as it existed at some particular > point in time -- every commit is either fully reflected in the > constructed PartitionDesc or not reflected at all. Unfortunately, > that would mean that we can't use the syscache to perform the lookups, > which might have unhappy performance consequences. I do have a patch sitting around that moves the relpartbound into a new catalogue table named pg_partition. This gets rid of the usage of pg_inherits for partitioned tables. I wonder if that problem is easier to solve with that. It also solves the issue with long partition keys and lack of toast table on pg_class. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 8, 2018 at 3:59 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > On 9 November 2018 at 05:34, Robert Haas <robertmhaas@gmail.com> wrote: > > I suspect the only good way of fixing this problem is using a single > > snapshot to perform both the scan of pg_inherits and the subsequent > > pg_class lookups. That way, you know that you are seeing the state of > > the whole partitioning hierarchy as it existed at some particular > > point in time -- every commit is either fully reflected in the > > constructed PartitionDesc or not reflected at all. Unfortunately, > > that would mean that we can't use the syscache to perform the lookups, > > which might have unhappy performance consequences. > > I do have a patch sitting around that moves the relpartbound into a > new catalogue table named pg_partition. This gets rid of the usage of > pg_inherits for partitioned tables. I wonder if that problem is easier > to solve with that. It also solves the issue with long partition keys > and lack of toast table on pg_class. Yeah, I thought about that, and it does make some sense. Not sure if it would hurt performance to have to access another table, but maybe it comes out in the wash if pg_inherits is gone? Seems like a fair amount of code rearrangement just to get around the lack of a TOAST table on pg_class, but maybe it's worth it. I had another idea, too. I think we might be able to reuse the technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7. That is: - make a note of SharedInvalidMessageCounter before doing any of the relevant catalog lookups - do them - AcceptInvalidationMessages() - if SharedInvalidMessageCounter has changed, discard all the data we collected and retry from the top I believe that is sufficient to guarantee that whatever we construct will have a consistent view of the catalogs which is the most recent available view as of the time we do the work. And with this approach I believe we can continue to use syscache lookups to get the data rather than having to use actual index scans, which is nice. Then again, with your approach I'm guessing that one index scan would get us the list of children and their partition bound information. That would be even better -- the syscache lookup per child goes away altogether; it's just a question of deforming the pg_partition tuples. Way back at the beginning of the partitioning work, I mulled over the idea of storing the partition bound information in a new column in pg_inherits rather than in pg_class. I wonder why exactly I rejected that idea, and whether I was wrong to do so. One possible advantage of that approach over a pg_partition table is that is that client code which queries pg_inherits will have to be adjusted if we stop using it, and some of those queries are going to get more complicated. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 9, 2018 at 9:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > I had another idea, too. I think we might be able to reuse the > technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7. > That is: > > - make a note of SharedInvalidMessageCounter before doing any of the > relevant catalog lookups > - do them > - AcceptInvalidationMessages() > - if SharedInvalidMessageCounter has changed, discard all the data we > collected and retry from the top > > I believe that is sufficient to guarantee that whatever we construct > will have a consistent view of the catalogs which is the most recent > available view as of the time we do the work. And with this approach > I believe we can continue to use syscache lookups to get the data > rather than having to use actual index scans, which is nice. Here are a couple of patches to illustrate this approach to this part of the overall problem. 0001 is, I think, a good cleanup that may as well be applied in isolation; it makes the code in RelationBuildPartitionDesc both cleaner and more efficient. 0002 adjust things so that - I hope - the partition bounds we get for the individual partitions has to be as of the same point in the commit sequence as the list of children. As I noted before, Alvaro's patch doesn't seem to have tackled this part of the problem. Another part of the problem is finding a way to make sure that if we execute a query (or plan one), all parts of the executor (or planner) see the same PartitionDesc for every relation. In the case of parallel query, I think it's important to try to get consistency not only within a single backend but also across backends. I'm thinking about perhaps creating an object with a name like PartitionDescDirectory which can optionally attach to dynamic shared memory. It would store an OID -> PartitionDesc mapping in local memory, and optionally, an additional OID -> serialized-PartitionDesc in DSA. When given an OID, it would check the local hash table first, and then if that doesn't find anything, check the shared hash table if there is one. If an entry is found there, deserialize and add to the local hash table. We'd then hang such a directory off of the EState for the executor and the PlannerInfo for the planner. As compared with Alvaro's proposal, this approach has the advantage of not treating parallel query as a second-class citizen, and also of keeping partitioning considerations out of the snapshot handling, which as I said before seems to me to be a good idea. One thing which was vaguely on my mind in earlier emails but which I think I can now articulate somewhat more clearly is this: In some cases, a consistent but outdated view of the catalog state is just as bad as an inconsistent view of the catalog state. For example, it's not OK to decide that a tuple can be placed in a certain partition based on an outdated list of relation constraints, including the partitioning constraint - nor is it OK to decide that a partition can be pruned based on an old view of the partitioning constraint. I think this means that whenever we change a partition's partitioning constraint, we MUST take AccessExclusiveLock on the partition. Otherwise, a heap_insert() [or a partition pruning decision] can be in progress on that relation in one relation at the same time that some other partition is changing the partition constraint, which can't possibly work. The best we can reasonably do is to reduce the locking level on the partitioned table itself. A corollary is that holding the PartitionDescs that a particular query sees for a particular relation fixed, whether by the method Alvaro proposes or by what I am proposing here or by some other method is not a panacea. For example, the ExecPartitionCheck call in copy.c sometimes gets skipped on the theory that if tuple routing has sent us to partition X, then the tuple being routed must satisfy the partition constraint for that partition. But that's not true if we set up tuple routing using one view of the catalogs, and then things changed afterwards. RelationBuildPartitionDesc doesn't lock the children whose relpartbounds it is fetching (!), so unless we're guaranteed to have already locked them children earlier for some other reason, we could grab the partition bound at this point and then it could change again before we get a lock on them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Nov 14, 2018 at 02:27:31PM -0500, Robert Haas wrote: > Here are a couple of patches to illustrate this approach to this part > of the overall problem. 0001 is, I think, a good cleanup that may as > well be applied in isolation; it makes the code in > RelationBuildPartitionDesc both cleaner and more efficient. 0002 > adjust things so that - I hope - the partition bounds we get for the > individual partitions has to be as of the same point in the commit > sequence as the list of children. As I noted before, Alvaro's patch > doesn't seem to have tackled this part of the problem. You may want to rebase these patches as of b52b7dc2, and change the first argument of partition_bounds_create() so as a list is used in input... -- Michael
Attachment
On 2018/11/15 4:27, Robert Haas wrote: > RelationBuildPartitionDesc doesn't lock the children > whose relpartbounds it is fetching (!), so unless we're guaranteed to > have already locked them children earlier for some other reason, we > could grab the partition bound at this point and then it could change > again before we get a lock on them. Hmm, I think that RelationBuildPartitionDesc doesn't need to lock a partition before fetching its relpartbound, because the latter can't change if the caller is holding a lock on the parent, which it must be if we're in RelationBuildPartitionDesc for parent at all. Am I missing something? As Michael pointed out, the first cleanup patch needs to be rebased due to a recent commit [1]. I did that to see if something we did in that commit made things worse for your patch, but seems fine. I had to go and change things outside RelationBuildPartitionDesc as I rebased, due to the aforementioned commit, but they're simple changes such as changing List * arguments of some newly added functions to PartitionBoundSpec **. Please find the rebased patches attached with this email. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b52b7dc2
Attachment
On 2018/11/15 11:03, Amit Langote wrote: > As Michael pointed out, the first cleanup patch needs to be rebased due to > a recent commit [1]. I did that to see if something we did in that commit > made things worse for your patch, but seems fine. I had to go and change > things outside RelationBuildPartitionDesc as I rebased, due to the > aforementioned commit, but they're simple changes such as changing List * > arguments of some newly added functions to PartitionBoundSpec **. Please > find the rebased patches attached with this email. > > [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b52b7dc2 I noticed that the regression tests containing partitioned tables fail randomly with the rebased patches I posted, whereas they didn't if I apply them to HEAD without [1]. It seems to be due to the slightly confused memory context handling in RelationBuildPartitionDesc after [1], which Alvaro had expressed some doubts about yesterday. I've fixed 0001 again to re-order the code so that allocations happen the correct context and now tests pass with the rebased patches. By the way, I noticed that the oids array added by Robert's original 0001 patch wasn't initialized to NULL, which could lead to calling pfree on a garbage value of oids after the 2nd patch. Thanks, Amit [2] https://www.postgresql.org/message-id/20181113135915.v4r77tdthlajdlqq%40alvherre.pgsql
Attachment
On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote: > I've fixed 0001 again to re-order the code so that allocations happen the > correct context and now tests pass with the rebased patches. I have been looking at 0001, and it seems to me that you make even more messy the current situation. Coming to my point: do we have actually any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation has no partitions? It seems to me that we had better set rd_pdcxt and rd_partdesc to NULL in this case. -- Michael
Attachment
On 2018/11/15 14:38, Michael Paquier wrote: > On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote: >> I've fixed 0001 again to re-order the code so that allocations happen the >> correct context and now tests pass with the rebased patches. > > I have been looking at 0001, and it seems to me that you make even more > messy the current situation. Coming to my point: do we have actually > any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation > has no partitions? It seems to me that we had better set rd_pdcxt and > rd_partdesc to NULL in this case. As things stand today, rd_partdesc of a partitioned table must always be non-NULL. In fact, there are many places in the backend code that Assert it: tablecmds.c: ATPrepDropNotNull() if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { PartitionDesc partdesc = RelationGetPartitionDesc(rel); Assert(partdesc != NULL); prepunion.c: expand_partitioned_rtentry() PartitionDesc partdesc = RelationGetPartitionDesc(parentrel); check_stack_depth(); /* A partitioned table should always have a partition descriptor. */ Assert(partdesc); plancat.c: set_relation_partition_info() partdesc = RelationGetPartitionDesc(relation); partkey = RelationGetPartitionKey(relation); rel->part_scheme = find_partition_scheme(root, relation); Assert(partdesc != NULL && rel->part_scheme != NULL); Maybe there are others in a different form. If there are no partitions, nparts is 0, and other fields are NULL, though rd_partdesc itself is never NULL. If we want to redesign that and allow it to be NULL until some code in the backend wants to use it, then maybe we can consider doing what you say. But, many non-trivial operations on partitioned tables require the PartitionDesc, so there is perhaps not much point to designing it such that rd_partdesc is set only when needed, because it will be referenced sooner than later. Maybe, we can consider doing that sort of thing for boundinfo, because it's expensive to build, and not all operations want the canonicalized bounds. Thanks, Amit
On Thu, Nov 15, 2018 at 02:53:47PM +0900, Amit Langote wrote: > As things stand today, rd_partdesc of a partitioned table must always be > non-NULL. In fact, there are many places in the backend code that Assert it: > > [...] I have noticed those, and they actually would not care much if rd_partdesc was actually NULL. I find interesting that the planner portion actually does roughly the same thing with a partitioned table with no partitions and a non-partitioned table. > Maybe there are others in a different form. > > If there are no partitions, nparts is 0, and other fields are NULL, though > rd_partdesc itself is never NULL. I find a bit confusing that both concepts have the same meaning, aka that a relation has no partition, and that it is actually relkind which decides rd_partdesc should be NULL or set up. This stuff also does empty allocations. > If we want to redesign that and allow it to be NULL until some code in the > backend wants to use it, then maybe we can consider doing what you say. > But, many non-trivial operations on partitioned tables require the > PartitionDesc, so there is perhaps not much point to designing it such > that rd_partdesc is set only when needed, because it will be referenced > sooner than later. Maybe, we can consider doing that sort of thing for > boundinfo, because it's expensive to build, and not all operations want > the canonicalized bounds. I am fine if that's the consensus of this thread. But as far as I can see it is possible to remove a bit of the memory handling mess by doing so. My 2c. -- Michael
Attachment
On 2018/11/15 15:22, Michael Paquier wrote: >> If there are no partitions, nparts is 0, and other fields are NULL, though >> rd_partdesc itself is never NULL. > > I find a bit confusing that both concepts have the same meaning, aka > that a relation has no partition, and that it is actually relkind which > decides rd_partdesc should be NULL or set up. This stuff also does > empty allocations. > >> If we want to redesign that and allow it to be NULL until some code in the >> backend wants to use it, then maybe we can consider doing what you say. >> But, many non-trivial operations on partitioned tables require the >> PartitionDesc, so there is perhaps not much point to designing it such >> that rd_partdesc is set only when needed, because it will be referenced >> sooner than later. Maybe, we can consider doing that sort of thing for >> boundinfo, because it's expensive to build, and not all operations want >> the canonicalized bounds. > > I am fine if that's the consensus of this thread. But as far as I can > see it is possible to remove a bit of the memory handling mess by doing > so. My 2c. Perhaps, we can discuss this another thread. I know this thread contains important points about partition descriptor creation and modification, but memory context considerations seems like a separate topic. The following message could be a starting point, because there we were talking about perhaps a similar design as you're saying: https://www.postgresql.org/message-id/143ed9a4-6038-76d4-9a55-502035815e68%40lab.ntt.co.jp Also, while I understood Alvaro's and your comment on the other thread that memory handling is messy as is, but sorry, it's not clear to me why you say this patch makes it messier. It reduces context switch calls so that RelationBuildPartitionDesc roughly looks like this after the patch: Start with CurrentMemoryContext... 1. read catalogs and make bounddescs and oids arrays 2. partition_bounds_create(...) 3. create and switch to rd_pdcxt 4. create PartitionDesc, copy partdesc->oids and partdesc->boundinfo 5. switch back to the old context Thanks, Amit
On Thu, Nov 15, 2018 at 12:38 AM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote: > > I've fixed 0001 again to re-order the code so that allocations happen the > > correct context and now tests pass with the rebased patches. > > I have been looking at 0001, and it seems to me that you make even more > messy the current situation. Coming to my point: do we have actually > any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation > has no partitions? It seems to me that we had better set rd_pdcxt and > rd_partdesc to NULL in this case. I think that's unrelated to this patch, as Amit also says, but I have to say that the last few hunks of the rebased version of this patch do not make a lot of sense to me. This patch is supposed to be reducing list construction, and the original version did that, but the rebased version adds a partition_bounds_copy() operation, whereas my version did not add any expensive operations - it only removed some cost. I don't see why anything I changed should necessitate such a change, nor does it seem like a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018/11/15 22:57, Robert Haas wrote: > On Thu, Nov 15, 2018 at 12:38 AM Michael Paquier <michael@paquier.xyz> wrote: >> On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote: >>> I've fixed 0001 again to re-order the code so that allocations happen the >>> correct context and now tests pass with the rebased patches. >> >> I have been looking at 0001, and it seems to me that you make even more >> messy the current situation. Coming to my point: do we have actually >> any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation >> has no partitions? It seems to me that we had better set rd_pdcxt and >> rd_partdesc to NULL in this case. > > I think that's unrelated to this patch, as Amit also says, but I have > to say that the last few hunks of the rebased version of this patch do > not make a lot of sense to me. This patch is supposed to be reducing > list construction, and the original version did that, but the rebased > version adds a partition_bounds_copy() operation, whereas my version > did not add any expensive operations - it only removed some cost. I > don't see why anything I changed should necessitate such a change, nor > does it seem like a good idea. The partition_bounds_copy() is not because of your changes, it's there in HEAD. The reason we do that is because partition_bounds_create allocates the memory for the PartitionBoundInfo it returns along with other temporary allocations in CurrentMemoryContext. But we'd need to copy it into rd_pdcxt before calling it a property of rd_partdesc, so the partition_bounds_copy. Maybe partition_bounds_create() should've had a MemoryContext argument to pass it the context we want it to create the PartitionBoundInfo in. That way, we can simply pass rd_pdcxt to it and avoid making a copy. As is, we're now allocating two copies of PartitionBoundInfo, one in the CurrentMemoryContext and another in rd_pdcxt, whereas the previous code would only allocate the latter. Maybe we should fix it as being a regression. Thanks, Amit
On Fri, Nov 16, 2018 at 10:57:57AM +0900, Amit Langote wrote: > Maybe partition_bounds_create() should've had a MemoryContext argument to > pass it the context we want it to create the PartitionBoundInfo in. That > way, we can simply pass rd_pdcxt to it and avoid making a copy. As is, > we're now allocating two copies of PartitionBoundInfo, one in the > CurrentMemoryContext and another in rd_pdcxt, whereas the previous code > would only allocate the latter. Maybe we should fix it as being a regression. Not sure about what you mean by regression here, but passing the memory context as an argument has sense as you can remove the extra partition bound copy, as it has sense to use an array instead of a list for performance, which may matter if many partitions are handled when building the cache. So cleaning up both things at the same time would be nice. -- Michael
Attachment
On Fri, Nov 16, 2018 at 1:00 PM Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Nov 16, 2018 at 10:57:57AM +0900, Amit Langote wrote: > > Maybe partition_bounds_create() should've had a MemoryContext argument to > > pass it the context we want it to create the PartitionBoundInfo in. That > > way, we can simply pass rd_pdcxt to it and avoid making a copy. As is, > > we're now allocating two copies of PartitionBoundInfo, one in the > > CurrentMemoryContext and another in rd_pdcxt, whereas the previous code > > would only allocate the latter. Maybe we should fix it as being a regression. > > Not sure about what you mean by regression here, The regression is, as I mentioned, that the new code allocates two copies of PartitionBoundInfo whereas only one would be allocated before. > but passing the memory > context as an argument has sense as you can remove the extra partition > bound copy, as it has sense to use an array instead of a list for > performance, which may matter if many partitions are handled when > building the cache. So cleaning up both things at the same time would > be nice. Maybe, the patch to add the memory context argument to partition_bound_create and other related static functions in partbound.c should be its own patch, as that seems to be a separate issue. OTOH, other changes needed to implement Robert's proposal of using PartitionBoundSpec and Oid arrays instead of existing lists should be in the same patch. Thanks, Amit
On Thu, Nov 15, 2018 at 8:58 PM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > The partition_bounds_copy() is not because of your changes, it's there in > HEAD. OK, but it seems to me that your version of my patch rearranges the code more than necessary. How about the attached? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote: > OK, but it seems to me that your version of my patch rearranges the > code more than necessary. > > How about the attached? What you are proposing here looks good to me. Thanks! -- Michael
Attachment
On 2018/11/17 9:06, Michael Paquier wrote: > On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote: >> OK, but it seems to me that your version of my patch rearranges the >> code more than necessary. >> >> How about the attached? > > What you are proposing here looks good to me. Thanks! Me too, now that I see the patch closely. The errors I'd seen in the regression tests were due to uninitialized oids variable which is fixed in the later patches, not due to "confused memory context switching" as I'd put it [1] and made that the reason for additional changes. Thanks, Amit [1] https://www.postgresql.org/message-id/1be8055c-137b-5639-9bcf-8a2d5fef6e5a%40lab.ntt.co.jp
On Sun, Nov 18, 2018 at 9:43 PM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/11/17 9:06, Michael Paquier wrote: > > On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote: > >> OK, but it seems to me that your version of my patch rearranges the > >> code more than necessary. > >> > >> How about the attached? > > > > What you are proposing here looks good to me. Thanks! > > Me too, now that I see the patch closely. The errors I'd seen in the > regression tests were due to uninitialized oids variable which is fixed in > the later patches, not due to "confused memory context switching" as I'd > put it [1] and made that the reason for additional changes. OK. Rebased again, and committed (although I forgot to include a link to this discussion - sorry about that). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 14, 2018 at 9:03 PM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/11/15 4:27, Robert Haas wrote: > > RelationBuildPartitionDesc doesn't lock the children > > whose relpartbounds it is fetching (!), so unless we're guaranteed to > > have already locked them children earlier for some other reason, we > > could grab the partition bound at this point and then it could change > > again before we get a lock on them. > > Hmm, I think that RelationBuildPartitionDesc doesn't need to lock a > partition before fetching its relpartbound, because the latter can't > change if the caller is holding a lock on the parent, which it must be if > we're in RelationBuildPartitionDesc for parent at all. Am I missing > something? After thinking about this for a bit, I think that right now it's fine, because you can't create or drop or attach or detach a partition without holding AccessExclusiveLock on both the parent and the child, so if you hold even AccessShareLock on the parent, the child's relpartbound can't be changing. However, what we want to do is get the lock level on the parent down to ShareUpdateExclusiveLock, at which point the child's relpartbound could indeed change under us. I think, however, that what I previously posted as 0002 is sufficient to fix that part of the problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello > OK. Rebased again, and committed (although I forgot to include a link > to this discussion - sorry about that). Seems we erroneously moved this thread to next CF: https://commitfest.postgresql.org/21/1842/ Can you close this entry? regards, Sergei
On Sat, Dec 15, 2018 at 01:04:00PM +0300, Sergei Kornilov wrote: > Seems we erroneously moved this thread to next CF: > https://commitfest.postgresql.org/21/1842/ > Can you close this entry? Robert has committed a patch to refactor a bit the list contruction of RelationBuildPartitionDesc thanks to 7ee5f88e, but the main patch has not been committed, so the current status looks right to me. -- Michael
Attachment
On Sun, Dec 16, 2018 at 6:43 AM Michael Paquier <michael@paquier.xyz> wrote: > On Sat, Dec 15, 2018 at 01:04:00PM +0300, Sergei Kornilov wrote: > > Seems we erroneously moved this thread to next CF: > > https://commitfest.postgresql.org/21/1842/ > > Can you close this entry? > > Robert has committed a patch to refactor a bit the list contruction of > RelationBuildPartitionDesc thanks to 7ee5f88e, but the main patch has > not been committed, so the current status looks right to me. I have done a bit more work on this, but need to spend some more time on it before I have something that is worth posting. Not sure whether I'll get to that before the New Year at this point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-Dec-17, Robert Haas wrote: > I have done a bit more work on this, but need to spend some more time > on it before I have something that is worth posting. Not sure whether > I'll get to that before the New Year at this point. This patch missing the CF deadline would not be a happy way for me to begin the new year. I'm not sure what's the best way to move forward with this patch, but I encourage you to post whatever version you have before the deadline, even if you're not fully happy with it (and, heck, even if it doesn't compile and/or is full of FIXME or TODO comments). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Dec 17, 2018 at 06:52:51PM -0300, Alvaro Herrera wrote: > On 2018-Dec-17, Robert Haas wrote: > This patch missing the CF deadline would not be a happy way for me to > begin the new year. > > I'm not sure what's the best way to move forward with this patch, but I > encourage you to post whatever version you have before the deadline, > even if you're not fully happy with it (and, heck, even if it doesn't > compile and/or is full of FIXME or TODO comments). Agreed. This patch has value, and somebody else could always take it from the point where you were. -- Michael
Attachment
On Mon, Dec 17, 2018 at 6:44 PM Michael Paquier <michael@paquier.xyz> wrote: > Agreed. This patch has value, and somebody else could always take it > from the point where you were. OK. I'll post what I have by the end of the week. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote: > OK. I'll post what I have by the end of the week. Thanks, Robert. -- Michael
Attachment
On Tue, Dec 18, 2018 at 8:04 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote: > > OK. I'll post what I have by the end of the week. > > Thanks, Robert. OK, so I got slightly delayed here by utterly destroying my laptop, but I've mostly reconstructed what I did. I think there are some remaining problems, but this seems like a good time to share what I've got so far. I'm attaching three patches. 0001 is one which I posted before. It attempts to fix up RelationBuildPartitionDesc() so that this function will always return a partition descriptor based on a consistent snapshot of the catalogs. Without this, I think there's nothing to prevent a commit which happens while the function is running from causing the function to fail or produce nonsense answers. 0002 introduces the concept of a partition directory. The idea is that the planner will create a partition directory, and so will the executor, and all calls which occur in those places to RelationGetPartitionDesc() will instead call PartitionDirectoryLookup(). Every lookup for the same relation in the same partition directory is guaranteed to produce the same answer. I believe this patch still has a number of weaknesses. More on that below. 0003 actually lowers the lock level. The comment here might need some more work. Here is a list of possible or definite problems that are known to me: - I think we need a way to make partition directory lookups consistent across backends in the case of parallel query. I believe this can be done with a dshash and some serialization and deserialization logic, but I haven't attempted that yet. - I refactored expand_inherited_rtentry() to drive partition expansion entirely off of PartitionDescs. The reason why this is necessary is that it clearly will not work to have find_all_inheritors() use a current snapshot to decide what children we have and lock them, and then consult a different source of truth to decide which relations to open with NoLock. There's nothing to keep the lists of partitions from being different in the two cases, and that demonstrably causes assertion failures if you SELECT with an ATTACH/DETACH loop running in the background. However, it also changes the order in which tables get locked. Possibly that could be fixed by teaching expand_partitioned_rtentry() to qsort() the OIDs the way find_inheritance_children() does. It also loses the infinite-loop protection which find_all_inheritors() has. Not sure what to do about that. - In order for the new PartitionDirectory machinery to avoid use-after-free bugs, we have to either copy the PartitionDesc out of the relcache into the partition directory or avoid freeing it while it is still in use. Copying it seems unappealing for performance reasons, so I took the latter approach. However, what I did here in terms of reclaiming memory is just about the least aggressive strategy short of leaking it altogether - it just keeps it around until the next rebuild that occurs while the relcache entry is not in use. We might want to do better, e.g. freeing old copies immediately as soon as the relcache reference count drops to 0. I just did it this way because it was simple to code. - I tried this with Alvaro's isolation tests and it fails some tests. That's because Alvaro's tests expect that the list of accessible partitions is based on the query snapshot. For reasons I attempted to explain in the comments in 0003, I think the idea that we can choose the set of accessible partitions based on the query snapshot is very wrong. For example, suppose transaction 1 begins, reads an unrelated table to establish a snapshot, and then goes idle. Then transaction 2 comes along, detaches a partition from an important table, and then does crazy stuff with that table -- changes the column list, drops it, reattaches it with different bounds, whatever. Then it commits. If transaction 1 now comes along and uses the query snapshot to decide that the detached partition ought to still be seen as a partition of that partitioned table, disaster will ensue. - I don't have any tests here, but I think it would be good to add some, perhaps modeled on Alvaro's, and also some that involve multiple ATTACH and DETACH operations mixed with other interesting kinds of DDL. I also didn't make any attempt to update the documentation, which is another thing that will probably have to be done at some point. Not sure how much documentation we have of any of this now. - I am uncertain whether the logic that builds up the partition constraint is really safe in the face of concurrent DDL. I kinda suspect there are some problems there, but maybe not. Once you hold ANY lock on a partition, the partition constraint can't concurrently become stricter, because no ATTACH can be done without AccessExclusiveLock on the partition being attached; but it could concurrently become less strict, because you only need a lesser lock for a detach. Not sure if/how that could foul with this logic. - I have not done anything about the fact that index detach takes AccessExclusiveLock. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Thanks for this work! I like the name "partition directory". On 2018-Dec-20, Robert Haas wrote: > 0002 introduces the concept of a partition directory. The idea is > that the planner will create a partition directory, and so will the > executor, and all calls which occur in those places to > RelationGetPartitionDesc() will instead call > PartitionDirectoryLookup(). Every lookup for the same relation in the > same partition directory is guaranteed to produce the same answer. I > believe this patch still has a number of weaknesses. More on that > below. The commit message for this one also points out another potential problem, > Introduce the concept of a partition directory. > > Teach the optimizer and executor to use it, so that a single planning > cycle or query execution gets the same PartitionDesc for the same table > every time it looks it up. This does not prevent changes between > planning and execution, nor does it guarantee that all tables are > expanded according to the same snapshot. Namely: how does this handle the case of partition pruning structure being passed from planner to executor, if an attach happens in the middle of it and puts a partition in between existing partitions? Array indexes of any partitions that appear later in the partition descriptor will change. This is the reason I used the query snapshot rather than EState. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Introduce the concept of a partition directory. > > > > Teach the optimizer and executor to use it, so that a single planning > > cycle or query execution gets the same PartitionDesc for the same table > > every time it looks it up. This does not prevent changes between > > planning and execution, nor does it guarantee that all tables are > > expanded according to the same snapshot. > > Namely: how does this handle the case of partition pruning structure > being passed from planner to executor, if an attach happens in the > middle of it and puts a partition in between existing partitions? Array > indexes of any partitions that appear later in the partition descriptor > will change. > > This is the reason I used the query snapshot rather than EState. I didn't handle that. If partition pruning relies on nothing changing between planning and execution, isn't that broken regardless of any of this? It's true that with the simple query protocol we'll hold locks continuously from planning into execution, and therefore with the current locking regime we couldn't really have a problem. But unless I'm confused, with the extended query protocol it's quite possible to generate a plan, release locks, and then reacquire locks at execution time. Unless we have some guarantee that a new plan will always be generated if any DDL has happened in the middle, I think we've got trouble, and I don't think that is guaranteed in all cases. Maybe I'm all wet, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-Dec-20, Robert Haas wrote: > I didn't handle that. If partition pruning relies on nothing changing > between planning and execution, isn't that broken regardless of any of > this? It's true that with the simple query protocol we'll hold locks > continuously from planning into execution, and therefore with the > current locking regime we couldn't really have a problem. But unless > I'm confused, with the extended query protocol it's quite possible to > generate a plan, release locks, and then reacquire locks at execution > time. Unless we have some guarantee that a new plan will always be > generated if any DDL has happened in the middle, I think we've got > trouble, and I don't think that is guaranteed in all cases. Oh, so maybe this case is already handled by plan invalidation -- I mean, if we run DDL, the stored plan is thrown away and a new one recomputed. IOW this was already a solved problem and I didn't need to spend effort on it. /me slaps own forehead -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Dec 20, 2018 at 4:11 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Oh, so maybe this case is already handled by plan invalidation -- I > mean, if we run DDL, the stored plan is thrown away and a new one > recomputed. IOW this was already a solved problem and I didn't need to > spend effort on it. /me slaps own forehead I'm kinda saying the opposite - I'm not sure that it's safe even with the higher lock levels. If the plan is relying on the same partition descriptor being in effect at plan time as at execution time, that sounds kinda dangerous to me. Lowering the lock level might also make something that was previously safe into something unsafe, because now there's no longer a guarantee that invalidation messages are received soon enough. With AccessExclusiveLock, we'll send invalidation messages before releasing the lock, and other processes will acquire the lock and then AcceptInvalidationMessages(). But with ShareUpdateExclusiveLock the locks can coexist, so now there might be trouble. I think this is an area where we need to do some more investigation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 20, 2018 at 4:38 PM Robert Haas <robertmhaas@gmail.com> wrote: > Lowering the lock level might also make something that was previously > safe into something unsafe, because now there's no longer a guarantee > that invalidation messages are received soon enough. With > AccessExclusiveLock, we'll send invalidation messages before releasing > the lock, and other processes will acquire the lock and then > AcceptInvalidationMessages(). But with ShareUpdateExclusiveLock the > locks can coexist, so now there might be trouble. I think this is an > area where we need to do some more investigation. So there are definitely problems here. With my patch: create table tab (a int, b text) partition by range (a); create table tab1 partition of tab for values from (0) to (10); prepare t as select * from tab; begin; explain execute t; -- seq scan on tab1 execute t; -- no rows Then, in another session: alter table tab detach partition tab1; insert into tab1 values (300, 'oops'); Back to the first session: execute t; -- shows (300, 'oops') explain execute t; -- still planning to scan tab1 commit; explain execute t; -- now it got the memo, and plans to scan nothing execute t; -- no rows Well, that's not good. We're showing a value that was never within the partition bounds of any partition of tab. The problem is that, since we already have locks on all relevant objects, nothing triggers the second 'explain execute' to process invalidation messages, so we don't update the plan. Generally, any DDL with less than AccessExclusiveLock has this issue. On another thread, I was discussing with Tom and Peter the possibility of trying to rejigger things so that we always AcceptInvalidationMessages() at least once per command, but I think that just turns this into a race: if a concurrent commit happens after 'explain execute t' decides not to re-plan but before it begins executing, we have the same problem. This example doesn't involve partition pruning, and in general I don't think that the problem is confined to partition pruning. It's rather that if there's no conflict between the lock that is needed to change the set of partitions and the lock that is needed to run a query, then there's no way to guarantee that the query runs with the same set of partitions for which it was planned. Unless I'm mistaken, which I might be, this is also a problem with your approach -- if you repeat the same prepared query in the same transaction, the transaction snapshot will be updated, and thus the PartitionDesc will be expanded differently at execution time, but the plan will not have changed, because invalidation messages have not been processed. Anyway, I think the only fix here is likely to be making the executor resilient against concurrent changes in the PartitionDesc. I don't think there's going to be any easy way to compensate for added partitions without re-planning, but maybe we could find a way to flag detached partitions so that they return no rows without actually touching the underlying relation. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 21 Dec 2018 at 09:43, Robert Haas <robertmhaas@gmail.com> wrote: > - I refactored expand_inherited_rtentry() to drive partition expansion > entirely off of PartitionDescs. The reason why this is necessary is > that it clearly will not work to have find_all_inheritors() use a > current snapshot to decide what children we have and lock them, and > then consult a different source of truth to decide which relations to > open with NoLock. There's nothing to keep the lists of partitions > from being different in the two cases, and that demonstrably causes > assertion failures if you SELECT with an ATTACH/DETACH loop running in > the background. However, it also changes the order in which tables get > locked. Possibly that could be fixed by teaching > expand_partitioned_rtentry() to qsort() the OIDs the way > find_inheritance_children() does. It also loses the infinite-loop > protection which find_all_inheritors() has. Not sure what to do about > that. I don't think you need to qsort() the Oids before locking. What the qsort() does today is ensure we get a consistent locking order. Any other order would surely do, providing we stick to it consistently. I think PartitionDesc order is fine, as it's consistent. Having it locked in PartitionDesc order I think is what's needed for [1] anyway. [2] proposes to relax the locking order taken during execution. [1] https://commitfest.postgresql.org/21/1778/ [2] https://commitfest.postgresql.org/21/1887/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, 21 Dec 2018 at 10:05, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Namely: how does this handle the case of partition pruning structure > > being passed from planner to executor, if an attach happens in the > > middle of it and puts a partition in between existing partitions? Array > > indexes of any partitions that appear later in the partition descriptor > > will change. > > > > This is the reason I used the query snapshot rather than EState. > > I didn't handle that. If partition pruning relies on nothing changing > between planning and execution, isn't that broken regardless of any of > this? It's true that with the simple query protocol we'll hold locks > continuously from planning into execution, and therefore with the > current locking regime we couldn't really have a problem. But unless > I'm confused, with the extended query protocol it's quite possible to > generate a plan, release locks, and then reacquire locks at execution > time. Unless we have some guarantee that a new plan will always be > generated if any DDL has happened in the middle, I think we've got > trouble, and I don't think that is guaranteed in all cases. Today the plan would be invalidated if a partition was ATTACHED or DETACHED. The newly built plan would get the updated list of partitions. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 21, 2018 at 6:04 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > I don't think you need to qsort() the Oids before locking. What the > qsort() does today is ensure we get a consistent locking order. Any > other order would surely do, providing we stick to it consistently. I > think PartitionDesc order is fine, as it's consistent. Having it > locked in PartitionDesc order I think is what's needed for [1] anyway. > [2] proposes to relax the locking order taken during execution. If queries take locks in one order and DDL takes them in some other order, queries and DDL starting around the same time could deadlock. Unless we convert the whole system to lock everything in PartitionDesc order the issue doesn't go away completely. But maybe we just have to live with that. Surely we're not going to pay the cost of locking partitions that we don't otherwise need to avoid a deadlock-vs-DDL risk, and once we've decided to assume that risk, I'm not sure a qsort() here helps anything much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 21, 2018 at 6:06 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > I didn't handle that. If partition pruning relies on nothing changing > > between planning and execution, isn't that broken regardless of any of > > this? It's true that with the simple query protocol we'll hold locks > > continuously from planning into execution, and therefore with the > > current locking regime we couldn't really have a problem. But unless > > I'm confused, with the extended query protocol it's quite possible to > > generate a plan, release locks, and then reacquire locks at execution > > time. Unless we have some guarantee that a new plan will always be > > generated if any DDL has happened in the middle, I think we've got > > trouble, and I don't think that is guaranteed in all cases. > > Today the plan would be invalidated if a partition was ATTACHED or > DETACHED. The newly built plan would get the updated list of > partitions. I think you're right, and that won't be true any more once we lower the lock level, so it has to be handled somehow. The entire plan invalidation mechanism seems to depend fundamentally on AccessExclusiveLock being used everywhere, so this is likely to be an ongoing issue every time we want to reduce a lock level anywhere. I wonder if there is any kind of systematic fix or if we are just going to have to keep inventing ad-hoc solutions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 25 Dec 2018 at 08:15, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Dec 21, 2018 at 6:04 PM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > I don't think you need to qsort() the Oids before locking. What the > > qsort() does today is ensure we get a consistent locking order. Any > > other order would surely do, providing we stick to it consistently. I > > think PartitionDesc order is fine, as it's consistent. Having it > > locked in PartitionDesc order I think is what's needed for [1] anyway. > > [2] proposes to relax the locking order taken during execution. > > If queries take locks in one order and DDL takes them in some other > order, queries and DDL starting around the same time could deadlock. > Unless we convert the whole system to lock everything in PartitionDesc > order the issue doesn't go away completely. But maybe we just have to > live with that. Surely we're not going to pay the cost of locking > partitions that we don't otherwise need to avoid a deadlock-vs-DDL > risk, and once we've decided to assume that risk, I'm not sure a > qsort() here helps anything much. When I said "consistent" I meant consistent over all places where we obtain locks on all partitions. My original v1-0002 patch attempted something like this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Namely: how does this handle the case of partition pruning structure > being passed from planner to executor, if an attach happens in the > middle of it and puts a partition in between existing partitions? Array > indexes of any partitions that appear later in the partition descriptor > will change. I finally gotten a little more time to work on this. It took me a while to understand that a PartitionedRelPruneInfos assumes that the indexes of partitions in the PartitionDesc don't change between planning and execution, because subplan_map[] and subplan_map[] are indexed by PartitionDesc offset. I suppose the reason for this is so that we don't have to go to the expense of copying the partition bounds from the PartitionDesc into the final plan, but it somehow seems a little scary to me. Perhaps I am too easily frightened, but it's certainly a problem from the point of view of this project, which wants to let the PartitionDesc change concurrently. I wrote a little patch that stores the relation OIDs of the partitions into the PartitionedPruneRelInfo and then, at execution time, does an Assert() that what it gets matches what existed at plan time. I figured that a good start would be to find a test case where this fails with concurrent DDL allowed, but I haven't so far succeeded in devising one. To make the Assert() fail, I need to come up with a case where concurrent DDL has caused the PartitionDesc to be rebuilt but without causing an update to the plan. If I use prepared queries inside of a transaction block, I can continue to run old plans after concurrent DDL has changed things, but I can't actually make the Assert() fail, because the queries continue to use the old plans precisely because they haven't processed invalidation messages, and therefore they also have the old PartitionDesc and everything works. Maybe if I try it with CLOBBER_CACHE_ALWAYS... I also had the idea of trying to use a cursor, because if I could start execution of a query, then force a relcache rebuild, then continue executing the query, maybe something would blow up somehow. But that's not so easy because I don't think we have any way using SQL to declare a cursor for a prepared query, so how do I need to get a query plan that involves run-time pruning without using parameters, which I'm pretty sure is possible but I haven't figured it out yet. And even there the PartitionDirectory concept might preserve us from any damage if the change happens after the executor is initialized, though I'm not sure if there are any cases where we don't do the first PartitionDesc lookup for a particular table until mid-execution. Anyway, I think this idea of passing a list of relation OIDs that we saw at planning time through to the executor and cross-checking them might have some legs. If we only allowed concurrently *adding* partitions and not concurrently *removing* them, then even if we find the case(s) where the PartitionDesc can change under us, we can probably just adjust subplan_map and subpart_map to compensate, since we can iterate through the old and new arrays of relation OIDs and just figure out which things have shifted to higher indexes in the PartitionDesc. This is all kind of hand-waving at the moment; tips appreciated. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-Jan-25, Robert Haas wrote: > I finally gotten a little more time to work on this. It took me a > while to understand that a PartitionedRelPruneInfos assumes that the > indexes of partitions in the PartitionDesc don't change between > planning and execution, because subplan_map[] and subplan_map[] are > indexed by PartitionDesc offset. Right, the planner/executor "disconnect" is one of the challenges, and why I was trying to keep the old copy of the PartitionDesc around instead of building updated ones as needed. > I suppose the reason for this is so > that we don't have to go to the expense of copying the partition > bounds from the PartitionDesc into the final plan, but it somehow > seems a little scary to me. Perhaps I am too easily frightened, but > it's certainly a problem from the point of view of this project, which > wants to let the PartitionDesc change concurrently. Well, my definition of the problem started with the assumption that we would keep the partition array indexes unchanged, so "change concurrently" is what we needed to avoid. Yes, I realize that you've opted to change that definition. I may have forgotten some of your earlier emails on this, but one aspect (possibly a key one) is that I'm not sure we really need to cope, other than with an ERROR, with queries that continue to run across an attach/detach -- moreso in absurd scenarios such as the ones you described where the detached table is later re-attached, possibly to a different partitioned table. I mean, if we can just detect the case and raise an error, and this let us make it all work reasonably, that might be better. > I wrote a little patch that stores the relation OIDs of the partitions > into the PartitionedPruneRelInfo and then, at execution time, does an > Assert() that what it gets matches what existed at plan time. I > figured that a good start would be to find a test case where this > fails with concurrent DDL allowed, but I haven't so far succeeded in > devising one. To make the Assert() fail, I need to come up with a > case where concurrent DDL has caused the PartitionDesc to be rebuilt > but without causing an update to the plan. If I use prepared queries > inside of a transaction block, [...] > I also had the idea of trying to use a cursor, because if I could > start execution of a query, [...] Those are the ways I thought of, and the reason for the shape of some of those .spec tests. I wasn't able to hit the situation. > Maybe if I try it with CLOBBER_CACHE_ALWAYS... I didn't try this one. > Anyway, I think this idea of passing a list of relation OIDs that we > saw at planning time through to the executor and cross-checking them > might have some legs. If we only allowed concurrently *adding* > partitions and not concurrently *removing* them, then even if we find > the case(s) where the PartitionDesc can change under us, we can > probably just adjust subplan_map and subpart_map to compensate, since > we can iterate through the old and new arrays of relation OIDs and > just figure out which things have shifted to higher indexes in the > PartitionDesc. This is all kind of hand-waving at the moment; tips > appreciated. I think detaching partitions concurrently is a necessary part of this feature, so I would prefer not to go with a solution that works for attaching partitions but not for detaching them. That said, I don't see why it's impossible to adjust the partition maps in both cases. But I don't have anything better than hand-waving ATM. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Right, the planner/executor "disconnect" is one of the challenges, and > why I was trying to keep the old copy of the PartitionDesc around > instead of building updated ones as needed. I agree that would be simpler, but I just don't see how to make it work. For one thing, keeping the old copy around can't work in parallel workers, which never had a copy in the first place. For two things, we don't have a really good mechanism to keep the PartitionDesc that was used at plan time around until execution time. Keeping the relation open would do it, but I'm pretty sure that causes other problems; the system doesn't expect any residual references. I know you had a solution to this problem, but I don't see how it can work. You said "Snapshots have their own cache (hash table) of partition descriptors. If a partdesc is requested and the snapshot has already obtained that partdesc, the original one is returned -- we don't request a new one from partcache." But presumably this means when the last snapshot is unregistered, the cache is flushed (otherwise, when?) and if that's so then this relies on the snapshot that was used for planning still being around at execution time, which I am pretty sure isn't guaranteed. Also, and I think this point is worthy of some further discussion, the thing that really seems broken to me about your design is the idea that it's OK to use the query or transaction snapshot to decide which partitions exist. The problem with that is that some query or transaction with an old snapshot might see as a partition some table that has been dropped or radically altered - different column types, attached to some other table now, attached to same table but with different bounds, or just dropped. And therefore it might try to insert data into that table and fail in all kinds of crazy ways, about the mildest of which is inserting data that doesn't match the current partition constraint. I'm willing to be told that I've misunderstood the way it all works and this isn't really a problem for some reason, but my current belief is that not only is it a problem with your design, but that it's such a bad problem that there's really no way to fix it and we have to abandon your entire approach and go a different route. If you don't think that's true, then perhaps we should discuss it further. > > I suppose the reason for this is so > > that we don't have to go to the expense of copying the partition > > bounds from the PartitionDesc into the final plan, but it somehow > > seems a little scary to me. Perhaps I am too easily frightened, but > > it's certainly a problem from the point of view of this project, which > > wants to let the PartitionDesc change concurrently. > > Well, my definition of the problem started with the assumption that we > would keep the partition array indexes unchanged, so "change > concurrently" is what we needed to avoid. Yes, I realize that you've > opted to change that definition. I don't think I made a conscious decision to change this, and I'm kind of wondering whether I have missed some better approach here. I feel like the direction I'm pursuing is an inevitable consequence of having no good way to keep the PartitionDesc around from plan-time to execution-time, which in turn feels like an inevitable consequence of the two points I made above: there's no guarantee that the plan-time snapshot is still registered anywhere by the time we get to execution time, and even if there were, the associated PartitionDesc may point to tables that have been drastically modified or don't exist any more. But it's possible that my chain-of-inevitable-consequences has a weak link, in which case I would surely like it if you (or someone else) would point it out to me. > I may have forgotten some of your earlier emails on this, but one aspect > (possibly a key one) is that I'm not sure we really need to cope, other > than with an ERROR, with queries that continue to run across an > attach/detach -- moreso in absurd scenarios such as the ones you > described where the detached table is later re-attached, possibly to a > different partitioned table. I mean, if we can just detect the case and > raise an error, and this let us make it all work reasonably, that might > be better. Well, that's an interesting idea. I assumed that users would hate that kind of behavior with a fiery passion that could never be quenched. If not, then the problem changes from *coping* with concurrent changes to *detecting* concurrent changes, which may be easier, but see below. > I think detaching partitions concurrently is a necessary part of this > feature, so I would prefer not to go with a solution that works for > attaching partitions but not for detaching them. That said, I don't see > why it's impossible to adjust the partition maps in both cases. But I > don't have anything better than hand-waving ATM. The general problem here goes back to what I wrote in the third paragraph of this email: a PartitionDesc that was built with a particular snapshot can't be assumed to be usable after any subsequent DDL has occurred that might affect the shape of the PartitionDesc. For example, if somebody detaches a partition and reattaches it with different partition bounds, and we use the old PartitionDesc for tuple routing, we'll route tuples into that partition that do not satisfy its partition constraint. And we won't even get an ERROR, because the system assumes that any tuple which arrives at a partition as a result of tuple routing must necessarily satisfy the partition constraint. If only concurrent CREATE/ATTACH operations are allowed and DROP/DETACH is not, then that kind of thing isn't possible. Any new partitions which have shown up since the plan was created can just be ignored, and the old ones must still have the same partition bounds that they did before, and everything is fine. Or, if we're OK with a less-nice solution, we could just ERROR out when the number of partitions have changed. Some people will get errors they don't like, but they won't end up with rows in their partitions that violate the constraints. But as soon as you allow concurrent DETACH, then things get really crazy. Even if, at execution time, there are the same number of partitions as I had at plan time, and even if those partitions have the same OIDs as what I had at plan time, and even if those OIDs are in the same order in the PartitionDesc, it does not prove that things are OK. The partition could have been detached and reattached with a narrower set of partition bounds. And if so, then we might route a tuple to it that doesn't fit that narrower set of bounds, and there will be no error, just database corruption. I suppose one idea for handling this is to stick a counter into pg_class or something that gets incremented every time a partition is detached. At plan time, save the counter value; if it has changed at execution time, ERROR. If not, then you have only added partitions to worry about, and that's an easier problem. But I kind of wonder whether we're really gaining as much as you think by trying to support concurrent DETACH in the first place. If there are queries running against the table, there's probably at least AccessShareLock on the partition itself, not just the parent. And that means that reducing the necessary lock on the parent from AccessExclusiveLock to something less doesn't really help that much, because we're still going to block trying to acquire AccessExclusiveLock on the child. Now you might say: OK, well, just reduce that lock level, too. But that seems to me to be opening a giant can of worms which we are unlikely to get resolved in time for this release. The worst of those problem is that if you also reduce the lock level on the partition when attaching it, then you are adding a constraint while somebody might at the exact same time be inserting a tuple that violates that constraint. Those two things have to be synchronized somehow. You could avoid that by reducing the lock level on the partition when detaching and not when attaching. But even then, detaching a partition can involve performing a whole bunch of operations for which we currently require AccessExclusiveLock. AlterTableGetLockLevel says: /* * Removing constraints can affect SELECTs that have been * optimised assuming the constraint holds true. */ case AT_DropConstraint: /* as DROP INDEX */ case AT_DropNotNull: /* may change some SQL plans */ cmd_lockmode = AccessExclusiveLock; break; Dropping a partition implicitly involves dropping a constraint. We could gamble that the above has no consequences that are really serious enough to care about, and that none of the other subsidiary objects that we adjust during detach (indexes, foreign keys, triggers) really need AccessExclusiveLock now, and that none of the other kinds of subsidiary objects that we might need to adjust in the future during a detach will be changes that require AccessExclusiveLock either, but that sounds awfully risky to me. We have very little DDL that runs with less than AccessExclusiveLock, and I've already found lots of subtle problems that have to be patched up just for the particular case of allowing attach/detach to take a lesser lock on the parent table, and I bet that there are a whole bunch more similar problems when you start talking about weakening the lock on the child table, and I'm not convinced that there are any reasonable solutions to some of those problems, let alone that we can come up with good solutions to all of them in the very near future. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, 28 Jan 2019 at 20:15, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Right, the planner/executor "disconnect" is one of the challenges, and
> why I was trying to keep the old copy of the PartitionDesc around
> instead of building updated ones as needed.
I agree that would be simpler, but I just don't see how to make it
work. For one thing, keeping the old copy around can't work in
parallel workers, which never had a copy in the first place. For two
things, we don't have a really good mechanism to keep the
PartitionDesc that was used at plan time around until execution time.
Keeping the relation open would do it, but I'm pretty sure that causes
other problems; the system doesn't expect any residual references.
I know you had a solution to this problem, but I don't see how it can
work. You said "Snapshots have their own cache (hash table) of
partition descriptors. If a partdesc is requested and the snapshot has
already obtained that partdesc, the original one is returned -- we
don't request a new one from partcache." But presumably this means
when the last snapshot is unregistered, the cache is flushed
(otherwise, when?) and if that's so then this relies on the snapshot
that was used for planning still being around at execution time, which
I am pretty sure isn't guaranteed.
Also, and I think this point is worthy of some further discussion, the
thing that really seems broken to me about your design is the idea
that it's OK to use the query or transaction snapshot to decide which
partitions exist. The problem with that is that some query or
transaction with an old snapshot might see as a partition some table
that has been dropped or radically altered - different column types,
attached to some other table now, attached to same table but with
different bounds, or just dropped. And therefore it might try to
insert data into that table and fail in all kinds of crazy ways, about
the mildest of which is inserting data that doesn't match the current
partition constraint. I'm willing to be told that I've misunderstood
the way it all works and this isn't really a problem for some reason,
but my current belief is that not only is it a problem with your
design, but that it's such a bad problem that there's really no way to
fix it and we have to abandon your entire approach and go a different
route. If you don't think that's true, then perhaps we should discuss
it further.
> > I suppose the reason for this is so
> > that we don't have to go to the expense of copying the partition
> > bounds from the PartitionDesc into the final plan, but it somehow
> > seems a little scary to me. Perhaps I am too easily frightened, but
> > it's certainly a problem from the point of view of this project, which
> > wants to let the PartitionDesc change concurrently.
>
> Well, my definition of the problem started with the assumption that we
> would keep the partition array indexes unchanged, so "change
> concurrently" is what we needed to avoid. Yes, I realize that you've
> opted to change that definition.
I don't think I made a conscious decision to change this, and I'm kind
of wondering whether I have missed some better approach here. I feel
like the direction I'm pursuing is an inevitable consequence of having
no good way to keep the PartitionDesc around from plan-time to
execution-time, which in turn feels like an inevitable consequence of
the two points I made above: there's no guarantee that the plan-time
snapshot is still registered anywhere by the time we get to execution
time, and even if there were, the associated PartitionDesc may point
to tables that have been drastically modified or don't exist any more.
But it's possible that my chain-of-inevitable-consequences has a weak
link, in which case I would surely like it if you (or someone else)
would point it out to me.
> I may have forgotten some of your earlier emails on this, but one aspect
> (possibly a key one) is that I'm not sure we really need to cope, other
> than with an ERROR, with queries that continue to run across an
> attach/detach -- moreso in absurd scenarios such as the ones you
> described where the detached table is later re-attached, possibly to a
> different partitioned table. I mean, if we can just detect the case and
> raise an error, and this let us make it all work reasonably, that might
> be better.
Well, that's an interesting idea. I assumed that users would hate
that kind of behavior with a fiery passion that could never be
quenched. If not, then the problem changes from *coping* with
concurrent changes to *detecting* concurrent changes, which may be
easier, but see below.
> I think detaching partitions concurrently is a necessary part of this
> feature, so I would prefer not to go with a solution that works for
> attaching partitions but not for detaching them. That said, I don't see
> why it's impossible to adjust the partition maps in both cases. But I
> don't have anything better than hand-waving ATM.
The general problem here goes back to what I wrote in the third
paragraph of this email: a PartitionDesc that was built with a
particular snapshot can't be assumed to be usable after any subsequent
DDL has occurred that might affect the shape of the PartitionDesc.
For example, if somebody detaches a partition and reattaches it with
different partition bounds, and we use the old PartitionDesc for tuple
routing, we'll route tuples into that partition that do not satisfy
its partition constraint. And we won't even get an ERROR, because the
system assumes that any tuple which arrives at a partition as a result
of tuple routing must necessarily satisfy the partition constraint.
If only concurrent CREATE/ATTACH operations are allowed and
DROP/DETACH is not, then that kind of thing isn't possible. Any new
partitions which have shown up since the plan was created can just be
ignored, and the old ones must still have the same partition bounds
that they did before, and everything is fine. Or, if we're OK with a
less-nice solution, we could just ERROR out when the number of
partitions have changed. Some people will get errors they don't like,
but they won't end up with rows in their partitions that violate the
constraints.
But as soon as you allow concurrent DETACH, then things get really
crazy. Even if, at execution time, there are the same number of
partitions as I had at plan time, and even if those partitions have
the same OIDs as what I had at plan time, and even if those OIDs are
in the same order in the PartitionDesc, it does not prove that things
are OK. The partition could have been detached and reattached with a
narrower set of partition bounds. And if so, then we might route a
tuple to it that doesn't fit that narrower set of bounds, and there
will be no error, just database corruption.
I suppose one idea for handling this is to stick a counter into
pg_class or something that gets incremented every time a partition is
detached. At plan time, save the counter value; if it has changed at
execution time, ERROR. If not, then you have only added partitions to
worry about, and that's an easier problem.
Yes, a version number would solve that issue.
But I kind of wonder whether we're really gaining as much as you think
by trying to support concurrent DETACH in the first place. If there
are queries running against the table, there's probably at least
AccessShareLock on the partition itself, not just the parent. And
that means that reducing the necessary lock on the parent from
AccessExclusiveLock to something less doesn't really help that much,
because we're still going to block trying to acquire
AccessExclusiveLock on the child. Now you might say: OK, well, just
reduce that lock level, too.
The whole point of CONCURRENT detach is that you're not removing it whilst people are still using it, you're just marking it for later disuse.
But that seems to me to be opening a giant can of worms which we are
unlikely to get resolved in time for this release. The worst of those
problem is that if you also reduce the lock level on the partition
when attaching it, then you are adding a constraint while somebody
might at the exact same time be inserting a tuple that violates that
constraint.
Spurious.
This would only be true if we were adding a constraint that affected existing partitions.
The constraint being added affects the newly added partition, not existing ones.
Those two things have to be synchronized somehow. You
could avoid that by reducing the lock level on the partition when
detaching and not when attaching. But even then, detaching a
partition can involve performing a whole bunch of operations for which
we currently require AccessExclusiveLock. AlterTableGetLockLevel says:
/*
* Removing constraints can affect SELECTs that have been
* optimised assuming the constraint holds true.
*/
case AT_DropConstraint: /* as DROP INDEX */
case AT_DropNotNull: /* may change some SQL plans */
cmd_lockmode = AccessExclusiveLock;
break;
Dropping a partition implicitly involves dropping a constraint. We
could gamble that the above has no consequences that are really
It's not a gamble if you know that the constraints being dropped constrain only the object being dropped.
serious enough to care about, and that none of the other subsidiary
objects that we adjust during detach (indexes, foreign keys, triggers)
really need AccessExclusiveLock now, and that none of the other kinds
of subsidiary objects that we might need to adjust in the future
during a detach will be changes that require AccessExclusiveLock
either, but that sounds awfully risky to me. We have very little DDL
that runs with less than AccessExclusiveLock, and I've already found
lots of subtle problems that have to be patched up just for the
particular case of allowing attach/detach to take a lesser lock on the
parent table, and I bet that there are a whole bunch more similar
problems when you start talking about weakening the lock on the child
table, and I'm not convinced that there are any reasonable solutions
to some of those problems, let alone that we can come up with good
solutions to all of them in the very near future.
I've not read every argument on this thread, but many of the later points made here are spurious, by which I mean they sound like they could apply but in fact do not.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 29, 2019 at 12:29 AM Simon Riggs <simon@2ndquadrant.com> wrote: >> But I kind of wonder whether we're really gaining as much as you think >> by trying to support concurrent DETACH in the first place. If there >> are queries running against the table, there's probably at least >> AccessShareLock on the partition itself, not just the parent. And >> that means that reducing the necessary lock on the parent from >> AccessExclusiveLock to something less doesn't really help that much, >> because we're still going to block trying to acquire >> AccessExclusiveLock on the child. Now you might say: OK, well, just >> reduce that lock level, too. > > The whole point of CONCURRENT detach is that you're not removing it whilst people are still using it, you're just markingit for later disuse. Well, I don't think that's the way any patch so far proposed actually works. >> But that seems to me to be opening a giant can of worms which we are >> unlikely to get resolved in time for this release. The worst of those >> problem is that if you also reduce the lock level on the partition >> when attaching it, then you are adding a constraint while somebody >> might at the exact same time be inserting a tuple that violates that >> constraint. > > Spurious. > > This would only be true if we were adding a constraint that affected existing partitions. > > The constraint being added affects the newly added partition, not existing ones. I agree that it affects the newly added partition, not existing ones. But if you don't hold an AccessExclusiveLock on that partition while you are adding that constraint to it, then somebody could be concurrently inserting a tuple that violates that constraint. This would be an INSERT targeting the partition directly, not somebody operating on the partitioning hierarchy to which it is being attached. >> Those two things have to be synchronized somehow. You >> could avoid that by reducing the lock level on the partition when >> detaching and not when attaching. But even then, detaching a >> partition can involve performing a whole bunch of operations for which >> we currently require AccessExclusiveLock. AlterTableGetLockLevel says: >> >> /* >> * Removing constraints can affect SELECTs that have been >> * optimised assuming the constraint holds true. >> */ >> case AT_DropConstraint: /* as DROP INDEX */ >> case AT_DropNotNull: /* may change some SQL plans */ >> cmd_lockmode = AccessExclusiveLock; >> break; >> >> Dropping a partition implicitly involves dropping a constraint. We >> could gamble that the above has no consequences that are really > > It's not a gamble if you know that the constraints being dropped constrain only the object being dropped. That's not true, but I can't refute your argument any more than that because you haven't made one. > I've not read every argument on this thread, but many of the later points made here are spurious, by which I mean theysound like they could apply but in fact do not. I think they do apply, and until somebody explains convincingly why they don't, I'm going to keep thinking that they do. Telling me that my points are wrong without making any kind of argument about why they are wrong is not constructive. I've put a lot of energy into analyzing this topic, both recently and in previous release cycles, and I'm not inclined to just say "OK, well, Simon says I'm wrong, so that's the end of it." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I wrote a little patch that stores the relation OIDs of the partitions > > into the PartitionedPruneRelInfo and then, at execution time, does an > > Assert() that what it gets matches what existed at plan time. I > > figured that a good start would be to find a test case where this > > fails with concurrent DDL allowed, but I haven't so far succeeded in > > devising one. To make the Assert() fail, I need to come up with a > > case where concurrent DDL has caused the PartitionDesc to be rebuilt > > but without causing an update to the plan. If I use prepared queries > > inside of a transaction block, [...] > > > I also had the idea of trying to use a cursor, because if I could > > start execution of a query, [...] > > Those are the ways I thought of, and the reason for the shape of some of > those .spec tests. I wasn't able to hit the situation. I've managed to come up with a test case that seems to hit this case. Preparation: create table foo (a int, b text, primary key (a)) partition by range (a); create table foo1 partition of foo for values from (0) to (1000); create table foo2 partition of foo for values from (1000) to (2000); insert into foo1 values (1, 'one'); insert into foo2 values (1001, 'two'); alter system set plan_cache_mode = force_generic_plan; select pg_reload_conf(); $ cat >x alter table foo detach partition foo2; alter table foo attach partition foo2 for values from (1000) to (2000); ^D Window #1: prepare foo as select * from foo where a = $1; explain execute foo(1500); \watch 0.01 Window #2: $ pgbench -n -f x -T 60 Boom: TRAP: FailedAssertion("!(partdesc->nparts == pinfo->nparts)", File: "execPartition.c", Line: 1631) I don't know how to reduce this to something reliable enough to include it in the regression tests, and maybe we don't really need that, but it's good to know that this is not a purely theoretical problem. I think next I'll try to write some code to make execPartition.c able to cope with the situation when it arises. (My draft/WIP patches attached, if you're interested.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0008-Drop-lock-level.patch
- 0006-Adapt-the-executor-to-use-a-PartitionDirectory.patch
- 0005-Adapt-the-optimizer-to-use-a-PartitionDirectory.patch
- 0007-relid_map-crosschecks.patch
- 0004-Postpone-old-context-removal.patch
- 0003-Initial-cut-at-PartitionDirectory.patch
- 0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patch
- 0001-Move-code-for-managing-PartitionDescs-into-a-new-fil.patch
On Tue, Jan 29, 2019 at 1:59 PM Robert Haas <robertmhaas@gmail.com> wrote: > I don't know how to reduce this to something reliable enough to > include it in the regression tests, and maybe we don't really need > that, but it's good to know that this is not a purely theoretical > problem. I think next I'll try to write some code to make > execPartition.c able to cope with the situation when it arises. OK, that seems to be pretty easy. New patch series attached. The patch with that new logic is 0004. I've consolidated some of the things I had as separate patches in my last post and rewritten the commit messages to explain more clearly the purpose of each patch. Open issues: - For now, I haven't tried to handle the DETACH PARTITION case. I don't think there's anything preventing someone - possibly even me - from implementing the counter-based approach that I described in the previous message, but I think it would be good to have some more discussion first on whether it's acceptable to make concurrent queries error out. I think any queries that were already up and running would be fine, but any that were planned before the DETACH and tried to execute afterwards would get an ERROR. That's fairly low-probability, because normally the shared invalidation machinery would cause replanning, but there's a race condition, so we'd have to document something like: if you use this feature, it'll probably just work, but you might get some funny errors in other sessions if you're unlucky. That kinda sucks but maybe we should just suck it up. Possibly we should consider making the concurrent behavior optional, so that if you'd rather take blocking locks than risk errors, you have that option. Of course I guess you could also just let people do an explicit LOCK TABLE if that's what they want. Or we could try to actually make it work in that case, I guess by ignoring the detached partitions, but that seems a lot harder. - 0003 doesn't have any handling for parallel query at this point, so even though within a single backend a single query execution will always get the same PartitionDesc for the same relation, the answers might not be consistent across the parallel group. I keep going back and forth on whether this really matters. It's too late to modify the plan, so any relations attached since it was generated are not going to get scanned. As for detached relations, we're talking about making them error out, so we don't have to worry about different backends come to different conclusions about whether they should be scanned. But maybe we should try to be smarter instead. One concern is that even relations that aren't scanned could still be affected because of tuple routing, but right now parallel queries can't INSERT or UPDATE rows anyway. Then again, maybe we should try not to add more obstacles in the way of lifting that restriction. Then again again, AFAICT we wouldn't be able to test that the new code is actually solving a problem right now today, and how much untested code do we really want in the tree? And then on the eleventh hand, maybe there are other reasons why it's important to use the same PartitionDesc across all parallel workers that I'm not thinking about at the moment. - 0003 also changes the order in which locks are acquired. I am not sure whether we care about this, especially in view of other pending changes. If you know of other problems, have solutions to or opinions about these, or think the whole approach is wrong, please speak up! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patch
- 0001-Move-code-for-managing-PartitionDescs-into-a-new-fil.patch
- 0005-Reduce-the-lock-level-required-to-attach-a-partition.patch
- 0004-Teach-runtime-partition-pruning-to-cope-with-concurr.patch
- 0003-Ensure-that-repeated-PartitionDesc-lookups-return-th.patch
On 2019-Jan-31, Robert Haas wrote: > OK, that seems to be pretty easy. New patch series attached. The > patch with that new logic is 0004. I've consolidated some of the > things I had as separate patches in my last post and rewritten the > commit messages to explain more clearly the purpose of each patch. Looks awesome. > - For now, I haven't tried to handle the DETACH PARTITION case. I > don't think there's anything preventing someone - possibly even me - > from implementing the counter-based approach that I described in the > previous message, but I think it would be good to have some more > discussion first on whether it's acceptable to make concurrent queries > error out. I think any queries that were already up and running would > be fine, but any that were planned before the DETACH and tried to > execute afterwards would get an ERROR. That's fairly low-probability, > because normally the shared invalidation machinery would cause > replanning, but there's a race condition, so we'd have to document > something like: if you use this feature, it'll probably just work, but > you might get some funny errors in other sessions if you're unlucky. > That kinda sucks but maybe we should just suck it up. Possibly we > should consider making the concurrent behavior optional, so that if > you'd rather take blocking locks than risk errors, you have that > option. Of course I guess you could also just let people do an > explicit LOCK TABLE if that's what they want. Or we could try to > actually make it work in that case, I guess by ignoring the detached > partitions, but that seems a lot harder. I think telling people to do LOCK TABLE beforehand if they care about errors is sufficient. On the other hand, I do hope that we're only going to cause queries to fail if they would affect the partition that's being detached and not other partitions in the table. Or maybe because of the replanning on invalidation this doesn't matter as much as I think it does. > - 0003 doesn't have any handling for parallel query at this point, so > even though within a single backend a single query execution will > always get the same PartitionDesc for the same relation, the answers > might not be consistent across the parallel group. That doesn't sound good. I think the easiest would be to just serialize the PartitionDesc and send it to the workers instead of them recomputing it, but then I worry that this might have bad performance when the partition desc is large. (Or maybe sending bytes over pqmq is faster than reading all those catalog entries and so this isn't a concern anyway.) > - 0003 also changes the order in which locks are acquired. I am not > sure whether we care about this, especially in view of other pending > changes. Yeah, the drawbacks of the unpredictable locking order are worrisome, but then the performance gain is hard to dismiss. Not this patch only but the others too. If we're okay with the others going in, I guess we don't have concerns about this one either. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 31, 2019 at 6:00 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > - 0003 doesn't have any handling for parallel query at this point, so > > even though within a single backend a single query execution will > > always get the same PartitionDesc for the same relation, the answers > > might not be consistent across the parallel group. > > That doesn't sound good. I think the easiest would be to just serialize > the PartitionDesc and send it to the workers instead of them recomputing > it, but then I worry that this might have bad performance when the > partition desc is large. (Or maybe sending bytes over pqmq is faster > than reading all those catalog entries and so this isn't a concern > anyway.) I don't think we'd be using pqmq here, or shm_mq either, but I think the bigger issues is that starting a parallel query is already a pretty heavy operation, and so the added overhead of this is probably not very noticeable. I agree that it seems a bit expensive, but since we're already waiting for the postmaster to fork() a new process which then has to initialize itself, this probably won't break the bank. What bothers me more is that it's adding a substantial amount of code that could very well contain bugs to fix something that isn't clearly a problem in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 1, 2019 at 9:00 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't think we'd be using pqmq here, or shm_mq either, but I think > the bigger issues is that starting a parallel query is already a > pretty heavy operation, and so the added overhead of this is probably > not very noticeable. I agree that it seems a bit expensive, but since > we're already waiting for the postmaster to fork() a new process which > then has to initialize itself, this probably won't break the bank. > What bothers me more is that it's adding a substantial amount of code > that could very well contain bugs to fix something that isn't clearly > a problem in the first place. I spent most of the last 6 hours writing and debugging a substantial chunk of the code that would be needed. Here's an 0006 patch that adds functions to serialize and restore PartitionDesc in a manner similar to what parallel query does for other object types. Since a PartitionDesc includes a pointer to a PartitionBoundInfo, that meant also writing functions to serialize and restore those. If we want to go this route, I think the next thing to do would be to integrate this into the PartitionDirectory infrastructure. Basically what I'm imagining we would do there is have a hash table stored in shared memory to go with the one that is already stored in backend-private memory. The shared table stores serialized entries, and the local table stores normal ones. Any lookups try the local table first, then the shared table. If we get a hit in the shared table, we deserialize whatever we find there and stash the result in the local table. If we find it neither place, we generate a new entry in the local table and then serialize it into the shard table. It's not quite clear to me at the moment how to solve the concurrency problems associated with this design, but it's probably not too hard. I don't have enough mental energy left to figure it out today, though. After having written this code, I'm still torn about whether to go further with this design. On the one hand, this is such boilerplate code that it's kinda hard to imagine it having too many more bugs; on the other hand, as you can see, it's a non-trivial amount of code to add without a real clear reason, and I'm not sure we have one, even though in the abstract it seems like a better way to go. Still interesting in hearing more opinions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sat, 2 Feb 2019 at 09:31, Robert Haas <robertmhaas@gmail.com> wrote: > After having written this code, I'm still torn about whether to go > further with this design. On the one hand, this is such boilerplate > code that it's kinda hard to imagine it having too many more bugs; on > the other hand, as you can see, it's a non-trivial amount of code to > add without a real clear reason, and I'm not sure we have one, even > though in the abstract it seems like a better way to go. I think we do need to ensure that the PartitionDesc matches between worker and leader. Have a look at choose_next_subplan_for_worker() in nodeAppend.c. Notice that a call is made to ExecFindMatchingSubPlans(). -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Feb 2, 2019 at 7:19 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > I think we do need to ensure that the PartitionDesc matches between > worker and leader. Have a look at choose_next_subplan_for_worker() in > nodeAppend.c. Notice that a call is made to > ExecFindMatchingSubPlans(). Thanks for the tip. I see that code, but I'm not sure that I understand why it matters here. First, if I'm not mistaken, what's being returned by ExecFindMatchingSubPlans is a BitmapSet of subplan indexes, not anything that returns to a PartitionDesc directly. And second, even if it did, it looks like the computation is done separately in every backend and not shared among backends, so even if it were directly referring to PartitionDesc indexes, it still won't be assuming that they're the same in every backend. Can you further explain your thinking? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, 4 Feb 2019 at 16:45, Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Feb 2, 2019 at 7:19 PM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > I think we do need to ensure that the PartitionDesc matches between > > worker and leader. Have a look at choose_next_subplan_for_worker() in > > nodeAppend.c. Notice that a call is made to > > ExecFindMatchingSubPlans(). > > Thanks for the tip. I see that code, but I'm not sure that I > understand why it matters here. First, if I'm not mistaken, what's > being returned by ExecFindMatchingSubPlans is a BitmapSet of subplan > indexes, not anything that returns to a PartitionDesc directly. And > second, even if it did, it looks like the computation is done > separately in every backend and not shared among backends, so even if > it were directly referring to PartitionDesc indexes, it still won't be > assuming that they're the same in every backend. Can you further > explain your thinking? In a Parallel Append, each parallel worker will call ExecInitAppend(), which calls ExecCreatePartitionPruneState(). That function makes a call to RelationGetPartitionDesc() and records the partdesc's boundinfo in context->boundinfo. This means that if we perform any pruning in the parallel worker in choose_next_subplan_for_worker() then find_matching_subplans_recurse() will use the PartitionDesc from the parallel worker to translate the partition indexes into the Append's subnodes. If the PartitionDesc from the parallel worker has an extra partition than what was there when the plan was built then the partition index to subplan index translation will be incorrect as the find_matching_subplans_recurse() will call get_matching_partitions() using the context with the PartitionDesc containing the additional partition. The return value from get_matching_partitions() is fine, it's just that the code inside the while ((i = bms_next_member(partset, i)) >= 0) loop that will do the wrong thing. It could even crash if partset has an index out of bounds of the subplan_map or subpart_map arrays. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 4, 2019 at 12:02 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > If the PartitionDesc from the parallel worker has an extra partition > than what was there when the plan was built then the partition index > to subplan index translation will be incorrect as the > find_matching_subplans_recurse() will call get_matching_partitions() > using the context with the PartitionDesc containing the additional > partition. The return value from get_matching_partitions() is fine, > it's just that the code inside the while ((i = > bms_next_member(partset, i)) >= 0) loop that will do the wrong thing. > It could even crash if partset has an index out of bounds of the > subplan_map or subpart_map arrays. Is there any chance you've missed the fact that in one of the later patches in the series I added code to adjust the subplan_map and subpart_map arrays to compensate for any extra partitions? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 4, 2019 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Feb 4, 2019 at 12:02 AM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > If the PartitionDesc from the parallel worker has an extra partition > > than what was there when the plan was built then the partition index > > to subplan index translation will be incorrect as the > > find_matching_subplans_recurse() will call get_matching_partitions() > > using the context with the PartitionDesc containing the additional > > partition. The return value from get_matching_partitions() is fine, > > it's just that the code inside the while ((i = > > bms_next_member(partset, i)) >= 0) loop that will do the wrong thing. > > It could even crash if partset has an index out of bounds of the > > subplan_map or subpart_map arrays. > > Is there any chance you've missed the fact that in one of the later > patches in the series I added code to adjust the subplan_map and > subpart_map arrays to compensate for any extra partitions? In case that wasn't clear enough, my point here is that while the leader and workers could end up with different ideas about the shape of the PartitionDesc, each would end up with a subplan_map and subpart_map array adapted to the view of the PartitionDesc with which they ended up, and therefore, I think, everything should work. So far there is, to my knowledge, no situation in which a PartitionDesc index gets passed between one backend and another, and as long as we don't do that, it's not really necessary for them to agree; each backend needs to individually ignore any concurrently added partitions not contemplated by the plan, but it doesn't matter whether backend A and backend B agree on which partitions were concurrently added, just that each ignores the ones it knows about. Since time is rolling along here, I went ahead and committed 0001 which seems harmless even if somebody finds a huge problem with some other part of this. If anybody wants to review the approach or the code before I proceed further, that would be great, but please speak up soon. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 5 Feb 2019 at 01:54, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Feb 4, 2019 at 12:02 AM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > If the PartitionDesc from the parallel worker has an extra partition > > than what was there when the plan was built then the partition index > > to subplan index translation will be incorrect as the > > find_matching_subplans_recurse() will call get_matching_partitions() > > using the context with the PartitionDesc containing the additional > > partition. The return value from get_matching_partitions() is fine, > > it's just that the code inside the while ((i = > > bms_next_member(partset, i)) >= 0) loop that will do the wrong thing. > > It could even crash if partset has an index out of bounds of the > > subplan_map or subpart_map arrays. > > Is there any chance you've missed the fact that in one of the later > patches in the series I added code to adjust the subplan_map and > subpart_map arrays to compensate for any extra partitions? I admit that I hadn't looked at the patch, I was just going on what I had read here. I wasn't sure how the re-map would have been done as some of the information is unavailable during execution, but I see now that you're modified it so we send a list of Oids that we expect and remap based on if an unexpected Oid is found. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 21, 2018 at 6:04 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > On Fri, 21 Dec 2018 at 09:43, Robert Haas <robertmhaas@gmail.com> wrote: > > - I refactored expand_inherited_rtentry() to drive partition expansion > > entirely off of PartitionDescs. The reason why this is necessary is > > that it clearly will not work to have find_all_inheritors() use a > > current snapshot to decide what children we have and lock them, and > > then consult a different source of truth to decide which relations to > > open with NoLock. There's nothing to keep the lists of partitions > > from being different in the two cases, and that demonstrably causes > > assertion failures if you SELECT with an ATTACH/DETACH loop running in > > the background. However, it also changes the order in which tables get > > locked. Possibly that could be fixed by teaching > > expand_partitioned_rtentry() to qsort() the OIDs the way > > find_inheritance_children() does. It also loses the infinite-loop > > protection which find_all_inheritors() has. Not sure what to do about > > that. > > I don't think you need to qsort() the Oids before locking. What the > qsort() does today is ensure we get a consistent locking order. Any > other order would surely do, providing we stick to it consistently. I > think PartitionDesc order is fine, as it's consistent. Having it > locked in PartitionDesc order I think is what's needed for [1] anyway. > [2] proposes to relax the locking order taken during execution. > > [1] https://commitfest.postgresql.org/21/1778/ > [2] https://commitfest.postgresql.org/21/1887/ Based on this feedback, I went ahead and committed the part of the previously-posted patch set that makes this change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 31, 2019 at 1:02 PM Robert Haas <robertmhaas@gmail.com> wrote: > New patch series attached. And here's yet another new patch series, rebased over today's commit and with a couple of other fixes: 1. I realized that the PartitionDirectory for the planner ought to be attached to the PlannerGlobal, not the PlannerInfo; we don't want to create more than one partition directory per query planning cycle, and we do want our notion of the PartitionDesc for a given relation to be stable between the outer query and any subqueries. 2. I discovered - via CLOBBER_CACHE_ALWAYS testing - that the PartitionDirectory has to hold a reference count on the relcache entry. In hindsight, this should have been obvious: the planner keeps the locks when it closes a relation and later reopens it, but it doesn't keep the relation open, which is what prevents recycling of the old PartitionDesc. Unfortunately these additional reference count manipulations are probably not free. I don't know expensive they are, though; maybe it's not too bad. Aside from these problems, I think I have spotted a subtle problem in 0001. I'll think about that some more and post another update. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Feb 26, 2019 at 5:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > Aside from these problems, I think I have spotted a subtle problem in > 0001. I'll think about that some more and post another update. 0001 turned out to be guarding against the wrong problem. It supposed that if we didn't get a coherent view of the system catalogs due to concurrent DDL, we could just AcceptInvalidationMessages() and retry. But that turns out to be wrong, because there's a (very) narrow window after a process removes itself from the ProcArray and before it sends invalidation messages. It wasn't difficult to engineer an alternative solution that works, but unfortunately it's only good enough to handle the ATTACH case, so this is another thing that will need more thought for concurrent DETACH. Anyway, the updated 0001 contains that code and some explanatory comments. The rest of the series is substantially unchanged. I'm not currently aware of any remaining correctness issues with this code, although certainly there may be some. There has been a certain dearth of volunteers to review any of this. I do plan to poke at it a bit to see whether it has any significant performance impact, but not today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Feb 28, 2019 at 3:27 PM Robert Haas <robertmhaas@gmail.com> wrote: > I'm not currently aware of any remaining correctness issues with this > code, although certainly there may be some. There has been a certain > dearth of volunteers to review any of this. I do plan to poke at it a > bit to see whether it has any significant performance impact, but not > today. Today, did some performance testing. I created a table with 100 partitions and randomly selected rows from it using pgbench, with and without -M prepared. The results show a small regression, but I believe it's below the noise floor. Five minute test runs. with prepared queries master: tps = 10919.914458 (including connections establishing) tps = 10876.271217 (including connections establishing) tps = 10761.586160 (including connections establishing) concurrent-attach: tps = 10883.535582 (including connections establishing) tps = 10868.471805 (including connections establishing) tps = 10761.586160 (including connections establishing) with simple queries master: tps = 1486.120530 (including connections establishing) tps = 1486.797251 (including connections establishing) tps = 1494.129256 (including connections establishing) concurrent-attach: tps = 1481.774212 (including connections establishing) tps = 1472.159016 (including connections establishing) tps = 1476.444097 (including connections establishing) Looking at the total of the three results, that's about an 0.8% regression with simple queries and an 0.2% regression with prepared queries. Looking at the median, it's about 0.7% and 0.07%. Would anybody like to argue that's a reason not to commit these patches? Would anyone like to argue that there is any other reason not to commit these patches? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 6 Mar 2019 at 10:13, Robert Haas <robertmhaas@gmail.com> wrote: > Would anyone like to argue that there is any other reason not to > commit these patches? Hi Robert, Thanks for working on this. I'm sorry I didn't get a chance to dedicate some time to look at it. It looks like you've pushed all of this now. Can the CF entry be marked as committed? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 14, 2019 at 6:12 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > On Wed, 6 Mar 2019 at 10:13, Robert Haas <robertmhaas@gmail.com> wrote: > > Would anyone like to argue that there is any other reason not to > > commit these patches? > > Hi Robert, > > Thanks for working on this. I'm sorry I didn't get a chance to > dedicate some time to look at it. > > It looks like you've pushed all of this now. Can the CF entry be > marked as committed? Yeah, done now, thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company