Thread: Table partition with primary key in 11.3
CREATE TABLE public.test1 ( x1 integer NOT NULL, x2 integer NOT NULL, CONSTRAINT test1_pkey PRIMARY KEY (x1) INCLUDE(x2) ) PARTITION BY RANGE (x2); This query works in 11.1 but fails in 11.3 with messages: ERROR: insufficient columns in PRIMARY KEY constraint definition DETAIL: PRIMARY KEY constraint on table "test1" lacks column "x2" which is part of the partition key. SQL state: 0A000
User <ya@penek.org> writes: > CREATE TABLE public.test1 ( > x1 integer NOT NULL, > x2 integer NOT NULL, > CONSTRAINT test1_pkey PRIMARY KEY (x1) INCLUDE(x2) > ) PARTITION BY RANGE (x2); > This query works in 11.1 but fails in 11.3 with messages: > ERROR: insufficient columns in PRIMARY KEY constraint definition > DETAIL: PRIMARY KEY constraint on table "test1" lacks column "x2" which > is part of the partition key. > SQL state: 0A000 Indeed, that primary key is no good. It was a bug that 11.1 allowed it, which was fixed here: Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Branch: master [0ad41cf53] 2019-01-14 19:28:10 -0300 Branch: REL_11_STABLE Release: REL_11_2 [74aa7e046] 2019-01-14 19:25:19 -0300 Fix unique INCLUDE indexes on partitioned tables We were considering the INCLUDE columns as part of the key, allowing unicity-violating rows to be inserted in different partitions. Concurrent development conflict in eb7ed3f30634 and 8224de4f42cc. Reported-by: Justin Pryzby Discussion: https://postgr.es/m/20190109065109.GA4285@telsasoft.com regards, tom lane
>> CREATE TABLE public.test1 ( >> x1 integer NOT NULL, >> x2 integer NOT NULL, >> CONSTRAINT test1_pkey PRIMARY KEY (x1) INCLUDE(x2) >> ) PARTITION BY RANGE (x2); >> This query works in 11.1 but fails in 11.3 with messages: >> ERROR: insufficient columns in PRIMARY KEY constraint definition >> DETAIL: PRIMARY KEY constraint on table "test1" lacks column "x2" which >> is part of the partition key. >> SQL state: 0A000 > Indeed, that primary key is no good. It was a bug that 11.1 > allowed it, which was fixed here: > Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> > Branch: master [0ad41cf53] 2019-01-14 19:28:10 -0300 > Branch: REL_11_STABLE Release: REL_11_2 [74aa7e046] 2019-01-14 19:25:19 -0300 > Fix unique INCLUDE indexes on partitioned tables > We were considering the INCLUDE columns as part of the key, allowing > unicity-violating rows to be inserted in different partitions. > Concurrent development conflict in eb7ed3f30634 and 8224de4f42cc. > Reported-by: Justin Pryzby > Discussion: https://postgr.es/m/20190109065109.GA4285@telsasoft.com > regards, tom lane I think that your position about primary keys in partitional tables is not right. If we see regular table, one-field primary key is cross-table unique. In partitional tables for users view we MUST also seen unique one-field primary key because this is user requirement andanother keys can destroy logic between regular and partitional tables and functionality of partitioning becomes useless. For administrators of table we not-MAY, but MUST see one-field unique primary key in cross-table realisation. All another realizations are plugging holes in a hurry and non-logical at global review of engine.
On 2019-Jun-06, Alex V. wrote: > I think that your position about primary keys in partitional tables is > not right. > > If we see regular table, one-field primary key is cross-table unique. > In partitional tables for users view we MUST also seen unique > one-field primary key because this is user requirement and another > keys can destroy logic between regular and partitional tables and > functionality of partitioning becomes useless. > For administrators of table we not-MAY, but MUST see one-field unique > primary key in cross-table realisation. > All another realizations are plugging holes in a hurry and non-logical > at global review of engine. If you are saying that you think that Postgres should support primary keys that don't necessarily overlap partition keys, then I agree with you. Please send a patch to implement that capability. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 6 Jun 2019 at 18:03, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Jun-06, Alex V. wrote:
> I think that your position about primary keys in partitional tables is
> not right.
>
> If we see regular table, one-field primary key is cross-table unique.
> In partitional tables for users view we MUST also seen unique
> one-field primary key because this is user requirement and another
> keys can destroy logic between regular and partitional tables and
> functionality of partitioning becomes useless.
> For administrators of table we not-MAY, but MUST see one-field unique
> primary key in cross-table realisation.
> All another realizations are plugging holes in a hurry and non-logical
> at global review of engine.
If you are saying that you think that Postgres should support primary
keys that don't necessarily overlap partition keys, then I agree with
you. Please send a patch to implement that capability.
The point is that unique indexes that overlap partition keys are rather cheap because they can be made from a set of local indexes.
A unique index that doesn't overlap partition keys would be
* unfeasibly huge
* likely to hit the 32TB limit on relations
* unable to cope with dropping partitions
It sounds like you want it, cos its just a line of DDL, but in practice your colleagues wouldn't want those things.
On Thu, Jun 6, 2019 at 10:03 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > If you are saying that you think that Postgres should support primary > keys that don't necessarily overlap partition keys, then I agree with > you. Please send a patch to implement that capability. FWIW, I could probably be convinced to work on the nbtree parts of global indexes if there was a reasonably worked out design that had some momentum behind it. I would at least need to know what the partition identifiers will need to look like in indexes. Andres has suggested that I work on variable-width table identifiers in nbtree for the benefit of pluggable storage engines, but I don't have much enthusiasm for the idea of doing that without delivering a clear benefit to users in the same release. It seems pretty obvious to me that the right general approach for global indexes is to make the partition identifier a tiebreaker column that comes right before the heap TID tiebreaker column (which would be per-partition). It probably needs to be variable-width, so that the overhead isn't too bad most of the time. I don't think that it's necessary to remove the restriction on the size of relations for global indexes, though maybe global indexes could focus our attention on that problem. -- Peter Geoghegan
On Fri, 7 Jun 2019 at 08:48, Peter Geoghegan <pg@bowt.ie> wrote: > Andres has > suggested that I work on variable-width table identifiers in nbtree > for the benefit of pluggable storage engines, but I don't have much > enthusiasm for the idea of doing that without delivering a clear > benefit to users in the same release. You may already be aware, but another use case for such variable-width identifiers was with indirect indexes as discussed in [1] [1] https://www.postgresql.org/message-id/20161018182843.xczrxsa2yd47pnru%40alvherre.pgsql -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 6, 2019 at 3:00 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > You may already be aware, but another use case for such variable-width > identifiers was with indirect indexes as discussed in [1] Right. I went with global indexes because indirect indexes are probably a lot more risky as a project. I'd be particularly concerned about the complexity of VACUUM there, whereas that doesn't seem all that bad in the case of global indexes. -- Peter Geoghegan
On 2019-Jun-07, Peter Geoghegan wrote: > On Thu, Jun 6, 2019 at 3:00 PM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > You may already be aware, but another use case for such variable-width > > identifiers was with indirect indexes as discussed in [1] > > Right. I went with global indexes because indirect indexes are > probably a lot more risky as a project. I'd be particularly concerned > about the complexity of VACUUM there, whereas that doesn't seem all > that bad in the case of global indexes. I think vacuuming for global indexes is somewhat challenging as well :-) Maybe not as much as for indirect indexes, that's true. In order for it to be sustainable, I think you'll want to reuse partition identifiers when the partitions are dropped/detached, which means that you need a way to ensure that index entries to those partitions are removed from all indexes. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 7, 2019 at 9:10 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I think vacuuming for global indexes is somewhat challenging as well :-) > Maybe not as much as for indirect indexes, that's true. > > In order for it to be sustainable, I think you'll want to reuse > partition identifiers when the partitions are dropped/detached, which > means that you need a way to ensure that index entries to those > partitions are removed from all indexes. I'm not so sure about that. I see your point, but I think that you can also make the opposite argument. That is, you can make a good case for asynchronously cleaning up the dead entries that point to a dropped partition (probably within VACUUM). Perhaps we should offer *both* as options. -- Peter Geoghegan
On 2019-Jun-07, Peter Geoghegan wrote: > On Fri, Jun 7, 2019 at 9:10 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I think vacuuming for global indexes is somewhat challenging as well :-) > > Maybe not as much as for indirect indexes, that's true. > > > > In order for it to be sustainable, I think you'll want to reuse > > partition identifiers when the partitions are dropped/detached, which > > means that you need a way to ensure that index entries to those > > partitions are removed from all indexes. > > I'm not so sure about that. I see your point, but I think that you can > also make the opposite argument. That is, you can make a good case for > asynchronously cleaning up the dead entries that point to a dropped > partition (probably within VACUUM). Perhaps we should offer *both* as > options. I was thinking of asynchonously cleaning it up rather than blocking DROP/DETACH ... which means you need to keep state somewhere. I don't think blocking DROP/DETACH is valuable -- a global index that blocks DROP/DETACH until the index is clean serves no useful purpose. (You could think of a multi-step approach with internal transaction commits, similar to CIC, but you still need a plan to clean that up in case the server crashes during that operation.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> > I was thinking of asynchonously cleaning it up rather than blocking > DROP/DETACH ... which means you need to keep state somewhere. I don't > think blocking DROP/DETACH is valuable -- a global index that blocks > DROP/DETACH until the index is clean serves no useful purpose. (You > could think of a multi-step approach with internal transaction commits, > similar to CIC, but you still need a plan to clean that up in case the > server crashes during that operation.) In Oracle if you have a global unique index and a partition is dropped, the index is marked invalid and needs to be rebuild. IOW, an outage. DB2's approach is better. When the partition is dropped, the index entries are marked for deletion and it starts a async process of cleaning it up, which can run into several days if the dropped partition is large. But at least the table is online.
On Fri, Jun 7, 2019 at 12:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I was thinking of asynchonously cleaning it up rather than blocking > DROP/DETACH ... which means you need to keep state somewhere. I don't > think blocking DROP/DETACH is valuable -- a global index that blocks > DROP/DETACH until the index is clean serves no useful purpose. (You > could think of a multi-step approach with internal transaction commits, > similar to CIC, but you still need a plan to clean that up in case the > server crashes during that operation.) The advantage of synchronous clean-up of global indexes when DROP'ing a partition are that you can recycle the partition number (or whatever we end up calling it) immediately and predictably, and you can reuse the space in indexes occupied by keys from the dropped partition immediately and predictably. That seems less useful than asynchronous processing on average, certainly, but those are still real advantages. You seemed to be particularly concerned about quickly recycling partition numbers when we drop a partition. I hope that we can come up with a very efficient on-disk representation for global index tuples, where only the bare minimum amount of space is used for partition numbers. Maybe it won't matter that much if partition numbers cannot be recycled due to this asynchronous processing. -- Peter Geoghegan
On 2019-Jun-07, Peter Geoghegan wrote: > On Fri, Jun 7, 2019 at 12:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I was thinking of asynchonously cleaning it up rather than blocking > > DROP/DETACH ... which means you need to keep state somewhere. I don't > > think blocking DROP/DETACH is valuable -- a global index that blocks > > DROP/DETACH until the index is clean serves no useful purpose. (You > > could think of a multi-step approach with internal transaction commits, > > similar to CIC, but you still need a plan to clean that up in case the > > server crashes during that operation.) > > The advantage of synchronous clean-up of global indexes when DROP'ing > a partition are that you can recycle the partition number (or whatever > we end up calling it) immediately and predictably, and you can reuse > the space in indexes occupied by keys from the dropped partition > immediately and predictably. That seems less useful than asynchronous > processing on average, certainly, but those are still real advantages. > You seemed to be particularly concerned about quickly recycling > partition numbers when we drop a partition. Well, "quickly" might mean within a week. If it takes that long to fully remove a monthly partition to make that partition ID available to some future month's partition, that seems acceptable. Blocking DROP/DETACH for one hour is certainly not acceptable. If this scheme means that you can keep the partition identifiers stored in the index to, for instance, 10 bits (allowing for 1024 partitions to exist at any one time, including those in the process of being cleaned up) instead of having to expand to (say) 24 because that covers a couple of years of operation before having to recreate the index, it seems worthwhile. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 7, 2019 at 12:43 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Well, "quickly" might mean within a week. If it takes that long to > fully remove a monthly partition to make that partition ID available to > some future month's partition, that seems acceptable. Blocking > DROP/DETACH for one hour is certainly not acceptable. I agree that synchronous clean-up of global indexes wouldn't make sense there, and might not be very compelling in practice. It occurs to me that we could add a code path to nbtree page splits, that considered removing dropped partition tuples to avert a page split. This would be a bit like the LP_DEAD/kill_prior_tuple thing. Technically the space used by index tuples that point to a dropped partitions wouldn't become reclaimable immediately, but it might not matter with this optimization. > If this scheme means that you can keep the partition identifiers stored > in the index to, for instance, 10 bits (allowing for 1024 partitions to > exist at any one time, including those in the process of being cleaned > up) instead of having to expand to (say) 24 because that covers a couple > of years of operation before having to recreate the index, it seems > worthwhile. I think that we should have no inherent limit on the number of partitions available at once, on general principle. Limiting the number of partitions is a design that probably has a lot of sharp edges. The nbtree heap TID column and partition number column should probably be a single varwidth column (not two separate columns), that is often no wider than 6 bytes, but can be wider when there are many partitions and/or very large partitions. That will be challenging, but it seems like the right place to solve the problem. I think that I could make that happen. Maybe this same representation could be used for all nbtree indexes, not just global nbtree indexes. -- Peter Geoghegan
Somehow we ended up discussing this topic in a rather mistitled thread ... oh well :-) (Nowadays I hesitate to change threads' subject lines, because gmail). On 2019-Jun-07, Peter Geoghegan wrote: > On Fri, Jun 7, 2019 at 12:43 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Well, "quickly" might mean within a week. If it takes that long to > > fully remove a monthly partition to make that partition ID available to > > some future month's partition, that seems acceptable. Blocking > > DROP/DETACH for one hour is certainly not acceptable. > > I agree that synchronous clean-up of global indexes wouldn't make > sense there, and might not be very compelling in practice. > > It occurs to me that we could add a code path to nbtree page splits, > that considered removing dropped partition tuples to avert a page > split. This would be a bit like the LP_DEAD/kill_prior_tuple thing. > Technically the space used by index tuples that point to a dropped > partitions wouldn't become reclaimable immediately, but it might not > matter with this optimization. This seems useful on the surface: you drop a partition, and slowly and incrementally any index items that point to it are removed by processes scanning the index. You can't rely solely on this, though: as pointed out by Robert in the indirect index thread, doing this only means that non-scanned parts of the index to retain entries for arbitrary long, which is bad. Also, this adds latency to client-connected processes. Because you can't rely on that exclusively, and you want to reuse the partition ID eventually, you still need a cleanup process that removes those remaining index entries. This cleanup process is a background process, so it doesn't affect latency. I think it's not a good idea to add latency to clients in order to optimize a background process. > > If this scheme means that you can keep the partition identifiers stored > > in the index to, for instance, 10 bits (allowing for 1024 partitions to > > exist at any one time, including those in the process of being cleaned > > up) instead of having to expand to (say) 24 because that covers a couple > > of years of operation before having to recreate the index, it seems > > worthwhile. > > I think that we should have no inherent limit on the number of > partitions available at once, on general principle. Limiting the > number of partitions is a design that probably has a lot of sharp > edges. Yeah, I misspoke. The way I see this working is this: when a new partition is created/attached, we scan the set of partitions for that partitioned table to determine the lowest unused one; use that as partition ID for the new partition. Index entries for that partition will use the smallest possible representation for that partition ID. When a partition is dropped, a vestigial catalog entry for it remains, until all global index entries for it have been removed. This prevents reuse of the partition ID until it no longer causes harm. This way, when a partition is dropped, we have to take the time to scan all global indexes; when they've been scanned we can remove the catalog entry, and at that point the partition ID becomes available to future partitions. > The nbtree heap TID column and partition number column should probably > be a single varwidth column (not two separate columns), that is often > no wider than 6 bytes, but can be wider when there are many partitions > and/or very large partitions. That will be challenging, but it seems > like the right place to solve the problem. I think that I could make > that happen. Maybe this same representation could be used for all > nbtree indexes, not just global nbtree indexes. Maybe local nbtree indexes would have a partition ID of length 0, since that many bits are necessary to identify which table is pointed to by each index item. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 7, 2019 at 1:22 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Somehow we ended up discussing this topic in a rather mistitled thread > ... oh well :-) (Nowadays I hesitate to change threads' subject lines, > because gmail). You can blame me for that, I think. > > It occurs to me that we could add a code path to nbtree page splits, > > that considered removing dropped partition tuples to avert a page > > split. This would be a bit like the LP_DEAD/kill_prior_tuple thing. > > Technically the space used by index tuples that point to a dropped > > partitions wouldn't become reclaimable immediately, but it might not > > matter with this optimization. > > This seems useful on the surface: you drop a partition, and slowly and > incrementally any index items that point to it are removed by processes > scanning the index. You can't rely solely on this, though: as pointed > out by Robert in the indirect index thread, doing this only means that > non-scanned parts of the index to retain entries for arbitrary long, > which is bad. Also, this adds latency to client-connected processes. Well, we don't have to rely on index scans to set the LP_DEAD bit in this case. We probably wouldn't do that at all. Rather, we'd have the page split code refer to a list of dropped partition numbers as targets for killing immediately. Maybe we'd hint the number of distinct partitions represented on the page, to make it a bit faster. > Because you can't rely on that exclusively, and you want to reuse the > partition ID eventually, you still need a cleanup process that removes > those remaining index entries. This cleanup process is a background > process, so it doesn't affect latency. I think it's not a good idea to > add latency to clients in order to optimize a background process. Ordinarily I would agree, but we're talking about something that takes place at the point that we're just about to split the page, that will probably make the page split unnecessary when we can reclaim as few as one or two tuples. A page split is already a very expensive thing by any measure, and something that can rarely be "undone", so avoiding them entirely is very compelling. Delaying a split will often prevent it altogether. We're already doing foreground processing, just by having page splits at all. Other DB systems that don't do much foreground processing will still do a certain amount of it if that avoids a split in some cases -- "Modern B-Tree techniques" mentions this, and suggests quite a number of ways that a split might be averted. > This way, when a partition is dropped, we have to take the time to scan > all global indexes; when they've been scanned we can remove the catalog > entry, and at that point the partition ID becomes available to future > partitions. It seems worth recycling partition IDs, but it should be possible to delay that for a very long time if necessary. Ideally, users wouldn't have to bother with it when they have really huge global indexes. > > The nbtree heap TID column and partition number column should probably > > be a single varwidth column (not two separate columns), that is often > > no wider than 6 bytes, but can be wider when there are many partitions > > and/or very large partitions. That will be challenging, but it seems > > like the right place to solve the problem. I think that I could make > > that happen. Maybe this same representation could be used for all > > nbtree indexes, not just global nbtree indexes. > > Maybe local nbtree indexes would have a partition ID of length 0, since > that many bits are necessary to identify which table is pointed to by > each index item. Right -- special cases are best avoided here. In general, we'd push as much of the new complexity as we can into this new TID-like table identifier, while requiring it to work with our existing requirements for TIDs, plus certain new requirements for global indexes (and maybe other new requirements, such as relations that are larger than 35GB). If the complexity is well-encapsulated, then it probably won't be too bad. Access methods would have to be okay with varwidth table identifiers, which is a big change, but they at least shouldn't have to worry about anything else breaking. They'd probably have a pg_attribute entry for the varwidth table identifier column, too (it would be the last column in every nbtree index). We'd expect a space efficient representation with real world relations, that at least matches what we get with heap TIDs today. This isn't quite as hard as it sounds. You don't have to be Claude Shannon to realize that it's kind of silly to reserve 16 bits for the offset number component of a TID/ItemPointer. We need to continue to support offset numbers that go that high, but the implementation would optimize for the common case where offset numbers are less than 512 (or maybe less than 1024). -- Peter Geoghegan
On 2019-Jun-07, Peter Geoghegan wrote: > On Fri, Jun 7, 2019 at 1:22 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Because you can't rely on that exclusively, and you want to reuse the > > partition ID eventually, you still need a cleanup process that removes > > those remaining index entries. This cleanup process is a background > > process, so it doesn't affect latency. I think it's not a good idea to > > add latency to clients in order to optimize a background process. > > Ordinarily I would agree, but we're talking about something that takes > place at the point that we're just about to split the page, that will > probably make the page split unnecessary when we can reclaim as few as > one or two tuples. A page split is already a very expensive thing by > any measure, and something that can rarely be "undone", so avoiding > them entirely is very compelling. Sorry, I confused your argumentation with mine. I agree that removing entries to try and prevent a page split is worth doing. > > This way, when a partition is dropped, we have to take the time to scan > > all global indexes; when they've been scanned we can remove the catalog > > entry, and at that point the partition ID becomes available to future > > partitions. > > It seems worth recycling partition IDs, but it should be possible to > delay that for a very long time if necessary. Ideally, users wouldn't > have to bother with it when they have really huge global indexes. I envision this happening automatically -- you drop the partition, a persistent work item is registered, autovacuum takes care of it whenever. The user doesn't have to do anything about it. > You don't have to be Claude Shannon to realize that it's kind of silly > to reserve 16 bits for the offset number component of a > TID/ItemPointer. We need to continue to support offset numbers that go > that high, but the implementation would optimize for the common case > where offset numbers are less than 512 (or maybe less than 1024). (In many actual cases offset numbers require less than 7 bits in typical pages, even). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 7, 2019 at 2:35 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I envision this happening automatically -- you drop the partition, a > persistent work item is registered, autovacuum takes care of it > whenever. The user doesn't have to do anything about it. We don't have to agree on anything now, but I think that it's possible that the page split thing will very effective. Perhaps even so effective that it won't make much sense to vacuum global indexes just because there is a pending dropped partition. -- Peter Geoghegan