Thread: Table partition with primary key in 11.3

Table partition with primary key in 11.3

From
User
Date:
CREATE TABLE public.test1 (
     x1 integer NOT NULL,
     x2 integer NOT NULL,
     CONSTRAINT test1_pkey PRIMARY KEY (x1) INCLUDE(x2)
) PARTITION BY RANGE (x2);

This query works in 11.1 but fails in 11.3 with messages:

ERROR: insufficient columns in PRIMARY KEY constraint definition
DETAIL: PRIMARY KEY constraint on table "test1" lacks column "x2" which 
is part of the partition key.
SQL state: 0A000





Re: Table partition with primary key in 11.3

From
Tom Lane
Date:
User <ya@penek.org> writes:
> CREATE TABLE public.test1 (
>      x1 integer NOT NULL,
>      x2 integer NOT NULL,
>      CONSTRAINT test1_pkey PRIMARY KEY (x1) INCLUDE(x2)
> ) PARTITION BY RANGE (x2);

> This query works in 11.1 but fails in 11.3 with messages:

> ERROR: insufficient columns in PRIMARY KEY constraint definition
> DETAIL: PRIMARY KEY constraint on table "test1" lacks column "x2" which 
> is part of the partition key.
> SQL state: 0A000

Indeed, that primary key is no good.  It was a bug that 11.1
allowed it, which was fixed here:

Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Branch: master [0ad41cf53] 2019-01-14 19:28:10 -0300
Branch: REL_11_STABLE Release: REL_11_2 [74aa7e046] 2019-01-14 19:25:19 -0300

    Fix unique INCLUDE indexes on partitioned tables
    
    We were considering the INCLUDE columns as part of the key, allowing
    unicity-violating rows to be inserted in different partitions.
    
    Concurrent development conflict in eb7ed3f30634 and 8224de4f42cc.
    
    Reported-by: Justin Pryzby
    Discussion: https://postgr.es/m/20190109065109.GA4285@telsasoft.com

            regards, tom lane



Re: Table partition with primary key in 11.3

From
"Alex V."
Date:
>> CREATE TABLE public.test1 (
>>      x1 integer NOT NULL,
>>      x2 integer NOT NULL,
>>      CONSTRAINT test1_pkey PRIMARY KEY (x1) INCLUDE(x2)
>> ) PARTITION BY RANGE (x2);

>> This query works in 11.1 but fails in 11.3 with messages:

>> ERROR: insufficient columns in PRIMARY KEY constraint definition
>> DETAIL: PRIMARY KEY constraint on table "test1" lacks column "x2" which 
>> is part of the partition key.
>> SQL state: 0A000

> Indeed, that primary key is no good.  It was a bug that 11.1
> allowed it, which was fixed here:

> Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> Branch: master [0ad41cf53] 2019-01-14 19:28:10 -0300
> Branch: REL_11_STABLE Release: REL_11_2 [74aa7e046] 2019-01-14 19:25:19 -0300

>     Fix unique INCLUDE indexes on partitioned tables
    
>     We were considering the INCLUDE columns as part of the key, allowing
>     unicity-violating rows to be inserted in different partitions.
    
>     Concurrent development conflict in eb7ed3f30634 and 8224de4f42cc.
    
>     Reported-by: Justin Pryzby
>     Discussion: https://postgr.es/m/20190109065109.GA4285@telsasoft.com

>             regards, tom lane

I think that your position about primary keys in partitional tables is not right.

If we see regular table, one-field primary key is cross-table unique.
In partitional tables for users view we MUST also seen unique one-field primary key because this is user requirement
andanother keys can destroy logic between regular and partitional tables and functionality of partitioning becomes
useless.
For administrators of table we not-MAY, but MUST see one-field unique primary key in cross-table realisation.
All another realizations are plugging holes in a hurry and non-logical at global review of engine.



Re: Table partition with primary key in 11.3

From
Alvaro Herrera
Date:
On 2019-Jun-06, Alex V. wrote:

> I think that your position about primary keys in partitional tables is
> not right.
> 
> If we see regular table, one-field primary key is cross-table unique.
> In partitional tables for users view we MUST also seen unique
> one-field primary key because this is user requirement and another
> keys can destroy logic between regular and partitional tables and
> functionality of partitioning becomes useless.
> For administrators of table we not-MAY, but MUST see one-field unique
> primary key in cross-table realisation.
> All another realizations are plugging holes in a hurry and non-logical
> at global review of engine.

If you are saying that you think that Postgres should support primary
keys that don't necessarily overlap partition keys, then I agree with
you.  Please send a patch to implement that capability.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Table partition with primary key in 11.3

From
Simon Riggs
Date:
On Thu, 6 Jun 2019 at 18:03, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Jun-06, Alex V. wrote:

> I think that your position about primary keys in partitional tables is
> not right.
>
> If we see regular table, one-field primary key is cross-table unique.
> In partitional tables for users view we MUST also seen unique
> one-field primary key because this is user requirement and another
> keys can destroy logic between regular and partitional tables and
> functionality of partitioning becomes useless.
> For administrators of table we not-MAY, but MUST see one-field unique
> primary key in cross-table realisation.
> All another realizations are plugging holes in a hurry and non-logical
> at global review of engine.

If you are saying that you think that Postgres should support primary
keys that don't necessarily overlap partition keys, then I agree with
you.  Please send a patch to implement that capability.

The point is that unique indexes that overlap partition keys are rather cheap because they can be made from a set of local indexes.

A unique index that doesn't overlap partition keys would be
* unfeasibly huge
* likely to hit the 32TB limit on relations
* unable to cope with dropping partitions

It sounds like you want it, cos its just a line of DDL, but in practice your colleagues wouldn't want those things. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: Table partition with primary key in 11.3

From
Peter Geoghegan
Date:
On Thu, Jun 6, 2019 at 10:03 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> If you are saying that you think that Postgres should support primary
> keys that don't necessarily overlap partition keys, then I agree with
> you.  Please send a patch to implement that capability.

FWIW, I could probably be convinced to work on the nbtree parts of
global indexes if there was a reasonably worked out design that had
some momentum behind it. I would at least need to know what the
partition identifiers will need to look like in indexes. Andres has
suggested that I work on variable-width table identifiers in nbtree
for the benefit of pluggable storage engines, but I don't have much
enthusiasm for the idea of doing that without delivering a clear
benefit to users in the same release.

It seems pretty obvious to me that the right general approach for
global indexes is to make the partition identifier a tiebreaker column
that comes right before the heap TID tiebreaker column (which would be
per-partition). It probably needs to be variable-width, so that the
overhead isn't too bad most of the time. I don't think that it's
necessary to remove the restriction on the size of relations for
global indexes, though maybe global indexes could focus our attention
on that problem.

-- 
Peter Geoghegan



Re: Table partition with primary key in 11.3

From
David Rowley
Date:
On Fri, 7 Jun 2019 at 08:48, Peter Geoghegan <pg@bowt.ie> wrote:
> Andres has
> suggested that I work on variable-width table identifiers in nbtree
> for the benefit of pluggable storage engines, but I don't have much
> enthusiasm for the idea of doing that without delivering a clear
> benefit to users in the same release.

You may already be aware, but another use case for such variable-width
identifiers was with indirect indexes as discussed in [1]

[1] https://www.postgresql.org/message-id/20161018182843.xczrxsa2yd47pnru%40alvherre.pgsql

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Table partition with primary key in 11.3

From
Peter Geoghegan
Date:
On Thu, Jun 6, 2019 at 3:00 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> You may already be aware, but another use case for such variable-width
> identifiers was with indirect indexes as discussed in [1]

Right. I went with global indexes because indirect indexes are
probably a lot more risky as a project. I'd be particularly concerned
about the complexity of VACUUM there, whereas that doesn't seem all
that bad in the case of global indexes.

-- 
Peter Geoghegan



Re: Table partition with primary key in 11.3

From
Alvaro Herrera
Date:
On 2019-Jun-07, Peter Geoghegan wrote:

> On Thu, Jun 6, 2019 at 3:00 PM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > You may already be aware, but another use case for such variable-width
> > identifiers was with indirect indexes as discussed in [1]
> 
> Right. I went with global indexes because indirect indexes are
> probably a lot more risky as a project. I'd be particularly concerned
> about the complexity of VACUUM there, whereas that doesn't seem all
> that bad in the case of global indexes.

I think vacuuming for global indexes is somewhat challenging as well :-)
Maybe not as much as for indirect indexes, that's true.

In order for it to be sustainable, I think you'll want to reuse
partition identifiers when the partitions are dropped/detached, which
means that you need a way to ensure that index entries to those
partitions are removed from all indexes.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Table partition with primary key in 11.3

From
Peter Geoghegan
Date:
On Fri, Jun 7, 2019 at 9:10 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I think vacuuming for global indexes is somewhat challenging as well :-)
> Maybe not as much as for indirect indexes, that's true.
>
> In order for it to be sustainable, I think you'll want to reuse
> partition identifiers when the partitions are dropped/detached, which
> means that you need a way to ensure that index entries to those
> partitions are removed from all indexes.

I'm not so sure about that. I see your point, but I think that you can
also make the opposite argument. That is, you can make a good case for
asynchronously cleaning up the dead entries that point to a dropped
partition (probably within VACUUM). Perhaps we should offer *both* as
options.

--
Peter Geoghegan



Re: Table partition with primary key in 11.3

From
Alvaro Herrera
Date:
On 2019-Jun-07, Peter Geoghegan wrote:

> On Fri, Jun 7, 2019 at 9:10 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > I think vacuuming for global indexes is somewhat challenging as well :-)
> > Maybe not as much as for indirect indexes, that's true.
> >
> > In order for it to be sustainable, I think you'll want to reuse
> > partition identifiers when the partitions are dropped/detached, which
> > means that you need a way to ensure that index entries to those
> > partitions are removed from all indexes.
> 
> I'm not so sure about that. I see your point, but I think that you can
> also make the opposite argument. That is, you can make a good case for
> asynchronously cleaning up the dead entries that point to a dropped
> partition (probably within VACUUM). Perhaps we should offer *both* as
> options.

I was thinking of asynchonously cleaning it up rather than blocking
DROP/DETACH ... which means you need to keep state somewhere.  I don't
think blocking DROP/DETACH is valuable -- a global index that blocks
DROP/DETACH until the index is clean serves no useful purpose.  (You
could think of a multi-step approach with internal transaction commits,
similar to CIC, but you still need a plan to clean that up in case the
server crashes during that operation.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Table partition with primary key in 11.3

From
Ravi Krishna
Date:
>
> I was thinking of asynchonously cleaning it up rather than blocking
> DROP/DETACH ... which means you need to keep state somewhere.  I don't
> think blocking DROP/DETACH is valuable -- a global index that blocks
> DROP/DETACH until the index is clean serves no useful purpose.  (You
> could think of a multi-step approach with internal transaction commits,
> similar to CIC, but you still need a plan to clean that up in case the
> server crashes during that operation.)

In Oracle if you have a global unique index and a partition is dropped, the index is marked invalid and needs to be
rebuild. 
IOW, an outage.

DB2's approach is better. When the partition is dropped, the index entries are marked for deletion and it starts a
async
process of cleaning it up, which can run into several days if the dropped partition is large. But at least the table is
online.




Re: Table partition with primary key in 11.3

From
Peter Geoghegan
Date:
On Fri, Jun 7, 2019 at 12:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I was thinking of asynchonously cleaning it up rather than blocking
> DROP/DETACH ... which means you need to keep state somewhere.  I don't
> think blocking DROP/DETACH is valuable -- a global index that blocks
> DROP/DETACH until the index is clean serves no useful purpose.  (You
> could think of a multi-step approach with internal transaction commits,
> similar to CIC, but you still need a plan to clean that up in case the
> server crashes during that operation.)

The advantage of synchronous clean-up of global indexes when DROP'ing
a partition are that you can recycle the partition number (or whatever
we end up calling it) immediately and predictably, and you can reuse
the space in indexes occupied by keys from the dropped partition
immediately and predictably. That seems less useful than asynchronous
processing on average, certainly, but those are still real advantages.
You seemed to be particularly concerned about quickly recycling
partition numbers when we drop a partition.

I hope that we can come up with a very efficient on-disk
representation for global index tuples, where only the bare minimum
amount of space is used for partition numbers. Maybe it won't matter
that much if partition numbers cannot be recycled due to this
asynchronous processing.

-- 
Peter Geoghegan



Re: Table partition with primary key in 11.3

From
Alvaro Herrera
Date:
On 2019-Jun-07, Peter Geoghegan wrote:

> On Fri, Jun 7, 2019 at 12:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > I was thinking of asynchonously cleaning it up rather than blocking
> > DROP/DETACH ... which means you need to keep state somewhere.  I don't
> > think blocking DROP/DETACH is valuable -- a global index that blocks
> > DROP/DETACH until the index is clean serves no useful purpose.  (You
> > could think of a multi-step approach with internal transaction commits,
> > similar to CIC, but you still need a plan to clean that up in case the
> > server crashes during that operation.)
> 
> The advantage of synchronous clean-up of global indexes when DROP'ing
> a partition are that you can recycle the partition number (or whatever
> we end up calling it) immediately and predictably, and you can reuse
> the space in indexes occupied by keys from the dropped partition
> immediately and predictably. That seems less useful than asynchronous
> processing on average, certainly, but those are still real advantages.
> You seemed to be particularly concerned about quickly recycling
> partition numbers when we drop a partition.

Well, "quickly" might mean within a week.  If it takes that long to
fully remove a monthly partition to make that partition ID available to
some future month's partition, that seems acceptable.  Blocking
DROP/DETACH for one hour is certainly not acceptable.

If this scheme means that you can keep the partition identifiers stored
in the index to, for instance, 10 bits (allowing for 1024 partitions to
exist at any one time, including those in the process of being cleaned
up) instead of having to expand to (say) 24 because that covers a couple
of years of operation before having to recreate the index, it seems
worthwhile.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Table partition with primary key in 11.3

From
Peter Geoghegan
Date:
On Fri, Jun 7, 2019 at 12:43 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Well, "quickly" might mean within a week.  If it takes that long to
> fully remove a monthly partition to make that partition ID available to
> some future month's partition, that seems acceptable.  Blocking
> DROP/DETACH for one hour is certainly not acceptable.

I agree that synchronous clean-up of global indexes wouldn't make
sense there, and might not be very compelling in practice.

It occurs to me that we could add a code path to nbtree page splits,
that considered removing dropped partition tuples to avert a page
split. This would be a bit like the LP_DEAD/kill_prior_tuple thing.
Technically the space used by index tuples that point to a dropped
partitions wouldn't become reclaimable immediately, but it might not
matter with this optimization.

> If this scheme means that you can keep the partition identifiers stored
> in the index to, for instance, 10 bits (allowing for 1024 partitions to
> exist at any one time, including those in the process of being cleaned
> up) instead of having to expand to (say) 24 because that covers a couple
> of years of operation before having to recreate the index, it seems
> worthwhile.

I think that we should have no inherent limit on the number of
partitions available at once, on general principle. Limiting the
number of partitions is a design that probably has a lot of sharp
edges.

The nbtree heap TID column and partition number column should probably
be a single varwidth column (not two separate columns), that is often
no wider than 6 bytes, but can be wider when there are many partitions
and/or very large partitions. That will be challenging, but it seems
like the right place to solve the problem. I think that I could make
that happen. Maybe this same representation could be used for all
nbtree indexes, not just global nbtree indexes.

-- 
Peter Geoghegan



Re: Table partition with primary key in 11.3

From
Alvaro Herrera
Date:
Somehow we ended up discussing this topic in a rather mistitled thread
... oh well :-)  (Nowadays I hesitate to change threads' subject lines,
because gmail).

On 2019-Jun-07, Peter Geoghegan wrote:

> On Fri, Jun 7, 2019 at 12:43 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Well, "quickly" might mean within a week.  If it takes that long to
> > fully remove a monthly partition to make that partition ID available to
> > some future month's partition, that seems acceptable.  Blocking
> > DROP/DETACH for one hour is certainly not acceptable.
> 
> I agree that synchronous clean-up of global indexes wouldn't make
> sense there, and might not be very compelling in practice.
> 
> It occurs to me that we could add a code path to nbtree page splits,
> that considered removing dropped partition tuples to avert a page
> split. This would be a bit like the LP_DEAD/kill_prior_tuple thing.
> Technically the space used by index tuples that point to a dropped
> partitions wouldn't become reclaimable immediately, but it might not
> matter with this optimization.

This seems useful on the surface: you drop a partition, and slowly and
incrementally any index items that point to it are removed by processes
scanning the index.  You can't rely solely on this, though: as pointed
out by Robert in the indirect index thread, doing this only means that
non-scanned parts of the index to retain entries for arbitrary long,
which is bad.  Also, this adds latency to client-connected processes.

Because you can't rely on that exclusively, and you want to reuse the
partition ID eventually, you still need a cleanup process that removes
those remaining index entries.  This cleanup process is a background
process, so it doesn't affect latency.  I think it's not a good idea to
add latency to clients in order to optimize a background process.

> > If this scheme means that you can keep the partition identifiers stored
> > in the index to, for instance, 10 bits (allowing for 1024 partitions to
> > exist at any one time, including those in the process of being cleaned
> > up) instead of having to expand to (say) 24 because that covers a couple
> > of years of operation before having to recreate the index, it seems
> > worthwhile.
> 
> I think that we should have no inherent limit on the number of
> partitions available at once, on general principle. Limiting the
> number of partitions is a design that probably has a lot of sharp
> edges.

Yeah, I misspoke.  The way I see this working is this: when a new
partition is created/attached, we scan the set of partitions for that
partitioned table to determine the lowest unused one; use that as
partition ID for the new partition.  Index entries for that partition
will use the smallest possible representation for that partition ID.

When a partition is dropped, a vestigial catalog entry for it remains,
until all global index entries for it have been removed.  This prevents
reuse of the partition ID until it no longer causes harm.

This way, when a partition is dropped, we have to take the time to scan
all global indexes; when they've been scanned we can remove the catalog
entry, and at that point the partition ID becomes available to future
partitions.

> The nbtree heap TID column and partition number column should probably
> be a single varwidth column (not two separate columns), that is often
> no wider than 6 bytes, but can be wider when there are many partitions
> and/or very large partitions. That will be challenging, but it seems
> like the right place to solve the problem. I think that I could make
> that happen. Maybe this same representation could be used for all
> nbtree indexes, not just global nbtree indexes.

Maybe local nbtree indexes would have a partition ID of length 0, since
that many bits are necessary to identify which table is pointed to by
each index item.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Table partition with primary key in 11.3

From
Peter Geoghegan
Date:
On Fri, Jun 7, 2019 at 1:22 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Somehow we ended up discussing this topic in a rather mistitled thread
> ... oh well :-)  (Nowadays I hesitate to change threads' subject lines,
> because gmail).

You can blame me for that, I think.

> > It occurs to me that we could add a code path to nbtree page splits,
> > that considered removing dropped partition tuples to avert a page
> > split. This would be a bit like the LP_DEAD/kill_prior_tuple thing.
> > Technically the space used by index tuples that point to a dropped
> > partitions wouldn't become reclaimable immediately, but it might not
> > matter with this optimization.
>
> This seems useful on the surface: you drop a partition, and slowly and
> incrementally any index items that point to it are removed by processes
> scanning the index.  You can't rely solely on this, though: as pointed
> out by Robert in the indirect index thread, doing this only means that
> non-scanned parts of the index to retain entries for arbitrary long,
> which is bad.  Also, this adds latency to client-connected processes.

Well, we don't have to rely on index scans to set the LP_DEAD bit in
this case. We probably wouldn't do that at all. Rather, we'd have the
page split code refer to a list of dropped partition numbers as
targets for killing immediately. Maybe we'd hint the number of
distinct partitions represented on the page, to make it a bit faster.

> Because you can't rely on that exclusively, and you want to reuse the
> partition ID eventually, you still need a cleanup process that removes
> those remaining index entries.  This cleanup process is a background
> process, so it doesn't affect latency.  I think it's not a good idea to
> add latency to clients in order to optimize a background process.

Ordinarily I would agree, but we're talking about something that takes
place at the point that we're just about to split the page, that will
probably make the page split unnecessary when we can reclaim as few as
one or two tuples. A page split is already a very expensive thing by
any measure, and something that can rarely be "undone", so avoiding
them entirely is very compelling. Delaying a split will often prevent
it altogether. We're already doing foreground processing, just by
having page splits at all.

Other DB systems that don't do much foreground processing will still
do a certain amount of it if that avoids a split in some cases --
"Modern B-Tree techniques" mentions this, and suggests quite a number
of ways that a split might be averted.

> This way, when a partition is dropped, we have to take the time to scan
> all global indexes; when they've been scanned we can remove the catalog
> entry, and at that point the partition ID becomes available to future
> partitions.

It seems worth recycling partition IDs, but it should be possible to
delay that for a very long time if necessary. Ideally, users wouldn't
have to bother with it when they have really huge global indexes.

> > The nbtree heap TID column and partition number column should probably
> > be a single varwidth column (not two separate columns), that is often
> > no wider than 6 bytes, but can be wider when there are many partitions
> > and/or very large partitions. That will be challenging, but it seems
> > like the right place to solve the problem. I think that I could make
> > that happen. Maybe this same representation could be used for all
> > nbtree indexes, not just global nbtree indexes.
>
> Maybe local nbtree indexes would have a partition ID of length 0, since
> that many bits are necessary to identify which table is pointed to by
> each index item.

Right -- special cases are best avoided here. In general, we'd push as
much of the new complexity as we can into this new TID-like table
identifier, while requiring it to work with our existing requirements
for TIDs, plus certain new requirements for global indexes (and maybe
other new requirements, such as relations that are larger than 35GB).
If the complexity is well-encapsulated, then it probably won't be too
bad. Access methods would have to be okay with varwidth table
identifiers, which is a big change, but they at least shouldn't have
to worry about anything else breaking. They'd probably have a
pg_attribute entry for the varwidth table identifier column, too (it
would be the last column in every nbtree index).

We'd expect a space efficient representation with real world
relations, that at least matches what we get with heap TIDs today.
This isn't quite as hard as it sounds. You don't have to be Claude
Shannon to realize that it's kind of silly to reserve 16 bits for the
offset number component of a TID/ItemPointer. We need to continue to
support offset numbers that go that high, but the implementation would
optimize for the common case where offset numbers are less than 512
(or maybe less than 1024).

--
Peter Geoghegan



Re: Table partition with primary key in 11.3

From
Alvaro Herrera
Date:
On 2019-Jun-07, Peter Geoghegan wrote:

> On Fri, Jun 7, 2019 at 1:22 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > Because you can't rely on that exclusively, and you want to reuse the
> > partition ID eventually, you still need a cleanup process that removes
> > those remaining index entries.  This cleanup process is a background
> > process, so it doesn't affect latency.  I think it's not a good idea to
> > add latency to clients in order to optimize a background process.
> 
> Ordinarily I would agree, but we're talking about something that takes
> place at the point that we're just about to split the page, that will
> probably make the page split unnecessary when we can reclaim as few as
> one or two tuples. A page split is already a very expensive thing by
> any measure, and something that can rarely be "undone", so avoiding
> them entirely is very compelling.

Sorry, I confused your argumentation with mine.  I agree that removing
entries to try and prevent a page split is worth doing.

> > This way, when a partition is dropped, we have to take the time to scan
> > all global indexes; when they've been scanned we can remove the catalog
> > entry, and at that point the partition ID becomes available to future
> > partitions.
> 
> It seems worth recycling partition IDs, but it should be possible to
> delay that for a very long time if necessary. Ideally, users wouldn't
> have to bother with it when they have really huge global indexes.

I envision this happening automatically -- you drop the partition, a
persistent work item is registered, autovacuum takes care of it
whenever.  The user doesn't have to do anything about it.

> You don't have to be Claude Shannon to realize that it's kind of silly
> to reserve 16 bits for the offset number component of a
> TID/ItemPointer. We need to continue to support offset numbers that go
> that high, but the implementation would optimize for the common case
> where offset numbers are less than 512 (or maybe less than 1024).

(In many actual cases offset numbers require less than 7 bits in typical
pages, even).

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Table partition with primary key in 11.3

From
Peter Geoghegan
Date:
On Fri, Jun 7, 2019 at 2:35 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I envision this happening automatically -- you drop the partition, a
> persistent work item is registered, autovacuum takes care of it
> whenever.  The user doesn't have to do anything about it.

We don't have to agree on anything now, but I think that it's possible
that the page split thing will very effective. Perhaps even so
effective that it won't make much sense to vacuum global indexes just
because there is a pending dropped partition.

--
Peter Geoghegan