Re: Table partition with primary key in 11.3 - Mailing list pgsql-general

From Alvaro Herrera
Subject Re: Table partition with primary key in 11.3
Date
Msg-id 20190607202158.GA24592@alvherre.pgsql
Whole thread Raw
In response to Re: Table partition with primary key in 11.3  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Table partition with primary key in 11.3
List pgsql-general
Somehow we ended up discussing this topic in a rather mistitled thread
... oh well :-)  (Nowadays I hesitate to change threads' subject lines,
because gmail).

On 2019-Jun-07, Peter Geoghegan wrote:

> On Fri, Jun 7, 2019 at 12:43 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Well, "quickly" might mean within a week.  If it takes that long to
> > fully remove a monthly partition to make that partition ID available to
> > some future month's partition, that seems acceptable.  Blocking
> > DROP/DETACH for one hour is certainly not acceptable.
> 
> I agree that synchronous clean-up of global indexes wouldn't make
> sense there, and might not be very compelling in practice.
> 
> It occurs to me that we could add a code path to nbtree page splits,
> that considered removing dropped partition tuples to avert a page
> split. This would be a bit like the LP_DEAD/kill_prior_tuple thing.
> Technically the space used by index tuples that point to a dropped
> partitions wouldn't become reclaimable immediately, but it might not
> matter with this optimization.

This seems useful on the surface: you drop a partition, and slowly and
incrementally any index items that point to it are removed by processes
scanning the index.  You can't rely solely on this, though: as pointed
out by Robert in the indirect index thread, doing this only means that
non-scanned parts of the index to retain entries for arbitrary long,
which is bad.  Also, this adds latency to client-connected processes.

Because you can't rely on that exclusively, and you want to reuse the
partition ID eventually, you still need a cleanup process that removes
those remaining index entries.  This cleanup process is a background
process, so it doesn't affect latency.  I think it's not a good idea to
add latency to clients in order to optimize a background process.

> > If this scheme means that you can keep the partition identifiers stored
> > in the index to, for instance, 10 bits (allowing for 1024 partitions to
> > exist at any one time, including those in the process of being cleaned
> > up) instead of having to expand to (say) 24 because that covers a couple
> > of years of operation before having to recreate the index, it seems
> > worthwhile.
> 
> I think that we should have no inherent limit on the number of
> partitions available at once, on general principle. Limiting the
> number of partitions is a design that probably has a lot of sharp
> edges.

Yeah, I misspoke.  The way I see this working is this: when a new
partition is created/attached, we scan the set of partitions for that
partitioned table to determine the lowest unused one; use that as
partition ID for the new partition.  Index entries for that partition
will use the smallest possible representation for that partition ID.

When a partition is dropped, a vestigial catalog entry for it remains,
until all global index entries for it have been removed.  This prevents
reuse of the partition ID until it no longer causes harm.

This way, when a partition is dropped, we have to take the time to scan
all global indexes; when they've been scanned we can remove the catalog
entry, and at that point the partition ID becomes available to future
partitions.

> The nbtree heap TID column and partition number column should probably
> be a single varwidth column (not two separate columns), that is often
> no wider than 6 bytes, but can be wider when there are many partitions
> and/or very large partitions. That will be challenging, but it seems
> like the right place to solve the problem. I think that I could make
> that happen. Maybe this same representation could be used for all
> nbtree indexes, not just global nbtree indexes.

Maybe local nbtree indexes would have a partition ID of length 0, since
that many bits are necessary to identify which table is pointed to by
each index item.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-general by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Table partition with primary key in 11.3
Next
From: Peter Geoghegan
Date:
Subject: Re: Table partition with primary key in 11.3