Re: Declarative partitioning - Mailing list pgsql-hackers

From Amit Langote
Subject Re: Declarative partitioning
Date
Msg-id 55E9317C.60705@lab.ntt.co.jp
Whole thread Raw
In response to Re: Declarative partitioning  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Declarative partitioning  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Sorry about the long delay in replying, to this message or the others
posted in the last few days. I should have notified in advance of my
vacation with rather limited Internet access.

On 2015-08-26 PM 11:00, Simon Riggs wrote:
> On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> 
>> 3. Multi-level partitioning
>>
>> CREATE TABLE table_name
>> PARTITION OF partitioned_table_name
>> FOR VALUES values_spec
>> PARTITION BY {RANGE|LIST} ON (columns_list)
>>
>> This variant implements a form of so called composite or sub-partitioning
>> with arbitrarily deep partitioning structure. A table created using this
>> form has both the relkind RELKIND_PARTITIONED_REL and
>> pg_class.relispartition set to true.
>>
> 
> Multi-level partitioning is probably going to complicate things beyond
> sanity.
> 
> One RELKIND_PARTITIONED_REL with lots of partitions sounds best to me. We
> can still have N dimensions of partitioning (or partitioning and
> subpartitioning, if you prefer that term)
> 

Implementation in the patch lets RELKIND_PARTITIONED_REL under another and
so on to which appears to give a capability to have arbitrarily deep
partitioning structure. That might feel unnecessary. Initial motivation to
go for such design was to not have to complicate the partition key catalog
with details of how to accommodate some notion of sub-partition template.
ISTM, most usecases (as covered by other databases) would not require to
use more than 2 levels. Maybe, we should consider catering to only that
set of usecases using some sub-partition template based design (and then
some specialized syntax).

Or, we could just not handle sub-partitioning initially, by which I mean
composite partitioning where you can have schemes such as range-list,
list-range, range-hash, etc. We do still have multi-column
range partitioning.

In any case, we have to have a design that will cater to the multi-level
partitioning.

> 
>> The patch does not yet implement any planner changes for partitioned
>> tables, although I'm working on the same and post updates as soon as
>> possible. That means, it is not possible to run SELECT/UPDATE/DELETE
>> queries on partitioned tables without getting:
>>
>> postgres=# SELECT * FROM persons;
>> ERROR: could not open file "base/13244/106975": No such file or directory
>>
>> Given that there would be more direct ways of performing partition pruning
>> decisions with the proposed, it would be nice to utilize them.
>> Specifically, I would like to avoid having to rely on constraint exclusion
>> for partition pruning whereby subquery_planner() builds append_rel_list
>> and the later steps exclude useless partitions.
>>
> 
> This is really the heart of this patch/design. You can work for months on
> all the rest of this, but you will live or die by how the optimization
> works because that is the thing we really need to work well. Previous
> attempts ignored this aspect and didn't get committed. It's hard, perhaps
> even scary, but its critical. It's the 80/20 rule in reverse - 20% of the
> code is 80% of the difficulty.
> 
> I suggest you write a partition query test script .sql and work towards
> making this work. Not exhaustive and weird tests, but 5-10 key queries that
> need to be optimized precisely and quickly. I'm sure that's been done
> before.
> 

Yes, I am working on this and hope to have something to show soon.

> 
> I couldn't see why you invented a new form of Alter Table recursion.
> 

It was intended to keep the ALTER TABLE considerations for inherited
tables (and typed tables) separate from those for partitioned tables. But...

This begs a larger question that I did not try to answer in this
design/patch - for partitions, do we need to have any catalog entries
other than the pg_class tuple? If we manage to not require them, we would
not need any AT recursion business at all except for cases that require
pg_class tuple updates. The pg_class tuple would be the only
authoritative catalog entry for partitions. Everything else belongs with
the master table. That includes catalog entries for attributes,
constraints, triggers, etc. Alvaro had mentioned something like this in
his proposal. Although, I hope that such radical design is not very
difficult to realize/code.

Also, we might have to go back to the slightly controversial question of
whether partitions share the same namespace as normal tables (partitioned
or not). In the patch, partitions are created using CREATE TABLE, altered
with ALTER TABLE (albeit with several restrictions). How about rather
partitions are created/altered using:

ALTER TABLE master CREATE PARTITION <name> ...,

ALTER TABLE master MODIFY PARTITION <name> ... (as mentioned above)

AT commands covered by the latter should only ever require updating the
pg_class tuple for the named partition.

ALTER TABLE <name> directly on a partition can be made to say something
like the following under this scheme:
 ERROR: <name> is not a table

Although, I'm not thinking of hiding partitions from everyone. For
example, maintenance commands like VACUUM/ANALYZE (including autovacuum,
of course) would be able to see them.

> We will need to support multi-row batched COPY.

Currently, tuple-routing may switch the ResultRelInfo for every
consecutive tuple. So, at once I abandoned any hope of adapting the
routing mechanism to heap_multi_insert() API. But, perhaps there is a
smarter way. Let me think about that.

> 
> I'm pleased to see this patch and will stay with it to completion, perhaps
> others also. We have 3 more CFs in this release, Nov, Jan, Mar - so this
> has a great chance of making it into 9.6. The current patch implements a
> bunch of stuff, but its hard to say what, how or why it does it and without
> the planner stuff its all moot. My recommendation is we say "Returned with
> Feedback" on this now, looking forward to next patch.
> 
> If you submit another patch before Nov, I will review it without waiting
> for Nov 1.
> 
> There will be much discussion on syntax, but that is not the key point. DDL
> Support routines are usually pretty straightforward too, so that can be
> left for now.
> 

Sure, let me address number of points you have raised. In the upcoming
versions, I will try to provide extensive documentation (both internal and
user). As for the syntax and DDL changes you think can be dealt with
later, I tend to think let's keep the bare essentials necessary.

Thanks,
Amit




pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Allow replication roles to use file access functions
Next
From: Masahiko Sawada
Date:
Subject: Re: Freeze avoidance of very large table.