Re: Partitioning option for COPY - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Partitioning option for COPY
Date
Msg-id 1258969003.27757.4259.camel@ebony
Whole thread Raw
In response to Re: Partitioning option for COPY  (Emmanuel Cecchet <manu@asterdata.com>)
Responses Re: Partitioning option for COPY
Re: Partitioning option for COPY
List pgsql-hackers
On Wed, 2009-11-11 at 19:53 -0500, Emmanuel Cecchet wrote:
> Hi,
> >> I have extracted the partitioning option for COPY (removed the error 
> >> logging part) from the previous patch.
> >>     
> >
> > We can use an INSERT trigger to route tuples into partitions even now.
> > Why do you need an additional router for COPY? 
> Tom has already explained on the list why using a trigger was a bad idea 
> (and I know we can use a trigger since I am the one who wrote it).
> If you look at the code you will see that you can do optimizations in 
> the COPY code that you cannot do in the trigger.
> 
> >  Also, it would be nicer
> > that the router can works not only in COPY but also in INSERT.
> >   
> As 8.5 will at best provide a syntactic hack on top of the existing 
> constraint implementation, I think that it will not hurt to have routing 
> in COPY since we will not have it anywhere otherwise.
> > BTW, I'm working on meta data of partitioning now. Your "partitioning"
> > option in COPY could be replaced with the catalog.
> >   
> This implementation is only for the current 8.5 and it will not be 
> needed anymore once we get a fully functional partitioning in Postgres 
> which seems to be for a future version.

Yes, the trigger way of doing this is a bad way. 

I regret to say that the way proposed here isn't much better, AFAICS.
Let me explain why I think that, but -1 to anyone applying this patch.

This patch proposes keeping a cache of last visited partitions to reduce
the overhead of data routing.

What I've requested is that partitioning work by using a data structure
held in relcache for inheritance parents. This differs in 3 ways from
this patch
a) it has a clearly defined location for the cached metadata, with
clearly identified and well worked out mechanisms for cache invalidation
b) the cache can be built once when it is first needed, not slowly grown
as parts of the metadata are used
c) it would be available for all parts of the server, not just COPY.

The easiest way to build that metadata is when structured partitioning
info is available. i.e. the best next action is to complete and commit
Itagaki's partitioning syntax patch. Then we can easily build the
metadata for partitioning, which can then be used in COPY for data
routing.

Anyway, I want data routing, as is the intention of this patch. I just
don't think this patch is a useful way to do it. It is too narrow in its
scope and potentially buggy in its approach to developing a cache and
using trigger-like stuff. 

ISTM that with the right metadata in the right place, a cleaner and
easier solution is still possible for 8.5. The code within COPY should
really just reduce to a small piece of code to derive the correct
relation for the desired row and then use that during heap_insert().

I have just discussed partitioning with Itagaki-san at JPUG, so I know
his plans. Itagaki-san and Manu, please can you work together to make
this work for 8.5? 

---

A more detailed explanation of Partitioning Metadata:

Partitioning Metadata is information held on the relcache for a table
that has child partitions. Currently, a table does not cache info about
its children, which prevents various optimisations.

We would have an extra pointer on the Relation struct that points to a
PartitioningMetadata struct. We can fill in this information when we
construct the relcache for a relation, or we can populate it on demand
the first time we attempt to use that information (if it exists).

We want to hold an array of partition boundary values. This will then
allow us to use bsearch to find the partition that a specific value
applies to. Thus it can be used for routing data from INSERTs or COPY,
can be used for identifying which partitions need to be
included/excluded from an APPEND node. Using this will be O(logN) rather
than O(N), so allowing us to have much larger number of partitions when
required. Note that it can also be used within the executor to perform
dynamic partition elimination, thus allowing us to easily implement
partition aware joins etc.

To construct the array we must sort the partition boundary values and
prove that the partition definitions do not overlap. That is much easier
to do when the partitions are explicitly defined. (Plus, there is no
requirement to have, or mechanism to specify, unique partitions
currently, although most users assume this in their usage).

I imagine we would have an API called something like
RelationIdentifyPartition() where we provide value(s) for the
PartitioningKey column(s) and we then return the Oid of the partition
that holds that value. That function would build the metadata, if not
already cached, then bsearch it to provide the Oid.

-- Simon Riggs           www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Thom Brown
Date:
Subject: Re: [GENERAL] Updating column on row update
Next
From: Magnus Hagander
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Remove -w (--ignore-all-space) option from pg_regress's diff