Re: Dynamic Partitioning using Segment Visibility Maps - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Dynamic Partitioning using Segment Visibility Maps
Date
Msg-id 1199891288.4266.300.camel@ebony.site
Whole thread Raw
In response to Re: Dynamic Partitioning using Segment Visibility Maps  (Gavin Sherry <swm@alcove.com.au>)
Responses Re: Dynamic Partitioning using Segment Visibility Maps  (Gavin Sherry <swm@alcove.com.au>)
List pgsql-hackers
Gavin and all,

This is quite a long reply, so apologies for that.

On Wed, 2008-01-09 at 07:28 +0100, Gavin Sherry wrote:
> On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote:
> > This technique would be useful for any table with historical data keyed
> > by date or timestamp. It would also be useful for data where a
> > time-of-insert component is implicit, such as many major entity tables
> > where the object ids are assigned by a sequence. e.g. an Orders table
> > with an OrderId as PK. Once all orders placed in a period have been
> > shipped/resolved/closed then the segments will be marked read-only.
> > 

> A novel approach to the problem. For me, this proposal addresses some
> of the other problems in postgres (visibility in the heap vs. index)
> rather than the problems with partitioning itself.

I think you're right that it isn't attempting to address the problems
with partitioning directly, it is attempting to address the core use
case itself. I think we have an opportunity to bypass the
legacy-of-thought that Oracle has left us and implement something more
usable.

There are some low-level technical issues associated with declarative
partitioning that I'll address last, which are in many ways the big
reason to want to go to non-declarative partitioning.

> It might seem otherwise, but for me partitioning is tool for managing
> large volumes of data. It allows the user to encode what they know about
> the nature of their data, and that's often about management. In this
> way, your proposal actually makes partitioning too smart: the user wants
> to tell the system how to organise the data, as opposed to the other way
> around.

> At Greenplum, we've been discussing this in depth. Interestingly, we
> also felt that the storage layer could be much smarter with large tables
> with predictable layouts and predictable data patterns. But the thing
> is, people with large tables like partitioning, putting different
> partitions on different storage; they like the ability to merge and
> split partitions; they like truncating old partitions. In a word, they
> seem to like the managability partitions give them -- as well as the
> performance that comes with this.

Do people really "like" running all that DDL? There is significant
manpower cost in implementing and maintaining a partitioning scheme,
plus significant costs in getting it wrong. 

If people with large tables like partitioning why is Oracle moving
towards automated partitioning in 11g? Automated partitioning was one of
the major goals for this next set of Postgres partitioning functionality
also, whether or not we have declarative partitioning. My thinking is
lets go for the best ideas and skip over the stuff that will be (is)
obsolete before its left the design stage.

I see many more systems in the Terabyte range now than I did 10 years
ago, but I see about the same number of DBAs. We'll always need experts,
but I feel we should be using our expertise to simplify the standard
cases, not just maintain the same level of difficulty in managing them.
One of the big benefits of the dynamic partitioning approach is that it
needs no DDL. So it will work out-of-the-box, for anybody.

Deleting older data would be optimised under the proposed scheme, so
that's not really a problem. Loading data is actually slightly harder
and slower with declarative partitioning (see below).

Merging and splitting partitions are tools for fine tuning a very
complex partitioning scheme. They do also allow a non-linear
segmentation scheme, which might aid performance in some cases.

(On a different note, I'm strongly in favour of a declarative approach
to other optimizer information to allow the DBA to say what they know
about how a table will be used. But that's off-topic for now.)

One major advantage of the dynamic approach is that it can work on
multiple dimensions simultaneously, which isn't possible with
declarative partitioning. For example if you have a table of Orders then
you will be able to benefit from Segment Exclusion on all of these
columns, rather than just one of them: OrderId, OrderDate,
RequiredByDate, LastModifiedDate. This will result in some "sloppiness"
in the partitioning, e.g. if we fill 1 partition a day of Orders, then
the OrderId and OrderData columns will start out perfectly arranged. Any
particular RequiredByDate will probably be spread out over 7 partitions,
but thats way better than being spread out over 365+ partitions.

When we look at the data in the partition we can look at any number of
columns. When we declaratively partition, you get only one connected set
of columns, which is one of the the reasons you want multi-dimensional
partitioning in the first place.

> To this end, we (well, Jeff Cohen) looked at the syntax and semantics of
> partitining in leading databases (Oracle, Informix, DB2) and came up
> with a highly expressive grammar which takes the best of each I think
> (I'll post details on the grammar in a seperate thread). The idea is
> that range (for example, a date range), list (a list of distinct values)
> and hash partitioning be supported on multiple columns. Partitions can
> be added, merged, truncated. Partitions can reside on different
> tablespaces. The existing issues with the rewriter, COPY support and the
> like go away by smartening up the backend.

> To explore the grammar and semantics Jeff and I (to a small extent) have
> implemented the parsing of such a grammar in the backend. At the moment,
> this just results in the generation of a bunch of rules (i.e., it
> produces the current behaviour, as if someone had written rules
> themselves) and debugging output.
> 
> The fully fledged approach will see partitioning rules stored in
> a new system catalog which can be interrogated by the planner or COPY to
> determine which partition a given tuple belongs in or which partition(s)
> a WHERE clause matches.

When loading data we would need to examine each incoming tuple and
distribute it to the correct place in the partitioned heap. That's a
non-trivial overhead on data loading that I would like to avoid, since
if we rely on the natural locality of loaded data then that overhead is
zero.

> ...partitioning rules stored in
> a new system catalog which can be interrogated...

Agreed. I think this would be the best approach whatever we do. pg_class
would eventually impose a limit on the number of partitions for us.

> Yes, this is the traditional approach but I think it still has merit. We
> shall continue working on this approach because it is what our customers
> have asked for. We would also like to see it get into postgres too.

I think there very likely some merit hiding in the "traditional
approach", but we should be clear about what that is and is not. (I
still have half my brain in that way of doing things, so I'm working
through everything to make sure I miss as little as possible.)

I'd like to hear from anybody with some descriptions of use cases where
the differences between what's been proposed and what is missing provide
a significant gain. If nobody has ever run such tests or had those
thoughts, lets examine them now to make sure we're doing the right
thing. I don't want to copy what others have done without considering
whether it is correct for us.


After thinking about what you've said ISTM that the declarative
partition approach has these advantages over the dynamic approach

- allows partitions of different sizes, for when a non-linear
partitioning scheme gives considerable performance gain

- allows partitioning to continue to work whilst UPDATEs continue within
a particular partition

and these disadvantages

- only works with one set of columns, cannot exclude partitions by other
columns

- requires manual definition

- requires partition names and numbers to be user visible

- declarative partitioing can improve the situation, or make it worse,
depending upon whether the definitions match the reality of the data

- acquires table locks during DDL execution, to enforce validity of the
rules, possibly for extended periods while data moves/is checked

- requires loaded data to execute partitioning code to identify where it
should insert itself

My assessment is that the dynamic partitioning approach wins in most
common cases for large growing tables, though there are additional use
cases where more control is desirable.

(The two advantages of declarative partitioning might be something we
could change with the dynamic approach, but not in first phase. For
example we might see a way to break down segments into 4 sub-segments
recursively using a quad tree approach. Thus when segments are being
updated we simply isolate the issue around the changing data.)

I would like to proceed in a way that allows dynamic partitioning to
occur naturally as the default, with the option to add explicit
partitioning, if required and possible at a later stage.

At this stage I'm not sure whether those additional cases apply to truly
growing tables and whether there is sufficient additional advantage to
make declarative partitioning worth considering.


Technical Issues
----------------

The main problems I see occur in the storage layer. If we can discuss
what you think is needed at the storage layer then we might stand a
chance of laying the foundations for both options. The catalog changes
and executor changes are probably fairly similar for both.

My first problem sounds quite mundane: expandability. If you define a
partition manually and logically then it needs to cope with an arbitrary
number of rows. That means that the view of a partition as a range of
contiguous blocks disappears and is replaced by a more complex
definition. When you turn partitioning on its head and say "What data is
in this range of blocks?" you don't need to have expandability and the
partition can be a fixed set of contiguous blocks. That works well with
any shape of data, so merging/splitting operations are not required
because we never have fat or thin partitions to cope with. You can also
redefine partitioning simply by changing the segment size and
re-VACUUMing the table.

At the moment, smgr provides the executor with a single uniform view of
a table as a single contiguous range of blocks. The current
implementation of md.c allows a table to actually consist of a range of
files, without the executor knowing. If we violate that, then we'd need
to change all code that accepts a BlockNumber, i.e. indexes, buffer
manager etc. So anything we do that is of any complexity must be below
the smgr layer. Allowing expandable partitions eventually implies a
non-contiguous range of blocks. 

In the worst case coping with a non-contiguous block range means you
need to consult a block index for each block, which would be horrible.
One better solution might be to have logical partitions broken up into
physical partitions all of the same size. Various solutions possible,
but most seem to detract from the thing which we were seeking in the
first place and/or seem to restrict the range of block addresses
available, unless we go for variable-size BlockNumbers also. Urgh.

If anybody thinks we can get around this by putting a maximum size on
partitions, then I'd say you're mad. The last thing we would want is a
mechanism that stops you from adding new data if you get the partition
definitions wrong, especially since my experience is that most people
set up partitioning in a way that is proven sub-optimal after some time
in real operation. The same comments would also apply to Oracle's Hash
Cluster and DB2's Multi-Dimensional Clustering techniques. They're great
performance features when declared correctly, but almost unusable in
real applications that can't control what business and the real world
will throw at them.

So it seems likely that we'd be heading for a level of change within the
storage layer that would not be easily accepted, given that not
everybody needs partitioning in Postgres. 

We're probably caught on the question of whether this is a general
purpose or special purpose database system. Realistically, I have to set
limits, so I'd have to describe my current goal of being able to manage
16TB easily and quickly with Postgres, without too much additional
thought on the part of the DBA, when they have one or more naturally
growing tables. Special purpose requirements for parallelism or larger
databases can use another database system, e.g. GP, XDB, ParAccel etc.


So, there are specific disadvantages that I see to declarative
partitioning with respect to its deeper implementation for us. I see no
need to rule it out completely, but I do see reasons to de-prioritise
it. Detailed comments and thoughts most welcome.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Named vs Unnamed Partitions
Next
From: Gregory Stark
Date:
Subject: Re: Some ideas about Vacuum