Re: Dynamic Partitioning using Segment Visibility Maps - Mailing list pgsql-hackers

From Gavin Sherry
Subject Re: Dynamic Partitioning using Segment Visibility Maps
Date
Msg-id 20080110020638.GB2837@europa.idg.com.au
Whole thread Raw
In response to Re: Dynamic Partitioning using Segment Visibility Maps  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: Dynamic Partitioning using Segment Visibility Maps  (Simon Riggs <simon@2ndquadrant.com>)
Re: Dynamic Partitioning using Segment Visibility Maps  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
Hi Simon,

On Wed, Jan 09, 2008 at 03:08:08PM +0000, Simon Riggs wrote:
> Do people really "like" running all that DDL? There is significant
> manpower cost in implementing and maintaining a partitioning scheme,
> plus significant costs in getting it wrong. 

Well... that's impossible for me to say. Most of the people who use lots
of data like using MDX (a large and quite complex reporting and
modelling language from MS, which is more complex than SQL IMHO) so...

> 
> If people with large tables like partitioning why is Oracle moving
> towards automated partitioning in 11g? Automated partitioning was one of

Have you used Oracle's partitioning? I'm not surprised they're moving
away ;-).

More seriously, I didn't know that. Do you have a URL?

> the major goals for this next set of Postgres partitioning functionality
> also, whether or not we have declarative partitioning. My thinking is
> lets go for the best ideas and skip over the stuff that will be (is)
> obsolete before its left the design stage.

Well, you're assuming that because declarative partitioning has been around a
while, it's obselete. I don't think that's the case. Postgres has been
around about as long...

> I see many more systems in the Terabyte range now than I did 10 years
> ago, but I see about the same number of DBAs. We'll always need experts,
> but I feel we should be using our expertise to simplify the standard
> cases, not just maintain the same level of difficulty in managing them.
> One of the big benefits of the dynamic partitioning approach is that it
> needs no DDL. So it will work out-of-the-box, for anybody.

It's fine to say that, but people have already started talking about
grammar extensions for dynamic partitions. People want the tools to
manage it. There's difficulty in learning how and when to use the
different VACUUM options that have been discussed.

> Deleting older data would be optimised under the proposed scheme, so
> that's not really a problem. Loading data is actually slightly harder
> and slower with declarative partitioning (see below).
> 
> Merging and splitting partitions are tools for fine tuning a very
> complex partitioning scheme. They do also allow a non-linear
> segmentation scheme, which might aid performance in some cases.

What about adding partitions into a set of partitions. Greg's post on
that was dismissed and it might not be representative but we have users
doing that all the time.

> One major advantage of the dynamic approach is that it can work on
> multiple dimensions simultaneously, which isn't possible with
> declarative partitioning. For example if you have a table of Orders then
> you will be able to benefit from Segment Exclusion on all of these
> columns, rather than just one of them: OrderId, OrderDate,
> RequiredByDate, LastModifiedDate. This will result in some "sloppiness"
> in the partitioning, e.g. if we fill 1 partition a day of Orders, then
> the OrderId and OrderData columns will start out perfectly arranged. Any
> particular RequiredByDate will probably be spread out over 7 partitions,
> but thats way better than being spread out over 365+ partitions.
> 
> When we look at the data in the partition we can look at any number of
> columns. When we declaratively partition, you get only one connected set
> of columns, which is one of the the reasons you want multi-dimensional
> partitioning in the first place.
> 
> > To this end, we (well, Jeff Cohen) looked at the syntax and semantics of
> > partitining in leading databases (Oracle, Informix, DB2) and came up
> > with a highly expressive grammar which takes the best of each I think
> > (I'll post details on the grammar in a seperate thread). The idea is
> > that range (for example, a date range), list (a list of distinct values)
> > and hash partitioning be supported on multiple columns. Partitions can
> > be added, merged, truncated. Partitions can reside on different
> > tablespaces. The existing issues with the rewriter, COPY support and the
> > like go away by smartening up the backend.
> 
> > To explore the grammar and semantics Jeff and I (to a small extent) have
> > implemented the parsing of such a grammar in the backend. At the moment,
> > this just results in the generation of a bunch of rules (i.e., it
> > produces the current behaviour, as if someone had written rules
> > themselves) and debugging output.
> > 
> > The fully fledged approach will see partitioning rules stored in
> > a new system catalog which can be interrogated by the planner or COPY to
> > determine which partition a given tuple belongs in or which partition(s)
> > a WHERE clause matches.
> 
> When loading data we would need to examine each incoming tuple and
> distribute it to the correct place in the partitioned heap. That's a
> non-trivial overhead on data loading that I would like to avoid, since
> if we rely on the natural locality of loaded data then that overhead is
> zero.

Yes, but it's a lot faster than what we have at the moment and the
dynamic approach is not without its costs. I've pointed it out before,
but I will again: updating a key which occurs in all segments would mean
the whole thing needs to be vacuumed to get partitioning back. That's a
MASSIVE cost.

> I'd like to hear from anybody with some descriptions of use cases where
> the differences between what's been proposed and what is missing provide
> a significant gain. If nobody has ever run such tests or had those
> thoughts, lets examine them now to make sure we're doing the right
> thing. I don't want to copy what others have done without considering
> whether it is correct for us.
> 

I don't think you're stating the question clearly enough. It's not just
a case of how many people hate typing out DDL commands. How many people
would be adversely affected by the difficulties I've pointed out. If you
update 100 rows in your big table, you're going to have to consider
VACUUMing it since your partition may now have 100 odd read-write tables
selected for scanning, not just 1 or 2.

> 
> After thinking about what you've said ISTM that the declarative
> partition approach has these advantages over the dynamic approach
> 
> - allows partitions of different sizes, for when a non-linear
> partitioning scheme gives considerable performance gain
> 
> - allows partitioning to continue to work whilst UPDATEs continue within
> a particular partition

What about consistency over time? High levels of control for users if
they want it? Even simple things like tablespace support? High
performance unload and load performance much better than now?
Reliable and consistent query plans? Support for datatypes other than
those relating to time.

> 
> and these disadvantages
> 
> - only works with one set of columns, cannot exclude partitions by other
> columns
> 
> - requires manual definition
> 
> - requires partition names and numbers to be user visible

In fact the grammar we will propose does not make this mandatory. But
either way, how is that a disadvantage worth listing?

> 
> - declarative partitioing can improve the situation, or make it worse,
> depending upon whether the definitions match the reality of the data

Which is the same as dynamic partitioning.

> - acquires table locks during DDL execution, to enforce validity of the
> rules, possibly for extended periods while data moves/is checked
> 
> - requires loaded data to execute partitioning code to identify where it
> should insert itself

Sure, that's how it works.

> I would like to proceed in a way that allows dynamic partitioning to
> occur naturally as the default, with the option to add explicit
> partitioning, if required and possible at a later stage.

As the default? What about effects on people who do not need
partitioning?

> Technical Issues
> ----------------
> 
> The main problems I see occur in the storage layer. If we can discuss
> what you think is needed at the storage layer then we might stand a
> chance of laying the foundations for both options. The catalog changes
> and executor changes are probably fairly similar for both.

I agree, there are major issues with the storage layer. I could list a
lot of things I'd like to see improved but they come down to taking
advantage of how storage works (reading 1 block is as good as reading
100), compression and much faster load.

> 
> My first problem sounds quite mundane: expandability. If you define a
> partition manually and logically then it needs to cope with an arbitrary
> number of rows. That means that the view of a partition as a range of
> contiguous blocks disappears and is replaced by a more complex
> definition. When you turn partitioning on its head and say "What data is
> in this range of blocks?" you don't need to have expandability and the
> partition can be a fixed set of contiguous blocks. That works well with
> any shape of data, so merging/splitting operations are not required
> because we never have fat or thin partitions to cope with. You can also
> redefine partitioning simply by changing the segment size and
> re-VACUUMing the table.

And that's what I don't like about it. It makes VACUUM more expensive
and suddenly it becomes a data management tool. You said elsewhere that
load is faster but it's not. It's two times slower because you /must/
VACUUM after load.

> At the moment, smgr provides the executor with a single uniform view of
> a table as a single contiguous range of blocks. The current
> implementation of md.c allows a table to actually consist of a range of
> files, without the executor knowing. If we violate that, then we'd need
> to change all code that accepts a BlockNumber, i.e. indexes, buffer
> manager etc. So anything we do that is of any complexity must be below
> the smgr layer. Allowing expandable partitions eventually implies a
> non-contiguous range of blocks. 
> 
> In the worst case coping with a non-contiguous block range means you
> need to consult a block index for each block, which would be horrible.
> One better solution might be to have logical partitions broken up into
> physical partitions all of the same size. Various solutions possible,
> but most seem to detract from the thing which we were seeking in the
> first place and/or seem to restrict the range of block addresses
> available, unless we go for variable-size BlockNumbers also. Urgh.
> 
> If anybody thinks we can get around this by putting a maximum size on
> partitions, then I'd say you're mad. The last thing we would want is a
> mechanism that stops you from adding new data if you get the partition
> definitions wrong, especially since my experience is that most people
> set up partitioning in a way that is proven sub-optimal after some time
> in real operation. The same comments would also apply to Oracle's Hash
> Cluster and DB2's Multi-Dimensional Clustering techniques. They're great
> performance features when declared correctly, but almost unusable in
> real applications that can't control what business and the real world
> will throw at them.
> 
> So it seems likely that we'd be heading for a level of change within the
> storage layer that would not be easily accepted, given that not
> everybody needs partitioning in Postgres. 

Originally, when Jeff and others at Greenplum were looking at better
partitioning they came up with a system quite similar to the one you're
proposing. There were two main reasons it was put to one side: it
undermines the planner and it doesn't give predictable results, the
performance changes and generally degrades over time..

I've talked about the latter else where so, to the former. If the
exclusion is executor driven, the planner cannot help but create a seq
scan plan. The planner will think you're returning 100X rows when really
you end up returning X rows. After that, all decisions made by the
planner are totally bogus. The solution means having the planner look at
those values for each partition so that it can generate an accurate
plan. But then we have locking issues: what if the partition range
changes underneath us?

Postgres wouldn't be the only database with this technology. Netezza
works in a similar way AND it degrades in the that I've said. You can
also assume that lots of obvious approaches to doing this are patented
by them, by the way.

> 
> We're probably caught on the question of whether this is a general
> purpose or special purpose database system. Realistically, I have to set
> limits, so I'd have to describe my current goal of being able to manage
> 16TB easily and quickly with Postgres, without too much additional
> thought on the part of the DBA, when they have one or more naturally
> growing tables. Special purpose requirements for parallelism or larger
> databases can use another database system, e.g. GP, XDB, ParAccel etc.

Sure, but this still affects everyone. With the declarative approach,
those who want it, use it. With the dynamic approach, everyone
experiences the cost.

Thanks,

Gavin


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: odd convert_from bug
Next
From: Alvaro Herrera
Date:
Subject: Re: operator suggest " interval / interval = numeric"