Home > mailing lists
Feature suggestions (long) - Mailing list pgsql-hackers

From	Martijn van Oosterhout
Subject	Feature suggestions (long)
Date	May 17, 2003 11:00:52
Msg-id	20030517150039.GF3336@svana.org Whole thread Raw
Responses	Re: Feature suggestions (long)
List	pgsql-hackers
Tree view
[Please CC any replies to me. Thanks.]

I'm going suggest a feature like what Oracle calls "partitions" and later on
something with indexes. The idea is to generate some discussion to see if
they are worthy of being added to the TODO list.

Partitions
==========
The idea is to split data across tables according to the contents of a
tuple. So you could split the contents of a table based on years or some
such. I'm not going to go too much into why it's a good idea, there are
several links on the web discussing it:

http://www.nyoug.org/200212nanda.pdf

Anyway, what I'm proposing is different from what Oracle does but flexible
enough such that what Oracle does could the implemented/emulated easily
enough. I was thinking along the lines of:

CREATE PARTITION sales_2003 ON sales WHERE saledate >= '2003-01-01' AND saledate < '2004-01-01';

This would produce a sequence of events equivalent to:

CREATE TABLE sales_2003 () INHERITS ( sales );
INSERT INTO sales_2003 SELECT * FROM sales WHERE <condition>;
DELETE FROM sales WHERE <condition>;

CREATE RULE rule1 AS ON INSERT TO sales
WHERE <condition> DO INSTEAD
INSERT INTO sales_2003 (NEW.*);

CREATE RULE rule2 AS ON DELETE TO sales
WHERE <condition> DO INSTEAD
DELETE FROM sales_2003 WHERE <???>;

<a pile more RULEs>

err... It gets a bit complex here. My point is that it can be done but with
proper support it can be done more efficiently, reliably and usably.

Internally you could treat them more like real inherited tables. Inserts
would be fairly straight forward (use the conditions to determine the
resulting table). Deletes require no work at all. Updates are trickier. The
problem is that after an update the tuple may end up needing to be in
another table. Adding/dropping/altering columns would have to be handled
similarly to inherited tables.

Looking at heap_update or ExecUpdate it's not clear whether it would be a
problem if the old and new tuples are in a separate tables. As long as the
descriptors are the same it shouldn't be too much of a deal hopefully.

So far the storage doesn't seem too difficult. I'm not sure how to deal with
inheritance and this. I'd suggest making partitions work only on individual
tables and not inherited.

Next stage would be teaching the planner. The conditions would be
pseudo-constraints on the partitions. Hence if the conditions and the
constraints form a non-intersecting set, you can skip that partition
altogether.

Now, semantically, should you be allowed to do inserts, updates and deletes
on partitions directly, or should they be hidden? If/when we get
table-spaces, you should be able to move them around.

Deleting a partition would amount to moving all the tuples back to the main
table. Of course, if there are other partitions the conditions will have to
be re-evaluated to place the tuple into one of the other partitions.

What do you do if multiple partitions have conflicting conditions? I'd be
tempted to do a first-come-first-serve basis, though maybe some kind of
priority? One advantage of the Oracle approach is that this issue doesn't
come up. We certainly can't write a system to detect anything but the
obvious cases.

Of course, making individual indexes for all the partitions could get
tedious, even if the system did it automatically; which me brings to my next
suggestion:

Multi-table indexes
===================

Currently an index only applies to a single table. If you use inheritance
all the indexes can get cumbersome. With something like the above partitions
it could get completely out of hand if used heavily. What I'd like is if
instead of an index mapping <index keys> -> ctid, it does <index keys> ->
(tableoid,ctid). At the same time it lets us build UNIQUE indexes for
inheritance (for primary keys).

Currently, an index is represented by a row in pg_index. To implement this
you would need to arrange a way for an index to list all its constituent
tables so that when doing an update you know which to indexes to update.
Multiple rows seems the easiest way but it does violate a uniqueness
constraint.

Obviously, the columns indexed can only be present in the root table.
However, the index would be usable for queries on subtables too. The planner
would simply have to remember to push down 'tableoid IN (oid,oid,oid,...)'
conditions.

The major problem here is statistics. What you really need is some sort of
aggregate statistics over an entire inheritance tree. Without something like
this the planner can't make sensible decisions on when to use the index.
ANALYZE currently doesn't do this. It's not entirely clear how it would go
about it either. This doesn't apply to UNIQUE indexes though.

Another issue is when a table is deleted. Normally you can just purge the
index when the table is deleted. With these indexes you can't since the
remaining data is still useful. So the index access method would have to be
careful not to return tuples from tables that don't exist anymore. VACUUM
would hopefully eventually clean them up.

There would also be an issue of the IndexScan node returning the right
fields depending on which part of the subtree is being queried. For example,
if B were inherited from A then an IndexScan on A* could return fields not
in A if the query only ever referenced B. This may already be solved as
the IndexScan node can scan multiple indexes.

=

In any case, the second feature seems like it will be required (or at least
something similar) to handle proper referential integrity and primary keys
for inherited tables. But if something like the former is implemented
(hopefully, as using rules would get very cumbersome) then the latter would
be necessary.

Have a good weekend.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> "the West won the world not by the superiority of its ideas or values or
> religion but rather by its superiority in applying organised violence.
> Westerners often forget this fact, non-Westerners never do."
>   - Samuel P. Huntington
pgsql-hackers by date:
From: Michael Meskes
Date: 17 May 2003, 05:06:12
Subject: Re: ECPG timestamp.c
From: Christopher Kings-Lynne
Date: 17 May 2003, 11:33:27
Subject: Re: Heads up: 7.3.3 this Wednesday
Feature suggestions (long) - Mailing list pgsql-hackers

Previous

Next