Thread: cluster index on a table

cluster index on a table

From
Ibrahim Harrani
Date:
Hello,

I have a table like following. To increase the performance of this
table, I would like to  create CLUSTER.
First, Which index should I use on this table for CLUSTER?
Secondly,  Can I create multiple CLUSTER  on the same table?
I will appreciate, if you can suggest other options to increase the
performance of the table.
I use this table to save metadata of the mails on my system.


mail=# \d maillogs
                                         Table "public.maillogs"
       Column       |            Type             |
   Modifiers
--------------------+-----------------------------+-------------------------------------------------------
 id                 | bigint                      | not null default
nextval('maillogs_id_seq'::regclass)
 queueid            | character varying(255)      | not null default
'*'::character varying
 recvtime           | timestamp without time zone | default now()
 remoteip           | character varying(128)      | not null default
'0.0.0.0'::character varying
 relayflag          | smallint                    | not null default
(0)::smallint
 retaction          | integer                     |
 retval             | integer                     | not null default 0
 probspam           | double precision            | not null default
(0)::double precision
 messageid          | text                        |
 fromaddress        | text                        | not null
 toaddress          | text                        | not null
 envelopesender     | text                        |
 enveloperecipients | text                        |
 messagesubject     | text                        |
 size               | bigint                      |
 logstr             | character varying(1024)     |
 destinationaddress | character varying(255)      |
 quarantinepath     | character varying(1024)     | not null default
''::character varying
 backuppath         | character varying(1024)     | not null default
''::character varying
 quarantineflag     | smallint                    | not null default
(0)::smallint
 backupflag         | smallint                    | not null default
(0)::smallint
 deletedflag        | smallint                    | not null default 0
 profileid          | integer                     | not null default 0
Indexes:
    "maillogs_pkey" PRIMARY KEY, btree (id) CLUSTER
    "idx_maillogs_backupflag" btree (backupflag)
    "idx_maillogs_deletedflag" btree (deletedflag)
    "idx_maillogs_enveloperecipients" btree (enveloperecipients)
    "idx_maillogs_envelopesender" btree (envelopesender)
    "idx_maillogs_messagesubject" btree (messagesubject)
    "idx_maillogs_quarantineflag" btree (quarantineflag)
    "idx_maillogs_recvtime" btree (recvtime)
    "idx_maillogs_remoteip" btree (remoteip)
    "idx_maillogs_revtal" btree (retval)
Foreign-key constraints:
    "maillogs_profileid_fkey" FOREIGN KEY (profileid) REFERENCES
profiles(profileid)
Triggers:
    maillogs_insert AFTER INSERT ON maillogs FOR EACH ROW EXECUTE
PROCEDURE maillogs_insert()

mail=#

Re: cluster index on a table

From
Kenneth Marshall
Date:
Clustering reorganizes the layout of a table according to
the ordering of a SINGLE index. This will place items that
are adjacent in the index adjacent in the heap. So you need
to cluster on the index that will help the locality of
reference for the queries which will benefit you the most.
Execution time sensitive queries are a good way to choose.

Cheers,
Ken

On Wed, Jun 24, 2009 at 08:32:14PM +0300, Ibrahim Harrani wrote:
> Hello,
>
> I have a table like following. To increase the performance of this
> table, I would like to  create CLUSTER.
> First, Which index should I use on this table for CLUSTER?
> Secondly,  Can I create multiple CLUSTER  on the same table?
> I will appreciate, if you can suggest other options to increase the
> performance of the table.
> I use this table to save metadata of the mails on my system.
>
>
> mail=# \d maillogs
>                                          Table "public.maillogs"
>        Column       |            Type             |
>    Modifiers
> --------------------+-----------------------------+-------------------------------------------------------
>  id                 | bigint                      | not null default
> nextval('maillogs_id_seq'::regclass)
>  queueid            | character varying(255)      | not null default
> '*'::character varying
>  recvtime           | timestamp without time zone | default now()
>  remoteip           | character varying(128)      | not null default
> '0.0.0.0'::character varying
>  relayflag          | smallint                    | not null default
> (0)::smallint
>  retaction          | integer                     |
>  retval             | integer                     | not null default 0
>  probspam           | double precision            | not null default
> (0)::double precision
>  messageid          | text                        |
>  fromaddress        | text                        | not null
>  toaddress          | text                        | not null
>  envelopesender     | text                        |
>  enveloperecipients | text                        |
>  messagesubject     | text                        |
>  size               | bigint                      |
>  logstr             | character varying(1024)     |
>  destinationaddress | character varying(255)      |
>  quarantinepath     | character varying(1024)     | not null default
> ''::character varying
>  backuppath         | character varying(1024)     | not null default
> ''::character varying
>  quarantineflag     | smallint                    | not null default
> (0)::smallint
>  backupflag         | smallint                    | not null default
> (0)::smallint
>  deletedflag        | smallint                    | not null default 0
>  profileid          | integer                     | not null default 0
> Indexes:
>     "maillogs_pkey" PRIMARY KEY, btree (id) CLUSTER
>     "idx_maillogs_backupflag" btree (backupflag)
>     "idx_maillogs_deletedflag" btree (deletedflag)
>     "idx_maillogs_enveloperecipients" btree (enveloperecipients)
>     "idx_maillogs_envelopesender" btree (envelopesender)
>     "idx_maillogs_messagesubject" btree (messagesubject)
>     "idx_maillogs_quarantineflag" btree (quarantineflag)
>     "idx_maillogs_recvtime" btree (recvtime)
>     "idx_maillogs_remoteip" btree (remoteip)
>     "idx_maillogs_revtal" btree (retval)
> Foreign-key constraints:
>     "maillogs_profileid_fkey" FOREIGN KEY (profileid) REFERENCES
> profiles(profileid)
> Triggers:
>     maillogs_insert AFTER INSERT ON maillogs FOR EACH ROW EXECUTE
> PROCEDURE maillogs_insert()
>
> mail=#
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: cluster index on a table

From
Scott Marlowe
Date:
As another poster pointed out, you cluster on ONE index and one index
only.  However, you can cluster on a multi-column index.

Re: cluster index on a table

From
Ibrahim Harrani
Date:
Hi,

thanks for your suggestion.
Is there any benefit of setting fillfactor to 70 or 80 on this table?



On Wed, Jun 24, 2009 at 8:42 PM, Scott Marlowe<scott.marlowe@gmail.com> wrote:
> As another poster pointed out, you cluster on ONE index and one index
> only.  However, you can cluster on a multi-column index.
>

Re: cluster index on a table

From
Scott Carey
Date:
If you have a lot of insert/update/delete activity on a table fillfactor can help.

I don’t believe that postgres will try and maintain the table in the cluster order however.


On 7/15/09 8:04 AM, "Ibrahim Harrani" <ibrahim.harrani@gmail.com> wrote:

Hi,

thanks for your suggestion.
Is there any benefit of setting fillfactor to 70 or 80 on this table?



On Wed, Jun 24, 2009 at 8:42 PM, Scott Marlowe<scott.marlowe@gmail.com> wrote:
> As another poster pointed out, you cluster on ONE index and one index
> only.  However, you can cluster on a multi-column index.
>

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: cluster index on a table

From
Scott Marlowe
Date:
I'd love to see it.

On Wed, Jul 15, 2009 at 8:17 PM, Justin Pitts<justinpitts@gmail.com> wrote:
> Is there any interest in adding that (continual/automatic cluster
> order maintenance) to a future release?
>
> On Wed, Jul 15, 2009 at 8:33 PM, Scott Carey<scott@richrelevance.com> wrote:
>> If you have a lot of insert/update/delete activity on a table fillfactor can
>> help.
>>
>> I don’t believe that postgres will try and maintain the table in the cluster
>> order however.
>>
>>
>> On 7/15/09 8:04 AM, "Ibrahim Harrani" <ibrahim.harrani@gmail.com> wrote:
>>
>> Hi,
>>
>> thanks for your suggestion.
>> Is there any benefit of setting fillfactor to 70 or 80 on this table?
>>
>>
>>
>> On Wed, Jun 24, 2009 at 8:42 PM, Scott Marlowe<scott.marlowe@gmail.com>
>> wrote:
>>> As another poster pointed out, you cluster on ONE index and one index
>>> only.  However, you can cluster on a multi-column index.
>>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>



--
When fascism comes to America, it will be intolerance sold as diversity.

Re: cluster index on a table

From
Scott Mead
Date:

On Wed, Jul 15, 2009 at 10:36 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
I'd love to see it.

  +1 for index organized tables 

--Scott

Re: cluster index on a table

From
Scara Maccai
Date:
>   +1 for index organized tables 

+1

I have a table:

CREATE TABLE mytab
(
  "time" timestamp without time zone NOT NULL,
  ne_id integer NOT NULL,
  values integer,
 CONSTRAINT mytab_pk PRIMARY KEY (ne_id, "time"),
  CONSTRAINT mytab_ne_id_key UNIQUE ("time", ne_id)
}

The table is written every 15 minutes (that is, every 15 minutes all 20000 ne_ids get written), so the table is
"naturally"clustered on ("time", ne_id). 

Since I need it clustered on (ne_id, "time"), I tried to cluster on a day by day basis, since clustering the whole
tablewould take too much time: that is, I'd "reorder" the table every day (say at 1:00 AM). 

I've written a function (below) that re-insterts rows in the table ordered by ne_id,time; but it doesn't work! When I
doa "select * from mytab" rows aren't ordered by (ne_id,time).... 

What am I doing wrong?


CREATE OR REPLACE FUNCTION somefunc()
  RETURNS void AS
$BODY$
DECLARE
    t1 timestamp := '2006-10-01 00:00:00';
    t2 timestamp := '2006-10-01 23:59:00';
BEGIN
lock table stscell60_60_13_2800_512_cell_0610_leo in ACCESS EXCLUSIVE MODE;
WHILE t1 < '2006-11-01 00:00:00' LOOP
    insert into mytab select time,ne_id+100000, values from mytab where time between t1 and t2  order by ne_id,time;
    DELETE from mytab where time between t1 and t2 and ne_id<100000;
    update mytab set ne_id = ne_id - 100000 where time between t1 and t2;
    t1 := t1 + interval '1 days';
    t2 := t2 + interval '1 days';
END LOOP ;
END;
$BODY$
  LANGUAGE 'plpgsql'







Re: cluster index on a table

From
Scott Carey
Date:
Either true Index Organized Tables or Clustered Indexes would be very useful
for a variety of table/query types.  The latter seems more difficult with
Postgres' MVCC model since it requires data to be stored in the index that
is authoritative.

Storing the full tuple in an index and not even having a data only page
would also be an interesting approach to this (and perhaps simpler than a
separate index file and data file if trying to keep the data in the order of
the index).



On 7/15/09 7:17 PM, "Justin Pitts" <justinpitts@gmail.com> wrote:

> Is there any interest in adding that (continual/automatic cluster
> order maintenance) to a future release?
>
> On Wed, Jul 15, 2009 at 8:33 PM, Scott Carey<scott@richrelevance.com> wrote:
>> If you have a lot of insert/update/delete activity on a table fillfactor can
>> help.
>>
>> I don¹t believe that postgres will try and maintain the table in the cluster
>> order however.
>>
>>
>> On 7/15/09 8:04 AM, "Ibrahim Harrani" <ibrahim.harrani@gmail.com> wrote:
>>
>> Hi,
>>
>> thanks for your suggestion.
>> Is there any benefit of setting fillfactor to 70 or 80 on this table?
>>
>>
>>
>> On Wed, Jun 24, 2009 at 8:42 PM, Scott Marlowe<scott.marlowe@gmail.com>
>> wrote:
>>> As another poster pointed out, you cluster on ONE index and one index
>>> only.  However, you can cluster on a multi-column index.
>>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>


Re: cluster index on a table

From
Scott Carey
Date:
I could be wrong, but I think MSSQL only keeps the data specified in the
index in the index, and the remaining columns in the data.
That is, if there is a clustered index on a table on three columns out of
five, those three columns in the index are stored in the index, while the
other two are in a data portion.  But it has been several years since I
worked with that DB.

They are certainly storing at least those columns in the index itself.  And
that feature does work very well from a performance perspective.

IOT in Oracle is a huge win in some cases, but a bit more clunky for others
than Clustered Indexes in MSSQL.  Both are highly useful.

On 7/16/09 10:52 AM, "Justin Pitts" <justinpitts@gmail.com> wrote:

> ISTR that is the approach that MSSQL follows.
>
>>
>> Storing the full tuple in an index and not even having a data only
>> page
>> would also be an interesting approach to this (and perhaps simpler
>> than a
>> separate index file and data file if trying to keep the data in the
>> order of
>> the index).
>
>


Re: cluster index on a table

From
"Kevin Grittner"
Date:
Scott Carey <scott@richrelevance.com> wrote:
> I could be wrong, but I think MSSQL only keeps the data specified in
> the index in the index, and the remaining columns in the data.

Unless it has changed recently, an MS SQL Server clustered index is
the same as the Sybase implementation: all data for the tuple is
stored in the leaf page of the clustered index.  There is no separate
heap.  The indid in sysindexes is part of the clue -- a table has
either one 0 entry for the heap (if there is no clustered index) or
one 1 entry for the clustered index.  "Normal" indexes have indid of 2
through 254, and indid 255 is reserved for out-of-line storage of text
and image data.

-Kevin

Re: cluster index on a table

From
Scott Carey
Date:
Yes, it seems as though the whole tuple is entirely in the index if it is
clustered.

From :
http://msdn.microsoft.com/en-us/library/ms177484.aspx

" Each index row in the nonclustered index contains the nonclustered key
value and a row locator. This locator points to the data row in the
clustered index or heap having the key value."

That sort of model should work with MVCC and even HOT with the same
restrictions that HOT has now.

On 7/16/09 11:35 AM, "Justin Pitts" <justinpitts@gmail.com> wrote:

>
>
> According to the books online
> http://msdn.microsoft.com/en-us/library/ms177443.aspx
>   :
>
>         "In a clustered index, the leaf nodes contain the data pages of the
> underlying table."
>
>
> Which agrees with your assertion.
>
>  From a performance perspective, it DOES work very well. Which is why
> I keep hoping for it to show up in PostgreSQL.
>
> On Jul 16, 2009, at 2:21 PM, Scott Carey wrote:
>
>> I could be wrong, but I think MSSQL only keeps the data specified in
>> the
>> index in the index, and the remaining columns in the data.
>> That is, if there is a clustered index on a table on three columns
>> out of
>> five, those three columns in the index are stored in the index,
>> while the
>> other two are in a data portion.  But it has been several years
>> since I
>> worked with that DB.
>>
>> They are certainly storing at least those columns in the index
>> itself.  And
>> that feature does work very well from a performance perspective.
>>
>> IOT in Oracle is a huge win in some cases, but a bit more clunky for
>> others
>> than Clustered Indexes in MSSQL.  Both are highly useful.
>>
>> On 7/16/09 10:52 AM, "Justin Pitts" <justinpitts@gmail.com> wrote:
>>
>>> ISTR that is the approach that MSSQL follows.
>>>
>>>>
>>>> Storing the full tuple in an index and not even having a data only
>>>> page
>>>> would also be an interesting approach to this (and perhaps simpler
>>>> than a
>>>> separate index file and data file if trying to keep the data in the
>>>> order of
>>>> the index).
>>>
>>>
>>
>
>


Re: cluster index on a table

From
Greg Stark
Date:
On Thu, Jul 16, 2009 at 8:18 PM, Scott Carey<scott@richrelevance.com> wrote:
> " Each index row in the nonclustered index contains the nonclustered key
> value and a row locator. This locator points to the data row in the
> clustered index or heap having the key value."
>
> That sort of model should work with MVCC and even HOT with the same
> restrictions that HOT has now.


The problem with this is that btree indexes need to be able to split
pages. In which case your tuple's tid changes and all hell breaks
loose. One of the fundamental design assumptions in our MVCC design is
that you can trust a tuple to stay where it is as long as it's visible
to your transaction.

For example you may want to go back and check the discussion on
getting vacuum to do a sequential scan of indexes. The solution we
found for that only works because only a single vacuum can be scanning
the index at a time.

Another scenario to think about, picture yourself in the middle of a
nested loop processing all the matches for a tuple in the outer
relation. Now someone else comes along and wants to insert a new tuple
on the same page as that outer tuple and has to split the page. How do
you do that without messing up the nested loop which may not come back
to that page for many minutes?


--
greg
http://mit.edu/~gsstark/resume.pdf

Re: cluster index on a table

From
Scott Carey
Date:


On 7/16/09 12:46 PM, "Greg Stark" <gsstark@mit.edu> wrote:

> On Thu, Jul 16, 2009 at 8:18 PM, Scott Carey<scott@richrelevance.com> wrote:
>> " Each index row in the nonclustered index contains the nonclustered key
>> value and a row locator. This locator points to the data row in the
>> clustered index or heap having the key value."
>>
>> That sort of model should work with MVCC and even HOT with the same
>> restrictions that HOT has now.
>
>
> The problem with this is that btree indexes need to be able to split
> pages. In which case your tuple's tid changes and all hell breaks
> loose. One of the fundamental design assumptions in our MVCC design is
> that you can trust a tuple to stay where it is as long as it's visible
> to your transaction.
>
> For example you may want to go back and check the discussion on
> getting vacuum to do a sequential scan of indexes. The solution we
> found for that only works because only a single vacuum can be scanning
> the index at a time.
>
> Another scenario to think about, picture yourself in the middle of a
> nested loop processing all the matches for a tuple in the outer
> relation. Now someone else comes along and wants to insert a new tuple
> on the same page as that outer tuple and has to split the page. How do
> you do that without messing up the nested loop which may not come back
> to that page for many minutes?
>

Keep the old page around or a copy of it that old transactions reference?
Just more Copy on Write.
How is that different from a nested loop on an index scan/seek currently?
Doesn't an old transaction have to reference an old heap page through an
index with the current implementation?  At least, the index references
multiple versions and their visibility must be checked.  Can a similar
solution work here?  Just reference the pre and post split pages and filter
by visibility?

>
> --
> greg
> http://mit.edu/~gsstark/resume.pdf
>


Re: cluster index on a table

From
"Kevin Grittner"
Date:
Scara Maccai <m_lists@yahoo.it> wrote:

> What am I doing wrong?

> [function which uses INSERT/UPDATE/DELETE statements to try to force
> order of rows in heap]

You seem to be assuming that the rows will be in the table in the
sequence of your inserts.  You might be better off with a CLUSTER on
some index.  (There are a few other options, like TRUNCATE / INSERT or
SELECT INTO / DROP TABLE / ALTER TABLE RENAME -- but CLUSTER is
probably the safest, easiest way to go.)

-Kevin

Re: cluster index on a table

From
Greg Stark
Date:
On Thu, Jul 16, 2009 at 9:06 PM, Scott Carey<scott@richrelevance.com> wrote:
> Keep the old page around or a copy of it that old transactions reference?
> Just more Copy on Write.
> How is that different from a nested loop on an index scan/seek currently?
> Doesn't an old transaction have to reference an old heap page through an
> index with the current implementation?  At least, the index references
> multiple versions and their visibility must be checked.  Can a similar
> solution work here?  Just reference the pre and post split pages and filter
> by visibility?

That would be a completely different architecture than we have now.
We're not Oracle, Postgres does all this at the tuple level, not at
the page level. We have tuple versions, not page versions, and tuple
locks, not page locks.

The old transaction has to reference a heap page through an index with
the current implementation. But it can do so safely precisely because
the tuple will be where the index references it as long as necessary.
As long as that old transaction is live it's guaranteed not to be
removed by vacuum (well... except by VACUUM FULL but that's a whole
nother story).

Actually this is probably the clearest problem with IOT in the
Postgres universe. What do other indexes use to reference these rows
if they can move around?

I wanted to call Heikki's "grouped index item" patch that he worked on
for so long index organized tables. Basically that's what they are
except the leaf tuples are stored in the regular heap like always,
hopefully in index order. And there are no leaf tuples in the index so
the actual index is much much smaller. It doesn't look like a
traditional IOT but it behaves a lot like one in the space savings and
access patterns.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: cluster index on a table

From
Greg Stark
Date:
> Scara Maccai <m_lists@yahoo.it> wrote:
>
>> What am I doing wrong?

I didn't exactly follow the full sequence but it sounded like what's
happening is that Postgres is noticing all these empty pages from
earlier deletes and reusing that space. That's what it's designed to
do. As Kevin said, there's no guarantee that tuples will be read back
in the order you inserted them.

You might want to check your assumptions about the performance. If
you're deleting large batches the new tuples might not be where you
expect them to be in the table but they should still all end up in
chunks mostly in order. They might be located together closely enough
that they might still perform as if they're clustered.

If you're really insistent that they be clustered you would have to
make sure there are no free space map entries for them. This means
never running vacuum on the table. That will cause massive problems
down the line but if you periodically run CLUSTER you might be able to
engineer things to avoid them since cluster effectively does a vacuum
too. Keep in mind this will mean your table is massively bloated which
will make sequential scans much slower.



Also, keep in mind that Postgres is moving in the direction of
maintaining the free space map more automatically. It will get harder
and harder to ensure that space doesn't get reused. I'm not even sure
some features in 8.3 (HOT) and 8.4 (new FSM) don't already make it
nearly impossible. I've certainly never heard of anyone else trying
to.



A better option you might consider is to use a separate table for the
re-ordered tuples. If you never insert into the re-ordered table
except in the final order you want (and in the same connection), and
never update or delete records, then it should work.

You could even do this using partitions, so you have a single table
with the dynamically added records in one partition and then you
re-order the records into a new partition and swap it in to replace
the old partition.


Whatever you do you'll definitely still want an ORDER BY clause on
your queries if you need them in a certain order. Running the queries
you're doing is fine for seeing what order they're in on disk but
there are several reasons why they might still come out out of order
even if you never run vacuum and always insert them in order.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: cluster index on a table

From
Ibrahim Harrani
Date:
Hi Scott,

Which fillfactor is better 70, 80 or another value?

Thanks.

Thanks

On Thu, Jul 16, 2009 at 3:33 AM, Scott Carey<scott@richrelevance.com> wrote:
> If you have a lot of insert/update/delete activity on a table fillfactor can
> help.
>
> I don’t believe that postgres will try and maintain the table in the cluster
> order however.
>
>
> On 7/15/09 8:04 AM, "Ibrahim Harrani" <ibrahim.harrani@gmail.com> wrote:
>
> Hi,
>
> thanks for your suggestion.
> Is there any benefit of setting fillfactor to 70 or 80 on this table?
>
>
>
> On Wed, Jun 24, 2009 at 8:42 PM, Scott Marlowe<scott.marlowe@gmail.com>
> wrote:
>> As another poster pointed out, you cluster on ONE index and one index
>> only.  However, you can cluster on a multi-column index.
>>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
>

Re: cluster index on a table

From
Scott Carey
Date:
On 7/16/09 1:49 PM, "Greg Stark" <gsstark@mit.edu> wrote:

> On Thu, Jul 16, 2009 at 9:06 PM, Scott Carey<scott@richrelevance.com> wrote:
>> Keep the old page around or a copy of it that old transactions reference?
>> Just more Copy on Write.
>> How is that different from a nested loop on an index scan/seek currently?
>> Doesn't an old transaction have to reference an old heap page through an
>> index with the current implementation?  At least, the index references
>> multiple versions and their visibility must be checked.  Can a similar
>> solution work here?  Just reference the pre and post split pages and filter
>> by visibility?
>
> That would be a completely different architecture than we have now.
> We're not Oracle, Postgres does all this at the tuple level, not at
> the page level. We have tuple versions, not page versions, and tuple
> locks, not page locks.

Yes, that is a little more tricky, but I still don't see why that is an
issue.  The Copy on Write I was referring to is at the tuple level.

>
> The old transaction has to reference a heap page through an index with
> the current implementation.

And so it could have a reference to an clustered index leaf page with tuples
in it through an index.  Essentially, clustered index leaf pages are a
special type of heap-like page.

> But it can do so safely precisely because
> the tuple will be where the index references it as long as necessary.

Can't the same guarantee be made for clustered index data?  When one splits,
keep the old one around as long as necessary.

> As long as that old transaction is live it's guaranteed not to be
> removed by vacuum (well... except by VACUUM FULL but that's a whole
> nother story).
>
> Actually this is probably the clearest problem with IOT in the
> Postgres universe. What do other indexes use to reference these rows
> if they can move around?


Indexes would point to a heap page for normal tables and clustered index
pages for clustered tables.  When new versions of data come in, it may point
to new clustered index pages, just like they currently get modified to point
to new heap pages with new data.  A split just means more data is modified.
And if multiple versions are still 'live' they point to multiple versions --
just as they now have to with the ordinary heap pages.  Page splitting only
means that there might be some copies of row versions with the same tid due
to a split, but labeling a page with a 'creation' tid to differentiate the
pre and post split pages removes the ambiguity.  I suppose that would
require some page locking.

>
> I wanted to call Heikki's "grouped index item" patch that he worked on
> for so long index organized tables. Basically that's what they are
> except the leaf tuples are stored in the regular heap like always,
> hopefully in index order. And there are no leaf tuples in the index so
> the actual index is much much smaller. It doesn't look like a
> traditional IOT but it behaves a lot like one in the space savings and
> access patterns.

Sounds like its a best-effort to maintain the heap in the index order, using
FILLFACTOR and dead space to hopefully keep it in roughly the right order.
That is an incremental improvement on the current situation.

For clustered indexes, clustered index pages would need to be treated like
the current heap pages and tuples with respect to copy on write.  Yes, that
is more than a trivial modification of the current heap or index page types
and their MVCC semantics.  Its definitely a third type of page.

Am I missing something?

>
> --
> greg
> http://mit.edu/~gsstark/resume.pdf
>


Re: cluster index on a table

From
Greg Stark
Date:
On Fri, Jul 17, 2009 at 1:02 AM, Scott Carey<scott@richrelevance.com> wrote:
> Indexes would point to a heap page for normal tables and clustered index
> pages for clustered tables.  When new versions of data come in, it may point
> to new clustered index pages, just like they currently get modified to point
> to new heap pages with new data.  A split just means more data is modified.
> And if multiple versions are still 'live' they point to multiple versions --
> just as they now have to with the ordinary heap pages.  Page splitting only
> means that there might be some copies of row versions with the same tid due
> to a split, but labeling a page with a 'creation' tid to differentiate the
> pre and post split pages removes the ambiguity.  I suppose that would
> require some page locking.


None of this makes much sense to me.

If you keep the old tuples around when you split a full page where are
you going to put the new tuples you're inserting?

Also, are you clear what a tid is? It's the block number plus the
index to the tuple entry. If you split the page and move half the
tuples to the new page they'll all have different tids.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: cluster index on a table

From
Scott Carey
Date:
On 7/16/09 5:27 PM, "Greg Stark" <gsstark@mit.edu> wrote:

> On Fri, Jul 17, 2009 at 1:02 AM, Scott Carey<scott@richrelevance.com> wrote:
>> Indexes would point to a heap page for normal tables and clustered index
>> pages for clustered tables.  When new versions of data come in, it may point
>> to new clustered index pages, just like they currently get modified to point
>> to new heap pages with new data.  A split just means more data is modified.
>> And if multiple versions are still 'live' they point to multiple versions --
>> just as they now have to with the ordinary heap pages.  Page splitting only
>> means that there might be some copies of row versions with the same tid due
>> to a split, but labeling a page with a 'creation' tid to differentiate the
>> pre and post split pages removes the ambiguity.  I suppose that would
>> require some page locking.
>
>
> None of this makes much sense to me.
>
> If you keep the old tuples around when you split a full page where are
> you going to put the new tuples you're inserting?

The new page?
Simplest solution - split into two or more new pages and copy.  Now there
are two copies of all the old stuff, and one copy of the new tuples (in the
new pages).
One could optimize it to only create one new page and copy half the tuples,
depending on where the inserted ones would go, but its probably not a big
win and makes it more complicated.

In the current scheme, when an update occurs and the HOT feature is not
applicable, there is a copy-on-write for that tuple, and AFAICS all the
indexes must point to both those tuples as long as the previous one is still
visible -- so there are two or more index references for the same tuple at
different versions.

In the case I am describing, all the tuples in the page that is split would
now have multiple entries in all indexes (which isn't a cheap thing to do).
An old transaction can see all the index references, but either the tuples
or page could be marked to track visibility and remove the ambiguity.  If
the transaction id/number ('spit id'?) that caused the split and created the
new pages is stored in the new pages, all ambiguity can be resolved for a
unique index.  With a non-unique index there would have to be some tuple
level visibility check or store more information in the index (how does it
work now?).

>
> Also, are you clear what a tid is? It's the block number plus the
> index to the tuple entry. If you split the page and move half the
> tuples to the new page they'll all have different tids.
>

I was thinking tid = transaction id based on the context.  A version number
of sorts -- the one that each tuple is marked with for visibility.  But its
essentially the pointer to the tuple.  My mistake.

All tuples in a split page will likely have to be copied, and it will have
to be such that both copies exist until it can be sure the old copy is no
longer visible.  Its essentially the same as the copy on write now but
affects more than the tuple that is updated -- it affects all in the page
when there is a split.
For the actual index with the embedded tuples, that's not so bad since all
the references to these are co-located.  For other indexes, its an expensive
update.

But, that is what FILLFACTOR and the like are for -- Making sure splits
occur less often, and perhaps even reserving some empty pages in the index
at creation time for future splits.  Maintenance and bloat is the tradeoff
with these things, but the performance payoff for many workloads is huge.

Am I missing something else?  Is it not possible to move tuples that aren't
being updated without something like an exclusive table lock?

I'm definitely not saying that the above would be easy, I just don't yet see
how it is incompatible with Postgres' MVCC.  Hard?  Probably.

> --
> greg
> http://mit.edu/~gsstark/resume.pdf
>


Re: cluster index on a table

From
Justin Pitts
Date:
Is there any interest in adding that (continual/automatic cluster
order maintenance) to a future release?

On Wed, Jul 15, 2009 at 8:33 PM, Scott Carey<scott@richrelevance.com> wrote:
> If you have a lot of insert/update/delete activity on a table fillfactor can
> help.
>
> I don’t believe that postgres will try and maintain the table in the cluster
> order however.
>
>
> On 7/15/09 8:04 AM, "Ibrahim Harrani" <ibrahim.harrani@gmail.com> wrote:
>
> Hi,
>
> thanks for your suggestion.
> Is there any benefit of setting fillfactor to 70 or 80 on this table?
>
>
>
> On Wed, Jun 24, 2009 at 8:42 PM, Scott Marlowe<scott.marlowe@gmail.com>
> wrote:
>> As another poster pointed out, you cluster on ONE index and one index
>> only.  However, you can cluster on a multi-column index.
>>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
>

Re: cluster index on a table

From
Justin Pitts
Date:
ISTR that is the approach that MSSQL follows.

>
> Storing the full tuple in an index and not even having a data only
> page
> would also be an interesting approach to this (and perhaps simpler
> than a
> separate index file and data file if trying to keep the data in the
> order of
> the index).


Re: cluster index on a table

From
Justin Pitts
Date:
According to the books online http://msdn.microsoft.com/en-us/library/ms177443.aspx
  :

    "In a clustered index, the leaf nodes contain the data pages of the
underlying table."


Which agrees with your assertion.

 From a performance perspective, it DOES work very well. Which is why
I keep hoping for it to show up in PostgreSQL.

On Jul 16, 2009, at 2:21 PM, Scott Carey wrote:

> I could be wrong, but I think MSSQL only keeps the data specified in
> the
> index in the index, and the remaining columns in the data.
> That is, if there is a clustered index on a table on three columns
> out of
> five, those three columns in the index are stored in the index,
> while the
> other two are in a data portion.  But it has been several years
> since I
> worked with that DB.
>
> They are certainly storing at least those columns in the index
> itself.  And
> that feature does work very well from a performance perspective.
>
> IOT in Oracle is a huge win in some cases, but a bit more clunky for
> others
> than Clustered Indexes in MSSQL.  Both are highly useful.
>
> On 7/16/09 10:52 AM, "Justin Pitts" <justinpitts@gmail.com> wrote:
>
>> ISTR that is the approach that MSSQL follows.
>>
>>>
>>> Storing the full tuple in an index and not even having a data only
>>> page
>>> would also be an interesting approach to this (and perhaps simpler
>>> than a
>>> separate index file and data file if trying to keep the data in the
>>> order of
>>> the index).
>>
>>
>


Re: cluster index on a table

From
Scara Maccai
Date:
> You might be better off
> with a CLUSTER on
> some index. 

I can't: table is too big, can't lock it for minutes; that's why I wanted to cluster it "one day at a time".




Re: cluster index on a table

From
Scara Maccai
Date:
> As Kevin said, there's no guarantee that tuples will be
> read back
> in the order you inserted them.

Ok, didn't know that

> A better option you might consider is to use a separate
> table for the
> re-ordered tuples.
> You could even do this using partitions

Problem is I'm already using partions: I'm partitioning on a monthly basis. I want to avoid partitioning on a daily
basis:I have 200 tables partitioned by month, 2 years of data. Partition them by day would mean 700*200 tables: what
kindof performance impacts would it mean? 


Does this other option make sense:

partition only "last month" by day; older months by month.
Day by day the tables of the current month gets clustered (say at 1.00AM next day).
Then, every 1st of the month, create a new table as

- create table mytable as select * from <parent_table> where time <in last month> (this gets all the data of last month
orderedin the "almost" correct order, because all the single tables were clustered) 
- alter mytable  add constraint "time in last month"
- alter mytable  inherit  <parent_table>

and then drop last month's tables.

Is this even doable? I mean: between

- alter mytable  inherit  <parent_table>
- drop last month's tables.

more than one table with the same constraint would inherit from the same table: that's fine unless someone can see the
"change"before the "drop tables" part, but I guess this shouldn't be a problem if I use the serializable transaction
level.

This way I could cluster the tables (not perfectly, since I would cluster data day by day, but it's enough) and still
havefew tables, say (31 for current month + 23 for the past 23 months) * 200. 












Re: cluster index on a table

From
"phb07@apra.asso.fr"
Date:
Hi all,
 
>On Wed, Jul 15, 2009 at 10:36 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
I'd love to see it.
>
> +1 for index organized tables 
>
>--Scott
 
+1 also for me...
 
I am currently working for a large customer who is migrating his main application towards PostgreSQL, this application currently using DB2 and RFM-II (a RDBMS ued on Bull GCOS 8 mainframes). With both RDBMS, "cluster index" are used and data rows are stored taking into account these indexes. The benefits are :
- a good performance level, especially for batch chains that more or less "scan" a lot of large tables,
- and table reorganisations remain not too frequent (about once a month).
To keep a good performance level with PostgreSQL, I expect that we will need more frequent reorganisation operations, with the drawbacks this generates for the production schedules. This is one of the very few regressions we need to address (or may be the only one).
 
Despite my currently limited knowledge of the postgres internals, I don't see why it should be difficult to simply adapt the logic used to determine the data row location at insert time, using something like :
- read the cluster index to find the tid of the row having the key value just less than the key value of the row to insert,
- if there is place enough in this same page (due to the use of FILLFACTOR or previous row deletion), use it,
- else use the first available place using fsm.
This doesn't change anything on MVCC mechanism, doesn't change index structure and management, and doesn't require data row move.
This doesn't not ensure that all rows are allways in the "right" order but if the FILLFACTOR are correctly set, most rows are well stored, requiring less reorganisation.
But I probably miss something ;-)
 
Regards. Philippe Beaudoin.
 

Re: cluster index on a table

From
"Kevin Grittner"
Date:
Scara Maccai <m_lists@yahoo.it> wrote:

> - create table mytable as select * from <parent_table> where time
> <in last month> (this gets all the data of last month ordered in the
> "almost" correct order, because all the single tables were
> clustered)

Be sure to include an ORDER BY clause.  Without that, there is no
guarantee that even two successive SELECTs from the same table, with
no modifications between, will return rows in the same order.  For
example, if someone else starts a query which the planner determines
is best handled with a table scan, and that is still running when you
issue your INSERT/SELECT, your query will join the current scan at
it's point of progress, and "wrap around" when it hits the end.  Also,
there would be no guarantee of what order the child tables were read.

-Kevin

Re: cluster index on a table

From
Scara Maccai
Date:
> Be sure to include an ORDER BY clause.  For
> example, if someone else starts a query which the planner
> determines
> is best handled with a table scan, and that is still
> running when you
> issue your INSERT/SELECT, your query will join the current
> scan at
> it's point of progress, and "wrap around" when it hits the
> end.  Also,
> there would be no guarantee of what order the child tables
> were read.

Isn't it going to be much slower?
I'm asking because I could get away in my case without the order by, I guess: I'm not trying to create a completely
clusteredtable. The important thing is that most of the records are stored "close" enough one to the other in the right
order.







Re: cluster index on a table

From
"Kevin Grittner"
Date:
Scara Maccai <m_lists@yahoo.it> wrote:

>> Be sure to include an ORDER BY clause.

> Isn't it going to be much slower?

It might be; you'd have to test to know for sure.

> The important thing is that most of the records are stored "close"
> enough one to the other in the right order.

Then, yeah, it might not be worth the cost of sorting.

-Kevin