Thread: Grouped Index Tuples / Clustered Indexes

Grouped Index Tuples / Clustered Indexes

From
Heikki Linnakangas
Date:
I've updated the GIT patch at http://community.enterprisedb.com/git/. 
Bitrot caused by the findinsertloc-patch has been fixed, making that 
part of the GIT patch a little bit smaller and cleaner. I also did some 
refactoring, and minor cleanup and commenting.

Any comments on the design or patch? For your convenience, I copied the 
same text I added to access/nbtree/README to 
http://community.enterprisedb.com/git/git-readme.txt

Should we start playing the name game at this point? I've been thinking 
we should call this feature just Clustered Indexes, even though it's not 
exactly the same thing as clustered indexes in other DBMSs. From user 
point of view, they behave similarly enough that it may be best to use 
the existing term.

As a next step, I'm hoping to get the indexam API changes from the 
bitmap index patch committed soon, and in a way that supports GIT as well.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Grouped Index Tuples / Clustered Indexes

From
Grzegorz Jaskiewicz
Date:
my only question would be. 
Why isn't that in core already ?


Re: Grouped Index Tuples / Clustered Indexes

From
"Luke Lonergan"
Date:
+1


On 3/7/07 6:53 AM, "Grzegorz Jaskiewicz" <gj@pointblue.com.pl> wrote:

> my only question would be.
> Why isn't that in core already ?
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
> 




Re: Grouped Index Tuples / Clustered Indexes

From
"Simon Riggs"
Date:
On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:
> I've been thinking 
> we should call this feature just Clustered Indexes 

Works for me.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Grouped Index Tuples / Clustered Indexes

From
Gregory Stark
Date:
> On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:
>> I've been thinking 
>> we should call this feature just Clustered Indexes 

So we would have "clustered tables" which are tables whose heap is ordered
according to an index and separately "clustered indexes" which are indexes
optimized for such tables?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Grouped Index Tuples / Clustered Indexes

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
>> On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:
>>> I've been thinking 
>>> we should call this feature just Clustered Indexes 
> 
> So we would have "clustered tables" which are tables whose heap is ordered
> according to an index and separately "clustered indexes" which are indexes
> optimized for such tables?

Yes, that's what I was thinking.

There's a third related term in use as well. When you issue CLUSTER, the 
table will be clustered on an index. And that index is then the "index 
the table is clustered on". That's a bit cumbersome but that's the 
terminology we're using at the moment. Maybe we should to come up with a 
new term for that to avoid confusion..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Grouped Index Tuples / Clustered Indexes

From
"Florian G. Pflug"
Date:
Heikki Linnakangas wrote:
> There's a third related term in use as well. When you issue CLUSTER, the 
> table will be clustered on an index. And that index is then the "index 
> the table is clustered on". That's a bit cumbersome but that's the 
> terminology we're using at the moment. Maybe we should to come up with a 
> new term for that to avoid confusion..

This reminds me of something i've been wondering about for quite some
time. Why is it that one has to write "cluster <index> on <table>",
and not "cluster <table> on <index>"?

To me, the second variant would seem more logical, but then I'm
not a native english speaker...

I'm not suggesting that this should be changed, I'm just wondering
why it is the way it is.

greetings, Florian Pflug




Re: Grouped Index Tuples / Clustered Indexes

From
"Simon Riggs"
Date:
On Sun, 2007-03-11 at 11:22 +0000, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> >> On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:
> >>> I've been thinking 
> >>> we should call this feature just Clustered Indexes 
> > 
> > So we would have "clustered tables" which are tables whose heap is ordered
> > according to an index and separately "clustered indexes" which are indexes
> > optimized for such tables?
> 
> Yes, that's what I was thinking.
> 
> There's a third related term in use as well. When you issue CLUSTER, the 
> table will be clustered on an index. And that index is then the "index 
> the table is clustered on". That's a bit cumbersome but that's the 
> terminology we're using at the moment. Maybe we should to come up with a 
> new term for that to avoid confusion..

First thought: we can use the term "cluster*ing* index" for CLUSTER and
use the term "clustered" to refer to what has happened to the table and
the index. That will probably be confused with high availability
clustering, so perhaps not. 

Better thought: say that CLUSTER requires an "order-defining index".
That better explains the point that it is the table being clustered,
using the index to define the physical order of the rows in the heap. We
then use the word "clustered" to refer to what has happened to the
table, and with this patch, for the index also.

That way we can have new syntax for CLUSTER
CLUSTER table ORDER BY indexname

which is then the preferred syntax, rather than the perverse
CLUSTER index ON table

which gives the wrong impression about what is happening, since it is
the table that is changed, not the index.

- - -

- Are you suggesting that we have an explicit new syntax

CREATE [UNIQUE] CLUSTERED INDEX [CONCURRENTLY] fooidx ON foo (....) ...

or just that we refer to this feature as Clustered Indexes?

- Do we still need the index WITH option, in either case?

- Do you think that all Primary Keys should be clustered?

- Are you thinking to rename docs, catalog etc to reflect the new
naming/meaning?

My thinking would be: CLUSTERED, no, yes, yes
but I'd like to know what you think?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Grouped Index Tuples / Clustered Indexes

From
"Simon Riggs"
Date:
On Sun, 2007-03-11 at 19:06 +0100, Florian G. Pflug wrote:
> Heikki Linnakangas wrote:
> > There's a third related term in use as well. When you issue CLUSTER, the 
> > table will be clustered on an index. And that index is then the "index 
> > the table is clustered on". That's a bit cumbersome but that's the 
> > terminology we're using at the moment. Maybe we should to come up with a 
> > new term for that to avoid confusion..
> 
> This reminds me of something i've been wondering about for quite some
> time. Why is it that one has to write "cluster <index> on <table>",
> and not "cluster <table> on <index>"?
> 
> To me, the second variant would seem more logical, but then I'm
> not a native english speaker...
> 
> I'm not suggesting that this should be changed, I'm just wondering
> why it is the way it is.

No idea, but I agree it conveys exactly the opposite view of what
happens when the command is issued.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Grouped Index Tuples / Clustered Indexes

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Better thought: say that CLUSTER requires an "order-defining index".
> That better explains the point that it is the table being clustered,
> using the index to define the physical order of the rows in the heap. We
> then use the word "clustered" to refer to what has happened to the
> table, and with this patch, for the index also.
> 
> That way we can have new syntax for CLUSTER
> 
>     CLUSTER table ORDER BY indexname
> 
> which is then the preferred syntax, rather than the perverse
> 
>     CLUSTER index ON table
> 
> which gives the wrong impression about what is happening, since it is
> the table that is changed, not the index.

I like that, "order-defining index" conveys the point pretty well.

> - Are you suggesting that we have an explicit new syntax
> 
> CREATE [UNIQUE] CLUSTERED INDEX [CONCURRENTLY] fooidx ON foo (....) ...
> 
> or just that we refer to this feature as Clustered Indexes?

I'm not proposing new syntax, just a WITH-parameter. Makes more sense to 
me that way, the clusteredness has no user-visible effects except 
performance, and it's b-tree specific (though I guess you could apply 
the same concept to other indexams as well).

> - Do you think that all Primary Keys should be clustered?

No. There's a significant CPU overhead when the index and table are in 
memory and you're doing simple one-row lookups. And there's no promise 
that a table is physically in primary key order anyway.

There might be some interesting cases where we could enable it 
automatically. I've been thinking that if you explicitly CLUSTER a 
table, the order-defining index would definitely benefit from being a 
clustered index. If it's small enough that it fits in memory, there's no 
point in running CLUSTER in the first place. And if you run CLUSTER, we 
know it's in order. That seems like a pretty safe bet.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Grouped Index Tuples / Clustered Indexes

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Sun, 2007-03-11 at 19:06 +0100, Florian G. Pflug wrote:
> > Heikki Linnakangas wrote:
> > > There's a third related term in use as well. When you issue CLUSTER, the 
> > > table will be clustered on an index. And that index is then the "index 
> > > the table is clustered on". That's a bit cumbersome but that's the 
> > > terminology we're using at the moment. Maybe we should to come up with a 
> > > new term for that to avoid confusion..
> > 
> > This reminds me of something i've been wondering about for quite some
> > time. Why is it that one has to write "cluster <index> on <table>",
> > and not "cluster <table> on <index>"?
> > 
> > To me, the second variant would seem more logical, but then I'm
> > not a native english speaker...
> > 
> > I'm not suggesting that this should be changed, I'm just wondering
> > why it is the way it is.
> 
> No idea, but I agree it conveys exactly the opposite view of what
> happens when the command is issued.

We got the syntax from Berkely, and it has always seemed backwards to me
too.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Grouped Index Tuples / Clustered Indexes

From
Bruce Momjian
Date:
Added to TODO:
       o Add more logical syntax CLUSTER table ORDER BY index;         support current syntax for backward
compatibility

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Sun, 2007-03-11 at 11:22 +0000, Heikki Linnakangas wrote:
> > Gregory Stark wrote:
> > >> On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:
> > >>> I've been thinking 
> > >>> we should call this feature just Clustered Indexes 
> > > 
> > > So we would have "clustered tables" which are tables whose heap is ordered
> > > according to an index and separately "clustered indexes" which are indexes
> > > optimized for such tables?
> > 
> > Yes, that's what I was thinking.
> > 
> > There's a third related term in use as well. When you issue CLUSTER, the 
> > table will be clustered on an index. And that index is then the "index 
> > the table is clustered on". That's a bit cumbersome but that's the 
> > terminology we're using at the moment. Maybe we should to come up with a 
> > new term for that to avoid confusion..
> 
> First thought: we can use the term "cluster*ing* index" for CLUSTER and
> use the term "clustered" to refer to what has happened to the table and
> the index. That will probably be confused with high availability
> clustering, so perhaps not. 
> 
> Better thought: say that CLUSTER requires an "order-defining index".
> That better explains the point that it is the table being clustered,
> using the index to define the physical order of the rows in the heap. We
> then use the word "clustered" to refer to what has happened to the
> table, and with this patch, for the index also.
> 
> That way we can have new syntax for CLUSTER
> 
>     CLUSTER table ORDER BY indexname
> 
> which is then the preferred syntax, rather than the perverse
> 
>     CLUSTER index ON table
> 
> which gives the wrong impression about what is happening, since it is
> the table that is changed, not the index.
> 
> - - -
> 
> - Are you suggesting that we have an explicit new syntax
> 
> CREATE [UNIQUE] CLUSTERED INDEX [CONCURRENTLY] fooidx ON foo (....) ...
> 
> or just that we refer to this feature as Clustered Indexes?
> 
> - Do we still need the index WITH option, in either case?
> 
> - Do you think that all Primary Keys should be clustered?
> 
> - Are you thinking to rename docs, catalog etc to reflect the new
> naming/meaning?
> 
> My thinking would be: CLUSTERED, no, yes, yes
> but I'd like to know what you think?
> 
> -- 
>   Simon Riggs             
>   EnterpriseDB   http://www.enterprisedb.com
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Grouped Index Tuples / Clustered Indexes

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Heikki Linnakangas wrote:
> I've updated the GIT patch at http://community.enterprisedb.com/git/. 
> Bitrot caused by the findinsertloc-patch has been fixed, making that 
> part of the GIT patch a little bit smaller and cleaner. I also did some 
> refactoring, and minor cleanup and commenting.
> 
> Any comments on the design or patch? For your convenience, I copied the 
> same text I added to access/nbtree/README to 
> http://community.enterprisedb.com/git/git-readme.txt
> 
> Should we start playing the name game at this point? I've been thinking 
> we should call this feature just Clustered Indexes, even though it's not 
> exactly the same thing as clustered indexes in other DBMSs. From user 
> point of view, they behave similarly enough that it may be best to use 
> the existing term.
> 
> As a next step, I'm hoping to get the indexam API changes from the 
> bitmap index patch committed soon, and in a way that supports GIT as well.
> 
> -- 
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +