Thread: What are the benefits of using a clustered index?
Hi guys! This'll hopefully be an easy question for one of the perf guys like Tom.. I'm wondering what the performance advantage is to using a clustered index. Take the following example: I have a table of recipes with a recipeid, and a table of comments that various users might have entered about those recipes ("this tastes like crap!", etc) with a commentid and a FK on recipeid. Naturally I want to be able to say "give me all the comments about recipe x". To me, this seems like the perfect opportunity to use a cluster. Inserts are fairly rare and reads need to be super fast. If all the comments were clustered by recipeid, all the data would be right next to each other on disk. It seems like the query planner could count on this and make a lot of cool optimizations. It could just find the first row with recipe x and traverse the rows until it got to the next recipe. However, the thing that really confuses me is I can insert as many rows as I want, and they're not clustered. In fact, the only time they'll cluster is if I run the "CLUSTER RecipeComments" command. Sure, I could write a trigger to do this on insert, or I could write a script that runs every midnight to do this, or what not. However, if the query planner can't assume the data is in a certain order on disk, what's the point of having this at all? I'm quite sure SQL Server doesn't work this way and I'm not sure about Oracle. Can someone enlighten me on the exact benefit of this? Thanks!! Mike
Mike Christensen wrote: > However, if the query > planner can't assume the data is in a certain order on disk, what's the > point of having this at all? One benefit is that it reduces the number of pages that must be read from disk to retrieve all tuples that have a particular value for the indexed field. Because the tuples tend to be clustered by (in your case) recipeid, you'll normally have quite a few relevant tuples on a page and won't have as much uninteresting and irrelevent data getting read in along with what you actually want. That'll also improve efficiency of memory/cache use. I think it also helps keep the index smaller, since it doesn't need to refer to as many pages for a given value of interest. If you set a non-default FILLFACTOR on the table (and proably index) Pg will leave gaps in the table and I think it will try to insert tuples to maintain clustering order for smaller indexes and faster index scans. The downside is that there are holes in the table that may slow down a sequential scan somewhat. -- Craig Ringer
Mike Christensen wrote: > I'm wondering what the performance advantage is to using a clustered > index. In Postgres you don't cluster indexes. You cluster tables. It only means that the table is written from scratch, following the index order. So what it gives you is locality of access for queries that follow ranges of that index, nothing more. It seems very obvious that in this implementation a new tuple is not going to follow the index order; it'll just be stored wherever there is free space. If you run CLUSTER again, they'll be put in the right place. (There was a patch to enhance this so that a new insertion would instead use space closer to where the tuple would be if it followed the order. But it was only a hint; if there wasn't enough free space in the right spot, it would be stored elsewhere. Still, the patch was not committed.) > I'm quite sure SQL Server doesn't work this way and I'm not sure about > Oracle. Can someone enlighten me on the exact benefit of this? Thanks!! Yeah, they use a completely different definition of "clustered index" from ours. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, 2009-03-17 at 09:16 -0400, Alvaro Herrera wrote: > > I'm quite sure SQL Server doesn't work this way and I'm not sure > about > > Oracle. Can someone enlighten me on the exact benefit of this? > Thanks!! > > Yeah, they use a completely different definition of "clustered index" > from ours. Hopefully we regard it as a missing feature rather than as a separate definition. We could cluster the index, we just don't, yet. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
I would only like this as a feature if the optimizer can really take advantage of this. Clustering on every insert or update just for the fun of it won't really give us anything but more expensive writes.
I kinda figured if SQL Server and Oracle have it, they probably take full advantage of it for reads..
Mike
Simon Riggs wrote:
I kinda figured if SQL Server and Oracle have it, they probably take full advantage of it for reads..
Mike
Simon Riggs wrote:
On Tue, 2009-03-17 at 09:16 -0400, Alvaro Herrera wrote:I'm quite sure SQL Server doesn't work this way and I'm not sureaboutOracle. Can someone enlighten me on the exact benefit of this?Thanks!! Yeah, they use a completely different definition of "clustered index" from ours.Hopefully we regard it as a missing feature rather than as a separate definition. We could cluster the index, we just don't, yet.
On Tue, 2009-03-17 at 15:26 -0700, Mike Christensen wrote: > I would only like this as a feature if the optimizer can really take > advantage of this. Clustering on every insert or update just for the > fun of it won't really give us anything but more expensive writes. > > I kinda figured if SQL Server and Oracle have it, they probably take > full advantage of it for reads.. With SQLServer and Oracle the optimization is implicit in the physical design of the index. The Postgres equivalent would be a grouped index, not necessarily clustered in the same way. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
> Hopefully we regard it as a missing feature rather than as a separate > definition. We could cluster the index, we just don't, yet. Wouldn't this require keeping around multiple versions of index pages for MVCC? Which would create performance degradations elsewhere? -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
On Fri, 2009-03-20 at 09:33 -0600, Scott Ribe wrote: > > Hopefully we regard it as a missing feature rather than as a separate > > definition. We could cluster the index, we just don't, yet. > > Wouldn't this require keeping around multiple versions of index pages for > MVCC? No, it wouldn't require that. Just think one index tuple points to more than one heap row. We would still need to check visibility on the rows returned to ensure MVCC. Less index pointers, smaller index. The trick is: How? But that's a secondary issue to getting it on the TODO list, which is all I'm suggesting at present. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
> Just think one index tuple points to more than one heap row. We would > still need to check visibility on the rows returned to ensure MVCC. So you wind up with the heap rows stored in their own tree-like structure outside the index? OK. -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
Simon Riggs <simon@2ndQuadrant.com> writes: > Just think one index tuple points to more than one heap row. Could you expand on that? Like, uh, I have no idea what you're saying. > Less index pointers, smaller index. Are you talking about Heikkie's grouped-items-index? > The trick is: How? But that's a secondary issue to getting it on the > TODO list, which is all I'm suggesting at present. Well I think we need to be clear enough at least on the "what" if not the "how". But there's a bit a of a fuzzy line between them I admit. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!