Thread: Clustered indexes - When to use them?
Hi, Is there any "rule of thumb" on when to (not) use clustered indexes? What appen to the table/index? (any change on the physical organisation?) I've seen speed improvement on some queries but I'm not sure if I must use them or not... My rows are imported in batch of 100 (once the main script has collected them, this takes between 1 and 30min), then another script vacuums the table and aggregate the last imported rows, if I add a column with the commit timestamp and cluster on it, will I gain some perfs or not? Thanks, -- MaXX
The key expense in doing an index scan is the amount of randomness involved in reading the base table. If a table is in the same order as the index then reading the base table will be very fast. If the table is in a completely random order compared to an index (it's correlation is low), then an index scan becomes very expensive because every row you read out of the index means seeking to a random page in the table. So, if you do a lot of querying on the table that would work best with an index scan, it's probably worth it to cluster on that index. Note that I'm talking about index *scans* here, where you're pulling a decent number of rows. There's some other considerations as well, but this is probably the biggest one. On Wed, Nov 02, 2005 at 02:04:31PM +0100, MaXX wrote: > Hi, > Is there any "rule of thumb" on when to (not) use clustered indexes? > What appen to the table/index? (any change on the physical organisation?) > I've seen speed improvement on some queries but I'm not sure if I must use > them or not... > > My rows are imported in batch of 100 (once the main script has collected > them, this takes between 1 and 30min), then another script vacuums the > table and aggregate the last imported rows, if I add a column with the > commit timestamp and cluster on it, will I gain some perfs or not? > > Thanks, > -- > MaXX > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Ok thank you, so I can consider using clustered indexes when I need to 'reorder' random data to improve the speed of a particular query... In simple words: Clustered indexes are like the alphabetical index in a book, where term are randomly distibuted in the book and regular indexes are more like the table of content... Right? Thanks again, MaXX Jim C. Nasby wrote: > The key expense in doing an index scan is the amount of randomness > involved in reading the base table. If a table is in the same order as > the index then reading the base table will be very fast. If the table is > in a completely random order compared to an index (it's correlation is > low), then an index scan becomes very expensive because every row you > read out of the index means seeking to a random page in the table. > > So, if you do a lot of querying on the table that would work best with > an index scan, it's probably worth it to cluster on that index. > > Note that I'm talking about index *scans* here, where you're pulling a > decent number of rows. > > There's some other considerations as well, but this is probably the > biggest one. > -- MaXX
On Wed, 2005-11-02 at 13:50, MaXX wrote: > Ok thank you, > so I can consider using clustered indexes when I need to 'reorder' random > data to improve the speed of a particular query... > > In simple words: > Clustered indexes are like the alphabetical index in a book, where term are > randomly distibuted in the book and regular indexes are more like the table > of content... > Right? Not really. It's more like if someone reordered the book so that it was in the same order as whatever index you have in it. So, if the index was on the length of the words, the words in the book would be reordered to be smallest to largest (or reverse that). If the index were on just the words themselves, then the book would have the words reordered from A to Z etc... Note that clustered indexes do not dynamically update the table order in postgresql. If you're gonna insert to them, you need to recluster them every so often.
On Wed, Nov 02, 2005 at 08:50:45PM +0100, MaXX wrote: > Ok thank you, > so I can consider using clustered indexes when I need to 'reorder' random > data to improve the speed of a particular query... > > In simple words: > Clustered indexes are like the alphabetical index in a book, where term are > randomly distibuted in the book and regular indexes are more like the table > of content... > Right? You have that backwards. The TOC matches the ordering of the book (table). Think of it as the book is clustered on the TOC. Stuff from the index appears all over; it's not clustered. Keep in mind that for PostgreSQL it's simply a matter of correlation. You can actually see correlation in one of system views. The higher the correlation between an index and the table, the more efficient index scans will be. For some other databases, when you cluster on an index the table actually *becomes an index*. This means that doing an index scan is actually the same as a table/sequential scan, except that you can easily find an exact place to start. Because of this, a 'clustered table' (or an Index Organized Table as Oracle calls it) can be extremely fast for certain operations. In any case, remember the first rule of all performance tuning: don't. And the second rule: if you're going to, you better have metrics to measure your tuning with to make sure it's worth it. Feel free to call me at work if you still have questions. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > On Wed, Nov 02, 2005 at 08:50:45PM +0100, MaXX wrote: [...] >> In simple words: >> Clustered indexes are like the alphabetical index in a book, where term >> are randomly distibuted in the book and regular indexes are more like the >> table of content... >> Right? > You have that backwards. The TOC matches the ordering of the book > (table). Think of it as the book is clustered on the TOC. Stuff from the > index appears all over; it's not clustered. > Keep in mind that for PostgreSQL it's simply a matter of correlation. > You can actually see correlation in one of system views. The higher the > correlation between an index and the table, the more efficient index > scans will be. pgAdmin shows a correlation value in the statistics panel when I click on a column... Not sure if it is the right one as all column have it, look more correlation between values in the column... Still have a *lot* of things to learn... > For some other databases, when you cluster on an index the table > actually *becomes an index*. This means that doing an index scan is > actually the same as a table/sequential scan, except that you can easily > find an exact place to start. Because of this, a 'clustered table' (or > an Index Organized Table as Oracle calls it) can be extremely fast for > certain operations. Ok now I know why I can only have a single clustered index per table... > In any case, remember the first rule of all performance tuning: don't. > And the second rule: if you're going to, you better have metrics to > measure your tuning with to make sure it's worth it. Yeah! EXPLAIN ANALYSE, time, systat and friends... Thanks again, -- MaXX
Scott Marlowe wrote: > On Wed, 2005-11-02 at 13:50, MaXX wrote: [...] >> In simple words: >> Clustered indexes are like the alphabetical index in a book, where term >> are randomly distibuted in the book and regular indexes are more like the >> table of content... >> Right? > Not really. It's more like if someone reordered the book so that it was > in the same order as whatever index you have in it. So, if the index > was on the length of the words, the words in the book would be reordered > to be smallest to largest (or reverse that). If the index were on just > the words themselves, then the book would have the words reordered from > A to Z etc... Things become more clear now... > Note that clustered indexes do not dynamically update the table order in > postgresql. If you're gonna insert to them, you need to recluster them > every so often. Good to know it... Thanks, -- MaXX
On Wed, Nov 02, 2005 at 10:55:36PM +0100, MaXX wrote: > Jim C. Nasby wrote: > > On Wed, Nov 02, 2005 at 08:50:45PM +0100, MaXX wrote: > [...] > >> In simple words: > >> Clustered indexes are like the alphabetical index in a book, where term > >> are randomly distibuted in the book and regular indexes are more like the > >> table of content... > >> Right? > > You have that backwards. The TOC matches the ordering of the book > > (table). Think of it as the book is clustered on the TOC. Stuff from the > > index appears all over; it's not clustered. > > Keep in mind that for PostgreSQL it's simply a matter of correlation. > > You can actually see correlation in one of system views. The higher the > > correlation between an index and the table, the more efficient index > > scans will be. > pgAdmin shows a correlation value in the statistics panel when I click on a > column... Not sure if it is the right one as all column have it, look more > correlation between values in the column... > Still have a *lot* of things to learn... That's because unfortunately PostgreSQL only keeps statistics on individual columns. There's no stats kept on multi-column indexes; the best the planner can do is use the stats for the first column. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > That's because unfortunately PostgreSQL only keeps statistics on > individual columns. There's no stats kept on multi-column indexes; the > best the planner can do is use the stats for the first column. That's not what we do at all: we do look at the stats for each column and combine them. The deficiency is that we don't have any stats about cross-column correlations, and therefore must assume that the columns are independent (in the statistical sense). This is a poor assumption in the real world. But it's a lot subtler than "we only consider the first column". regards, tom lane
Hi all, I cannot find what is the problem with my function below. The following line in the function : AND PD.PDPONUM = ANY (receivingIds) don't work. If I change this line by AND PD.PDPONUM = 1734 (Hardcode a given value) I get a result row. When I call the same function select * from usp_Commande_Dues_Retourner('{\'1734\'}', 'M', '2005-02-02', '2005-11-02', 'EN' ); with the real line I get no result? Any idea? What is the difference between AND PD.PDPONUM = 1734 and AND PD.PDPONUM = ANY (receivingIds) Thanks for your help .. I'm messing around this problem for several hours now .. and haven't found the answer yet. Best Regard /David CREATE OR REPLACE FUNCTION usp_Commande_Dues_Retourner(VARCHAR[], VARCHAR, DATE, DATE, VARCHAR) RETURNS refcursor AS $$ DECLARE receivingIds ALIAS FOR $1; companyId ALIAS FOR $2; fromReceptionDate ALIAS FOR $3; toReceptionDate ALIAS FOR $4; warehouseId ALIAS FOR $5; BEGIN OPEN ref FOR SELECT BD.BDNUM, BDYPNUM, BORRNUMC, RRDESC, BONUM, BD.BDDTDUEA, BD.BDICNUM, (BD.BDPRIXNET * BD.BDQAEXPV) AS Total, BD.BDQAEXPV, IQQSTOCK - IQQRESV AS IQQSTOCK, BD.BDDTDUEV, T_IC2.ICQTE FROM BD INNER JOIN ( SELECT BDICNUM, SUM(BDQAEXPV) AS ICQTE FROM BD INNER JOIN BO ON BD.BDBONUM = BO.BONUM AND BD.BDYPNUM = BO.BOYPNUM INNER JOIN PD ON BD.BDNUM = PD.PDBDNUM AND BD.BDYPNUM = PD.PDYPNUM WHERE BDSTATV = 3 AND BDAENUM = warehouseId AND BOTYPE = 0 AND BOSTATUT IN (0, 1) AND fromReceptionDate::DATE <= BODTCOM::DATE AND toReceptionDate::DATE >= BODTCOM::DATE AND PD.PDPONUM = ANY (receivingIds) AND BD.BDYPNUM = companyId GROUP BY BDICNUM ) AS T_IC2 ON BD.BDICNUM = T_IC2.BDICNUM INNER JOIN BO ON BD.BDBONUM = BO.BONUM AND BD.BDYPNUM = BO.BOYPNUM INNER JOIN RR ON BO.BORRNUMC = RR.RRNUM LEFT OUTER JOIN IQ ON BD.BDICNUM = IQ.IQICNUM AND BD.BDAENUM = IQ.IQAENUM WHERE BD.BDSTATV = 3 AND BDAENUM = warehouseId AND BOTYPE = 0 AND BOSTATUT IN (0, 1) AND fromReceptionDate::DATE <= BODTCOM::DATE AND toReceptionDate::DATE >= BODTCOM::DATE AND BD.BDYPNUM = companyId ORDER BY BONUM, BDICNUM, BDQAEXPV ; RETURN ref; END; $$ LANGUAGE 'plpgsql';
On Wed, Nov 02, 2005 at 10:19:53PM -0500, David Gagnon wrote: > I cannot find what is the problem with my function below. The > following line in the function : AND PD.PDPONUM = ANY (receivingIds) > don't work. If I change this line by AND PD.PDPONUM = 1734 (Hardcode > a given value) I get a result row. > > When I call the same function > > select * from usp_Commande_Dues_Retourner('{\'1734\'}', 'M', > '2005-02-02', '2005-11-02', 'EN' ); > > with the real line I get no result? Why are you quoting the value inside the quotes? That is, is there a reason you're using '{\'1734\'}' instead of '{1734}'? Does using the latter work? It does for me in simple tests. -- Michael Fuhr
Hi >>When I call the same function >> >>select * from usp_Commande_Dues_Retourner('{\'1734\'}', 'M', >>'2005-02-02', '2005-11-02', 'EN' ); >> >>with the real line I get no result? >> >> > >Why are you quoting the value inside the quotes? > Because I had a bug in my java application. I modify my code to get select * from usp_Commande_Dues_Retourner('{"1734"}', 'M', '2005-02-02', '2005-11-02', 'EN' ); and it works ... Thanks ! /David >That is, is there >a reason you're using '{\'1734\'}' instead of '{1734}'? Does using >the latter work? It does for me in simple tests. > > Thanks for your answer. That was my error.