Thread: Clustered indexes - When to use them?

Clustered indexes - When to use them?

From
MaXX
Date:
Hi,
Is there any "rule of thumb" on when to (not) use clustered indexes?
What appen to the table/index? (any change on the physical organisation?)
I've seen speed improvement on some queries but I'm not sure if I must use
them or not...

My rows are imported in batch of 100 (once the main script has collected
them, this takes between 1 and 30min), then another script vacuums the
table and aggregate the last imported rows,  if I add a column with the
commit timestamp and cluster on it, will I gain some perfs or not?

Thanks,
--
MaXX


Re: Clustered indexes - When to use them?

From
"Jim C. Nasby"
Date:
The key expense in doing an index scan is the amount of randomness
involved in reading the base table. If a table is in the same order as
the index then reading the base table will be very fast. If the table is
in a completely random order compared to an index (it's correlation is
low), then an index scan becomes very expensive because every row you
read out of the index means seeking to a random page in the table.

So, if you do a lot of querying on the table that would work best with
an index scan, it's probably worth it to cluster on that index.

Note that I'm talking about index *scans* here, where you're pulling a
decent number of rows.

There's some other considerations as well, but this is probably the
biggest one.

On Wed, Nov 02, 2005 at 02:04:31PM +0100, MaXX wrote:
> Hi,
> Is there any "rule of thumb" on when to (not) use clustered indexes?
> What appen to the table/index? (any change on the physical organisation?)
> I've seen speed improvement on some queries but I'm not sure if I must use
> them or not...
>
> My rows are imported in batch of 100 (once the main script has collected
> them, this takes between 1 and 30min), then another script vacuums the
> table and aggregate the last imported rows,  if I add a column with the
> commit timestamp and cluster on it, will I gain some perfs or not?
>
> Thanks,
> --
> MaXX
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Clustered indexes - When to use them?

From
MaXX
Date:
Ok thank you,
so I can consider using clustered indexes when I need to 'reorder' random
data to improve the speed of a particular query...

In simple words:
Clustered indexes are like the alphabetical index in a book, where term are
randomly distibuted in the book and regular indexes are more like the table
of content...
Right?

Thanks again,
MaXX

Jim C. Nasby wrote:
> The key expense in doing an index scan is the amount of randomness
> involved in reading the base table. If a table is in the same order as
> the index then reading the base table will be very fast. If the table is
> in a completely random order compared to an index (it's correlation is
> low), then an index scan becomes very expensive because every row you
> read out of the index means seeking to a random page in the table.
>
> So, if you do a lot of querying on the table that would work best with
> an index scan, it's probably worth it to cluster on that index.
>
> Note that I'm talking about index *scans* here, where you're pulling a
> decent number of rows.
>
> There's some other considerations as well, but this is probably the
> biggest one.
>

--
MaXX


Re: Clustered indexes - When to use them?

From
Scott Marlowe
Date:
On Wed, 2005-11-02 at 13:50, MaXX wrote:
> Ok thank you,
> so I can consider using clustered indexes when I need to 'reorder' random
> data to improve the speed of a particular query...
>
> In simple words:
> Clustered indexes are like the alphabetical index in a book, where term are
> randomly distibuted in the book and regular indexes are more like the table
> of content...
> Right?

Not really.  It's more like if someone reordered the book so that it was
in the same order as whatever index you have in it.  So, if the index
was on the length of the words, the words in the book would be reordered
to be smallest to largest (or reverse that).  If the index were on just
the words themselves, then the book would have the words reordered from
A to Z etc...

Note that clustered indexes do not dynamically update the table order in
postgresql.  If you're gonna insert to them, you need to recluster them
every so often.

Re: Clustered indexes - When to use them?

From
"Jim C. Nasby"
Date:
On Wed, Nov 02, 2005 at 08:50:45PM +0100, MaXX wrote:
> Ok thank you,
> so I can consider using clustered indexes when I need to 'reorder' random
> data to improve the speed of a particular query...
>
> In simple words:
> Clustered indexes are like the alphabetical index in a book, where term are
> randomly distibuted in the book and regular indexes are more like the table
> of content...
> Right?

You have that backwards. The TOC matches the ordering of the book
(table). Think of it as the book is clustered on the TOC. Stuff from the
index appears all over; it's not clustered.

Keep in mind that for PostgreSQL it's simply a matter of correlation.
You can actually see correlation in one of system views. The higher the
correlation between an index and the table, the more efficient index
scans will be.

For some other databases, when you cluster on an index the table
actually *becomes an index*. This means that doing an index scan is
actually the same as a table/sequential scan, except that you can easily
find an exact place to start. Because of this, a 'clustered table' (or
an Index Organized Table as Oracle calls it) can be extremely fast for
certain operations.

In any case, remember the first rule of all performance tuning: don't.
And the second rule: if you're going to, you better have metrics to
measure your tuning with to make sure it's worth it.

Feel free to call me at work if you still have questions.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Clustered indexes - When to use them?

From
MaXX
Date:
Jim C. Nasby wrote:
> On Wed, Nov 02, 2005 at 08:50:45PM +0100, MaXX wrote:
[...]
>> In simple words:
>> Clustered indexes are like the alphabetical index in a book, where term
>> are randomly distibuted in the book and regular indexes are more like the
>> table of content...
>> Right?
> You have that backwards. The TOC matches the ordering of the book
> (table). Think of it as the book is clustered on the TOC. Stuff from the
> index appears all over; it's not clustered.
> Keep in mind that for PostgreSQL it's simply a matter of correlation.
> You can actually see correlation in one of system views. The higher the
> correlation between an index and the table, the more efficient index
> scans will be.
pgAdmin shows a correlation value in the statistics panel when I click on a
column... Not sure if it is the right one as all column have it, look more
correlation between values in the column...
Still have a *lot* of things to learn...

> For some other databases, when you cluster on an index the table
> actually *becomes an index*. This means that doing an index scan is
> actually the same as a table/sequential scan, except that you can easily
> find an exact place to start. Because of this, a 'clustered table' (or
> an Index Organized Table as Oracle calls it) can be extremely fast for
> certain operations.
Ok now I know why I can only have a single clustered index per table...

> In any case, remember the first rule of all performance tuning: don't.
> And the second rule: if you're going to, you better have metrics to
> measure your tuning with to make sure it's worth it.
Yeah! EXPLAIN ANALYSE, time, systat and friends...

Thanks again,
--
MaXX

Re: Clustered indexes - When to use them?

From
MaXX
Date:
Scott Marlowe wrote:
> On Wed, 2005-11-02 at 13:50, MaXX wrote:
[...]
>> In simple words:
>> Clustered indexes are like the alphabetical index in a book, where term
>> are randomly distibuted in the book and regular indexes are more like the
>> table of content...
>> Right?
> Not really.  It's more like if someone reordered the book so that it was
> in the same order as whatever index you have in it.  So, if the index
> was on the length of the words, the words in the book would be reordered
> to be smallest to largest (or reverse that).  If the index were on just
> the words themselves, then the book would have the words reordered from
> A to Z etc...
Things become more clear now...

> Note that clustered indexes do not dynamically update the table order in
> postgresql.  If you're gonna insert to them, you need to recluster them
> every so often.
Good to know it...

Thanks,
--
MaXX


Re: Clustered indexes - When to use them?

From
"Jim C. Nasby"
Date:
On Wed, Nov 02, 2005 at 10:55:36PM +0100, MaXX wrote:
> Jim C. Nasby wrote:
> > On Wed, Nov 02, 2005 at 08:50:45PM +0100, MaXX wrote:
> [...]
> >> In simple words:
> >> Clustered indexes are like the alphabetical index in a book, where term
> >> are randomly distibuted in the book and regular indexes are more like the
> >> table of content...
> >> Right?
> > You have that backwards. The TOC matches the ordering of the book
> > (table). Think of it as the book is clustered on the TOC. Stuff from the
> > index appears all over; it's not clustered.
> > Keep in mind that for PostgreSQL it's simply a matter of correlation.
> > You can actually see correlation in one of system views. The higher the
> > correlation between an index and the table, the more efficient index
> > scans will be.
> pgAdmin shows a correlation value in the statistics panel when I click on a
> column... Not sure if it is the right one as all column have it, look more
> correlation between values in the column...
> Still have a *lot* of things to learn...

That's because unfortunately PostgreSQL only keeps statistics on
individual columns. There's no stats kept on multi-column indexes; the
best the planner can do is use the stats for the first column.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Clustered indexes - When to use them?

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> That's because unfortunately PostgreSQL only keeps statistics on
> individual columns. There's no stats kept on multi-column indexes; the
> best the planner can do is use the stats for the first column.

That's not what we do at all: we do look at the stats for each column
and combine them.

The deficiency is that we don't have any stats about cross-column
correlations, and therefore must assume that the columns are independent
(in the statistical sense).  This is a poor assumption in the real world.
But it's a lot subtler than "we only consider the first column".

            regards, tom lane

Problem with array in plpgsql function .. please help :-)

From
David Gagnon
Date:
Hi all,

  I cannot find what is the problem with my function below.  The
following line in the function  :    AND PD.PDPONUM = ANY (receivingIds)

don't work.  If I change this line by    AND PD.PDPONUM = 1734 (Hardcode
a given value) I get a result row.

When I call the same function

 select * from usp_Commande_Dues_Retourner('{\'1734\'}', 'M',
'2005-02-02', '2005-11-02', 'EN' );

with the real line I get no result?

Any idea? What is the difference between AND PD.PDPONUM = 1734      and
AND PD.PDPONUM = ANY (receivingIds)



Thanks for your help .. I'm messing around this problem for several
hours now .. and haven't found the answer yet.

Best Regard
/David



CREATE OR REPLACE FUNCTION usp_Commande_Dues_Retourner(VARCHAR[],
VARCHAR, DATE, DATE, VARCHAR) RETURNS refcursor  AS $$
DECLARE

     receivingIds ALIAS FOR $1;
     companyId ALIAS FOR $2;
     fromReceptionDate ALIAS FOR $3;
     toReceptionDate ALIAS FOR $4;
     warehouseId ALIAS FOR $5;
BEGIN

    OPEN ref FOR SELECT BD.BDNUM, BDYPNUM, BORRNUMC, RRDESC, BONUM,
BD.BDDTDUEA, BD.BDICNUM, (BD.BDPRIXNET * BD.BDQAEXPV) AS Total,
BD.BDQAEXPV, IQQSTOCK - IQQRESV AS IQQSTOCK, BD.BDDTDUEV, T_IC2.ICQTE
                    FROM BD INNER JOIN (

                  SELECT BDICNUM, SUM(BDQAEXPV) AS ICQTE
                  FROM BD
                  INNER JOIN BO ON BD.BDBONUM = BO.BONUM AND BD.BDYPNUM
= BO.BOYPNUM
                  INNER JOIN PD ON BD.BDNUM = PD.PDBDNUM AND BD.BDYPNUM
= PD.PDYPNUM
                  WHERE BDSTATV = 3
                        AND BDAENUM = warehouseId
                        AND BOTYPE = 0
                        AND BOSTATUT IN (0, 1)
                        AND fromReceptionDate::DATE  <=  BODTCOM::DATE
                        AND toReceptionDate::DATE  >=  BODTCOM::DATE
                           AND PD.PDPONUM = ANY (receivingIds)
                        AND BD.BDYPNUM = companyId
                  GROUP BY BDICNUM

                    ) AS T_IC2 ON BD.BDICNUM = T_IC2.BDICNUM
                    INNER JOIN BO ON BD.BDBONUM = BO.BONUM  AND
BD.BDYPNUM = BO.BOYPNUM
                    INNER JOIN RR ON BO.BORRNUMC = RR.RRNUM
                    LEFT OUTER JOIN IQ ON BD.BDICNUM = IQ.IQICNUM AND
BD.BDAENUM = IQ.IQAENUM
                WHERE BD.BDSTATV = 3
                AND BDAENUM = warehouseId
                AND BOTYPE = 0
                AND BOSTATUT IN (0, 1)
                AND fromReceptionDate::DATE <=  BODTCOM::DATE
                AND toReceptionDate::DATE >=  BODTCOM::DATE
                AND BD.BDYPNUM = companyId
                ORDER BY BONUM, BDICNUM, BDQAEXPV ;
    RETURN ref;

END;
$$ LANGUAGE 'plpgsql';

Re: Problem with array in plpgsql function .. please help :-)

From
Michael Fuhr
Date:
On Wed, Nov 02, 2005 at 10:19:53PM -0500, David Gagnon wrote:
>  I cannot find what is the problem with my function below.  The
> following line in the function  :    AND PD.PDPONUM = ANY (receivingIds)
> don't work.  If I change this line by    AND PD.PDPONUM = 1734 (Hardcode
> a given value) I get a result row.
>
> When I call the same function
>
> select * from usp_Commande_Dues_Retourner('{\'1734\'}', 'M',
> '2005-02-02', '2005-11-02', 'EN' );
>
> with the real line I get no result?

Why are you quoting the value inside the quotes?  That is, is there
a reason you're using '{\'1734\'}' instead of '{1734}'?  Does using
the latter work?  It does for me in simple tests.

--
Michael Fuhr

Re: Problem with array in plpgsql function .. please help

From
David Gagnon
Date:
Hi

>>When I call the same function
>>
>>select * from usp_Commande_Dues_Retourner('{\'1734\'}', 'M',
>>'2005-02-02', '2005-11-02', 'EN' );
>>
>>with the real line I get no result?
>>
>>
>
>Why are you quoting the value inside the quotes?
>
Because I had a bug in my java application.  I modify my code to get

select * from usp_Commande_Dues_Retourner('{"1734"}', 'M',
'2005-02-02', '2005-11-02', 'EN' );

and it works ...

Thanks !
/David



>That is, is there
>a reason you're using '{\'1734\'}' instead of '{1734}'?  Does using
>the latter work?  It does for me in simple tests.
>
>

Thanks for your answer.  That was my error.