Thread: More tablescanning fun

More tablescanning fun

From
"Jim C. Nasby"
Date:
On this table

 project_id | integer | not null
 id         | integer | not null
 date       | date    | not null
 team_id    | integer | not null
 work_units | bigint  | not null
Indexes: email_contrib_pkey primary key btree (project_id, id, date)

with this breakdown of data

 project_id |  count
------------+----------
          5 | 56427141
          8 |  1058843
         24 |   361595
         25 |  4092575
        205 | 58512516

Any kind of operation on an entire project wants to tablescan, even
though it's going to take way longer.

explain analyze select sum(work_units) from email_contrib where
project_id=8;

Index scan  126, 56, 55 seconds
Seq. scan   1517, 850, 897 seconds

It seems like the metrics used for the cost of index scanning v. table
scanning on large tables need to be revisited. It might be such a huge
difference in this case because the table is essentially clustered on
the primary key. I can test this by doing an aggregate for, say, a
specific team_id, which would be pretty well spread across the entire
table, but that'll have to wait a bit.

Anyone have any thoughts on this? Also, is there a TODO to impliment
real clustered indexes? Doing stuff by project_id on this table in
sybase was very efficient, because there was a real clustered index on
the PK. By clustered index, I mean an index where the leaf nodes of the
B-tree were the actual table rows. This means the only overhead in going
through the index is scanning the branches, which in this case would be
pretty light-weight.

Is this something that I should be using some PGSQL-specific feature
for, like inheritance?

I've been really happy so far with PGSQL (comming from Sybase and DB2),
but it seems there's still some pretty big performance issues that want
to be addressed (or I should say performance issues that hurt really big
when you hit them :) ).
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: More tablescanning fun

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> It seems like the metrics used for the cost of index scanning v. table
> scanning on large tables need to be revisited. It might be such a huge
> difference in this case because the table is essentially clustered on
> the primary key.

Probably.  What does the correlation figure in pg_stats show as?

There's been some previous debate about the equation used to correct
for correlation, which is certainly bogus (I picked it more or less
out of the air ;-)).  But so far no one has proposed a replacement
equation with any better foundation ... take a look in
src/backend/optimizer/path/costsize.c if you want to get involved.

> Also, is there a TODO to impliment
> real clustered indexes?

No.  It's not apparent to me how you could do that without abandoning
MVCC, which we're not likely to do.

            regards, tom lane


Re: More tablescanning fun

From
"Jim C. Nasby"
Date:
On Thu, Apr 24, 2003 at 07:58:30PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > It seems like the metrics used for the cost of index scanning v. table
> > scanning on large tables need to be revisited. It might be such a huge
> > difference in this case because the table is essentially clustered on
> > the primary key.
>
> Probably.  What does the correlation figure in pg_stats show as?

stats=# select attname, correlation from pg_stats where
tablename='email_contrib';
  attname   | correlation
  ------------+-------------
  project_id |           1
  id         |    0.449204
  date       |    0.271775
  team_id    |    0.165588
  work_units |   0.0697928

> There's been some previous debate about the equation used to correct
> for correlation, which is certainly bogus (I picked it more or less
> out of the air ;-)).  But so far no one has proposed a replacement
> equation with any better foundation ... take a look in
> src/backend/optimizer/path/costsize.c if you want to get involved.

Are you reffering to the PF formula?

> > Also, is there a TODO to impliment
> > real clustered indexes?
>
> No.  It's not apparent to me how you could do that without abandoning
> MVCC, which we're not likely to do.

Hmm... does MVCC mandate inserts go at the end? My understanding is that
each tuple indicates it's insert/last modified time; if this is the
case, why would a true clustered index break mvcc? I guess an update
that moves the tuple would be tricky, but I'm guesing there's some kind
of magic that could happen there... worst case would be adding an
'expired' timestamp.

On the other hand, it might be possible to get the advantages of a
clustered index without doing a *true* clustered index. The real point
is to be able to use indexes; I've heard things like 'if you need to
access more than 10% of a table then using an index would be
disasterous', and that's not good... that number should really be over
50% for most reasonable ratios of fields indexed to fields in table (of
course field size plays a factor).
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: More tablescanning fun

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> On Thu, Apr 24, 2003 at 07:58:30PM -0400, Tom Lane wrote:
>> There's been some previous debate about the equation used to correct
>> for correlation, which is certainly bogus (I picked it more or less
>> out of the air ;-)).  But so far no one has proposed a replacement
>> equation with any better foundation ... take a look in
>> src/backend/optimizer/path/costsize.c if you want to get involved.

> Are you reffering to the PF formula?

The PF formula is good as far as I know, but it assumes an uncorrelated
table order.  The debate is how to correct it for nonzero correlation.
Specifically, this bit:

     * When the index ordering is exactly correlated with the table ordering
     * (just after a CLUSTER, for example), the number of pages fetched should
     * be just sT.  What's more, these will be sequential fetches, not the
     * random fetches that occur in the uncorrelated case.  So, depending on
     * the extent of correlation, we should estimate the actual I/O cost
     * somewhere between s * T * 1.0 and PF * random_cost.  We currently
     * interpolate linearly between these two endpoints based on the
     * correlation squared (XXX is that appropriate?).

I believe the endpoints s*T and PF*random_cost, I think, but the curve
between them is anyone's guess.  It's also quite possible that the
correlation stat that we currently compute is inadequate to model what's
going on.

>> No.  It's not apparent to me how you could do that without abandoning
>> MVCC, which we're not likely to do.

> Hmm... does MVCC mandate inserts go at the end?

Anywhere that there's free space.  The point is that you can't promise
updates will fit on the same page as the original tuple.  So whatever
desirable physical ordering you may have started with will surely
degrade over time.

> On the other hand, it might be possible to get the advantages of a
> clustered index without doing a *true* clustered index. The real point
> is to be able to use indexes; I've heard things like 'if you need to
> access more than 10% of a table then using an index would be
> disasterous', and that's not good... that number should really be over
> 50% for most reasonable ratios of fields indexed to fields in table (of
> course field size plays a factor).

If you have to read 50% of a table, you certainly should be doing a
linear scan.  There will be hardly any pages you can skip (unless the
table is improbably well clustered), and the extra I/O needed to read
the index will buy you nothing.

            regards, tom lane


Re: More tablescanning fun

From
"Jim C. Nasby"
Date:
On Fri, Apr 25, 2003 at 01:23:10AM -0400, Tom Lane wrote:
> I believe the endpoints s*T and PF*random_cost, I think, but the curve
> between them is anyone's guess.  It's also quite possible that the
> correlation stat that we currently compute is inadequate to model what's
> going on.

In this case, the interpolation can't be at fault, because correlation
is 1 (unless the interpolation is backwards, but that doesn't appear to
be the case).

One possibility is that IndexSelectivity isn't taking
most_common_(vals|freqs) into account.

Looking at this from an idea case, most (or all) of this query should be
retrieved by simply incrementing through both the index and the tuples
at the same time. We should end up pulling 0.7% of the index and raw
pages combined. Analyze thinks that using the index will be about 50%
more expensive, though. (3258557 v. 2274866)

A thought that comes to mind here is that it would be incredible if
pgsql could take metrics of how long things actually take on a live
system and incorporate them... basically learning as it goes. A first
step in this case would be to keep tabs on how close real page-read
counts come to what the optimizer predicted, and storing that for later
analysis. This would make it easier for you to verify your linear
correlation assumption, for example (it'd also make it easier to
validate the PF formula).

> >> No.  It's not apparent to me how you could do that without abandoning
> >> MVCC, which we're not likely to do.
>
> > Hmm... does MVCC mandate inserts go at the end?
>
> Anywhere that there's free space.  The point is that you can't promise
> updates will fit on the same page as the original tuple.  So whatever
> desirable physical ordering you may have started with will surely
> degrade over time.

Yes, updates are the tricky part to clustered indexes, and MVCC might
make it harder. What Sybase 11 (which only supports page locking) does
is see if the update moves the tuple off it's current page. If it
doesn't, it just shuffles the page around as needed and goes on with
business. If it needs to move, it grabs (and locks) the page it needs to
move to, inserts it on that page (possibly incurring a page split), and
deletes it from the old page. My guess is that with MVCC, you can't
simply delete the old tuple... you'd have to leave some kind of 'bread
crumb' behind for older transactions to see (though, I guess this would
already have to be happening somehow).

The reason to do this in this case is well worth it though... we end up
with one table (simplifies code) that should essentially act as if it
was multiple (5 in this case) tables, so performance should still be
very good.

> > On the other hand, it might be possible to get the advantages of a
> > clustered index without doing a *true* clustered index. The real point
> > is to be able to use indexes; I've heard things like 'if you need to
> > access more than 10% of a table then using an index would be
> > disasterous', and that's not good... that number should really be over
> > 50% for most reasonable ratios of fields indexed to fields in table (of
> > course field size plays a factor).
>
> If you have to read 50% of a table, you certainly should be doing a
> linear scan.  There will be hardly any pages you can skip (unless the
> table is improbably well clustered), and the extra I/O needed to read
> the index will buy you nothing.

Yes, and it's that 'improbably well clustered' case that I have here. :)
But even if you're only 25% clustered, I think you'll still see a huge
gain on a very large table, especially if the index tuples are
substantially smaller than the raw tuples (which they normally should
be).
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: More tablescanning fun

From
Hannu Krosing
Date:
Jim C. Nasby kirjutas R, 25.04.2003 kell 07:59:
> > > Also, is there a TODO to impliment
> > > real clustered indexes?
> >
> > No.  It's not apparent to me how you could do that without abandoning
> > MVCC, which we're not likely to do.
>
> Hmm... does MVCC mandate inserts go at the end?

I have been pondering if keeping pages half-empty (or even 70% empty)
could solve both clustering problems and longish updates for much data.

If we could place the copy in the same page than original, most big
updates would be possible by one sweep of disk heads and also clustering
order would be easier to keep if pages were kept intentionally half
empty.

So "VACUUM FULL 65% EMPTY;" could make sense ?


-------------
Hannu


Re: More tablescanning fun

From
"Jim C. Nasby"
Date:
On Fri, Apr 25, 2003 at 07:28:06PM +0300, Hannu Krosing wrote:
> I have been pondering if keeping pages half-empty (or even 70% empty)
> could solve both clustering problems and longish updates for much data.
>
> If we could place the copy in the same page than original, most big
> updates would be possible by one sweep of disk heads and also clustering
> order would be easier to keep if pages were kept intentionally half
> empty.
>
> So "VACUUM FULL 65% EMPTY;" could make sense ?

That's actually a recommended practice, at least for sybase when you're
using a clustered index, depending on what you're using it for. If you
cluster a table in such a way that inserts will happen across the entire
table, you'll actually end up with a fillratio (amount of data v. empty
space on a page) of 75% over time, because of page splits. When sybase
goes to insert, if it can't find room on the page it needs to insert
into (keep in mind this is a clustered table, so a given row *must* go
into a given position), it will split that single page into two pages,
each of which will then have a fillratio of 50%. Of course they'll
eventually approach 100%, so the average fill ratio across all pages for
the table would be 75%.

I'm not familiar enough with pgsql's guts to know how big an impact
updates across pages are, or if they even happen often at all. If you're
not maintaining a clustered table, couldn't all updates just occur
in-place? Or are you thinking of the case where you have a lot of
variable-length stuff?
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: More tablescanning fun

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> I have been pondering if keeping pages half-empty (or even 70% empty)
> could solve both clustering problems and longish updates for much data.

You could achieve that pretty easily if you simply don't ever VACUUM
FULL ;-)

UPDATE has always (AFAIR) attempted to place the new version on the same
page as the old, moving it elsewhere only if it doesn't fit.  So that
part of the logic is already there.

> So "VACUUM FULL 65% EMPTY;" could make sense ?

Not so much that, as a parameter to CLUSTER telling it to fill pages
only x% full.

            regards, tom lane


Re: More tablescanning fun

From
Hannu Krosing
Date:
Jim C. Nasby kirjutas R, 25.04.2003 kell 19:56:
> I'm not familiar enough with pgsql's guts to know how big an impact
> updates across pages are, or if they even happen often at all. If you're
> not maintaining a clustered table, couldn't all updates just occur
> in-place?

In postgres _no_ updates happen in-place. The MVCC concurrency works by
always inserting a new tuple on update .

-----------
Hannu


Re: More tablescanning fun

From
Manfred Koizar
Date:
On Fri, 25 Apr 2003 09:38:01 -0500, "Jim C. Nasby" <jim@nasby.net>
wrote:
>In this case, the interpolation can't be at fault, because correlation
>is 1 (unless the interpolation is backwards, but that doesn't appear to
>be the case).

But your index has 3 columns which causes the index correlation to be
assumed as 1/3.  So the interpolation uses 1/9 (correlation squared)
and you get a cost estimation that almost equals the upper bound.

If you want to play around with other interpolation methods, you might
want to get this patch: http://www.pivot.at/pg/16-correlation-732.diff

A short description of the GUC parameters introduced by this patch can
be found here:
http://archives.postgresql.org/pgsql-performance/2002-11/msg00256.php

As a short term workaround for an unmodified Postgres installation,
you can create an index on email_contrib(project_id).

Servus
 Manfred


Re: More tablescanning fun

From
"Jim C. Nasby"
Date:
On Wed, Apr 30, 2003 at 04:14:46PM +0200, Manfred Koizar wrote:
> On Fri, 25 Apr 2003 09:38:01 -0500, "Jim C. Nasby" <jim@nasby.net>
> wrote:
> >In this case, the interpolation can't be at fault, because correlation
> >is 1 (unless the interpolation is backwards, but that doesn't appear to
> >be the case).
>
> But your index has 3 columns which causes the index correlation to be
> assumed as 1/3.  So the interpolation uses 1/9 (correlation squared)
> and you get a cost estimation that almost equals the upper bound.

Hmm... interesting... maybe it would also be a good idea to expand
ANALYZE so that it will analyze actual index correlation? ie: in this
case, it would notice that the index on project_id, id, date is highly
correlated, across all 3 columns.

Supporting something close to a real clustered index would also work as
well, since the optimizer would treat that case differently (essentially
as a combination between an index scan but doing a seq. scan within each
page).
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"