Thread: Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1

Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1

From
felix@crowfix.com
Date:
Having been surprised a few times myself by EXPLAIN showing a
sequential scan instead of using an index, and having seen so many
others surprised by it, I hope I am not asking a similar question.

We recently upgraded our db servers, both old and new running 8.0, and
one casualty was forgetting to add the nightly VACUUM ANALYZE.
Inserts were down to 7-8 seconds apiece, but are now back to normal
under a second since the tables were vacuumed.

However, in the process of investigating this, my boss found something
which we do not understand.  A table with a primary key 'id' takes 200
seconds to SELECT MAX(id), but is as close to instantaneous as you'd
want for SELECT ID ORDER BY ID DESC LIMIT 1.  I understand why
count(*) has to traverse all records, but why does MAX have to?  This
table has about 750,000 rows, rather puny.

I suspect there is either a FAQ which I missed, or no one can answer
without EXPLAIN printouts.  I'm hoping there is some generic answer to
something simple I have overlooked.

--
            ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
     Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o

Re: Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1

From
Scott Marlowe
Date:
On Mon, 2005-10-24 at 16:57, felix@crowfix.com wrote:
> Having been surprised a few times myself by EXPLAIN showing a
> sequential scan instead of using an index, and having seen so many
> others surprised by it, I hope I am not asking a similar question.
>
> We recently upgraded our db servers, both old and new running 8.0, and
> one casualty was forgetting to add the nightly VACUUM ANALYZE.
> Inserts were down to 7-8 seconds apiece, but are now back to normal
> under a second since the tables were vacuumed.
>
> However, in the process of investigating this, my boss found something
> which we do not understand.  A table with a primary key 'id' takes 200
> seconds to SELECT MAX(id), but is as close to instantaneous as you'd
> want for SELECT ID ORDER BY ID DESC LIMIT 1.  I understand why
> count(*) has to traverse all records, but why does MAX have to?  This
> table has about 750,000 rows, rather puny.
>
> I suspect there is either a FAQ which I missed, or no one can answer
> without EXPLAIN printouts.  I'm hoping there is some generic answer to
> something simple I have overlooked.

It may be, but it's a complex enough question, I'll toss out the answer,
as I understand it.

The problems with aggregates extends from two design decisions made in
PostgreSQL.

1:  Generic aggregation

PostgreSQL uses a generic aggregation system.  I.e. there ain't no short
cuts in the query planner for one or another of the aggregates.  If you
make an aggregate function called std_dev() and implement it, it pretty
much gets treated the same by the query planner as the ones that are
built in.

So, to the query planner, max(fieldname) is about the same as
sum(fieldname).

Now, you and I can tell by looking at them that one of those could be
answered pretty quickly with an index, but how's the planner supposed to
know?

2:  MVCC

PostgreSQL's implementation of an "in store" Multi-version Concurrency
Control system means that indexes only tell you where the index entry is
in the table, not whether or not it is visible to this particular
transaction.  Depending on when the tuple was last updated, and when our
transactions started, and our transaction isolation level, a given
version of a given tuple may or may not be visible to our transaction.

So, while PostgreSQL can use an index to find things, it always has to
go back to the actual table to find out if the value is visible to the
current transaction.

Put simply, reads cost more, but writes don't make the performance of
the db plummet.

Put more simply, everyone gets a medium slow read performance for
certain ops, but writes keep streaming right along.    The Ain't No Such
Thing As A Free Lunch (TANSTAAFL).

Notice that you CAN use an index for most aggregates, as long as the
where clause you're using is limiting enough.

select count(*) from sometable where somefield > 22 can use an index if
somefield > 22 is selective enough.  But again, if the db is going to
read some base percentage of the pages in the main table, it's cheaper
to just switch to a sequential scan than to use the index, since it HAS
TO READ all the values in the table to see if they're visible.

The good news is that generally speaking, everything else in PostgreSQL
is quite fast, and parallelism is very good.

Hope that's not too involved, and Tom, hope I didn't make too many
mistakes there...

Re: Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1

From
felix@crowfix.com
Date:
Dang, that's a lot of answer! :-) and not what I was hoping for.  Max
and count both have to look up data records to skip values associated
with other transactions.  But count, by definition, has to scan every
single record from one end of the index to the other, so the index is
useless, whereas max will probably scan only a very few records before
finding the first valid one.

I can't see any difference between these two statements:

    SELECT MAX(id) FROM table;
    SELECT id FROM table ORDER BY id DESC LIMIT 1;

If the planner / optimizer / whatever doesn't optimize them to the
same end result, is there a reason not to?  Is there a case for
putting it on the TODO list?

In case it is any help, here is the EXPLAIN ANALYZE results:

EXPLAIN ANALYZE SELECT id FROM transaction ORDER BY id DESC LIMIT 1;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1.98 rows=1 width=4) (actual time=22.482..22.485
rows=1 loops=1)
   ->  Index Scan Backward using transaction_pkey on "transaction"
(cost=0.00..1944638.42 rows=984531 width=4) (actual
time=22.474..22.474
rows=1 loops=1)
 Total runtime: 22.546 ms
(3 rows)

----

EXPLAIN ANALYZE SELECT MAX(id) FROM transaction;
                                                          QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=52745.64..52745.64 rows=1 width=4) (actual
time=11500.994..11500.998 rows=1 loops=1)
   ->  Seq Scan on "transaction"  (cost=0.00..50284.31 rows=984531
width=4) (actual time=57.164..8676.015 rows=738952 loops=1)
 Total runtime: 11501.096 ms

And that's a good one - I've seen it take as long as 200000 ms...



--
            ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
     Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o

Re: Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1

From
Douglas McNaught
Date:
felix@crowfix.com writes:

> However, in the process of investigating this, my boss found something
> which we do not understand.  A table with a primary key 'id' takes 200
> seconds to SELECT MAX(id), but is as close to instantaneous as you'd
> want for SELECT ID ORDER BY ID DESC LIMIT 1.  I understand why
> count(*) has to traverse all records, but why does MAX have to?  This
> table has about 750,000 rows, rather puny.

As I understand it, because aggregates in PG are extensible (the query
planner just knows it's calling some function), MAX isn't specially
handled--the planner doesn't know it's equivalent to the other query.

There has been some talk of special-casing this, but I'm not sure
where it lead--you might check the archives.

-Doug

Re: Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1

From
Alex Turner
Date:
I believe based on semi-recent posts that MIN and MAX are now treated
as special cases in 8.1, and are synonymous with select id order by id
desc limit 1 etc..

Alex

On 10/24/05, Douglas McNaught <doug@mcnaught.org> wrote:
> felix@crowfix.com writes:
>
> > However, in the process of investigating this, my boss found something
> > which we do not understand.  A table with a primary key 'id' takes 200
> > seconds to SELECT MAX(id), but is as close to instantaneous as you'd
> > want for SELECT ID ORDER BY ID DESC LIMIT 1.  I understand why
> > count(*) has to traverse all records, but why does MAX have to?  This
> > table has about 750,000 rows, rather puny.
>
> As I understand it, because aggregates in PG are extensible (the query
> planner just knows it's calling some function), MAX isn't specially
> handled--the planner doesn't know it's equivalent to the other query.
>
> There has been some talk of special-casing this, but I'm not sure
> where it lead--you might check the archives.
>
> -Doug
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Re: Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1 -- SOLVED

From
felix@crowfix.com
Date:
On Mon, Oct 24, 2005 at 07:14:43PM -0400, Alex Turner wrote:
> I believe based on semi-recent posts that MIN and MAX are now treated
> as special cases in 8.1, and are synonymous with select id order by id
> desc limit 1 etc..

Aha!  I looked it up in the release notes, you are right.  I had never
thought they would not be special cased.

Thanks.

--
            ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
     Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o

Re: Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1

From
Michael Fuhr
Date:
On Mon, Oct 24, 2005 at 03:50:57PM -0700, felix@crowfix.com wrote:
> I can't see any difference between these two statements:
>
>     SELECT MAX(id) FROM table;
>     SELECT id FROM table ORDER BY id DESC LIMIT 1;
>
> If the planner / optimizer / whatever doesn't optimize them to the
> same end result, is there a reason not to?  Is there a case for
> putting it on the TODO list?

Already done in 8.1.  Here's an excerpt from the Release Notes:

Automatically use indexes for MIN() and MAX() (Tom)

    In previous releases, the only way to use an index for MIN()
    or MAX() was to rewrite the query as SELECT col FROM tab ORDER
    BY col LIMIT 1.  Index usage now happens automatically.

--
Michael Fuhr

Re: Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1 -- SOLVED

From
Bruno Wolff III
Date:
On Mon, Oct 24, 2005 at 16:21:57 -0700,
  felix@crowfix.com wrote:
> On Mon, Oct 24, 2005 at 07:14:43PM -0400, Alex Turner wrote:
> > I believe based on semi-recent posts that MIN and MAX are now treated
> > as special cases in 8.1, and are synonymous with select id order by id
> > desc limit 1 etc..
>
> Aha!  I looked it up in the release notes, you are right.  I had never
> thought they would not be special cased.

They really aren't being special cased in 8.1. There is a new way of handling
them in a general way to could be used by other functions with similar
properties.

Re: Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1

From
Brendan Jurd
Date:
> Already done in 8.1.  Here's an excerpt from the Release Notes:
>
> Automatically use indexes for MIN() and MAX() (Tom)
>
>     In previous releases, the only way to use an index for MIN()
>     or MAX() was to rewrite the query as SELECT col FROM tab ORDER
>     BY col LIMIT 1.  Index usage now happens automatically.
>

Which query form will generally be faster in 8.1 (or will they be
exactly the same)?

BJ

Re: Generic Q about max(id) vs ORDER BY ID DESC LIMIT 1

From
"Jim C. Nasby"
Date:
On Sun, Nov 27, 2005 at 11:38:57PM +1100, Brendan Jurd wrote:
> > Already done in 8.1.  Here's an excerpt from the Release Notes:
> >
> > Automatically use indexes for MIN() and MAX() (Tom)
> >
> >     In previous releases, the only way to use an index for MIN()
> >     or MAX() was to rewrite the query as SELECT col FROM tab ORDER
> >     BY col LIMIT 1.  Index usage now happens automatically.
> >
>
> Which query form will generally be faster in 8.1 (or will they be
> exactly the same)?

They'll effectively be the same:

stats=# explain select id from stats_participant where id is not null order by id limit 1;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.40 rows=1 width=4)
   ->  Index Scan using stats_participant_pkey on stats_participant  (cost=0.00..1486391.76 rows=436912 width=4)
         Filter: (id IS NOT NULL)
(3 rows)

stats=# explain select min(id) from stats_participant;
                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 Result  (cost=3.40..3.41 rows=1 width=0)
   InitPlan
     ->  Limit  (cost=0.00..3.40 rows=1 width=4)
           ->  Index Scan using stats_participant_pkey on stats_participant  (cost=0.00..1486391.76 rows=436912
width=4)
                 Filter: (id IS NOT NULL)
(5 rows)

stats=#

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461