Thread: Problems with order by, limit, and indices

Problems with order by, limit, and indices

From
Denis Perchine
Date:
Hello,

DB: PostgreSQL 7.1 (snap-06012001)
OS: Linux

I have the following table:

slygreetings=> \d users
                              Table "users"
  Attribute  |   Type    |                    Modifier
-------------+-----------+------------------------------------------------
 id          | integer   | not null default nextval('users_id_seq'::text)
 sendername  | text      |
 senderlname | text      |
 sendermail  | text      |
 sender_ip   | inet      |
 senderdate  | timestamp |
 rcptname    | text      |
 rcptmail    | text      |
 rcpt_ip     | inet      |
 reason      | text      |
 rcptdate    | timestamp |
 crypt       | text      |
 active      | boolean   | default 'false'
 variant_id  | integer   |
 info        | text      |
Indices: users_crypt_key,
         users_id_key,
         users_rcptdate_vid_key

Vacuum ouput:
slygreetings=> vacuum verbose analyze users;
NOTICE:  --Relation users--
NOTICE:  Pages 23572: Changed 0, reaped 0, Empty 0, New 0; Tup 660510: Vac 0,
Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 156, MaxLen 2032; Re-using:
Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 3.23s/0.26u sec.
NOTICE:  Index users_id_key: Pages 1450; Tuples 660510. CPU 0.18s/0.80u sec.
NOTICE:  Index users_crypt_key: Pages 3162; Tuples 660510. CPU 0.43s/0.82u
sec.
NOTICE:  Index users_rcptdate_vid_key: Pages 2321; Tuples 660510. CPU
0.29s/0.76u sec.
NOTICE:  --Relation pg_toast_18741--
NOTICE:  Pages 8: Changed 0, reaped 0, Empty 0, New 0; Tup 35: Vac 0,
Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 647, MaxLen 2034; Re-using:
Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 0.00s/0.00u sec.
NOTICE:  Index pg_toast_18741_idx: Pages 2; Tuples 35. CPU 0.00s/0.00u sec.
NOTICE:  Analyzing...
VACUUM

Index:

slygreetings=> \d users_rcptdate_vid_key
Index "users_rcptdate_vid_key"
 Attribute  |   Type
------------+-----------
 variant_id | integer
 active     | boolean
 rcptdate   | timestamp
btree

Statistics for variant_id (quite important, will see later)

slygreetings=> select count(variant_id),variant_id from users group by
variant_id;
 count  | variant_id
--------+------------
     11 |          1
 545067 |          2
  20080 |          3
  95351 |          4
      1 |          5
(5 rows)

I need to get 60 last not active users for specific variant_id.

Example with variant_id = 2

slygreetings=> explain select * from users where variant_id=2 AND active='f'
order by rcptdate limit 60;
NOTICE:  QUERY PLAN:

Limit  (cost=77117.18..77117.18 rows=60 width=145)
  ->  Sort  (cost=77117.18..77117.18 rows=162640 width=145)
        ->  Seq Scan on users  (cost=0.00..33479.65 rows=162640 width=145)

EXPLAIN

Executor stats:

EXECUTOR STATISTICS
! system usage stats:
!       40.441787 elapsed 19.090000 user 6.340000 system sec
!       [280.840000 user 85.980000 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       0/201 [2290/2233] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:      23572 read,          0 written, buffer hit rate =
0.00%
!       Local  blocks:          0 read,          0 written, buffer hit rate =
0.00%
!       Direct blocks:          0 read,          0 written

Example with variant_id = 5

slygreetings=> explain select * from users where variant_id=5 AND active='f'
order by rcptdate limit 60;
NOTICE:  QUERY PLAN:

Limit  (cost=13005.10..13005.10 rows=60 width=145)
  ->  Sort  (cost=13005.10..13005.10 rows=3445 width=145)
        ->  Index Scan using users_rcptdate_vid_key on users
(cost=0.00..12658.35 rows=3445 width=145)

EXPLAIN

Executor stats:
EXECUTOR STATISTICS
! system usage stats:
!       0.057452 elapsed 0.000000 user 0.000000 system sec
!       [280.840000 user 85.980000 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       0/0 [2290/2233] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:          5 read,          0 written, buffer hit rate =
0.00%
!       Local  blocks:          0 read,          0 written, buffer hit rate =
0.00%
!       Direct blocks:          0 read,          0 written

As far as you can see planner somehow ignores the possibility to use
users_rcptdate_vid_key for both where quals, and order by.

Can please someone help me with this? Is there any possibility to improve the
situation (or maybe I misunderstood something).

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

Re: Problems with order by, limit, and indices

From
Tom Lane
Date:
Denis Perchine <dyp@perchine.com> writes:
> Example with variant_id = 2

> slygreetings=> explain select * from users where variant_id=2 AND active='f'
> order by rcptdate limit 60;
> NOTICE:  QUERY PLAN:

> Limit  (cost=77117.18..77117.18 rows=60 width=145)
>   -> Sort  (cost=77117.18..77117.18 rows=162640 width=145)
>     -> Seq Scan on users  (cost=0.00..33479.65 rows=162640 width=145)

This plan looks fine to me, considering that variant_id=2 is the vast
majority of the table.  An indexscan will be slower, except perhaps if
you've recently CLUSTERed the table on this index.  (If you don't
believe me, try it with ENABLE_SEQSCAN set to OFF.)

> Example with variant_id = 5

> slygreetings=> explain select * from users where variant_id=5 AND active='f'
> order by rcptdate limit 60;
> NOTICE:  QUERY PLAN:

> Limit  (cost=13005.10..13005.10 rows=60 width=145)
>   -> Sort  (cost=13005.10..13005.10 rows=3445 width=145)
>     -> Index Scan using users_rcptdate_vid_key on users (cost=0.00..12658.35 rows=3445 width=145)

You could probably get a plan without the sort step if you said
    ... order by variant_id, rcptdate;

            regards, tom lane

Re: Problems with order by, limit, and indices

From
Denis Perchine
Date:
> > Example with variant_id = 2
> >
> > slygreetings=> explain select * from users where variant_id=2 AND
> > active='f' order by rcptdate limit 60;
> > NOTICE:  QUERY PLAN:
> >
> > Limit  (cost=77117.18..77117.18 rows=60 width=145)
> >   -> Sort  (cost=77117.18..77117.18 rows=162640 width=145)
> >     -> Seq Scan on users  (cost=0.00..33479.65 rows=162640 width=145)
>
> This plan looks fine to me, considering that variant_id=2 is the vast
> majority of the table.  An indexscan will be slower, except perhaps if
> you've recently CLUSTERed the table on this index.  (If you don't
> believe me, try it with ENABLE_SEQSCAN set to OFF.)

I would agree with you if there was no limit specified. As far as I can
understand it is possible to traverse users_rcptdate_vid_key Forward,
and get 60 tuples, than finish. And that tuples will be already sorted (index
includes rcptdate also).

> > Example with variant_id = 5
> >
> > slygreetings=> explain select * from users where variant_id=5 AND
> > active='f' order by rcptdate limit 60;
> > NOTICE:  QUERY PLAN:
> >
> > Limit  (cost=13005.10..13005.10 rows=60 width=145)
> >   -> Sort  (cost=13005.10..13005.10 rows=3445 width=145)
> >     -> Index Scan using users_rcptdate_vid_key on users
> > (cost=0.00..12658.35 rows=3445 width=145)
>
> You could probably get a plan without the sort step if you said
>     ... order by variant_id, rcptdate;

No way, it just get all tuples for the qual, sort them, and the limiting.
That's horrible...

slygreetings=> explain select * from users where variant_id=5 AND active='f'
order by rcptdate,variant_id limit 60;
NOTICE:  QUERY PLAN:

Limit  (cost=13005.10..13005.10 rows=60 width=145)
  ->  Sort  (cost=13005.10..13005.10 rows=3445 width=145)
        ->  Index Scan using users_rcptdate_vid_key on users
(cost=0.00..12658.35 rows=3445 width=145)

EXPLAIN
slygreetings=> explain select * from users where variant_id=5 AND active='f'
order by rcptdate,variant_id,active limit 60;
NOTICE:  QUERY PLAN:

Limit  (cost=13005.10..13005.10 rows=60 width=145)
  ->  Sort  (cost=13005.10..13005.10 rows=3445 width=145)
        ->  Index Scan using users_rcptdate_vid_key on users
(cost=0.00..12658.35 rows=3445 width=145)

EXPLAIN

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

Re: Problems with order by, limit, and indices

From
Denis Perchine
Date:
Hi,

another interesting thing...
This is current 7.1.

slygreetings=> explain select * from users where variant_id=5 AND active='f'
order by rcptdate,variant_id,active limit 60;
NOTICE:  QUERY PLAN:

Limit  (cost=13005.10..13005.10 rows=60 width=145)
  ->  Sort  (cost=13005.10..13005.10 rows=3445 width=145)
        ->  Index Scan using users_rcptdate_vid_key on users
(cost=0.00..12658.35 rows=3445 width=145)

EXPLAIN
slygreetings=> set enable_sort to off;
SET VARIABLE
slygreetings=> explain select * from users where variant_id=5 AND active='f'
order by rcptdate,variant_id,active limit 60;
NOTICE:  QUERY PLAN:

Limit  (cost=100013005.10..100013005.10 rows=60 width=145)
  ->  Sort  (cost=100013005.10..100013005.10 rows=3445 width=145)
        ->  Index Scan using users_rcptdate_vid_key on users
(cost=0.00..12658.35 rows=3445 width=145)

EXPLAIN

Cost is something really wierd.... Why?

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

Re: Problems with order by, limit, and indices

From
Tom Lane
Date:
Denis Perchine <dyp@perchine.com> writes:
>> You could probably get a plan without the sort step if you said
>> ... order by variant_id, rcptdate;

> No way, it just get all tuples for the qual, sort them, and the limiting.
> That's horrible...

> slygreetings=> explain select * from users where variant_id=5 AND active='f'
> order by rcptdate,variant_id limit 60;
> NOTICE:  QUERY PLAN:

> Limit  (cost=13005.10..13005.10 rows=60 width=145)
>    -> Sort  (cost=13005.10..13005.10 rows=3445 width=145)
>       -> Index Scan using users_rcptdate_vid_key on users (cost=0.00..12658.35 rows=3445 width=145)


Now, that's not what I told you to do, is it?  It works fine for me:

regression=# create table users (variant_id int , active bool, rcptdate date);
CREATE
regression=# create index usersind on users( variant_id,rcptdate,active);
CREATE
regression=# explain  select * from users where variant_id=5 AND active='f'
regression-# order by rcptdate limit 1;
NOTICE:  QUERY PLAN:

Limit  (cost=8.22..8.22 rows=1 width=9)
  ->  Sort  (cost=8.22..8.22 rows=5 width=9)
        ->  Index Scan using usersind on users  (cost=0.00..8.16 rows=5 width=9)

EXPLAIN
regression=# explain  select * from users where variant_id=5 AND active='f'
regression-# order by variant_id,rcptdate limit 1;
NOTICE:  QUERY PLAN:

Limit  (cost=0.00..1.63 rows=1 width=9)
  ->  Index Scan using usersind on users  (cost=0.00..8.16 rows=5 width=9)

EXPLAIN

The specified sort order has to match the index if you hope to avoid
a sort step.

            regards, tom lane

Re: Problems with order by, limit, and indices

From
Tom Lane
Date:
Denis Perchine <dyp@perchine.com> writes:
> Cost is something really wierd.... Why?

That's how the enable stuff works for plan types that can't be ignored
completely: it just adds a big constant to the estimated cost.  If there
is no other alternative plan, you get the unwanted plan type anyway.

            regards, tom lane

Re: Problems with order by, limit, and indices

From
Denis Perchine
Date:
> > slygreetings=> explain select * from users where variant_id=5 AND
> > active='f' order by rcptdate,variant_id limit 60;
> > NOTICE:  QUERY PLAN:
> >
> > Limit  (cost=13005.10..13005.10 rows=60 width=145)
> >    -> Sort  (cost=13005.10..13005.10 rows=3445 width=145)
> >       -> Index Scan using users_rcptdate_vid_key on users
> > (cost=0.00..12658.35 rows=3445 width=145)
>
> Now, that's not what I told you to do, is it?  It works fine for me:
>
> regression=# create table users (variant_id int , active bool, rcptdate
> date); CREATE
> regression=# create index usersind on users( variant_id,rcptdate,active);
> CREATE
> regression=# explain  select * from users where variant_id=5 AND active='f'
> regression-# order by rcptdate limit 1;
> NOTICE:  QUERY PLAN:
>
> Limit  (cost=8.22..8.22 rows=1 width=9)
>   ->  Sort  (cost=8.22..8.22 rows=5 width=9)
>         ->  Index Scan using usersind on users  (cost=0.00..8.16 rows=5
> width=9)
>
> EXPLAIN
> regression=# explain  select * from users where variant_id=5 AND active='f'
> regression-# order by variant_id,rcptdate limit 1;
> NOTICE:  QUERY PLAN:
>
> Limit  (cost=0.00..1.63 rows=1 width=9)
>   ->  Index Scan using usersind on users  (cost=0.00..8.16 rows=5 width=9)
>
> EXPLAIN
>
> The specified sort order has to match the index if you hope to avoid
> a sort step.

Dummy me... My aplogies for the time I stole...

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

Tuning questions, and an offer

From
Webb Sprague
Date:
I have a bunch of tuning questions, and I would appreciate any answers to
any of them, in any order... :)  Thanks in advance.

1. Clustering: Are the current docs corecct in saying that once
you cluster on an index, subsequent inserts will _not_ respect that
cluster?  I only ask because a recent posting seemed to imply otherwise.

2. Speed difference between trigger functions in C vs PGPLSQL:  Given that
I will need to use identical SQL inside the C function as in the PLPGSQL,
is there a big speed difference?

3. Query Execution Times: How do I find the number of seconds that a query
takes to execute?  In MySQL, that info is printed after the results.

4. Inserting w/ Indexes vs. Selecting w/o Indexes:  Given an equal number
of inserts and selects on a table, (a whole lot of both--about 10/second),
is it better to take the penalty on inserts due to indexes or on selects
do to lack thereof?

5. Does PG have partial indices yet? I only saw a little about this in the
docs.

6. Does PG cache inserts when using Perl, when using `$statement_handle =
$dbh->prepare( "insert into foo (bar) values (?)");
$statement_handle->execute();'?

7. How can one do If-Else chains in PL/PGSQL?  If I am being braindead
about this one, please ignore it.

8. If I get good answers to this, should I patch the FAQ and send it in
for approval?  Is there any other way to help (besides writing C, which I
don't do very well...)?

--
Webb Sprague
Programmer
O1 Communications